Six Frequently used Terminologies in Big Data

With the deluge of Data getting added to the system, it is getting difficult to collect, curate and process information.Growing middle class population , widespread penetration of mobility and technology adoption, are contributing towards exponential rise in the quantum of data.
It is commonly termed as Big Data problem. In this blog, i have listed out and defined six frequently used terminologies on big data.

1.) Big Data: Uri Fredman, in his article at FP, had charted the timeline of Big Data evolution. In 1997,NASA researchers Michael Cox and David Ellsworth use the term "big data" for the first time to describe a familiar challenge in the 1990s: supercomputers generating massive amounts of information -- in Cox and Ellsworth's case, simulations of airflow around aircraft -- that cannot be processed and visualized. "[D]ata sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk," they write. "We call this the problem of big data."

Wikipedia defines big data in information technology as a large and complex collection of data sets that is difficult to process using on-hand database management tools or traditional data processing applications.

Big Data infographic by wakeuptj.

2.) Hadoop: Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.

3.) MapReduce: MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.

4.) Cluster Analysis: Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

5.) Predictive Modelling: Predictive modelling is the process by which a model is created or chosen to try to best predict the probability of an outcome.

6.) In Memory Data Computing: Real or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques. Source: Wikipedia.

As stated earlier, this is not a comprehensive list. Would appreciate if you can give in your feedback on what more can be added to make this list more complete.