Six Frequently used Terminologies in Big Data

With the deluge of Data getting added to the system, it is getting difficult to collect, curate and process information.Growing middle class population , widespread penetration of mobility and technology adoption,  are contributing towards exponential rise in the quantum of data.
 It is commonly termed as Big Data problem. In this blog, i have listed out and defined six frequently used terminologies on big data.

1.) Big Data:  Uri Fredman, in his article at FP, had charted the timeline of Big Data evolution. In 1997,NASA researchers Michael Cox and David Ellsworth use the term "big data" for the first time to describe a familiar challenge in the 1990s: supercomputers generating massive amounts of information -- in Cox and Ellsworth's case, simulations of airflow around aircraft -- that cannot be processed and visualized. "[D]ata sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk," they write. "We call this the problem of big data."

Wikipedia defines big data in information technology as a large  and complex collection of data sets  that is difficult to process using on-hand database management tools or traditional data processing applications.

2.) Hadoop: Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.

3.) MapReduce: MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.

4.) Cluster AnalysisCluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). 

5.) Predictive ModellingPredictive modelling is the process by which a model is created or chosen to try to best predict the probability of an outcome.

6.) In Memory Data ComputingReal or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques. Source: Wikipedia.

As stated earlier, this is not a comprehensive list. Would appreciate if you can give in your feedback on what more can be added to make this list more complete.


  1. The most prominent problem businesses face following the Sarbanes Oxley Act is the issue of constantly shrinking storage space. The Sarbanes Oxley Act requires that all financial documents be saved - and that includes email correspondence.Self Storage

  2. Information is stored and analyzed on a large number of high-performance servers. advises Hadoop - the key technology, open source.
    Since the amount of information will only increase over time, the difficulty is not to get the data, but how to process it with maximum benefit. In general, the process of working with Big Data includes: collecting information, structuring it, creating insights and contexts, developing recommendations for action. Even before the first stage, it is important to clearly define the purpose of the work: what exactly is the data for, for example, the definition of the target audience for the product. Otherwise, there is a risk of getting a lot of information without understanding how specifically they can be used.


Post a Comment

Popular posts from this blog

Email Marketing Quick Trick : How to create a click to post

A MOOC Review : What Can Blockchain Do for you?