Wednesday, January 23, 2013

Six Frequently used Terminologies in Big Data

With the deluge of Data getting added to the system, it is getting difficult to collect, curate and process information.Growing middle class population , widespread penetration of mobility and technology adoption,  are contributing towards exponential rise in the quantum of data.
 It is commonly termed as Big Data problem. In this blog, i have listed out and defined six frequently used terminologies on big data.

1.) Big Data:  Uri Fredman, in his article at FP, had charted the timeline of Big Data evolution. In 1997,NASA researchers Michael Cox and David Ellsworth use the term "big data" for the first time to describe a familiar challenge in the 1990s: supercomputers generating massive amounts of information -- in Cox and Ellsworth's case, simulations of airflow around aircraft -- that cannot be processed and visualized. "[D]ata sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk," they write. "We call this the problem of big data."

Wikipedia defines big data in information technology as a large  and complex collection of data sets  that is difficult to process using on-hand database management tools or traditional data processing applications.


2.) Hadoop: Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.

3.) MapReduce: MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.

4.) Cluster AnalysisCluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). 

5.) Predictive ModellingPredictive modelling is the process by which a model is created or chosen to try to best predict the probability of an outcome.

6.) In Memory Data ComputingReal or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques. Source: Wikipedia.

As stated earlier, this is not a comprehensive list. Would appreciate if you can give in your feedback on what more can be added to make this list more complete.


Labels: , ,

Thursday, January 17, 2013

Guest Blog: Google+ to knock out Facebook! See the Maths!!

File:SingletonBox-knockout.jpg

In the recent episode of Crime Patrol, an Indian soap showcasing real life criminal cases in India, the husband confesses murdering the wife as the sleuths closed in with evidences. When asked the motive behind the murder as both were getting along well, he responded – ‘when the test results came, it was obvious that I was the reason for we being childless and the doctor opined that I’m impotent but curable.’ He went on to add further that ‘I was scared that my wife will tell her friends about it, who in turn will tell their friends, they theirs and hence it will spread to the whole complex in a week and to whole town in a month. And I will not be able to take that social stigma of being branded as an impotent.’
The director of the soap could have added few recent social network medium into the dialogue, e.g. ‘my wife would have twitted about my impotence’, ‘posted on the Facebook walls with the medical report uploaded’, ‘sent out bulk sms on my impotence’ etc. Well I left out ‘broadcasted to our Circles in G+.’ The Google+ is a recent phenomenon in Social Networking. Will it be the nemesis to Facebook as Facebook was to MySpace and Orkut?
Going back to the first paragraph, how would have news of husband’s spread across the population. Let’s say she disclosed it to 2 of her friends, in turn the 2 friends told to their 2 friends each after an hour, and hour after that those 4 friends would whisper to their 2 friends each…..and so on. Mathematically, the total numbers of people that are aware after say 10 hours are 511 including his wife and is arrived at by:
1+ 1*2+ 2**2+2**4+……..+2**10
This is called a Geometric Pogression or in short GP. I’m using ** notation to represent exponential, or as we commonly call it ‘to the power’. You don’t have to break your head on understanding GP, just assume it is true and proceed.
But if the wife was to use social network, or broadly let’s call it digital or electronic means, the news could have spread much faster, like a virus. Present day marketers have taken a leaf out of the viral epidemic outbreaks and coined the term ‘viral marketing’ to draw a parallel between the huge number of targets that their marketing campaigns will reach successfully, similar to the spread of the virus to a large number of people in an epidemic.
The spread of the virus in an epidemic and the spread of messages in the digital world have the same dependency on – the infection period (T) and the spread factor (S). Only when one becomes infected, then only one can spread the virus; so is the case in the digital world – only when you get hooked, say start using Facebook, you can infect others by the way of recommendation, referral, invitation or exerting peer pressure. The infection period (T) say for a viral fever is 3 days, i.e. you are down in 3 days after catching the virus and then you become a carrier, i.e. you can infect others around you. The spread factor, S, is the number of people around you whom you can possibly infect. Say in this case of a viral flu, the S is 7, i.e. you can infect 7 people.
There is a difference between how the words spread compared to that of a viral spread or a digital spread. We assumed that in the manual mode, the friends of the wife will spread the word in intervals of one hour, but no such things will happen in case of a viral break out. It will spread continuously as the viruses don’t wear a watch or are mindful of other activities except the sole motive of spreading themselves. Mathematically, while the spread of words that happens in intervals are represented by a geometric progression, the viral spread is denoted using something called as Euler’s Number, which is denoted by ‘e’.
In case of a viral spread the equivalent of the GP formula for spread as depicted above becomes:
C = I * e**[(S-1)*N] I have used the notation * for multiplication and ** for exponential or what you popularly read as ‘to the power’.
C: The number of carriers or infected persons after N infectious periods.
N : [Elapsed Time])/T, T is the infection period
I : The initial number of infected persons S and T are off course the parameters that have already been defined above.
Now let’s turn to the real McCoy.
Facebook took over from MySpace. MS was launched in 2003 and by Aug 2006 it had 100 million active users. In 2007 it was valued at 2 billion US$. The current number of users stands at 53 million and few weeks ago it was sold off for just US$ 35 million. That’s the non conventional economics of digital world or should we say the social networking world – with the decline in active users by half, the net worth declines by 40 times!!
And this decline was brought in by the advent of Facebook, undoubtedly. For MySpace the mathematical equation looks somewhat like this:
53 = 100 * e**[(S-1)*60]
The value of N is 60 because in 5 years the decline happened to its current level, and the infection period is assumed to be 1 months. So N will be equal to 60 months divided by 1 month, which is 60.
Here the moot point is why T, the infection period is assumed to be 1 month? Why not 2 weeks or 2 months? I have no logical explanation for that. I polled 51 people from all continents who are my friends belonging to all possible major races from various countries of the world and conclusion that emerged is to get infected by a social networking site, 1 month is the norm!
The spread factor, S, from the above equation comes to 0.9894 which is less than 1 as expected since there is decline in the number of active users.
For Facebook, which was started in 2004, the current number of active users or should we say carriers are 600 million. If we take the T as 1 month, then the equation for Facebook looks like:
600,000,000 = 1 * e**[(S-1)*84]
N in the above equation is 84, as seven years means 84 months and T is 1 month. So in the formula that we have for N, putting elapsed time as 84 months and T as 1 month, gives us N as 84.
Computing the value of the spread factor S for Facebook we get S = 1.24.
The spread factor as expected is obviously more than 1 as the Facebook virus has spread widely starting from just 1 user, Mark Zuckerberg himself in 2004 to 600 million over a span of 7 years. And can you draw an obvious conclusion? You as a Facebook aficionado infected or brought in at least 1.24 users every month to Facebook that made billions for Zuckerberg in wealth!!
OK, now more real McCoy!
Google+ will inflict similar trend as Facebook inflicted on MySpace. Facebook will inherit the negative spread factor of MySpace, the damage that it inflicted on MySace; while Google+ will inherit the positive spread factor of Facebook.
So when will Google+ overtake Facebook?
We can derive this by using the following equation:
600,000,000*e**[(0.9894-1)*N] = 1 * e**[(1.24-1)*N]
This gives us N as 80.656. We can approximate that to 81 months and conclude that in 81 months Google+ will take over Facebook.
There is more and more real McCoy!
 What is the value of the mathematical quantity as below?
600,000,000* e **[(0.9894 – 1)*1]
Here I used N as 1 month, i.e. just one infectious period for any social networking site as I had assumed earlier; have taken S as 0.9894, the same less than 1 spread actor that MySpace faced when Facebook came into being.
The answer is approximately 593.7 million; and the difference between this answer and 600 million is around 6.3 million; Facebook has currently 600 million users.
What’s the big deal? Actually the deal is much bigger than just being big.
As reported in various sources, the number of active users that Facebook lost in the month of May is 7 million. Hope you got the message now?
The decline in Facebook users number is in line with the mathematical explanations. In 81 months Facebook would not be as popular as it is now though it may not be gone from the scene…it will be on chopping blocks for say 20M$ compared to the current market capitalization of 12 billion $.
I have always maintained that mathematics is a conclusive science and it will deliver the knock out punch……

About the Author:

Chittaranjan Jena currently heads Mahindra Satyam South Africa Divison. He has published two books and is an avid blogger. You can find more information at http://crjena.wordpress.com

Labels: , ,

Sunday, January 13, 2013

Four reasons which makes HANA a compelling business necessity


HANA.jpg

SAP Business Suite on SAP HANA has been successfully released this week - Which is a testimony that in days to come that it will take center stage in enterprise wide software development. Here are my four reasons which makes In memory data computing databases like HANA a compelling business necessity.









More Speed:
Nimble Organization analyzes information at faster speed to stay ahead of their competitors. For Example, Smart organizations are capturing their success of their product releases by analyzing the tweets and messages of its products real time and even making necessary course corrections. To make smart and faster analysis of the data-set so large and complex (which have been named Big Data for their sheer volume), having a faster system performance has become a basic hygiene factor of any system.

Disk Scan speed has only become faster by 10 times over the last 30 odd years. Compared to the data growth, it is not adequate and will add to more scan time. There is a requirement to remove the redundancy involved in scanning and moving the data from disk to CPU for processing. In Memory Data Management removes two time consuming processes to make the processing faster – Scanning the data from Disk and moving it to the main memory (DRAM). As the CPU's have direct access to DRAM, the processing power is enhanced exponentially resulting in desired efficiency.

More Simplicity:
"With the collapse of OLTP and OLAP in one platform, there is a massive simplification on the way. For long CIOs have tried to fundamentally alter their investment away from “keeping the lights on” to “innovation” that gets their business stakeholders excited. With the elimination of batch, ability to do on the fly aggregates and extensions we have an agile system that will reduce operational costs while powering smarter innovations. Simplification in IT layers will also eliminate layers in business, where we can eliminate overhead functions have evolved to collect and expedite information clogged information flow across the company"

Big Data ready:
Data is growing exponentially. With more automation, more structured data is being captured in our organizations. Also, with the widespread adoption of social media, views, expressions and opinions are getting captured in Face book status messages, twitter tweets, Instagram’s pictures etc. which are categorized as unstructured data. The explosion of data can be summarized with the fact that the humans have doubled the amount of data that they produced in the last century in the last decade.

In Memory Data Computing leverages the advanced encoding methodologies which can reduce he data size to the tune of 20 times. It benefits the organization with lesser infrastructure requirement.

SAP HANA is big data ready as shown in the Early Findings of Scale Out Performance Test Results.

Cloud ready:
In Memory Databases such as HANA leverages the positives of columnar databases so that
  • New Attributes can be added easily vis-à-vis a row based database architecture
  • Locking for changing the data layout in only required for a short period contrary to row based architecture where the entire database or table would be completely locked to process data definition operations

Apart from these four compelling reasons, a faster system results in improved organizational efficiency. With faster technology, organization can definitely rewrite/ redesign the business processes. It will definitely enable the organization to "run better". Don't you think so?  I would appreciate your feedback on this blog piece.

Labels: , , , ,

Friday, January 11, 2013

Data Opportunities for the next century

STS-133 thermal sensors and foam insulation removing



As the resources are becoming scarce and the demands are increasing, the future is dependent of optimal utilization . Data becomes the key ingredient to any effective decision making. The opportunities are immense and It is very much foreseeable that Enterprises will invest in harness the data capture and analysis to make them successful.


Data forms the key ingredient to any effective decision making. It allows for correct analysis - 

Data: The Data should be relevant for accurate analysis. With the advent of Social Networking Sites, we have opened the flood gates for information - which till early part of this century were confined to enterprise premises. There is a deluge of information all around. Now the issues for this type of Information are twofold.


  •     Unstructured - Unstructured data - the data from news feeds, tweets, Facebook comments, Images etc. The acquisition of Radian6 by Salesforces.com, Yammer by Microsoft , SAP JAM are a key indicator as there is an attempt made by the industry to utilize these data, understand them and put it to business use. 
  •     Large Volume - The Amount of Data generated is growing at an exponential rate within enterprises. The Data generated will be precious for decision making - soon the systems which were designed to handle the existing system will face a road block to process the data (both to write and analyze the data to the database). With more volume comes the complexity of analyzing the data. Since the Human evolution to 2003, 12 exabytes were added to the system. 12 exabytes of data would have been added in the last 12 hours. The entire leading organizations are investing new upcoming technologies for unraveling value from this immense amount of data. The concept of Big Data is in place and companies are coming with innovative software to make the data to be easily processed and quickly analyzed. SAP's HANA, Oracle's Exalytics, IBM's Hadoop framework etc are some of the relevant examples.

The opportunities of data to benefit the mankind are immense. There are numerous place where we see a synergy - highlighting the two areas out of the innumerable possibilities:


  •      Imagine a social distribution system based on analytics. We can identify the core need of particular areas and channelize the resources. We have seen a subscription based model work in Software  - Why not have it governance. We have 120 billion people with each specific category - With Data available and analyzing the spending pattern, we can subsidize the items, have a variable benefits on items based on their usage pattern, quantity and demographics. The outcomes of analysis can open up an equitable economy where the benefits are appropriately passed to the citizen. Definitely, several progressive economies such as ours are thinking about it and ADHAAR is an initial start in the right direction.
  •      HealthCare: Research on Genomics generated huge amount of data. Analysis of these can result in cure of diseases, check epidemics and cost effective medications. It can find its application to check the spurious medicines proliferation  - possibility is numerous which the data provides.

Man reached moon with computers which had less processing power than my laptop. Now with high speed processing and evolving technologies, we are at better position to process this data and make this world "RUN better".


Labels: , , , ,

Monday, January 7, 2013

Two Key Takeaways from SAP's Benchmarking Offering



I had a discussion on the capability of this free offering of SAP with Shailendra Sahay, Program Director of Value Engineering (VE) Best Practices Center and Global Benchmarking. His organization runs the SAP Benchmarking Program globally. The organization helps SAP customers assess their business process maturity and gap to industry best practices. He's given a similar interview to Silicon India, now here are a few salient points from our discussion.


1.) Business Performance Benchmarking is the first step to understand the health of an organization's business processes.

As per Shailendra, "As a trusted vendor to the customer, our objective has always been to solve our customer's real business issues. It is always important for us to first understand the customer's pain points, before offering an IT solution. And our benchmarking program is the perfect tool set to make this happen. Companies can choose to participate in one or more of our 20 plus business process assessments. At the end of the benchmarking exercise, they will get a customized report which will help them understand where they are doing good and the areas they need to improve. This also helps us understand the real pain points in our customer's business, and articulate an IT solution that will help them run better as an organization. "

The Benchmarking program has grown exponentially over the years. This has now transformed the way our customer's approach benchmarking. This program is available through the self-service VMC (Value Management Center) platform. The VMC platform has a database of more than 10,000 benchmarking survey submissions from 4,000 plus companies across 24 industries. It provides 20 plus business process assessments and 12 industry-specific business process assessments (e.g. demand planning for retail, asset management for utilities, etc).

Organizations can use the VMC to benchmark against one of the 300 plus available industry/process combinations (e.g. supply chain planning in retail industry). Customers and select prospects can register themselves on this platform, and choose the survey they want to take, and can get to see real time results after successfully completing the survey. This service is available absolutely free of cost and team of Value Engineers can help the customers interpret the results.

2.) Evolution of Benchmarking Practice over the years

The Benchmarking program was first launched in North America and then slowly expanded it to cover the rest of the world.With experience, the quality of the program improvement which in turn has resulted in robust benchmarking coverage across geographies and industries.  Shailendra sights an example where they realized that many companies in emerging and less mature financial markets were not tracking all the KPI's asked in the benchmarking surveys.

As per Shailendra, "It was a learning experience for both the parties, where these companies realized what they were missing by not tracking all KPIs that are critical to a successful business. We also learned on how we could make our surveys easier so as not too appear too overwhelming to companies which are still ramping up on the best practices curve. Often, we learn a lot of new trends from our customers. Lately, our customers have started to track a lot of new and innovative KPIs in order to keep pace with the changing business landscape. We keep updating our surveys to maintain their relevance to the current business reality. "

We ended our chat discussing the future roadmap and the vision of the offering. As per Shailendra, the vision would be to "go on and enable value management to be more effective and to reach a stage where it becomes a part of customers' DNA. We want to make it a self sustaining discipline which runs on its own because customers see value in it. "

The VMC functionality have been augmented with Business Case creation. More Services are in pipeline. I would strongly recommend that we pass on this good word around our community in our quest to make the world "run better". Please drop in your feedback and comments.

Labels: , ,

Two ways how in memory can benefit an Organization



In my earlier blog, I had mentioned about how In Memory Data Management is relevant to an organization. In Memory Data Management is fast and it can compute large amount of Data - but what are the business benefits that an organisation derive if it uses it. With faster data, thebusiness opportunities are limitlessand we need to creatively think about how we can package solutions using this groundbreaking technology to bring tangible benefits for our customers. In this blog, I delve into how In Memory Data Management can be used in solving business problems and removing system inefficiencies.


Solving Business Problems

In India, during summers, most of the people savor Lassi, hot weather refreshment made of churned Yogurt. To cater to the growth in demand of Lassi, vendors have a unique method of churning the yogurt in mass scale - they useWashing Machine, an invention made for cleaning clothes by rotating the clothes in detergent water. Why not use a use the rotors to churn lassi - its even captured in an Advertisement. So is it a Washing Machine or a HIGH Volume Liquid Churner?



Similarly, a dishwasher can double up as an oven in case we make creative use of it.

Definitely, I am NOT proposing that we use In Memory Database to produce Lassi or cook hot dogs, but we can derive great analogies from the above examples to create business cases for In Memory Databases.

In Memory Databases processes the huge amount of data faster.  Beside the “Faster Processing and Quicker Results”, In Memory Data has greater impact as solutions to business problems.

For Example, in this video, customer talks about how they derived the business positives of faster processing power by using In Memory Data Computing technology, SAP HANA.

By reducing a transactional process from 5 hours to 5 seconds, the In Memory Data frees up the capacity of theemployee as well the computer terminal for more value added activities which in turn can bring more benefits to the organization.Many such use cases for SAP HANA can found here. You may not find an exact match, but it’s extremely likely that you’ll find one or more themes that closely resemble some of your business issues and/or conditions.

Removing System Inefficiency

Most people use Google for checking internet connection.Google, the name, has etched its brand as the fastest portal. It might or might not give us the most relevant result but consistently gives us the faster results. Speed has contributed to Google’s meteoric rise over the years and it continues to do so.

As the machines are connected by faster infrastructures, the software for future is the ones which render faster results to the users. More waiting times, will make the user loose their attention span and would end up in lowering the productivity.

In Memory Data Management calculates and renders the results based on every user instructions avoiding creation the redundant result set , which is a common phenomenon in RDBMS. In Memory Data management ensures that there is no system inefficiency pertaining to speed and in turn contributes to the productivity.

These are the two broad of many categories where In Memory Data Management can be utilized to bring business benefits to Organizations and make world “run better and faster”. I strongly believe in this. What do you think?

Labels: , , ,

Why I think In Memory Data Management was made for cloud


http://upload.wikimedia.org/wikipedia/commons/b/b5/Cloud_computing.svg







Recently, I came across a nice blog on why Amazon thinks big data was made for the cloud. It talks of how big data and cloud computing will work hand in hand to create a central platform for communities to share huge Data Set.  In Memory Data management such as HANA enables this symbiotic relationship between the cloud and big data by facilitating on the fly reorganization of data






  1. Separate Database for customers
  2. Shared Database, Separate Schemas for customers
  3. Shared Database, Shared Schema for customers

Of the three approaches, the shared schema approach has the lowest costs, because it serves largest number of tenants per database server. Also, the administrative and hardware/software costs are drastically reduced. But, it comes with one complexity: As the customer isolation is the minimal, stringent database management  is required to ensure that tenants can never access other tenants' data, even in the event of unexpected bugs or attacks. Dynamic reorganization of Data is one of the prime requirements.

In Memory Databases such as HANA leverages the positives of columnar databases so that
  • New Attributes can be added easily vis-à-vis a row based database architecture
  • Locking for changing the data layout in only required for a short period contrary to row based architecture where the entire database or table would be completely locked to process data definition operations

I feel that there is a  great applicability of In Memory Data Management in cloud. Do give in your comments/ feedback on this piece.

Labels: , ,