With the advent of growing technologies, devices and communication mediums the amount of data that is being produced is astronomical. Hence OLTP based software systems just don’t have the technical wherewithal to handle it. These systems neither have the storage capacity nor the processing capability to make sense of the huge volumes of consumer data that is lying untapped over the internet. It is not just the untapped data but the sheer velocity at which it is produced that has given CEO of firms around the world sleepless nights. Firms understand that within those data sets lies answers that can help firms know and understand their clients better. Comprehend a market trend better, forecast the future of the consumer market better. All these having the potential to help a firm beat its competition.
Hadoop or HDFS (Hadoop Distributed File System) is a java based distributed file system that provides high-performance access to data across all its nodes/cluster. It has addressed all the shortcomings that OLTP based systems had and thus it is the next big thing as far as database technology is concerned. Hadoop provides reliable and scalable data storage capabilities that can for example help a firm store and analyse all the consumer data in the form of Facebook, Twitter feeds let’s say for a popular product that is being talked about on social media. HDFS has demonstrated production scalability of up to 200 Petabytes.
The main feature that HDFS offers is MapReduce. MapReduce is a java based programming paradigm that allows parallel processing and massive scalability across thousands of servers in a Hadoop cluster. A cluster generally consists of 4500 servers and can support a billion files or blocks. It contains two tasks, Map and Reduce. Map takes a data set, it could be structured, unstructured and converts the data set into key-value pair or tuples. The output from the Map stage acts as input to the Reduce stage. The Reduce job works on the output from the Map job, combines those data tuples into smaller set of tuples.
The main advantage of MapReduce is scalability of data processing. It gets extrapolated to thousands of computing nodes. Once a MapReduce program is written, scaling the application so that it runs over thousands of machines in a cluster is a mere configuration change. The reason why MapReduce has gained overwhelming popularity is parallel computing systems that lets analysts, programmers to run various models over large distributed data sets. In addition, they can make use of statistical and machine learning techniques to make predictions, deduce patterns, dig out correlations.
Why is MapReduce a technological breakthrough? It helps us process unstructured data on a very large scale. Those data sets that can’t be accommodated by even the most sophisticated and well-designed database system ever known to man. It allows the tool of data science to run over distributed datasets.
MapReduce helps businesses and other organisation run various calculations to:
- Find the best price for their product that yields the best profit margin.
- Know the effectiveness of their marketing campaign and helps narrow down the areas where focus and effort is needed.
- Perform weather forecasting.
- Data mine web clicks performed by consumers, sentiment analysis of consumers of social platforms to determine what new products a firm must produce.
Are you ready to join the movement?