technical notes: Introduction to Big Data

What is Big Data ?
In essence, big data is about liberating data that is large in volume, broad in variety and high in velocity from multiple sources in order to create efficiencies, develop new products and be more competitive. Big data encompasses "techniques and technologies that make capturing value from data at an extreme scale economical".I believe first 3 (volume, variety and velocity) are attributes and character of data, while the last 2 (verification and value) are part of the process and outcome.

STRUCTURED VS. UNSTRUCTURED Data :

For the most part, structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations; whereas unstructured data is essentially the opposite. The lack of structure makes compilation a time and energy-consuming task. It would be beneficial to a company across all business strata to find a mechanism of data analysis to reduce the costs unstructured data adds to the organization.

Existing Big Data technologies :

RDBMS:

Before big data, traditional analysis involved crunching data in a traditional database. This was based on the relational database model, where data and the relationship between the data were stored in tables. The data was processed and stored in rows.Databases have progressed over the years, however, and are now using massively parallel processing (MPP) to break data up into smaller lots and process it on multiple machines simultaneously, enabling faster processing. Instead of storing the data in rows, the databases can also employ columnar architectures, which enable the processing of only the columns that have the data needed to answer the query and enable the storage of unstructured data.

MapReduce:

MapReduce is the combination of two functions to better process data. First, the map function separates data over multiple nodes, which are then processed in parallel. The reduce function then combines the results of the calculations into a set of responses.Google used MapReduce to index the web, and has been granted a patent for its MapReduce framework. However, the MapReduce method has now become commonly used, with the most famous implementation being in an open-source project called Hadoop (see below).

Massively parallel processing (MPP):

Like MapReduce, MPP processes data by distributing it across a number of nodes, which each process an allocation of data in parallel. The output is then assembled to create a result.However, MPP products are queried with SQL, while MapReduce is natively controlled via Java code. MPP is also generally used on expensive specialised hardware (sometimes referred to as big-data appliances), while MapReduce is deployed on commodity hardware.

Complex event processing (CEP):

Complex event processing involves processing time-based information in real time from various sources; for example, location data from mobile phones or information from sensors to predict, highlight or define events of interest. For example, information from sensors might lead to predicting equipment failures, even if the information from the sensors seems completely unrelated. Conducting complex event processing on large amounts of data can be enabled using MapReduce, by splitting the data into portions that aren't related to one another. For example, the sensor data for each piece of equipment could be sent to a different node for processing.

Hadoop

Derived from MapReduce technology, Hadoop is an open-source framework to process large amounts of data over multiple nodes in parallel, running on inexpensive hardware.Data is split into sections and loaded into a file store — for example, the Hadoop Distributed File System (HDFS), which is made up of multiple redundant nodes on cheap storage. A name node keeps track of which data is on which nodes. The data is replicated over more than one node, so that even if a node fails, there's still a copy of the data.The data can then be analysed using MapReduce, which discovers from the name node where the data needed for calculations resides. Processing is then done at the node in parallel. The results are aggregated to determine the answer to the query and then loaded onto a node, which can be further analysed using other tools. lternatively, the data can be loaded into traditional data warehouses for use with transactional processing.Apache is considered to be the most noteworthy Hadoop distribution.

NoSQL

NoSQL database-management systems are unlike relational database-management systems, in that they do not use SQL as their query language. The idea behind these systems is that that they are better for handling data that doesn't fit easily into tables. They dispense with the overhead of indexing, schema and ACID transactional properties to create large, replicated data stores for running analytics on inexpensive hardware, which is useful for dealing with unstructured data.

Cassandra

Cassandra is a NoSQL database alternative to Hadoop's HDFS.

Hive

Databases like Hadoop's file store make ad hoc query and analysis difficult, as the programming map/reduce functions that are required can be difficult. Realising this when working with Hadoop, Facebook created Hive, which converts SQL queries to map/reduce jobs to be executed using Hadoop.

Brief introduction to Hadoop :

Apache Hadoop is a community driven open-source project goverened by the Apache Software Foundation.

It was originally implemented at Yahoo based on papers published by Google in 2003 and 2004. Hadoop committers today work at several different organizations like Hortonworks, Microsoft, Facebook, Cloudera and many others around the world.

What is Hadoop:

The basic Hadoop platform consists of two primary components:

Storage: Hadoop Distributed File System (HDFS)
Transformation: MapReduce Engine

Hadoop Distributed File System, or HDFS, is the storage system used by the Hadoop platform. Data stored in these file systems are broken up into a NameNode (containing metadata) and multiple DataNodes (containing the data itself). At this point, however, data is not query-able. Hadoop must now start to group and make sense of the data. This is done through the MapReduce Engine.

MapReduce is a transformation mechanism that sits on top of the HDFS. It splits apart data for parallel processing and recombines the data into more understandable output. MapReduce programs were initially developed by Google in 2004 and can be written in a number of languages (such as Java, Python, and C++). Often, the outputs of MapReduce programs are understandable and query-able sets of data. These sets of data can then be queried using a number of SQL-like tools.

When not to use Hadoop :

In a recent survey of data scientists on the obstacles to big data analytics, vendor Paradigm4 reports that more than three-quarters (76%) of the scientists who said they have used Hadoop or Spark (the computational framework built on top of the Hadoop distributed file system) cite “significant limitations” to their use.
we can summarize the previous survey and conclude when not to use Hadoop in these points :

1- Takes too much effort to program .
2- Too slow for interactive, ad hoc queries.
3- Too slow for real-time or complex analytics .
4- Another concern not mentioned in the survey is the cost .(Hadoop is a open source free ware but sometimes enterprise contract with Hadoop services vendor or hiring qualified Hadoop programmers and analysts to work in-house, and by then launching misguided Hadoop projects that cause them to fall behind competitors).
5-Not suitable for large set of small size files .

Note :
After investigation i found this projects that use Hadoop as a real time analysis tool :

Impala from Cloudera uses HDFS but bypasses MapReduce altogether because there's too much overhead otherwise.
Apache Drill is another project that integrates with Hadoop to provide real-time query capabilities.
The Stinger project aims to make Hive itself more real-time.

Data moving from RDBMS to Hadoop .. is it possible ? :

There are multiple services and platforms to do this task like apache flume and SQOOP to do this task .

references :

http://www.brightplanet.com/2012/06/structured-vs-unstructured-data/
http://www.citeworld.com/article/2462886/big-data-analytics/when-to-use-hadoop-and-when-not-to.html
http://stackoverflow.com/questions/22469934/data-moving-from-rdbms-to-hadoop-using-sqoop-and-flume
https://flume.apache.org/
http://sqoop.apache.org/
http://stackoverflow.com/questions/19627795/why-hadoop-is-not-a-real-time-platform
http://www.zdnet.com/article/big-data-all-you-need-to-know/
http://www.ibmbigdatahub.com/blog/6-steps-start-your-big-data-journey
http://blog.performancearchitects.com/wp/2014/01/29/what-is-hadoop-and-how-does-it-compare-to-relational-databases/

technical notes

الثلاثاء، 2 يونيو 2015

Introduction to Big Data

STRUCTURED VS. UNSTRUCTURED Data :

RDBMS:

MapReduce:

ليست هناك تعليقات:

إرسال تعليق