الأربعاء، 3 يونيو 2015

Hadoop installation .. Standalone Mode

Hadoop should be installed in linux based system .. it cannot be installed in windows !!

Here i will write down the installation steps i have been done on Ubuntu 14.04 LTS .

you need to install Java before starting installing Hadoop .. you can follow the following link to install Java Wiki How .

i prefer to create separate user for Hadoop .

Open the terminal and do the following steps to install hadoop in ur linux OS .


  1. creating hadoop user :
    a- create the user by the following command :
    > sudo useradd hadoop
    > sudo passwd hadoop
    write the password and confirm password .
  2. SSH key generation "ensure you have install sshd ,sshd you can run > which ssh and you will get /usr/bin/ssh , /usr/bin/sshd if you have install them properly if not you can run > sudo apt-get install ssh to install them   ":
    a- run the following command to generate the rsa key :
             > ssh-keygen -t rsa
    b- you will be asked about the location of storing this key "i.e. ~/.ssh/authorized_keys "
    c- then give read and write permission for the owner :
              > chmod 0600 ~/.ssh/authorized_keys 

  3. Download Hadoop :
    a- Go to apache Hadoop home page
    b- Go to Download page
    c-In download page you will find a link to the releases of Hadoop click on the release you want to install it " the binary package".. it will redirect you to the mirror links page select any link and use it in the following command to download it :
    > sudo wget [ your link "i.e. http://apache.arvixe.com/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz "]
  4. extract the file to /usr/local directory :
    a- copy the downloaded file to /usr/local directory
    b- extract the file by this command 'sudo tar xzf hadoop-.tar.gz'
    c- rename the extracted file from hadoop. to be hadoop :
    > sudo mv hadoop-2.6.0 hadoop
    Note :
    if you got the following error :
       gzip: stdin: not in gzip format
       tar: Child returned status 1
       tar: Error is not recoverable: exiting now

    this mean the download or the copy is not done correctly so redownload the binary file and copy it in /usr/local properly .
  5. Define Hadoop in your path environmental variable :
    a- add the following line at the end of the '/.bashrc ' file :
    export PATH=$PATH:/usr/local/hadoop/bin/
  6. At this point the installation should be done .. you can check it by running the following command : > hadoop version
    you should got something like :
    hadoop version
    Hadoop 2.6.0
    Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
    Compiled by jenkins on 2014-11-13T21:10Z
    Compiled with protoc 2.5.0
    From source with checksum 18e43357c8f927c0695f1e9522859d6a
    This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar
references :

1. http://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm

       



الثلاثاء، 2 يونيو 2015

Introduction to Big Data


What is Big Data ?
In essence, big data is about liberating data that is large in volume, broad in variety and high in velocity from multiple sources in order to create efficiencies, develop new products and be more competitive. Big data encompasses "techniques and technologies that make capturing value from data at an extreme scale economical".I believe first 3 (volume, variety and velocity) are attributes and character of data, while the last 2 (verification and value) are part of the process and outcome.


STRUCTURED VS. UNSTRUCTURED Data :

For the most part, structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations; whereas unstructured data is essentially the opposite. The lack of structure makes compilation a time and energy-consuming task. It would be beneficial to a company across all business strata to find a mechanism of data analysis to reduce the costs unstructured data adds to the organization.



Existing Big Data technologies :

RDBMS:

Before big data, traditional analysis involved crunching data in a traditional database. This was based on the relational database model, where data and the relationship between the data were stored in tables. The data was processed and stored in rows.Databases have progressed over the years, however, and are now using massively parallel processing (MPP) to break data up into smaller lots and process it on multiple machines simultaneously, enabling faster processing. Instead of storing the data in rows, the databases can also employ columnar architectures, which enable the processing of only the columns that have the data needed to answer the query and enable the storage of unstructured data.

MapReduce:

MapReduce is the combination of two functions to better process data. First, the map function separates data over multiple nodes, which are then processed in parallel. The reduce function then combines the results of the calculations into a set of responses.Google used MapReduce to index the web, and has been granted a patent for its MapReduce framework. However, the MapReduce method has now become commonly used, with the most famous implementation being in an open-source project called Hadoop (see below).
Massively parallel processing (MPP):
Like MapReduce, MPP processes data by distributing it across a number of nodes, which each process an allocation of data in parallel. The output is then assembled to create a result.However, MPP products are queried with SQL, while MapReduce is natively controlled via Java code. MPP is also generally used on expensive specialised hardware (sometimes referred to as big-data appliances), while MapReduce is deployed on commodity hardware.
Complex event processing (CEP):
Complex event processing involves processing time-based information in real time from various sources; for example, location data from mobile phones or information from sensors to predict, highlight or define events of interest. For example, information from sensors might lead to predicting equipment failures, even if the information from the sensors seems completely unrelated. Conducting complex event processing on large amounts of data can be enabled using MapReduce, by splitting the data into portions that aren't related to one another. For example, the sensor data for each piece of equipment could be sent to a different node for processing.
Hadoop
Derived from MapReduce technology, Hadoop is an open-source framework to process large amounts of data over multiple nodes in parallel, running on inexpensive hardware.Data is split into sections and loaded into a file store — for example, the Hadoop Distributed File System (HDFS), which is made up of multiple redundant nodes on cheap storage. A name node keeps track of which data is on which nodes. The data is replicated over more than one node, so that even if a node fails, there's still a copy of the data.The data can then be analysed using MapReduce, which discovers from the name node where the data needed for calculations resides. Processing is then done at the node in parallel. The results are aggregated to determine the answer to the query and then loaded onto a node, which can be further analysed using other tools. lternatively, the data can be loaded into traditional data warehouses for use with transactional processing.Apache is considered to be the most noteworthy Hadoop distribution.
NoSQL
NoSQL database-management systems are unlike relational database-management systems, in that they do not use SQL as their query language. The idea behind these systems is that that they are better for handling data that doesn't fit easily into tables. They dispense with the overhead of indexing, schema and ACID transactional properties to create large, replicated data stores for running analytics on inexpensive hardware, which is useful for dealing with unstructured data.
Cassandra
Cassandra is a NoSQL database alternative to Hadoop's HDFS.
Hive
Databases like Hadoop's file store make ad hoc query and analysis difficult, as the programming map/reduce functions that are required can be difficult. Realising this when working with Hadoop, Facebook created Hive, which converts SQL queries to map/reduce jobs to be executed using Hadoop.