Hadoop is an Open Source, Scalable, and Fault-Tolerant framework written in Java. Hadoop captures more than 90% of the big Data market. It efficiently processes large volumes of data on a cluster of commodity hardware. Provides an efficient framework for running jobs on multiple nodes of clusters. Cluster means a group of systems connected via LAN.
The basic Hadoop programming language is Java, but this doesn’t mean you can code only in Java. You can code in C, C++, Perl, etc. You can code the Hadoop framework in any language but it will be more good to code in Java as you will have lower level control of the code.
Hadoop uses technologies like Map Reduce programming model as well as its file system (GFS). Hadoop works in master-slave fashion. There are a master node and n numbers of slave nodes where n can be 1000s. Master manages, maintains, and monitors the slaves while slaves are the actual worker nodes.
Daemons are process that runs in the background. There are mainly 4 Types of daemons which run for Hadoop
- Master Node – NameNode, ResourceManager,
- Slave Node – DataNode, NodeManager
Hadoop consists of three major parts:
- Hadoop Distributed File System (HDFS) – Storage layer of Hadoop
- MapReduce – Data processing layer of Hadoop
- YARN – Resource Management layer of Hadoop
Lets’ talk about them in detail.
On Master Node, a daemon called NAMENODE run for HDFS. On all the slave nodes, a daemon called DATANODE run for HDFS.
A file size is in the range of GBs and TBs. These files are split up into Blocks (default 128MB) and stored distributedly across multiple machines. These blocks replicate as per the replication factor. After replication, it is stored at different nodes. This handles the failure of a node in the cluster.
Most Important Pillars of Hadoop. Hadoop MapReduce is a programming model. It is designed for large Volume of Data in parallel by dividing the work into a set of independent tasks. Consists of Job Tracker and Task Tracker.
Yet Another Resource Negotiator. Resource management layer to Hadoop. Does its work very efficiently. On the Master Node, the RESOURCE MANAGER daemon runs from the YARN then for all the slave nodes NODE MANAGER daemon runs.
How Mapper Function works?
Mapper task is the first phase of processing that processes each input record and generates intermediate Key value pair. It stores intermediate output on the local disk. Hadoop Mapper task processes each input Record generates new Key-value pairs. These pairs can be completely different from the input pair. In the mapper task, the output is the full collection of all these pairs.
Before writing the output for each mapper task, partitioning of output takes place based on the key and then sorting is done. This way values for each Key are grouped together. Mapper only understands Key-value pairs of data, so before passing data to the mapper, data should be first converted into pairs.
How is Key-value pair generated?
InputSplit – Unit of work, container a single map task.
RecordReader – Communicates with InputSplit and it converts the data into key-value pairs suitable for Mapper. By default, it uses TextInputFormat for converting data into the Key-value pairs. It communicates with the InputSplit until the file reading is not completed.
How Mapper Works?
InputSplit converts the physical representation of the block into logical Hadoop mapper. The total number of blocks of the input files handles the number of map tasks in a program.
No. of Mapper = Total data size / Input Split size
How does Hadoop Reducer work?
Reducer takes the output of the mapper process i.e. Key-value pairs, processes each of them to generate the output. The output of the reducer is the final output, which is stored in the HDFS.
There are 3 phases of Reducer in Hadoop MapReduce.
- Shuffle Phase – The sorted output from the mapper is the input to the Reducer.
- Sort Phase – The Input from different Mappers is again sorted on the similar keys in different Mappers. The Shuffle and sort phases occur concurrently.
In this phase, after shuffling and sorting, reduce task aggregates the key-value pairs.
How Hadoop Works?
- Input Data breaks into blocks of size 128MB (default) and then moves to different Nodes
- Once all the blocks of the file stored on Data Nodes then a user can process the data.
- Then master schedules a program (submitted by the user) on individual nodes.
- Once all the nodes process the data then the output is written back to HDFS.
What are different types of Modes in Apache Hadoop?
Apache Hadoop runs in three modes:
- Local (Standalone) Mode – Hadoop by default run in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operation. It is also used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for configuration files.
- Pseudo-Distributed Mode – Just like the Standalone mode, Hadoop also runs on a single node in a Pseudo-distributed mode. The difference is that each daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both the Master and Slave node are the same.
- Fully-Distributed Mode – In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, it allows separate nodes for Master and Slave.
Explain Data Locality Feature of Hadoop?
Hadoop major drawback was cross-switch network traffic due to the huge volume of data. To overcome this drawback, Data locality came into the picture. It refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. Data locality increases the overall throughput of the system.
What is Safe Mode in Hadoop?
Safe mode in Apache Hadoop is a maintenance state of NameNode. During which NameNode doesn’t allow any modifications to the file system. During Safe mode, HDFS cluster is in read-only and doesn’t replicate or delete blocks.
How data or file is written into HDFS?
When a client wants to write a file to HDFS, it communicates to namenode for metadata. The Namenode responds with details of a number of blocks, replication factor. Then, on basis of information from NameNode, client split files into multiple blocks. After that client starts sending them to first DataNode. The client sends block A to Datanode 1 with other two Datanodes details.
When Datanode 1 receives block A sent from the client, Datanode 1 copy same block to Datanode 2 of the same rack. As both the Datanodes are in the same rack so block transfer via rack switch. Now Datanode 2 copies the same block to Datanode 3. As both the Datanodes are in different racks so block transfer via an out-of-rack switch.
After the Datanode receives the blocks from the client. Then Datanode sends write confirmation to Namenode. Now Datanode sends write confirmation to the client. The Same process will repeat for each block of the file. Data transfer happen in parallel for faster write of blocks.
Can multiple clients write into an HDFS file concurrently?
Multiple clients cannot write into an HDFS file at the same time. Apache Hadoop HDFS follows single writer multiple reader models. The client which opens a file for writing, the NameNode grant a lease. Now suppose, some other client wants to write into that file. It asks NameNode for the write operation. NameNode first checks whether it has granted the lease for writing into that file to someone else or not. When someone already acquires the lease, then, it will reject the write request of the other client.
How data or file is read in HDFS?
To read from HDFS, the first client communicates to namenode for metadata. A client comes out of namenode with the name of files and its location. The Namenode responds with details of the number of blocks, replication factor. Now client communicates with Datanode where the blocks are present. Clients start reading data parallel to the Datanode. It read on the basis of information received from the namenodes.
Once the client or application receives all the blocks of the file, it will combine these blocks to form a file. For read performance improvement, the location of each block ordered by their distance from the client. HDFS selects the replica which is closest to the client. This reduces the read latency and bandwidth consumption. It first read the block in the same node. Then another node in the same rack, and then finally another Datanode in another rack.
By default, the replication factor is 3 which is configurable. Replication of data solves the problem of data loss in unfavorable conditions.
How is indexing done in HDFS?
Hadoop has a unique way of indexing. Once Hadoop framework store the data as per the block size. HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.
What is the Heartbeat in Hadoop?
Heartbeat is the signals that NameNode receives from the DataNodes to show that it is functioning (alive). NameNode and DataNode do communicate using Heartbeat. If after the certain time of heartbeat NameNode do not receive any response from DataNode, then that Node is dead. The NameNode then schedules the creation of new replicas of those blocks on other DataNodes. The default heartbeat interval is 3 seconds.
What is Hadoop Archive (HAR) used for?
Hadoop Archive (HAR) basically deals with small files issue. HAR pack a number of small files into a large file, so, one can access the original files in parallel transparently (without expanding the files) and efficiently.
How does Hadoop handle Single point of failure?
Hadoop 2.0 overcomes this single point of failure by providing support for multiple NameNode. High availability feature provides an extra NameNode (active-standby NameNode) to Hadoop architecture which is configured for automatic failover. If active NameNode fails, then Standby Namenode takes all the responsibility of active node and cluster continues to work.
Explain Erasure Coding in Hadoop?
In Hadoop, by default HDFS replicates each block three times for several purposes. Replication in HDFS is very simple and robust form of redundancy to shield against the failure of datanode. But replication is very expensive. Thus, 3 x replication scheme has 200% overhead in storage space and other resources.
Thus, Hadoop 2.x introduced Erasure Coding a new feature to use in the place of Replication. It also provides the same level of fault tolerance with less space store and 50% storage overhead.
Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID implements EC through striping. In which it divides logical sequential data (such as a file) into the smaller unit (such as bit, byte or block). Then, stores data on a different disk.
How is Combiner different from Reducer?
The Combiner is Mini-Reducer that perform local reduce task. The Combiner runs on the Map output and produces the output to reducer input. A combiner is usually used for network optimization. Reducer takes a set of an intermediate key-value pair produced by the mapper as the input. Then runs a reduce function on each of them to generate the output. An output of the reducer is the final output.
Hadoop is based on the concept of batch processing where the processing happens of blocks of data that have already been stored over a period. At the time, Hadoop broke all the expectations with the revolutionary MapReduce framework in 2005. Hadoop MapReduce is the best framework for processing data in batches. This went on until 2014, till Spark overtook Hadoop. The USP for Spark was that it could process data in real time and was about 100 times faster than Hadoop MapReduce in batch processing large data sets.
Spark is a cluster – computing framework. It competes with MapReduce and the entire Hadoop ecosystem. Spark doesn’t have its own distributed filesystem, but it uses the HDFS.
Spark uses memory and uses the disk for processing. On the other hand, MapReduce is completely dedicated to disk-based. The main difference is the storage. Spark uses the RDDs that is Resilient Distributed Datasets. Spark is amazing for in-memory and more importantly iterative computing. The key benefit it offers is caching intermediate data in-memory for better access times.
Some use cases where Shark outperforms Hadoop
- Real-Time querying of data: Querying in secs rather than minutes using Shark
- Stream processing: Fraud detection and log processing in live streams for alerts, aggregates, and analysis
- Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really shine as they are easy and fast to process.
Features of Spark
Polyglot: Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python.
Speed: Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. Spark can achieve this speed through controlled partitioning.
Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra apart from the usual formats such as text files, CSV and RDBMS tables.
Lazy Evaluation: It delays its evaluation until it is necessary. This is one of the key factors contributing to its speed.
Real-Time Computation: Spark’s computation is real time and has low latency because of its in-memory computation.
Hadoop Integration: Apache Spark provides smooth compatibility with Hadoop. This is a boon for all the Big Data Engineers who started their careers with Hadoop. It is a potential replacement for the MapReduce Functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop Cluster using YARN for resource scheduling.
Machine Learning: Spark’s MLlib is the Machine Learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for Machine Learning.
Spark components are what make Apache Spark fast and reliable. A lot of these Spark components were built to resolve the issues that cropped up while using Hadoop MapReduce. Apache Spark has the following components:
- Spark Core
- Spark Streaming
- Spark SQL
- MLlib (Machine Learning)
Let’s look at each in detail.
Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Further, additional libraries which are built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:
- Memory management and fault recovery
- Scheduling, distributing and monitoring jobs on a cluster
- Interacting with storage systems
Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data.
Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.
Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.
The following are the four libraries of Spark SQL.
- Data Source API
- DataFrame API
- Interpreter & Optimizer
- SQL Service
GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.
The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and vertex have user-defined properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge.
To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, join vertices, and mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
MlLib (Machine Learning)
MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark.
Hadoop vs Spark
Spark was created by Apache too. It is a fast and general engine for large-scale data processing. People say that Spark is faster than Hadoop. It runs 10x faster on the disk. It is 100x faster than Hadoop MapReduce. The Spark can perform batch processing. It is suitable for streaming workloads, creating interactive queries, and, suitable for machine-learning (Data Science).
Spark was created to provide real-time data processing capability and it is compatible with Hadoop and all its modules. The Spark is listed as a module in the Hadoop project page. Spark has also its own page and it runs on Hadoop clusters through YARN. YARN stands for Yet Another Resource Negotiator. It runs as a Hadoop Module. You can compare and contrast. Data Scientist Experts seem to be positive that Spark will replace Hadoop in the future due to its quick and faster capabilities with the data processing.
That brings us to the end of the post. Hope you gained some insights about these latest Technologies. As the semester progresses I’ll keep posting such blogs as I study these intricate topics. If you found it useful, do like and share. 🙂