Apache Hadoop Tutorial for Beginners
The Hadoop open source software stack makes use of a group of machines. Hadoop offers distributed processing and storage for very large data sets. The objectives of this Hadoop tutorial are to get you started with Hadoop as well as to introduce you to Hadoop and the code and homework submission mechanisms. This Hadoop tutorial includes the Hadoop fundamentals along with HDFS, MapReduce, Yarn, etc.
Open source software is Hadoop. The Apache Open Source License, version 2.0, governs this project. Here, you will discover how to create, assemble, test, and run a straightforward Hadoop program. The first section of the topic serves as a tutorial and covers a basic introduction. The second section asks you to create your own Hadoop program. Although completing the instruction is optional, make sure you submit your work on time. This tutorial must be finished on your own.
Hadoop Tutorial For Beginners
Hadoop Introduction
The Hadoop framework offers the power and flexibility to perform tasks that were previously impossible. Doug Cutting, the man behind the popular text search library Apache Lucene, developed Hadoop. Hadoop is descended from the Lucene project’s Apache Nutch, an open source online search engine. Hadoop is a made-up moniker rather than an acronym.
HDFS – Hadoop Distributed File System
HDFS, which stands for Hadoop Distributed File system, is the name of the distributed file system that comes with Hadoop. Running on clusters of affordable hardware, HDFS is a file system made for storing very big files with streaming data access patterns.
In order to reduce the cost of seeks, HDFS blocks are larger than disk blocks. If the block is large enough, it may take much longer to transmit data from the disk than it does to seek to the beginning of the block. A huge file made up of numerous blocks can therefore be transferred at the disk transfer rate.
A quick calculation reveals that if the seek time is approximately 10 ms and the transfer rate is 100 MB/s, we need to make the block size approximately 100 MB in order to make the seek time 1% of the transfer time. Although many HDFS installations use bigger block sizes, the default is actually 128 MB.
File System Operations
We can perform all of the common file system activities, including reading files, creating directories, moving files, deleting data, and listing directories, now that the file system is ready for use. You can type hadoop fs -help to get detailed help on every command.
Start by copying a file from the local file system to HDFS:
% hadoop fs -copyFromLocal input/docs/quangle.txt \ hdfs://localhost/user/tom/quangle.txt
This command invokes Hadoop’s file system shell command fs, which supports a number of subcommands—in this case, we are running –copy FromLocal. The local file quangle.txt is copied to the file /user/tom/quangle.txt on the HDFS instance running on localhost. In fact, we could have omitted the scheme and host of the URI and picked up the default, hdfs://localhost, as specified in core-site.xml:
% hadoop fs -copyFromLocal input/docs/quangle.txt /user/tom/quangle.txt
We also could have used a relative path and copied the file to our home directory in HDFS, which in this case is /user/tom:
% hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txt
Let’s copy the file back to the local file system and check whether it’s the same
% hadoop fs -copyToLocal quangle.txt quangle.copy.txt % md5 input/docs/quangle.txt quangle.copy.txt MD5 (input/docs/quangle.txt) = e7891a2627cf263a079fb0f18256ffb2 MD5 (quangle.copy.txt) = e7891a2627cf263a079fb0f18256ffb2
The MD5 digests are the same, showing that the file survived its trip to HDFS and is back intact.
Finally, let’s look at an HDFS file listing. We create a directory first just to see how it is displayed in the listing:
% hadoop fs -mkdir books % hadoop fs -ls . Found 2 items drwxr-xr-x - tom supergroup 0 2014-10-04 13:22 books -rw-r--r-- 1 tom supergroup 119 2014-10-04 13:21 quangle.txt
Hadoop File systems
HDFS is only one implementation of Hadoop’s abstract concept of file systems. the org.apache.hadoop.fs abstract class for Java.There are numerous actual implementations of FileSystem, which in Hadoop provides the client interface to a file system.
YARN
The Hadoop cluster resource management system is called Apache YARN (Yet Another Resource Negotiator). YARN was added to Hadoop 2 to enhance the implementation of Map Reduce, although it is sufficiently open-source to accommodate other distributed computing paradigms as well.
Although YARN offers APIs for accessing and manipulating cluster resources, user code rarely uses these APIs directly. Instead, users write to higher-level APIs offered by distributed computing frameworks, which are based on YARN itself and shield the user from the complexities of resource management..
An example of certain distributed computing frameworks operating as YARN applications on the cluster compute layer (YARN) and the cluster storage layer (HDFS and HBase) is shown in the image below.
Two different types of long-running daemons are used by YARN to deliver its main services: a resource manager (one for each cluster) to control resource usage, and node managers (running on each cluster node) to launch and keep track of containers.
YARN Scheduling
The requests that a YARN application makes would, in an ideal world, be promptly granted. However, in the real world, resources are constrained, and on a busy cluster, an application frequently has to wait for some of its requests to be satisfied. The YARN scheduler’s responsibility is to distribute resources to apps in accordance with a set of predetermined policies.
Map Reduce
A programming model for processing data is called Map Reduce. The model is easy to understand but not too easy to express programs in. Map Reduce applications developed in a variety of languages can operate on Hadoop. The map phase and the reduce phase are the two processing stages that make up the Map Reduce algorithm. Key-value pairs can be selected by the programmer as input and output for each step.
The two phases of processing—the map phase and the reduce phase—are how Map Reduce operates. Key-value pairs serve as the input and output for each phase, and the programmer can select the types of these pairs.
The inputs and outputs for the map and reduce functions are key-value pairs, thus Map Reduce has a straightforward data processing model. This chapter examines the Map Reduce model in detail, focusing on the ways in which it can be utilized with data in a variety of formats, from plain text to complex binary objects.
Apache Hadoop Tutorial for beginners explains what is Hadoop. It gives a brief understanding of messaging and important Hadoop concepts are explained. I will be adding more posts in Hadoop tutorial, so please bookmark the post for future reference too.