HADOOP INTERVIEW QUESTIONS
https://data-flair.training/blogs/hadoop-interview-questions-and-answers/
Hadoop
Interview Questions and Answers
This blog post on Hadoop
interview questions and answers if one of our most
important article on Hadoop Blog. Intervews are very critical part of ones
career and it is important to know correct answers of the questions that are
asked in the interview to gain enough knowledge and confidence. The Hadoop
Interview Questions were prepared by the industry Experts at DataFlair. We have
divided the whole post in two parts:
1. Hadoop
Interview Questions for HDFS
2. Hadoop
Interview Questions for MapReduce
Hadoop
Interview Questions for HDFS
These 50+ Hadoop Interview Questions and Answers for HDFS are
from different components of HDFS. If you want to become a Hadoop Admin or
Hadoop developer, then DataFlair is
an appropriate place.
We were fully alert while framing these Hadoop Interview
questions. Do comment your thoughts in comment section below.
In this section of Hadoop interview questions and answers, we
have covered 50+ Hadoop interview questions and answers in detail. We have
covered HDFS Hadoop interview questions and answers for freshers, HDFS Hadoop
interview questions and answers for experienced as well as some advanced Hadoop
interview questions and answers.
HDFS Hadoop Interview Questions and Answers
Basic
Questions And Answers for Hadoop Interview
1) What is Hadoop HDFS – Hadoop Distributed
File System?
Hadoop distributed file system-HDFS is
the primary storage system of Hadoop. HDFS stores very large files running on a cluster
of commodity hardware. It works on the principle of storage of less number of large files rather than the huge
number of small files. HDFS stores data reliably even in the case of hardware
failure. It also provides high throughput access to the application by
accessing in parallel.
Components of HDFS:
·
NameNode – It
works as Master in Hadoop cluster. Namenode stores meta-data i.e. number of
blocks, replicas and other details. Meta-data is present in memory in the
master to provide faster retrieval of data. NameNode maintains and also manages
the slave nodes, and assigns tasks to them. It should deploy on reliable
hardware as it is the centerpiece of HDFS.
·
DataNode – It
works as Slave in
Hadoop cluster. In Hadoop HDFS, DataNode is responsible for storing actual data
in HDFS. It also performs read and writes operation as per request for the
clients. DataNodes can deploy on commodity hardware.
2) What are the key features of HDFS?
The various Features of HDFS are:
·
Fault Tolerance – In
Apache Hadoop HDFS, Fault-tolerance is
working strength of a system in unfavorable conditions. Hadoop HDFS is highly
fault-tolerant, in HDFS data is divided into blocks and multiple copies of
blocks are created on different machines in the cluster. If any machine in the
cluster goes down due to unfavorable conditions, then a client can easily
access their data from other machines which contain the same copy of data
blocks.
·
High Availability – HDFS
is highly available file system; data gets replicated among the nodes in the
HDFS cluster by creating a replica of the blocks on the other slaves present in
the HDFS cluster. Hence, when a client wants to access his data, they can
access their data from the slaves which contains its blocks and which is
available on the nearest node in the cluster. At the time of failure of a node,
a client can easily access their data from other nodes.
·
Data Reliability – HDFS
is a distributed file system which provides reliable data storage. HDFS can
store data in the range of 100s petabytes. It stores data reliably by creating
a replica of each and every block present on the nodes and hence, provides
fault tolerance facility.
·
Replication – Data
replication is one of the most important and unique features of HDFS. In HDFS,
replication data is done to solve the problem of data loss in unfavorable
conditions like crashing of the node, hardware failure and so on.
·
Scalability – HDFS
stores data on multiple nodes in the cluster, when requirement increases we can
scale the cluster. There are two scalability mechanisms available: vertical and
horizontal.
·
Distributed Storage – In
HDFS all the features are achieved via distributed storage and replication. In
HDFS data is stored in distributed manner across the nodes in the HDFS cluster.
3) What is the difference between NAS and
HDFS?
·
Hadoop distributed file
system (HDFS) is the primary storage system of Hadoop. HDFS designs to
store very large files running on a cluster of commodity hardware. While Network-attached storage (NAS) is a file-level
computer data storage server. NAS provides data access to a heterogeneous group
of clients.
·
HDFS distribute data blocks across all the machines in a
cluster. Whereas NAS, data stores on a dedicated hardware.
·
Hadoop HDFS is designed to work with MapReduce Framework. In MapReduce Framework computation
move to the data instead of Data to computation. NAS is not suitable for
MapReduce, as it stores data separately from the computations.
·
Hadoop HDFS runs on the cluster commodity hardware which is cost
effective. While a NAS is a high-end storage device which includes high cost.
4) List the various HDFS daemons in HDFS
cluster?
The daemon runs in HDFS cluster are as follows:
·
NameNode – It
is the master node. It is
responsible for storing the metadata of all the files and directories. It also
has information about blocks, their location, replicas and other detail.
·
Datanode – It
is the slave node that
contains the actual data. DataNode also performs read and write operation as
per request for the clients.
·
Secondary NameNode – Secondary
NameNode download the FsImage and EditLogs from the NameNode. Then it merges
EditLogs with the FsImage periodically. It keeps edits log size within a limit.
It stores the modified FsImage into persistent storage. which we can use
FsImage in the case of NameNode failure.
5) What is NameNode and DataNode in HDFS?
NameNode – It works as Master in Hadoop cluster. Below listed are the main function
performed by NameNode:
·
Stores metadata of actual data. E.g. Filename, Path, No. of
blocks, Block IDs, Block Location, No. of Replicas, and also Slave related
configuration.
·
It also manages Filesystem namespace.
·
Regulates client access request for actual file data file.
·
It also assigns work to Slaves (DataNode).
·
Executes file system namespace operation like opening/closing
files, renaming files/directories.
·
As Name node keep metadata in memory for fast retrieval. So it
requires the huge amount of memory for its operation. It should also host on
reliable hardware.
DataNode – It works as Slave in Hadoop cluster. Below listed are the
main function performed by DataNode:
·
Actually, stores Business data.
·
It is actual worker node, so it handles Read/Write/Data
processing.
·
Upon instruction from Master, it performs
creation/replication/deletion of data blocks.
·
As DataNode store all the Business data, so it requires the huge
amount of storage for its operation. It should also host on Commodity hardware.
These were some general Hadoop interview questions and answers.
Now let us take some Hadoop interview questions and answers specially for
freshers.
Hadoop
Interview Question and Answer for Freshers
6) What do you mean by metadata in HDFS?
In Apache Hadoop HDFS, metadata shows the structure of HDFS directories
and files. It provides the various information about directories and files like
permissions, replication factor. NameNode stores metadata Files which are as
follows:
·
FsImage – FsImage
is an “Image file”. It
contains the entire filesystem namespace and stored as a file in the namenode’s
local file system. It also contains a serialized form of all the directories
and file inodes in the filesystem. Each inode is
an internal representation of file or directory’s metadata.
·
EditLogs – EditLogs
contains all the recent modifications made to the file system of most recent
FsImage. When namenode receives a create/update/delete request from the client.
Then this request is first recorded to edits file.
If you face any doubt while reading the Hadoop interview
questions and answers drop a comment and we will get back to you.
7) What is Block in HDFS?
This one is very important Hadoop interview questions and
answers asked in most of the interviews.
Block is a continuous
location on the hard drive where data is stored. In general, FileSystem store
data as a collection of blocks. In a similar way, HDFS stores each file as
blocks, and distributes it across the Hadoop cluster. HDFS client does not have any
control on the block like block location. NameNode decides all such things.
The default size of the HDFS block is 128 MB, which we can
configure as per the requirement. All blocks of the file are of the same size
except the last block, which can be the same size or smaller.
If the data size is less than the block size, then block size will be equal to the data size. For example, if the file size is 129 MB, then 2 blocks will be created for it. One block will be of default size 128 MB and other will be 1 MB only and not 128 MB as it will waste the space (here block size is equal to data size). Hadoop is intelligent enough not to waste rest of 127 MB. So it is allocating 1Mb block only for 1MB data.
If the data size is less than the block size, then block size will be equal to the data size. For example, if the file size is 129 MB, then 2 blocks will be created for it. One block will be of default size 128 MB and other will be 1 MB only and not 128 MB as it will waste the space (here block size is equal to data size). Hadoop is intelligent enough not to waste rest of 127 MB. So it is allocating 1Mb block only for 1MB data.
The major advantages of storing data in such block size are that
it saves disk seek time.
8) Why is Data Block size set to 128 MB in
Hadoop?
Because of the following reasons Block size is 128 MB:
·
To reduce the disk seeks (IO). Larger the block size, lesser the
file blocks. Thus, less number of disk seeks. And block can transfer within
respectable limits and that to parallelly.
·
HDFS have huge data sets, i.e. terabytes and petabytes of data.
If we take 4 KB block size for HDFS, just like Linux file system, which has 4 KB block size.
Then we would be having too many blocks and therefore too much of metadata. Managing
this huge number of blocks and metadata will create huge overhead. Which is
something which we don’t want? So, the block size is set to 128 MB.
On the other hand, block size can’t be so large. Because the system will wait for a very long time for the last unit of data processing to finish its work.
On the other hand, block size can’t be so large. Because the system will wait for a very long time for the last unit of data processing to finish its work.
9) What is the difference between a MapReduce
InputSplit and HDFS block?
Tip for these type of Hadoop interview questions and and
answers: Start with the definition of Block and InputSplit and answer in
a comparison language and then cover its data representation, size and example
and that too in a comparison language.
By definition-
·
Block- Block
in Hadoop is the continuous location on the hard drive where HDFS store data.
In general, FileSystem store data as a collection of blocks. In a similar way,
HDFS stores each file as blocks, and distributes it across the Hadoop cluster.
·
InputSplit- InputSplit represents
the data which individual Mapper will process.
Further split divides into records. Each record (which is a key-value pair) will be processed by the map.
Data representation-
·
Block- It
is the physical representation of data.
·
InputSplit- It
is the logical representation of data. Thus, during data processing in
MapReduce program or other processing techniques use InputSplit. In MapReduce,
important thing is that InputSplit does not contain the input data. Hence, it
is just a reference to the data.
Size-
·
Block- The
default size of the HDFS block is 128 MB which is configured as per our
requirement. All blocks of the file are of the same size except the last block.
The last Block can be of same size or smaller. In Hadoop, the files split into 128
MB blocks and then stored into Hadoop Filesystem.
·
InputSplit- Split
size is approximately equal to block size, by default.
Example-
Consider an example, where we need to store the file in HDFS. HDFS stores files as blocks. Block is the smallest unit of data that can store or retrieved from the disk. The default size of the block is 128MB. HDFS break files into blocks and stores these blocks on different nodes in the cluster. We have a file of 130 MB, so HDFS will break this file into 2 blocks.
Consider an example, where we need to store the file in HDFS. HDFS stores files as blocks. Block is the smallest unit of data that can store or retrieved from the disk. The default size of the block is 128MB. HDFS break files into blocks and stores these blocks on different nodes in the cluster. We have a file of 130 MB, so HDFS will break this file into 2 blocks.
Now, if one wants to perform MapReduce operation on the blocks,
it will not process, as the 2nd block is incomplete. InputSplit solves this
problem. InputSplit will form a logical grouping of blocks as a single block.
As the InputSplit include a location for the next block. It also includes the
byte offset of the data needed to complete the block.
From this, we can conclude that InputSplit is only a logical
chunk of data. i.e. it has just the information about blocks address or
location. Thus, during MapReduce execution, Hadoop scans through the blocks and
create InputSplits.
10) How can one copy a file into HDFS with a different
block size to that of existing block size configuration?
By using bellow commands one can copy a file into HDFS with a
different block size:
–Ddfs.blocksize=block_size,
where block_size is in bytes.
So, consider an example to explain it in detail:
Suppose, you want to copy a file called test.txt of size, say of 128 MB, into the hdfs.
And for this file, you want the block size to be 32MB (33554432 Bytes) in place
of the default (128 MB). So, can issue the following command:
Hadoop fs –Ddfs.blocksize=33554432-copyFromlocal/home/dataflair/test.txt/sample_hdfs.
Now, you can check the HDFS block size associated with this file
by:
hadoop fs –stat
%o/sample_hdfs/test.txt
You can also check it by using the NameNode web UI for seeing
the HDFS directory.
These are very common type of Hadoop interview questions and
answers faced during the interview of a fresher.
Frequently
Asked Question in Hadoop Interview
11) Which one is the master node in HDFS? Can
it be commodity hardware?
Name node is the master node in HDFS. The NameNode stores
metadata and works as High
Availability machine in HDFS. It requires high memory
(RAM) space, so NameNode needs to be a high-end machine with good memory space.
It cannot be a commodity as the entire HDFS works on it.
12) In HDFS, how Name node determines which
data node to write on?
Answer these type of Hadoop interview questions answers very
shortly and to the point.
Namenode contains Metadata i.e. number of blocks, replicas,
their location, and other details. This meta-data is available in memory in the
master for faster retrieval of data. NameNode maintains and manages the
Datanodes, and assigns tasks to them.
13) What is a Heartbeat in Hadoop?
Heartbeat is the signals that
NameNode receives from the DataNodes to show that it is functioning (alive).
NameNode and DataNode do communicate using Heartbeat. If after a certain time of heartbeat
DataNode does not send any response to NameNode, then that Node is dead. So, NameNode in HDFS will create new replicas of those
blocks on other DataNodes.
Heartbeats carry information about total storage capacity. It also,
carry the fraction of storage in use, and the number of data transfers
currently in progress.
The default heartbeat interval is 3 seconds. One can change it by usingdfs.heartbeat.interval in hdfs-site.xml.
The default heartbeat interval is 3 seconds. One can change it by usingdfs.heartbeat.interval in hdfs-site.xml.
14) Can multiple clients write into an Hadoop
HDFS file concurrently?
Multiple clients cannot write into an Hadoop HDFS file at same time. Apache Hadoop
follows single writer multiple
reader models. When HDFS client opens a file for writing, then NameNode grant
a lease. Now suppose, some other client wants to write into that file. It asks
NameNode for a write operation in Hadoop. NameNode first checks whether it has
granted the lease for writing into that file to someone else or not. If someone
else acquires the lease, then it will reject the write request of the other
client.
15) How data or file is read in Hadoop HDFS?
To read from HDFS, the
first client communicates to namenode for metadata. The Namenode responds with
details of No. of blocks, Block IDs, Block Location, No. of Replicas. Then, the
client communicates with Datanode where the blocks are present. Clients start
reading data parallel from the Datanode. It read on the basis of information
received from the namenodes.
Once an application or HDFS client receives all the blocks of
the file, it will combine these blocks to form a file. To improve read
performance, the location of each block ordered by their distance from the
client. HDFS selects the replica which is closest to the client. This reduces
the read latency and bandwidth consumption. It first read the block in the same
node. Then another node in the same rack, and then finally another Datanode in
another rack.
16) Does HDFS allow a client to read a file
which is already opened for writing?
Yes, the client can read the file which is already opened for
writing. But, the problem in reading a file which is currently open for
writing, lies in the consistency of data. HDFS does not provide the surety that
the data which it has written into the file will be visible to a new reader.
For this, one can call the hflush operation.
It will push all the data in the buffer into write pipeline. Then the hflush
operation will wait for acknowledgments from the datanodes. Hence, by doing
this, the data that client has written into the file before the hflush operation visible to the reader for sure.
If you encounter any doubt or query in the Hadoop interview
questions, feel free to ask us in the comment section below and our support
team will get back to you.
17) Why is Reading done in parallel and
writing is not in HDFS?
Client read data parallelly because by doing so the client can
access the data fast. Reading in parallel makes the system fault tolerant. But
the client does not perform the write operation in Parallel. Because writing in
parallel might result in data inconsistency.
Suppose, you have a file and two nodes are trying to write data
into a file in parallel. Then the first node does not know what the second node
has written and vice-versa. So, we can not identify which data to store and
access.
Client in Hadoop writes data in pipeline anatomy. There are
various benefits of a pipeline write:
·
More efficient bandwidth
consumption for the client – The client only has to
transfer one replica to the first datanode in the pipeline write. So, each node
only gets and send one replica over the network (except the last datanode only
receives data). This results in balanced bandwidth consumption. As compared to
the client writing three replicas into three different datanodes.
·
Smaller sent/ack window
to maintain – The client maintain a much smaller sliding window. Sliding
window record which blocks in the replica is sending to the DataNodes. It also
records which blocks are waiting for acks to confirm the write has been done.
In a pipeline write, the client appears to write data to only one datanode.
18) What is the problem with small files in
Apache Hadoop?
Hadoop is not suitable for
small data. Hadoop HDFS lacks the ability to support the random reading of
small files. Small file in HDFS is smaller than the HDFS block size
(default 128 MB). If we are storing these huge numbers of small files, HDFS
can’t handle these lots of files. HDFS works with the small number of large
files for storing large datasets. It is not suitable for a large
number of small files. A large number of many small files
overload NameNode since it stores the namespace of HDFS.
Solution –
·
HAR (Hadoop Archive)
Files – HAR files deal with small file issue. HAR has introduced a
layer on top of HDFS, which provides interface for file accessing. Using Hadoop
archive command we can create HAR files. These file runs a MapReduce job to pack the archived files into a
smaller number of HDFS files. Reading through files in as HAR is not more
efficient than reading through files in HDFS.
·
Sequence Files – Sequence
Files also deal with small file problem. In this, we use the filename as key
and the file contents as the value. Suppose we have 10,000 files, each of 100
KB, we can write a program to put them into a single sequence file. Then one
can process them in a streaming fashion.
19) What is throughput in HDFS?
The amount of work done in a unit time is known as Throughput. Below are the reasons due to HDFS provides
good throughput:
·
Hadoop works on Data Locality principle. This principle state that
moves computation to data instead of data to computation. This reduces network
congestion and therefore, enhances the overall system throughput.
·
The HDFS is Write Once and Read Many Model. It simplifies the
data coherency issues as the data written once, one can not modify it. Thus,
provides high throughput data access.
20) Comparison between Secondary NameNode and
Checkpoint Node in Hadoop?
Secondary NameNode download
the FsImage and EditLogs from the NameNode. Then it merges EditLogs
with the FsImage periodically. Secondary NameNode stores the modified FsImage
into persistent storage. So, we can use FsImage in the case of NameNode
failure. But it does not upload the merged FsImage with EditLogs to active
namenode. While Checkpoint node is a node which periodically creates
checkpoints of the namespace.
Checkpoint Node in Hadoop first downloads FsImage and edits from
the active NameNode. Then it merges them (FsImage and edits) locally, and at
last, it uploads the new image back to the active namenode.
The above 7-20 Hadoop interview questions and answers were for
freshers, However experienced can also go through these Hadoop interview
questions and answers for revising the basics.
21) What is a Backup node in Hadoop?
Backup node provides the same
checkpointing functionality as the Checkpoint node (Checkpoint node is a node
which periodically creates checkpoints of the namespace. Checkpoint Node
downloads FsImage and edits from the active NameNode, merges them locally, and
uploads the new image back to the active NameNode). In Hadoop, Backup node
keeps an in-memory, up-to-date copy of the file system namespace, which is
always synchronized with the active NameNode state.
The Backup node does not need to download FsImage and edits
files from the active NameNode in order to create a checkpoint, as would be
required with a Checkpoint node or Secondary Namenode since it already has an
up-to-date state of the namespace state in memory. The Backup node checkpoint
process is more efficient as it only needs to save the namespace into the local
FsImage file and reset edits. One Backup node is supported by the NameNode at a
time. No checkpoint nodes may be registered if a Backup node is in use.
22) How does HDFS ensure Data Integrity of
data blocks stored in HDFS?
Data Integrity ensures the correctness of the data. But, it is
possible that the data will get corrupted during I/O operation on the disk.
Corruption can occur due to various reasons network faults, or buggy software.
Hadoop HDFS client software implements checksum checking on the contents of
HDFS files.
In Hadoop, when a client creates an HDFS file, it computes a
checksum of each block of the file. Then stores these checksums in a separate
hidden file in the same HDFS namespace. When a client retrieves file contents
it first checks. Then it verifies that the data it received from each Datanode
matches the checksum. Checksum stored in the associated checksum file. And if
not, then the client can opt to retrieve that block from another DataNode that
has a replica of that block.
23) What do you mean by the NameNode High
Availability in hadoop?
In Hadoop 1.x, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would be unable to read, write
file or list files. In such event, the whole Hadoop system would be out of
service until new namenode is up.
Hadoop 2.0 overcomes SPOF. Hadoop 2.x provide support for multiple NameNode. High
availability feature gives
an extra NameNode (active standby NameNode) to Hadoop architecture. This extra
NameNode configured for automatic failover. If active NameNode fails, then
Standby Namenode takes all its responsibility. And cluster work continuously.
The initial implementation of namenode high availability provided for single
active/standby namenode. However, some deployment requires high degree
fault-tolerance. Hadoop 3.x enable this feature by allowing the user
to run multiple standby namenode. The cluster tolerates the failure of 2 nodes
rather than 1 by configuring 3 namenode & 5 journal nodes.
24) What is Fault Tolerance in Hadoop HDFS?
Fault-tolerance in HDFS is
working strength of a system in unfavorable conditions. Unfavorable conditions
are the crashing of the node, hardware failure and so on. HDFS control faults
by the process of replica creation. When client stores a file in HDFS, Hadoop
framework divide the file into blocks. Then client distributes data blocks across different machines present in HDFS
cluster. And, then create the replica of each block is on other machines
present in the cluster.
HDFS, by default, creates 3 copies of a block on other machines
present in the cluster. If any machine in the cluster goes down or fails due to
unfavorable conditions. Then also, the user can easily access that data from
other machines in which replica of the block is present.
25) Describe HDFS Federation.
Limitations-
·
Namespace layer and storage layer are tightly coupled. This
makes alternate implementation of namenode difficult. It also restricts other
services to use block storage directly.
·
A namespace is not scalable like datanode. Scaling in HDFS cluster is horizontally by
adding datanodes. But we can’t add more namespace to an existing cluster.
·
There is no separation of the namespace. So, there is no
isolation among tenant organization that is using the cluster.
In Hadoop 2.0, HDFS Federation overcomes this
limitation. It supports too many NameNode/ Namespaces to scale the namespace
horizontally. In HDFS federation isolate different categories of application
and users to different namespaces. This improves Read/ write operation
throughput adding more namenodes.
26) What is the default replication factor in
Hadoop and how will you change it?
The default replication factor is 3. One can change replication factor in following three ways:
·
By adding this property
to hdfs-site.xml:
1.
<property>
2.
<name>dfs.replication</name>
3.
<value>5</value>
4.
<description>Block Replication</description>
5.
</property>
·
One can also change the
replication factor on a per-file basis using the command:
hadoop fs –setrep –w 3 / file_location
hadoop fs –setrep –w 3 / file_location
·
One can also change
replication factor for all the files in a directory by using:
hadoop fs –setrep –w 3 –R / directoey_location
hadoop fs –setrep –w 3 –R / directoey_location
27) Why Hadoop performs replication, although
it results in data redundancy?
In HDFS, Replication provides the fault tolerance. Replication
is one of the unique features of HDFS. Data Replication solves the issue of
data loss in unfavorable conditions. Unfavorable conditions are the hardware
failure, crashing of the node and so on.
HDFS by default creates 3 replicas of each block across the
cluster in Hadoop. And we can change it as per the need. So if any node goes
down, we can recover data on that node from the other node. In HDFS,
Replication will lead to the consumption of a lot of space. But the user can
always add more nodes to the cluster if required. It is very rare to have free
space issues in practical cluster. As the very first reason to deploy HDFS was
to store huge data sets. Also, one can change the replication factor to save
HDFS space. Or one can also use different codec provided by the Hadoop to
compress the data.
28) What is Rack Awareness in Apache Hadoop?
In Hadoop, Rack
Awareness improves the network traffic while reading/writing file.
In Rack Awareness NameNode chooses the DataNode which is closer to the same
rack or nearby rack. NameNode achieves Rack information by maintaining the rack
ids of each DataNode. Thus, this concept chooses Datanodes based on the Rack
information.
HDFS NameNode makes sure that all the replicas are not stored on
the single rack or same rack. It follows Rack
Awareness Algorithm to reduce latency as well as fault tolerance.
Default replication factor is 3. Therefore according to Rack
Awareness Algorithm:
·
The first replica of the block will store on a local rack.
·
The next replica will store on another datanode within the same
rack.
·
And the third replica stored on the different rack.
In Hadoop, we need Rack Awareness for below reason: It improves-
·
Data high availability and reliability.
·
The performance of the cluster.
·
Network bandwidth.
29) Explain the Single point of Failure in
Hadoop?
In Hadoop 1.0,
NameNode is a single point of Failure (SPOF). If namenode fails, all clients
would unable to read/write files. In such event, whole Hadoop system would be
out of service until new namenode is up.
Hadoop 2.0 overcomes this SPOF
by providing support for multiple NameNode. High availability feature provides an
extra NameNode to hadoop architecture. This feature provides automatic
failover. If active NameNode fails, then Standby-Namenode takes all the
responsibility of active node. And cluster continues to work.
The initial implementation of Namenode high availability provided for single
active/standby namenode. However, some deployment requires high degree
fault-tolerance. So new version 3.0 enable this feature by allowing the user to
run multiple standby namenode. The cluster tolerates the failure of 2 nodes
rather than 1 by configuring 3 namenode & 5 journalnodes.
30) Explain Erasure Coding in Apache Hadoop?
For several purposes, HDFS, by default, replicates each block
three times. Replication also provides the very simple form of redundancy to
protect against the failure of datanode. But replication is very expensive. 3 x
replication scheme results in 200% overhead in storage space and other resources.
Hadoop 2.x introduced a new feature called “Erasure Coding” to use in the place of Replication. It also provides the same level of fault tolerance with less space store and 50% storage overhead.
Hadoop 2.x introduced a new feature called “Erasure Coding” to use in the place of Replication. It also provides the same level of fault tolerance with less space store and 50% storage overhead.
Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID implements Erasure Coding through striping. In which it
divides logical sequential data (such as a file) into the smaller unit (such as
bit, byte or block). After that, it stores data on different disk.
Encoding – In this, RAID
calculates and sort Parity cells for each strip of data cells. Then, recover
error through the parity. Erasure coding extends a message with redundant data
for fault tolerance. Its codec operates on uniformly sized data cells. In
Erasure Coding, codec takes a number of data cells as input and produces parity
cells as the output.
There are two algorithms available for Erasure Coding:
·
XOR Algorithm
·
Reed-Solomon Algorithm
31) What is Balancer in Hadoop?
Data may not always distribute uniformly across the datanodes in
HDFS due to following reasons:
·
A lot of deletes and writes
·
Disk replacement
Data Blocks allocation strategy tries hard to spread new blocks
uniformly among all the datanodes. In a large cluster, each node has different
capacity. While quite often you need to delete some old nodes, also add new
nodes for more capacity.
The addition of new datanode becomes a bottleneck due to below
reason:
·
When Hadoop framework allocates all the new blocks
and read from new datanode. This will overload the new datanode.
HDFS provides a tool called Balancer that analyzes block
placement and balances across the datanodes.
These are very common type of Hadoop interview questions and
answers faced during the interview of an experienced professional.
32) What is Disk Balancer in Apache Hadoop?
Disk Balancer is
a command line tool, which distributes data evenly on all disks of a datanode.
This tool operates against a given datanode and moves blocks from one disk to
another.
Disk balancer works by creating and executing a plan (set of
statements) on the datanode. Thus, the plan describes how much data should move
between two disks. A plan composes multiple steps. Move step has source disk,
destination disk and the number of bytes to move. And the plan will execute
against an operational datanode.
By default, disk balancer is not enabled. Hence, to enable
diskbalnecr dfs.disk.balancer.enabled must
be set true in hdfs-site.xml.
When we write new block in hdfs, then, datanode uses volume
choosing the policy to choose the disk for the block. Each directory is the
volume in hdfs terminology. Thus, two such policies are: Round-robin and Available
space
·
Round-robin distributes
the new blocks evenly across the available disks.
·
Available space writes
data to the disk that has maximum free space (by percentage).
33) What is active and passive NameNode in
Hadoop?
In Hadoop 1.0, NameNode
is a single point of Failure (SPOF). If namenode fails, then all clients would
be unable to read, write file or list files. In such event, whole Hadoop system
would be out of service until new namenode is up.
Hadoop 2.0 overcomes this SPOF. Hadoop 2.0 provides
support for multiple NameNode. High availability feature provides an extra
NameNode to Hadoop architecture for automatic failover.
·
Active NameNode – It
is the NameNode which works and runs in the cluster. It is also responsible for
all client operations in the cluster.
·
Passive NameNode – It
is a standby namenode, which has similar data as active NameNode. It simply
acts as a slave, maintains enough state to provide a fast failover, if
necessary.
If Active NameNode fails,
then Passive NameNode takes
all the responsibility of active node. The cluster works continuously.
34) How is indexing done in Hadoop HDFS?
Apache Hadoop has a unique way of indexing. Once Hadoop
framework store the data as per the Data Bock size. HDFS will keep on storing the last
part of the data which will say where the next part of the data will be. In
fact, this is the base of HDFS.
35) What is a Block Scanner in HDFS?
Block scanner verify whether the data blocks stored on each
DataNodes are correct or not. When Block scanner detects corrupted data block,
then following steps occur:
·
First of all, DataNode report NameNode about the corrupted
block.
·
After that, NameNode will start the process of creating a new
replica. It creates new replica using the correct replica of the corrupted
block present in other DataNodes.
·
When the replication count of the correct replicas matches the
replication factor 3, then delete corrupted block
36) How to perform the inter-cluster data
copying work in HDFS?
HDFS use distributed copy command to perform the inter-cluster
data copying. That is as below:
hadoop distcp hdfs://<source NameNode> hdfs://<target NameNode>
DistCp (distributed copy)
is a tool also used for large inter/intra-cluster copying. It uses MapReduce to affect its distribution, error
handling and recovery and reporting. This distributed copy tool enlarges a list
of files and directories into the input to map tasks.
37) What are the main properties of
hdfs-site.xml file?
hdfs-site.xml – It
specifies configuration setting for HDFS daemons in Hadoop. It also provides
default block replication and permission checking on HDFS.
The three main hdfs-site.xml properties are:
1. dfs.name.dir gives you the location where NameNode stores
the metadata (FsImage and edit logs). It also specifies where DFS should
locate, on the disk or onto the remote directory.
2. dfs.data.dir gives the location of DataNodes where it
stores the data.
3. fs.checkpoint.dir is the directory on
the file system. Hence, on this directory secondary NameNode stores the
temporary images of edit logs.
38) How can one check whether NameNode is
working or not?
One can check the status of the HDFS NameNode in several ways.
Most usually, one uses the jps command
to check the status of all daemons running in the HDFS.
39) How would you restart NameNode?
NameNode is also known as Master node. It stores meta-data i.e.
number of blocks, replicas, and other details. NameNode maintains and manages
the slave nodes, and assigns tasks to them.
By following two methods, you can restart NameNode:
·
First stop the NameNode individually using ./sbin/hadoop-daemons.sh stop
namenode command. Then, start the NameNode using ./sbin/hadoop-daemons.sh start namenode
command.
·
Use ./sbin/stop-all.sh and
then use ./sbin/start-all.sh command
which will stop all the demons first. Then start all the daemons.
The above Hadoop interview questions and answers were for
experienced but freshers can also refer these Hadoop interview questions and
answers for in depth knowledge. Now let’s move forward with some advanced
Hadoop interview questions and answers.
Advanced
Questions for Hadoop Interview
40) How NameNode tackle Datanode failures in
Hadoop?
HDFS has a master-slave architecture
in which master is namenode and slave are datanode. HDFS cluster has single
namenode that manages file system namespace (metadata) and multiple datanodes
that are responsible for storing actual data in HDFs and performing the read-write operation as per request for the
clients.
NameNode receives Heartbeat and block report from Datanode.
Heartbeat receipt implies that the datanode is alive and functioning properly
and block report contains a list of all blocks on a datanode. When NameNode
observes that DataNode has not sent heartbeat message after a certain amount of
time, the datanode is marked as dead. The namenode replicates the blocks of the
dead node to another datanode using the replica created earlier. Hence,
NameNode can easily handle Datanode failure.
41) Is Namenode machine same as DataNode
machine as in terms of hardware in Hadoop?
NameNode is highly
available server, unlike DataNode. NameNode manages the File System
Namespace. It also maintains the metadata information.
Metadata information is the number of blocks, their location, replicas and
other details. It also executes file system execution such as naming, closing,
opening files/directories.
Because of the above reasons, NameNode requires higher RAM for
storing the metadata for millions of files. Whereas, DataNode is responsible
for storing actual data in HDFS. It performs read and write operation as per
request of the clients. Therefore, Datanode needs to have a higher disk capacity
for storing huge data sets.
42) If DataNode increases, then do we need to
upgrade NameNode in Hadoop?
Namenode stores meta-data i.e. number of blocks, their location,
replicas. In Hadoop, meta-data is present in memory in the master for faster
retrieval of data. NameNode manages and maintains the slave nodes, and assigns
tasks to them. It regulates client’s access to files.
It also executes file system execution such as naming, closing,
opening files/directories. During Hadoop installation, framework determines NameNode based
on the size of the cluster. Mostly we don’t need to upgrade the NameNode
because it does not store the actual data. But it stores the metadata, so such
requirement rarely arise.
43) Explain what happens if, during the PUT
operation, HDFS block is assigned a replication factor 1 instead of the default
value 3?
Replication factor can be set for the entire cluster to adjust
the number of replicated block. It ensures high data availability.
The cluster will have n-1 duplicated
blocks, for every block that are present in HDFS. So, if the replication factor
during PUT operation is set to 1 in place of the default value 3. Then it will
have a single copy of data. If one set replication factor 1. And if DataNode
crashes under any circumstances, then an only single copy of the data would
lose.
44) What are file permissions in HDFS? how
does HDFS check permissions for files/directory?
Hadoop distributed file system (HDFS) implements a permissions
model for files/directories.
For each files/directory, one can manage permissions for a set of 3 distinct user classes: owner, group, others
For each files/directory, one can manage permissions for a set of 3 distinct user classes: owner, group, others
There are also 3 different permissions for each user
class: Read (r), write
(w), execute(x)
·
For files, the w permission
is to write to the file and the r permission
is to readthe
file.
·
For directories, w permission
is to create or delet the directory. The rpermission is to list
the contents of the directory.
·
The X permission
is to access a child of the
directory.
HDFS check permissions for files or directory:
·
We can also check the owner’s permissions, if the user name
matches the owner of directory.
·
If the group matches the directory’s group, then Hadoop tests
the user’s group permissions.
·
Hadoop tests the “other”
permission, when owner and the group names doesn’t match.
·
If none of the permissions checks succeed, then client’s request
is denied.
45) How one can format Hadoop HDFS?
One can format HDFS by using bin/hadoop namenode –format
command.
$bin/hadoop namenode –format$ command formats
the HDFS via NameNode.
Formatting implies initializing the directory specified by
the dfs.name.dir variable.
When you run this command on existing files system, then, you will lose all
your data stored on your NameNode.
Hadoop NameNode directory contains the FsImage and edit files.
This hold the basic information’s about Hadoop file system. So, this basic information
includes like which user created files etc.
Hence, when we format the NameNode, then it will delete above information from directory. This information is present in the hdfs-site.xml as dfs.namenode.name.dir. So, formatting a NameNode will not format the DataNode.
NOTE: Never format, up and running Hadoop Filesystem. You will lose data stored in the HDFS.
Hence, when we format the NameNode, then it will delete above information from directory. This information is present in the hdfs-site.xml as dfs.namenode.name.dir. So, formatting a NameNode will not format the DataNode.
NOTE: Never format, up and running Hadoop Filesystem. You will lose data stored in the HDFS.
46) What is the process to change the files at
arbitrary locations in HDFS?
HDFS doesn’t support modifications at arbitrary offsets in the
file or multiple writers. But a single writer writes files in append-only
format. Writes to a file in HDFS are always made at the end
of the file.
47) Differentiate HDFS & HBase.
Data write process
·
HDFS- Append
method
·
HBase- Bulk
incremental, random write
Data read process
·
HDFS- Table
scan
·
HBase- Table
scan/random read/small range scan
Hive SQL querying
·
HDFS- Excellent
·
HBase- Average
These are some advanced Hadoop interview questions and answers
for HDFS that will help you in answer many more interview questions in the best
manner
48) What is meant by streaming access?
HDFS works on the principle of “write once, read many”. Its
focus is on fast and accurate data retrieval. Steaming access means reading the
complete data instead of retrieving a single record from the database.
49) How to transfer data from Hive to HDFS?
One can transfer data from Hive by writing the query:
hive> insert overwrite directory ‘/’ select
* from emp;
Hence, the output you receive will be stored in part files in
the specified HDFS path.
50) How to add/delete a Node to the existing
cluster?
To add a Node to the existing cluster follow:
Add the host name/Ip address in dfs.hosts/slaves file. Then,
refresh the cluster with
$hadoop dfsamin -refreshNodes
To delete a Node to the existing cluster
follow:
Add the hostname/Ip address to dfs.hosts.exclude/remove the
entry from slaves file. Then, refresh the cluster with $hadoop dfsamin
-refreshNodes
$hadoop dfsamin -refreshNodes
51) How to format the HDFS? How frequently it
will be done?
These type of Hadoop Interview Questions and Answers are also
taken very short and to the point. Giving very lengthy answer here is
unnecessary and may lead to negative points.
$hadoop namnode -format.
Note: Format the HDFS
only once that to during initial cluster setup.
52) What is the importance of
dfs.namenode.name.dir in HDFS?
dfs.namenode.name.dir contains the fsimage
file for namenode.
We should configure it to write to atleast two filesystems on
physical hosts/namenode/secondary namenode. Because if we lose FsImage file we
will lose entire HDFS file system. Also
there is no other recovery mechanism if there is no FsImage file available.
Number 40-52 were the advanced Hadoop interview question and
answer to get indepth knowledge in handling difficult Hadoop interview
questions and answers.
This was all about the Hadoop Interview Questions and Answers
These questions are frequently asked Hadoop interview questions
and answers. You can read here some more Hadoop HDFS interview questions and answers.
After going through these top Hadoop Interview questions and
answers you will be able to confidently face a interview and will be able to
answer Hadoop Interview questions and answers asked in your interview in the
best manner. These Hadoop Interview Questions are suggested by the experts at
DataFlair
Key –
Q.1 – Q.5 Basic Hadoop Interview Questions
Q.6 – Q.10 HDFS Hadoop interview questions and answers for
freshers
Q. 11- Q. 20 Frequently asked Questions in Hadoop Interview
Q.21 – Q39 were the HDFS Hadoop interview questions and answer
for experienced
Q.40 – Q.52 were the advanced HDFS Hadoop interview questions
and answers
These Hadoop interview questions and answers are categorized so
that you can pay more attention to questions specified for you, however, it is
recommended that you go through all the Hadoop interview questions and answers
for complete understanding.
If you have any more doubt or query on Hadoop Interview
Questions and Answers, Drop a comment and our support team will be happy to
help you. Now let’s jump to our second part of Hadoop Interview Questions i.e.
MapReduce Interview Questions and Answers.
Hadoop
Interview Questions and Answers for MapReduce
It is difficult to pass the Hadoop
interview as it is fast and growing technology. To get you through
this tough path the MapReduce Hadoop interview questions and answers will serve
as the backbone. This section contains the commonly asked MapReduce Hadoop
interview questions and answers.
In this section on MapReduce Hadoop interview questions and
answers, we have covered 50+ Hadoop interview questions and answers for
MapReduce in detail. We have covered MapReduce Hadoop interview questions
and answers for freshers, MapReduce Hadoop interview questions and answers for
experienced as well as some advanced Mapreduce Hadoop interview questions and
answers.
These 50 MapReduce Hadoop Interview Questions
are framed by keeping in mind the need of an era, and the trending pattern of
the interview that is being followed by the companies. The interview questions
of Hadoop MapReduce are dedicatedly framed by the company experts to help you
to reach your goal.
All the best!!!!
Top 50 MapReduce Hadoop Interview Questions and Answers for
Hadoop Jobs.
Basic
MapReduce Hadoop Interview Questions and Answers
53) What is MapReduce in Hadoop?
Hadoop MapReduce is
the data processing layer. It is the framework for writing applications that
process the vast amount of data stored in the HDFS.
It processes a huge amount of data in parallel by dividing the
job into a set of independent tasks (sub-job). In Hadoop, MapReduce works by
breaking the processing into phases: Map and Reduce.
·
Map – It
is the first phase of processing. In which we specify all the complex
logic/business rules/costly code. The map takes a set of data and converts it
into another set of data. It also breaks individual elements into tuples (key-value pairs).
·
Reduce – It
is the second phase of processing. In which we specify light-weight processing
like aggregation/summation. The output from the map is the input to Reducer.
Then, Reducer combines tuples (key-value) based on the key. And then, modifies
the value of the key accordingly.
54) What is the need of MapReduce in Hadoop?
In Hadoop, when we have stored the data in HDFS, how to process this data is the first question that arises?
Transferring all this data to a central node for processing is not going to
work. And we will have to wait forever for the data to transfer over the
network. Google faced this same problem with its Distributed Goggle File System (GFS). It solved this problem
using a MapReduce data processing model.
Challenges before MapReduce
·
Time-consuming – By
using single machine we cannot analyze the data (terabytes) as it will take a
lot of time.
·
Costly – All
the data (terabytes) in one server or as database cluster which is very
expensive. And also hard to manage.
MapReduce overcome these challenges
·
Time-efficient – If
we want to analyze the data. We can write the analysis code in Map function.
And the integration code in Reduce function and execute it. Thus, this
MapReduce code will go to every machine which has a part of our data and
executes on that specific part. Hence instead of moving terabytes of data, we
just move kilobytes of code. So this type of movement is time-efficient.
·
Cost-efficient – It
distributes the data over multiple low config machines.
Hadoop MapReduce Job Execution Flow Diagram
55) What is Mapper in Hadoop?
Mapper task processes each
input record (from RecordReader) and generates a key-value pair. This
key-value pairs generated by mapper is completely different from the input
pair. The Mapper store intermediate-output on the local disk. Thus, it does not
store its output on HDFS. It is temporary data and writing on HDFS will create
unnecessary multiple copies. Mapper only understands key-value pairs of data. So before
passing data to the mapper, it, first converts the data into key-value pairs.
Mapper only understands key-value pairs of data. So before
passing data to the mapper, it, first converts the data into key-value pairs. InputSplit and RecordReader convert data into key-value pairs. Input
split is the logical representation of data. RecordReader communicates with the
InputSplit and converts the data into Kay-value pairs. Hence
·
Key is
a reference to the input value.
·
Value is
the data set on which to operate.
Number of maps depends on the total size of the input. i.e. the
total number of blocks of the input files. Mapper= {(total data size)/ (input
split size)} If data size= 1 Tb and input split size= 100 MB Hence, Mapper=
(1000*1000)/100= 10,000
Mapper= {(total data size)/ (input split size)} If data size= 1
Tb and input split size= 100 MB Hence, Mapper= (1000*1000)/100= 10,000
If data size= 1 Tb and input split size= 100 MB HenceMapper=
(1000*1000)/100= 10,000
Mapper= (1000*1000)/100= 10,000
56) What is Reducer in Hadoop?
Reducer takes the output of
the Mapper (intermediate key-value pair) as the input. After that, it runs a
reduce function on each of them to generate the output. Thus the output of the
reducer is the final output, which it stored in HDFS. Usually, in Reducer, we do
aggregation or summation sort of computation. Reducer has three primary phases-
·
Shuffle- The
framework, fetches the relevant partition of the output of all the Mappers for
each reducer via HTTP.
·
Sort- The
framework groups Reducers inputs by the key in this Phase. Shuffle and sort phases occur simultaneously.
·
Reduce- After
shuffling and sorting, reduce task aggregates the key-value pairs. In this
phase, call the reduce (Object, Iterator, OutputCollector, Reporter) method for
each <key, (list of values)> pair in the grouped inputs.
With the help of Job.setNumreduceTasks(int) the
user set the number of reducers for the job.
Hence, right number of reducers is 0.95 or 1.75 multiplied by (<no. of nodes>*<no. of maximum container per node>)
Hence, right number of reducers is 0.95 or 1.75 multiplied by (<no. of nodes>*<no. of maximum container per node>)
57) How to set mappers and reducers for
MapReduce jobs?
One can configure JobConf to set number of mappers and reducers.
·
For Mapper – job.setNumMaptasks()
·
For Reducer – job.setNumreduceTasks()
These were some general MapReduce Hadoop interview questions and
answers. Now let us take some Mapreduce Hadoop interview questions and answers
specially for freshers.
MapReduce
Hadoop Interview Question and Answer for Freshers
58) What is the key- value pair in Hadoop
MapReduce?
Hadoop MapReduce implements
a data model, which represents data as key-value pairs. Both input and output
to MapReduce Framework should be in Key-value pairs only. In Hadoop, if a
schema is static we can directly work on the column instead of key-value. But,
the schema is not static we will work on keys and values. Keys and values are
not the intrinsic properties of the data. But the user analyzing the data
chooses a key-value pair.
A Key-value pair in Hadoop MapReduce generate in following way:
·
InputSplit – It is the logical representation of
data. InputSplit represents the data which individual Mapper will process.
·
RecordReader – It
converts the split into records which are in form of Key-value pairs. That is
suitable for reading by the mapper.
By Default RecordReader uses TextInputFormat for converting data into a key-value pair.
By Default RecordReader uses TextInputFormat for converting data into a key-value pair.
·
Key – It
is the byte offset of the beginning of the line within the file.
·
Value – It
is the contents of the line, excluding line terminators. For
For Example, file content is- on the top of the crumpetty Tree
Key- 0
Value- on the top of the crumpetty Tree
59) What is the need of key-value pair to
process the data in MapReduce?
Hadoop MapReduce works
on unstructured and semi-structured data apart from structured data. One can
read the Structured data like the ones stored in RDBMS by columns.
But handling unstructured data is feasible using key-value pairs. The very core idea
of MapReduce work on the basis of these pairs. Framework map data into a
collection of key-value pairs by mapper and reducer on all
the pairs with the same key.
In most of the computations- Map operation applies on each
logical “record” in our input. This computes a set of intermediate key-value
pairs. Then apply reduce operation on all the values that share the same key.
This combines the derived data properly.
Thus, we can say that key-value pairs are the best solution to
work on data problems on MapReduce.
60) What are the most common InputFormats in
Hadoop?
In Hadoop, Input files store the data for a MapReduce job. Input files which stores data typically reside
in HDFS. Thus, in MapReduce, InputFormat defines
how these input files split and read. InputFormat creates InputSplit.
Most common InputFormat are:
·
FileInputFormat – For
all file-based InputFormat it is the base class . It also specifies input
directory where data files are present. FileInputFormat also read all files.
And, then divides these files into one or more InputSplits.
·
TextInputFormat – It
is the default InputFormat of MapReduce. It uses each line of each input file
as a separate record. Thus, performs no parsing.
Key- byte offset.
Value- It is the contents of the line, excluding line terminators.
Key- byte offset.
Value- It is the contents of the line, excluding line terminators.
·
KeyValueTextInputFormat – It
also treats each line of input as a separate record. But the main difference is
that TextInputFormat treats entire line as the value. While the
KeyValueTextInputFormat breaks the line itself into key and value by the tab
character (‘/t’).
Key- Everything up to tab character.
Value- Remaining part of the line after tab character.
Key- Everything up to tab character.
Value- Remaining part of the line after tab character.
·
SequenceFileInputFormat – It
reads sequence files.
Key & Value- Both are user-defined.
Key & Value- Both are user-defined.
61) Explain InputSplit in Hadoop?
InputFormat creates InputSplit. InputSplit is the logical representation
of data. Further Hadoop framework divides InputSplit into records. Then mapper
will process each record. The size of split is approximately equal to HDFS block size
(128 MB). In MapReduce program, Inputsplit is user defined. So, the user can
control split size based on the size of data.
InputSplit in mapreduce has a
length in bytes. It also has set of storage locations (hostname strings). It
use storage location to place map tasks as close to split’s data as possible.
According to the inputslit size each Map tasks process. So that the largest one
gets processed first, this minimize the job runtime. In MapReduce, important
thing is that InputSplit is just a reference to the data, not contain input
data.
By calling ‘getSplit()’ client who is running job calculate the
split for the job . And then send to the application master and it will use
their storage location to schedule map tasks. And that will process them on the
cluster. In MapReduce, split is send to the createRecordReader() method. It will
create RecordReader for the split in mapreduce job. Then
RecordReader generate record (key-value pair). Then it passes to the map
function.
62) Explain the difference between a MapReduce
InputSplit and HDFS block.
Tip for these type of Mapreduce Hadoop interview questions and
and answers: Start with the definition of Block and InputSplit and answer in a
comparison language and then cover its data representation, size and example
and that too in a comparison language.
By definition-
·
Block – It
is the smallest unit of data that the file system store. In general, FileSystem
store data as a collection of blocks. In a similar way, HDFS stores each file
as blocks, and distributes it across the Hadoop cluster.
·
InputSplit – InputSplit
represents the data which individual Mapper will process. Further split divides
into records. Each record (which is a key-value pair) will be processed by the
map.
Size-
·
Block – The
default size of the HDFS block is
128 MB which is configured as per our requirement. All blocks of the file are
of the same size except the last block. The last Block can be of same size or
smaller. In Hadoop, the files split into 128 MB blocks and then stored into Hadoop
Filesystem.
·
InputSplit – Split
size is approximately equal to block size, by default.
Data representation-
·
Block – It
is the physical representation of data.
·
InputSplit – It
is the logical representation of data. Thus, during data processing in MapReduce
program or other processing techniques use InputSplit. In MapReduce, important
thing is that InputSplit does not contain the input data. Hence, it is just a
reference to the data.
63) What is the purpose of RecordReader in
hadoop?
RecordReader in Hadoop uses the
data within the boundaries, defined by InputSplit. It creates key-value pairs
for the mapper. The “start” is the byte position in the file. Thus at ‘start
the RecordReader should start generating key-value pairs. And the “end” is
where it should stop reading records.
RecordReader in MapReduce job load data from its source. And
then, converts the data into key-value pairs suitable for reading by the
mapper. RecordReader communicates with the InputSplit until it does not read
the complete file. The MapReduce framework defines RecordReader instance by the
InputFormat. By, default RecordReader also uses TextInputFormat for converting
data into key-value pairs.
TextInputFormat provides 2 types of RecordReader : LineRecordReader and SequenceFileRecordReader.
LineRecordReader in
Hadoop is the default RecordReader that TextInputFormat provides. Hence, each
line of the input file is the new value and the key is byte offset.
SequenceFileRecordReader in
Hadoop reads data specified by the header of a sequence file.
64) What is Combiner in Hadoop?
In MapReduce job,
Mapper Generate large chunks of intermediate data. Then pass it to reduce for
further processing. All this leads to enormous network congestion. Hadoop
MapReduce framework provides a function known as Combiner. It plays a key role in reducing network congestion.
The Combiner in Hadoop is also known as Mini-reducer that performs local aggregation on the
mapper’s output. This reduces the data transfer between mapperand reducer and
increases the efficiency.
There is no guarantee of execution of Ccombiner in Hadoop i.e.
Hadoop may or may not execute a combiner. Also if required it may execute it
more than 1 times. Hence, your MapReduce jobs should not depend on the
Combiners execution.
65) Explain about the partitioning, shuffle
and sort phase in MapReduce?
Partitioning Phase – Partitioning
specifies that all the values for each key are grouped together. Then make sure
that all the values of a single key go on the same Reducer. Thus allows even
distribution of the map output over the Reducer.
Shuffle Phase – It
is the process by which the system sorts the key-value output of the map tasks.
After that it transfer to the reducer.
Sort Phase – Mapper generate the
intermediate key-value pair. Before starting of Reducer, map reduce framework
sort these key-value pairs by the keys. It also helps reducer to easily
distinguish when a new reduce task should start. Thus saves time for the
reducer.
66) What does a “MapReduce Partitioner” do?
Partitioner comes int the
picture, if we are working on more than one reducer. Partitioner controls the
partitioning of the keys of the intermediate map-outputs. By hash function, the
key (or a subset of the key) is used to derive the partition. Partitioning
specifies that all the values for each key grouped together. And it make sure
that all the values of single key goes on the same reducer. Thus allowing even
distribution of the map output over the reducers. It redirects the Mappers
output to the reducers by determining which reducer is responsible for the
particular key.
The total number of partitioner is equal to the number of
Reducer. Partitioner in Hadoop will divide the data according to the number of
reducers. Thus, single reducer process the data from the single partitioner.
67) If no custom partitioner is defined in
Hadoop then how is data partitioned before it is sent to the reducer?
So, Hadoop MapReduce by default uses ‘HashPartitioner’.
It uses the hashCode() method
to determine, to which partition a given (key, value) pair will be sent.
HashPartitioner also has a method called getPartition.
HashPartitioner also takes key.hashCode() & integer>MAX_VALUE. It takes these code to
finds the modulus using the number of reduce tasks. Suppose there are 10 reduce
tasks, then getPartition will return values 0 through 9 for all keys.
1.
Public class
HashPartitioner<k, v>extends Partitioner<k, v>
2.
{
3.
Public int getpartitioner(k key, v value, int
numreduceTasks)
4.
{
5.
Return (key.hashCode() &
Integer.Max_VALUE) %
numreduceTasks;
6.
}
7.
}
These are very common type of MapReduce Hadoop interview
questions and answers faced during the interview of a Fresher.
68) How to write a custom partitioner for a
Hadoop MapReduce job?
This is one of the most common MapReduce Hadoop interview
question and answer
It stores the results uniformly across different reducers, based
on the user condition.
By setting a Partitioner to partition by the key, we can guarantee
that records for the same key will go the same reducer. It also ensures that
only one reducer receives all the records for that particular key.
By the following steps, we can write Custom partitioner for a
Hadoop MapReduce job:
·
Create a new class that extends Partitioner Class.
·
Then, Override method getPartition, in the wrapper that runs in
the MapReduce.
·
By using method set Partitioner class, add the custom
partitioner to the job. Or add the custom partitioner to the job as config
file.
69) What is shuffling and sorting in Hadoop
MapReduce?
Shuffling and Sorting takes
place after the completion of map task. Shuffle and sort phase in Hadoop occurs
simultaneously.
·
Shuffling- Shuffling
is the process by which the system sorts the key-value output of the map tasks
and transfer it to the reducer. Shuffle phase is important for reducers,
otherwise, they would not have any input. As shuffling can start even before
the map phase has finished. So this saves some time and completes the task in
lesser time.
·
Sorting- Mapper
generate the intermediate key-value pair. Before starting of reducer, mapreduce
framework sort these key-value pair by the keys. It also helps reducer to
easily distinguish when a new reduce task should start. Thus saves time for the
reducer.
Shuffling and sorting are not performed at all if you specify
zero reducer (setNumReduceTasks(0))
70) Why aggregation cannot be done in Mapper?
Mapper task processes each
input record (From RecordReader) and
generates a key-value pair. The Mapper store intermediate-output on the local
disk.
We cannot perform aggregation in mapper because:
We cannot perform aggregation in mapper because:
·
Sorting takes place only on the Reducer function. Thus there is no provision
for sorting in the mapper function. Without sorting aggregation is not
possible.
·
To perform aggregation, we need the output of all the Mapper function.
Thus, which may not be possible to collect in the map phase. Because mappers
may be running on different machines where the data blocks are present.
·
If we will try to perform aggregation of data at mapper, it
requires communication between all mapper functions. Which may be running on
different machines. Thus, this will consume high network bandwidth and can
cause network bottlenecking.
71) Explain map-only job?
MapReduce is the data processing layer of Hadoop. It
is the framework for writing applications that process the vast amount of data
stored in the HDFS. It
processes the huge amount of data in parallel by dividing the job into a set of
independent tasks (sub-job). In Hadoop, MapReduce have 2 phases of processing:
Map and Reduce.
In Map phase
we specify all the complex logic/business rules/costly code. Map takes a set of
data and converts it into another set of data. It also break individual
elements into tuples (key-value pairs). In Reduce phase we specify light-weight processing like
aggregation/summation. Reduce takes the output from the map as input. After
that it combines tuples (key-value) based on the key. And then, modifies the
value of the key accordingly.
Learn Hadoop from Industry Experts
Consider a case where we just need to perform the operation and
no aggregation required. Thus, in such case, we will prefer “Map-Only job” in Hadoop. In Map-Only job, the map does all
task with its InputSplit and the reducer do no job. Map output is the final
output.
This we can achieve by setting job.setNumreduceTasks(0) in the
configuration in a driver. This will make a number of reducer 0 and thus only
mapper will be doing the complete task.
72) What is SequenceFileInputFormat in Hadoop
MapReduce?
SequenceFileInputFormat is
an InputFormat which reads sequence files. Sequence files are binary files that
stores sequences of binary key-value pairs. These files are block-compress.
Thus, Sequence files provide direct serialization
and deserializationof several arbitrary data types.
Here Key and value- Both are user-defined.
Here Key and value- Both are user-defined.
SequenceFileAsTextInputFormat is
variant of SequenceFileInputFormat. It
converts the sequence file’s key value to text objects. Hence, by calling ‘tostring()’it performs conversion on the keys and values.
Thus, this InputFormat makes sequence files suitable input for streaming.
SequenceFileAsBinaryInputFormat is
variant of SequenceFileInputFormat. Hence, by using this we can extract the
sequence file’s keys and values as an opaque binary object.
The above 58 – 72 MapReduce Hadoop interview questions and
answers were for freshers, However experienced can also go through these
MapReduce Hadoop interview questions and answers for revising the basics.
MapReduce
Hadoop Interview questions and Answers for Experienced
73) What is KeyValueTextInputFormat in Hadoop?
KeyValueTextInputFormat- It
treats each line of input as a separate record. It breaks the line itself into
key and value. Thus, it uses the tab character (‘/t’) to break the line into a key-value pair.
Key- Everything up to
tab character.
Value- Remaining part of the line after tab character.
Value- Remaining part of the line after tab character.
Consider the following input file, where → represents a (horizontal) tab
character:
But→ his face you
could not see
Account→ of his beaver
hat Hence,
Output:
Key- But
Value- his face you could not see
Key- Account
Value- of his beaver hat
Value- his face you could not see
Key- Account
Value- of his beaver hat
74) Differentiate Reducer and Combiner in
Hadoop MapReduce?
Combiner- The combiner is
Mini-Reducer that perform local reduce task. It run on the Map output and
produces the output to reducer input. Combiner is usually used for network
optimization.
Reducer- Reducer takes a set of an intermediate
key-value pair produced by the mapper as the input. Then runs a reduce function
on each of them to generate the output. An output of the reducer is the final
output.
·
Unlike a reducer, the combiner has a limitation . i.e. the input
or output key and value types must match the output types of the mapper.
·
Combiners can operate only on a subset of keys and values . i.e.
combiners can execute on functions that are commutative.
·
Combiner functions take input from a single mapper. While
reducers can take data from multiple mappers as a result of partitioning.
75) Explain the process of spilling in
MapReduce?
Map task processes each input record (from RecordReader) and
generates a key-value pair. The Mapper does not store its output on HDFS. Thus, this is temporary data and writing on HDFS
will create unnecessary multiple copies. The Mapper writes its output into the
circular memory buffer (RAM). Size of the buffer is 100 MB by default. We can
also change it by using mapreduce.task.io.sort.mb property.
Now, spilling is
a process of copying the data from the memory buffer to disc. It takes place
when the content of the buffer reaches a certain threshold size. So, background
thread by default starts spilling the contents after 80% of the buffer size has
filled. Therefore, for a 100 MB size buffer, the spilling will start after the
content of the buffer reach a size of 80MB.
76) What happen if number of reducer is set to
0 in Hadoop?
If we set the number of reducer to 0:
·
Then no reducer will execute and no aggregation will take place.
·
In such case we will prefer “Map-only job” in Hadoop. In map-Only job, the map does all
task with its InputSplit and the reducer do no job. Map output is the final
output.
In between map and reduce phases there is key, sort, and shuffle phase. Sort and
shuffle phase are responsible for sorting the keys in ascending order. Then
grouping values based on same keys. This phase is very expensive. If reduce phase
is not required we should avoid it. Avoiding reduce phase would eliminate sort
and shuffle phase as well. This also saves network congestion. As in shuffling
an output of mapper travels to reducer,when data size is huge, large data
travel to reducer.
77) What is Speculative Execution in Hadoop?
MapReduce breaks jobs into tasks and run these tasks parallely
rather than sequentially. Thus reduces execution time. This model of execution
is sensitive to slow tasks as they slow down the overall execution of a job.
There are various reasons for the slowdown of tasks like hardware degradation.
But it may be difficult to detect causes since the tasks still complete
successfully. Although it takes more time than the expected time.
Hadoop framework doesn’t
try to fix and diagnose slow running task. It tries to detect them and run
backup tasks for them. This process is called Speculative execution in Hadoop.
These backup tasks are called Speculative tasks in Hadoop.
First of all Hadoop framework launch all the tasks for the job
in Hadoop MapReduce. Then it launch speculative tasks for those tasks that have
been running for some time (one minute). And the task that have not made any
much progress, on average, as compared with other tasks from the job.
If the original task completes before the speculative task. Then
it will kill speculative task . On the other hand, it will kill the original
task if the speculative task finishes before it.
78) What counter in Hadoop MapReduce?
Counters in MapReduce are
useful Channel for gathering statistics about the MapReduce job. Statistics
like for quality control or for application-level. They are also useful for
problem diagnosis.
Counters validate that:
·
Number of bytes read and write within map/reduce job is correct
or not
·
The number of tasks launches and successfully run in map/reduce
job is correct or not.
·
The amount of CPU and memory consumed is appropriate for our job
and cluster nodes.
There are two types of counters:
·
Built-In Counters – In
Hadoop there are some built-In counters for every job. These report various
metrics, like, there are counters for the number of bytes and records. Thus,
this allows us to confirm that it consume the expected amount of input. Also
make sure that it produce the expected amount of output.
·
User-Defined Counters – Hadoop
MapReduce permits user code to define a set of counters. These are then
increased as desired in the mapper or reducer. For example, in Java, use ‘enum’ to define counters.
79) How to submit extra files(jars,static
files) for MapReduce job during runtime in Hadoop?
MapReduce framework provides Distributed Cache to caches files needed by the
applications. It can cache read-only text files, archives, jar files etc.
An application which needs to use distributed cache to distribute a file should make sure that the files are available on URLs.
URLs can be either hdfs:// or http://.
An application which needs to use distributed cache to distribute a file should make sure that the files are available on URLs.
URLs can be either hdfs:// or http://.
Now, if the file is present on the hdfs:// or http://urls. Then,
user mentions it to be cache file to distribute. This framework will copy the
cache file on all the nodes before starting of tasks on those nodes. The files
are only copied once per job. Applications should not modify those files.
80) What is TextInputFormat in Hadoop?
TextInputFormat is the default InputFormat. It treats each line of the input file as a
separate record. For unformatted data or line-based records like log files,
TextInputFormat is useful. By default, RecordReader also uses TextInputFormat
for converting data into key-value pairs. So,
·
Key- It
is the byte offset of the beginning of the line.
·
Value- It
is the contents of the line, excluding line terminators.
File content is- on the top of the building
so,
Key- 0
Key- 0
Value- on the top of the building
TextInputFormat also provides below 2 types of RecordReader-
·
LineRecordReader
·
SequenceFileRecordReader
Top
Interview Quetions for Hadoop MapReduce
81) How many Mappers run for a MapReduce job?
Number of mappers depends on 2 factors:
·
Amount of data we want to process along with block size. It is
driven by the number of inputsplit. If we have block size of 128 MB and we expect 10TB of input
data, we will have 82,000 maps. Ultimately InputFormat determines the number of
maps.
·
Configuration of the slave i.e. number of core and RAM available
on slave. The right number of map/node can between 10-100. Hadoop framework
should give 1 to 1.5 cores of processor for each mapper. For a 15 core
processor, 10 mappers can run.
In MapReduce job, by changing block size we can control
number of Mappers . By Changing block size the number of inputsplit increases
or decreases.
By using the JobConf’s conf.setNumMapTasks(int num) we can increase the number of map task.
By using the JobConf’s conf.setNumMapTasks(int num) we can increase the number of map task.
Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Mapper= (1000*1000)/100= 10,000
If data size= 1 Tb and input split size= 100 MB
Mapper= (1000*1000)/100= 10,000
82) How many Reducers run for a MapReduce job?
Answer these type of MapReduce Hadoop interview questions
answers very shortly and to the point.
With the help of Job.setNumreduceTasks(int) the
user set the number of reduces for the job. To set the right number of
reducesrs use the below formula:
0.95 Or 1.75 multiplied
by (<no. of nodes> * <no. of maximum container per node>).
As the map finishes, all the reducers can launch immediately and
start transferring map output with 0.95. With 1.75, faster nodes finsihes first
round of reduces and launch second wave of reduces .
With the increase of number of reducers:
With the increase of number of reducers:
·
Load balancing increases.
·
Cost of failures decreases.
·
Framework overhead increases.
These are very common type of MapReduce Hadoop interview
questions and answers faced during the interview of an experienced
professional.
83) How to sort intermediate output based on
values in MapReduce?
Hadoop MapReduce automatically
sorts key-value pair generated
by the mapper. Sorting takes place on the basis of keys. Thus, to sort
intermediate output based on values we need to use secondary sorting.
There are two possible approaches:
·
First, using reducer, reducer reads and buffers all the value for
a given key. Then, do an in-reducer sort on all the values. Reducer will
receive all the values for a given key (huge list of values), this cause
reducer to run out of memory. Thus, this approach can work well if number of
values is small.
·
Second, using MapReduce paradigm. It sort the reducer input
values, by creating a composite key” (using value to key conversion approach) .
i.e.by adding a part of, or entire value to, the natural key to achieve sorting
technique. This approach is scalable and will not generate out of, memory
errors.
We need custom partitioner to make that all the data with same
key (composite key with the value) . So, data goes to the same reducer and
custom comparator. In Custom comparator the data grouped by the natural key
once it arrives at the reducer.
84) What is purpose of RecordWriter in Hadoop?
Reducer takes mapper output (intermediate key-value pair) as an
input. Then, it runs a reducer function on them to generate output (zero or
more key-value pair). So, the output of the reducer is the final output.
RecordWriter writes these output
key-value pair from the Reducer phase to output files. OutputFormat determines, how RecordWriter writes
these key-value pairs in Output files. Hadoop provides OutputFormat instances
which help to write files on the in HDFS or local disk.
85) What are the most common OutputFormat in
Hadoop?
Reducer takes mapper output as input and produces output
(zero or more key-value pair). RecordWriter writes these output key-value pair
from the Reducer phase to output files. So, OutputFormat determines, how
RecordWriter writes these key-value pairs in Output files.
FileOutputFormat.setOutputpath() method
used to set the output directory. So, every Reducer writes a separate in a
common output directory.
Most common OutputFormat are:
·
TextOutputFormat – It
is the default OutputFormat in MapReduce. TextOutputFormat writes key-value
pairs on individual lines of text files. Keys and values of this format can be
of any type. Because TextOutputFormat turns them to string by calling toString() on them.
·
SequenceFileOutputFormat
– This OutputFormat writes sequences files for its output. It is
also used between MapReduce jobs.
·
SequenceFileAsBinaryOutputFormat
– It is another form of SequenceFileInputFormat. which
writes keys and values to sequence file in binary format.
·
DBOutputFormat – We use
this for writing to relational databases and HBase. Then, it sends the reduce output to a SQL
table. It accepts key-value pairs, where the key has a type extending
DBwritable.
86) What is LazyOutputFormat in Hadoop?
FileOutputFormat
subclasses will create output files (part-r-nnnn), even if they are empty. Some
applications prefer not to create empty files, which is where LazyOutputFormat helps.
LazyOutputFormat is a wrapper OutputFormat. It make sure that the output file should
create only when it emit its first record for a given partition.
To use LazyOutputFormat, call its SetOutputFormatClass() method with the
JobConf.
To enable LazyOutputFormat, streaming and pipes supports a –
lazyOutput option.
87) How to handle record boundaries in Text
files or Sequence files in MapReduce InputSplits?
InputSplit’s RecordReader in MapReduce will “start” and “end” at a
record boundary.
In SequenceFile, every 2k bytes has a 20 bytes sync mark between
the records. And, the sync marks between the records allow the RecordReader to seek to the start
of the InputSplit. It contains a file, length and offset. It also find the
first sync mark after the start of the split. And, the RecordReader continues
processing records until it reaches the first sync mark after the end of the
split.
Similarly, Text files use newlines instead of sync marks to
handle record boundaries.
88) What are the main configuration parameters
in a MapReduce program?
The main configuration parameters are:
·
Input format of data.
·
Job’s input locations in the distributed file system.
·
Output format of data.
·
Job’s output location in the distributed file system.
·
JAR file containing the mapper, reducer and driver classes
·
Class containing the map function.
·
Class containing the reduce function..
89) Is it mandatory to set input and output
type/format in MapReduce?
No, it is mandatory.
Hadoop cluster, by default, takes the input and the output
format as ‘text’.
TextInputFormat – MapReduce
default InputFormat is TextInputFormat. It treats each
line of each input file as a separate record and also performs no parsing. For
unformatted data or line-based records like log files, TextInputFormat is also
useful. By default, RecordReader also uses TextInputFormat for converting data
into key-value pairs.
TextOutputFormat- MapReduce
default OutputFormat is TextOutputFormat. It also writes (key, value) pairs on
individual lines of text files. Its keys and values can be of any type.
90) What is Identity Mapper?
Identity Mapper is
the default Mapper provided by Hadoop. When MapReduce program has not defined
any mapper class then Identity mapper runs. It simply passes the input key-value pair for the reducer phase. Identity Mapper
does not perform computation and calculations on the input data. So, it only
writes the input data into output.
The class name is org.apache.hadoop.mapred.lib.IdentityMapper
91) What is Identity reducer?
Identity Reducer is
the default Reducer provided by Hadoop. When MapReduce program has not defined
any mapper class then Identity mapper runs. It does not mean that the reduce
step will not take place. It will take place and related sorting and shuffling will
also take place. But there will be no aggregation. So you can use identity
reducer if you want to sort your data that is coming from the map but don’t
care for any grouping.
The above MapReduce Hadoop interview questions and answers i.e
Q. 73 – Q. 91 were for experienced but freshers can also refer these MapReduce
Hadoop interview questions and answers for in depth knowledge. Now let’s move
forward with some advanced MapReduce Hadoop interview questions and answers.
Advanced
Interview Questions and Answers for Hadoop MapReduce
92) What is Chain Mapper?
We can use multiple Mapper classes within a single Map task by
using Chain Mapperclass.
The Mapper classes invoked in a chained (or piped) fashion. The output of the
first becomes the input of the second, and so on until the last mapper. The Hadoop framework write output of the last
mapper to the task’s output.
The key benefit of this feature is that the Mappers in the chain
do not need to be aware that they execute in a chain. And, this enables having
reusable specialized Mappers. We can combine these mappers to perform composite
operations within a single task in Hadoop.
Hadoop framework take Special care when create chains. The
key/values output by a Mapper are valid for the following mapper in the chain.
The class name is org.apache.hadoop.mapred.lib.ChainMapper
This is one of the very important Mapreduce Hadoop
interview questions and answers
93) What are the core methods of a Reducer?
Reducer process the output
the mapper. After processing the data, it also produces a new set of output,
which it stores in HDFS. And,
the core methods of a Reducer are:
·
setup()- Various
parameters like the input data size, distributed cache, heap size, etc this
method configure. Function Definition-
public void setup (context)
·
reduce() – Reducer
call this method once per key with the associated reduce task. Function Definition- public void reduce (key, value, context)
·
cleanup() – Reducer
call this method only once at the end of reduce task for clearing all the
temporary files. Function Definition-
public void cleanup (context)
94) What are the parameters of mappers and
reducers?
The parameters for Mappers are:
·
LongWritable(input)
·
text (input)
·
text (intermediate output)
·
IntWritable (intermediate output)
The parameters for Reducers are:
·
text (intermediate output)
·
IntWritable (intermediate output)
·
text (final output)
·
IntWritable (final output)
95) What is the difference between
TextinputFormat and KeyValueTextInputFormat class?
TextInputFormat – It is the default InputFormat. It treats each line of the input file as a
separate record. For unformatted data or line-based records like log files,
TextInputFormat is also useful. So,
·
Key- It
is byte offset of the beginning of the line within the file.
·
Value- It
is the contents of the line, excluding line terminators.
KeyValueTextInputFormat – It
is like TextInputFormat. The reason is it also treats each line of input as a
separate record. But the main difference is that TextInputFormat treats entire
line as the value. While the KeyValueTextInputFormat breaks the line itself
into key and value by the tab character (‘/t’). so,
·
Key- Everything
up to tab character.
·
Value- Remaining
part of the line after tab character.
For example, consider a file contents as below:
AL#Alabama
AR#Arkansas
FL#Florida
So, TextInputFormat
Key value
0 AL#Alabama 14
AR#Arkansas 23
FL#Florida
Key value
0 AL#Alabama 14
AR#Arkansas 23
FL#Florida
So, KeyValueTextInputFormat
Key value
AL Alabama
AR Arkansas
FL Florida
Key value
AL Alabama
AR Arkansas
FL Florida
These are some of the advanced MapReduce Hadoop interview
Questions and answers
96) How is the splitting of file invoked in Hadoop
?
InputFormat is responsible for creating InputSplit, which
is the logical representation of data. Further Hadoop framework divides split
into records. Then, Mapper process each record (which is a key-value pair).
By running getInputSplit() method
Hadoop framework invoke Splitting of file . getInputSplit() method belongs to
Input Format class (like FileInputFormat) defined by the user.
97) How many InputSplits will be made by
hadoop framework?
InputFormat is responsible for creating InputSplit, which is the
logical representation of data. Further Hadoop framework divides split into
records. Then, Mapper process each record (which is a key-value pair).
MapReduce system use storage locations to place map tasks as
close to split’s data as possible. By default, split size is approximately
equal to HDFS block size
(128 MB).
For, example the file size is 514 MB,
128MB: 1st block, 128Mb: 2nd block, 128Mb: 3rd block,
128Mb: 4th block, 2Mb: 5th block
128MB: 1st block, 128Mb: 2nd block, 128Mb: 3rd block,
128Mb: 4th block, 2Mb: 5th block
So, 5 InputSplit is created based on 5 blocks.
If in case you have any confusion about any MapReduce
Hadoop Interview Questions, do let us know by leaving a comment. we will be
glad to solve your queries.
98) Explain the usage of Context Object.
With the help of Context Object, Mapper can easily interact with
other Hadoop systems. It also helps in updating counters. So counters can report the progress and provide any
application-level status updates.
It contains configuration details for the job.
It contains configuration details for the job.
99) When is it not recommended to use
MapReduce paradigm for large scale data processing?
For iterative processing use cases it is not suggested to
use MapReduce. As it
is not cost effective, instead Apache Pig can be used for the same.
100) What is the difference between RDBMS with
Hadoop MapReduce?
Size of Data
·
RDBMS- Traditional
RDBMS can handle upto gigabytes of data.
·
MapReduce- Hadoop
MapReduce can hadnle upto petabytes of data or more.
Updates
·
RDBMS- Read
and Write multiple times.
·
MapReduce- Read
many times but write once model.
Schema
·
RDBMS- Static
Schema that needs to be pre-defined.
·
MapReduce- Has
a dynamic schema
Processing Model
·
RDBMS- Supports
both batch and interactive processing.
·
MapReduce- Supports
only batch processing.
Scalability
·
RDBMS- Non-Linear
·
MapReduce- Linear
101) Define Writable data types in Hadoop
MapReduce.
Hadoop reads and writes data in a serialized form in the
writable interface. The Writable interface has several classes like Text,
IntWritable, LongWriatble, FloatWritable, BooleanWritable. Users are also free
to define their personal Writable classes as well.
102) Explain what does the conf.setMapper
Class do in MapReduce?
Conf.setMapperclass sets the mapper class. Which includes
reading data and also generating a key-value pair out of the mapper.
Number 40-52 were the advanced HDFS Hadoop interview question
and answer to get indepth knowledge in handling difficult Hadoop interview
questions and answers.
This was all about the Hadoop Interview Questions and Answers
These questions are frequently asked MapReduce Hadoop
interview questions and answers. You can read here some more Hadoop MapReduce interview questions and answers.
After going through these MapReduce Hadoop Interview
questions and answers you will be able to confidently face a interview and will
be able to answer MapReduce Hadoop Interview questions and answers asked
in your interview in the best manner. These MapReduce Hadoop Interview Questions
are suggested by the experts at DataFlair
Key –
Q.53 – Q57 Basic MapReduce Hadoop interview questions and
answers
Q.58 – Q72 MapReduce Hadoop interview questions and answer for
Freshers
Q.73 -Q. 80 Hadoop MapReduce Interview Questions for Experienced
Q.81 – Q.91 Top questions asked in Hadoop Interview
Q.92 – Q.102 were the advanced MapReduce Hadoop
interview questions and answers
These MapReduce Hadoop interview questions and answers are
categorized so that you can pay more attention to questions specified for you,
however, it is recommended that you go through all the Hadoop interview
questions and answers for complete understanding.
If you have any more doubt or query on Hadoop Interview
Questions and Answers for Mapreduce, Drop a comment and our support team will
be happy to help you.
Hope the tutorial on Hadoop interview questions and answers was
helpful to you.
http://www.bigdatatrunk.com/top-50-interview-questions-hdfs/
Q1 What does ‘jps’ command do?
Answer:It gives the status of the deamons which run Hadoop cluster. It gives the output mentioning the status of Namenode, Datanode, Secondary Namenode, Jobtracker and Tasktracker.
Q2.What if a Namenode has no data?
Answer: It cannot be part of the Hadoop cluster.
Q3. What happens to job tracker when Namenode is down?
Answer: When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.
Q4.What is a Namenode?
Answer: Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.
Q5.Replication causes data redundancy, then why is it pursued in HDFS?
Answer: HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at least 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.
Q6. What is a Datanode?
Answer: Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.
Q7. Why do we use HDFS for applications having large data sets and not when there are lot of small files?
Answer: HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.
Q8.Explain the major difference between HDFS block and InputSplit.
Answer: In simple terms, block is the physical representation of data while split is the logical representation of data present in the block. Split acts a s an intermediary between block and mapper. Suppose we have two blocks:
Block 1: ii nntteell
Block 2: Ii ppaatt
Now, considering the map, it will read first block from ii till ll, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block. It then forms key-value pair using inputformat and records reader and sends map for further processing with inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.
Q9.What is a ‘block’ in HDFS?
Answer: A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.
Q10.Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.
Answer: Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1, if the DataNode crashes under any circumstances, then only single copy of the data would be lost.
Q11.What are the most common Input Formats in Hadoop?
Answer: There are three most common input formats in Hadoop:
- Text Input Format: Default input format in Hadoop
- Key Value Input Format: used for plain text files where the files are broken into lines
- Sequence File Input Format: used for reading files in sequence
Q12. What is commodity hardware?
Answer: Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.
Q13. What is the port number for NameNode,Secondary NameNode,DataNodes,TaskTracker and JobTracker?
Answer:
- NameNode 50070
- Secondary NameNode 50090
- DataNodes 50075
- JobTracker 50030
- TaskTracker 50060
Q14. Explain about the process of inter cluster data copying.
Answer: HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.
Q15. What is a heartbeat in HDFS?
Answer: A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.
Q16. Explain the difference between NAS and HDFS.
Answer: NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.
Q17. Explain about the indexing process in HDFS.
Answer: Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.
Q18. What is a rack awareness and on what basis is data stored in a rack?
Answer: All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack t0 ensure if any entire rack fails we still have one copy in another rack.This is generally referred to as the Replica Placement Policy.
Q19. How NameNode Handles data node failures?
Answer:Through checksums. every data has a record followed by a checksum. if checksum doesnot match with the original then it reports an data corrupted error.
Q20. What is HDFS?
Answer:The Hadoop Distributed File System (HDFS) is a sub-project of the Apache Hadoop project.HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
Q21. What are the key features of HDFS?
Answer: HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.
Q22. What is throughput? How does HDFS get a good throughput?
Answer: Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.
Q23. What is data-integrity in HDFS?
Answer: HDFS transparently checksums all data written to it and by default verifies checksums when reading data.A separate checksums created for every bytes of data(default is 512 bytes, because CRC-32 checksums is 4 bytes).Datanodes are responsible for verifying the data they receive before storing the data and checksums.It is possible to disable checksums by passing false to setverifychecksum() method on filesystem before using open() method to read file.
Q24. What all modes Hadoop can be run in?
Answer: Hadoop can run in three modes:
- Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
- Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
- Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.
Q25 What are the core components of Hadoop?
Answer: Core components of Hadoop are HDFS and MapReduce. HDFS is basically used to store large data sets and MapReduce is used to process such large data sets.
Q26. What is a metadata?
Answer: Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.
Q27. What happens when two clients try to write into the same HDFS file?
Answer:HDFS supports exclusive writes only. When the first client contacts the name-node to open the file for writing, the name-node grants a lease to the client to create this file. When the second client tries to open the same file for writing, the name-node will see that the lease for the file is already granted to another client, and will reject the open request for the second client
Q28. What is a daemon?
Answer: Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “Services” and in Dos is “TSR”.
Q29.What are file permissions in HDFS?
Answer:HDFS has a permission model for files and directories that is much like posix. They are three types of permissions
- read permissions(x)
- write permissions(w)
- execute permissions(X)
Each file and directory has an owner and group and mode
Q30. What does Data Locality mean?
Answer:Data Locality means processing the data where it resides. It simply means that Hadoop Map-Reduce will do their best to schedule the map tasks and the reduce tasks such that most tasks read their input data from the local computer. In certain scenarios , mainly in the reduce phase exception to Data Locality may be needed.
Q31. What is the process to change the files at arbitrary locations in HDFS?
Answer: HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.
Q32. What is the process of indexing in HDFS?
Answer: Once data is stored HDFS will depend on the last part to find out where the next part of data would be stored.
Q33.Difference betweenHadoop fs -copyFromLocal and Hadoop fs -moveFromLocal
Answer: Hadoop fs -put and Hadoop fs -copyFromLocal both are same means it’ll copy the data from local to hdfs and local copy also available and it’s working like copy & paste.Hadoop fs -moveFromLocal command working as cut & paste means it’ll move the file from local to HDFS, but local copy is not available.
Q34. What happens if one Hadoop client renames a file or a directory containing this file while another client is still writing into it?
Answer:A file will appear in the name space as soon as it is created. If a writer is writing to a file and another client renames either the file itself or any of its path components, then the original writer will get an IOException either when it finishes writing to the current block or when it closes the file.
Q35.What is Secondary NameNode?
Answer: Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.
Q36. What is default block size in HDFS?
Answer: As of Hadoop-2.4.0 release, the default block size in HDFS is 256 MB and prior to that it was 128 MB.
Q37. What are the limitations of HDFS file systems?
Answer: HDFS supports file operations reads, writes, appends and deletes efficiently but it doesn’t support file updates.HDFS is not suitable for large number of small sized files but best suits for large sized files. Because file system namespace maintained by Namenode is limited by it’s main memory capacity as namespace is stored in namenode’s main memory and large number of files will result in big fsimage file.
Q38. Is there an easy way to see the status and health of a cluster?
Answer: There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system.The JobTracker status page will display the state of all nodes, as well as the job queue and status about all currently running jobs and tasks. The NameNode status page will display the state of all nodes and the amount of free space, and provides the ability to browse the DFS via the web.
Q39. How do you debug a performance issue or a long running job?
Answer: This is an open ended question and the interviewer is trying to see the level of hands-on experience you have in solving production issues. Use your day to day work experience to answer this question. Here are some of the scenarios and responses to help you construct your answer. On a very high level you will follow the below steps.
- Understand the symptom
- Analyze the situation
- Identify the problem areas
- Propose solution
Q40. What is a sequence file in Hadoop?
Answer: Sequence file is used to store binary key/value pairs. Sequence files support splitting even when the data inside the file is compressed which is not possible with a regular compressed file. You can either choose to perform a record level compression in which the value in the key/value pair will be compressed. Or you can also choose to choose at the block level where multiple records will be compressed together.Consider case scenario: In M/R system, – HDFS block size is 64 MB. Now Input format is FileInputFormat and we have 3 files of size 64K, 65Mb and 127Mb. How many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows:
- split for 64K files
- splits for 65MB files
- splits for 127MB files
Q41. What happens when a datanode fails?
Answer: When a datanode fails:
- Jobtracker and namenode detect the failure
- On the failed node all tasks are re-scheduled
- Namenode replicates the users data to another node
Q42. What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it?
Answer: Distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.
Q43. What happens to a NameNode that has no data?
Answer:There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
Q44. What is a block and block scanner in HDFS?
Answer:Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
Q45. Why is a block in HDFS so Large?
Answer: HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be significantly longer than the time to seek to the start of the block.
Q46. What is HDFS High-Availability?
Answer: The 2.x release series of Hadoop adds support for HDFS high-availability (HA). In this implementation there is a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.
Q47. What are some typical functions of Job Tracker?
Answer: The following are some typical tasks of JobTracker:
- When Client applications submit map reduce jobs to the Job tracker
- The JobTracker talks to the Name node to determine the location of the data
- The JobTracker locates TaskTtracker nodes with available slots at or near the data
- The JobTracker submits the work to the chosen Tasktracker nodes
- The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker
- When the work is completed, the JobTracker updates its status
- Client applications can poll the JobTracker for information
Q48. How does one switch off the “SAFEMODE” in HDFS?
Answer:You use the command: Hadoop dfsadmin –safemode leave
Q49. What is streaming access?
Answer: As HDFS works on the principle of ‘Write Once, Read Many’, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.
Q50. Is Namenode also a commodity?
Answer: No. Namenode can never be commodity hardware because the entire HDFS rely on it. It is the single point of failure in HFS. Namenode has to be a high-availability machine.
-------------------
Mareducer
Q1 What is MapReduce?
Answer: MapReduce is a parallel programming model which is used to process large data sets across hundreds or thousands of servers in a Hadoop cluster.Map/reduce brings compute to the data at data location in contrast to traditional parallelism, which brings data to the compute location.The Term MapReduce is composed of Map and Reduce phase. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples key/value pairs. The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job. The programming language for MapReduce is Java.All data emitted in the flow of a MapReduce program is in the form of Key/Value pairs.
Q2 Explain a MapReduce program.
Answer: A MapReduce program consists of 3 parts namely, Driver, Mapper, and Reducer.
The Driver code runs on the client machine and is responsible for building the configuration of the job and submitting it to the Hadoop Cluster. The Driver code will contain the main() method that accepts arguments from the command line.
The Mapper code reads the input files as <Key,Value> pairs and emits key value pairs. The Mapper class extends MapReduceBase and implements the Mapper interface. The Mapper interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the input key and value types, the second two define the output key and value types.
The Reducer code reads the outputs generated by the different mappers as <Key,Value> pairs and emits key value pairs. The Reducer class extends MapReduceBase and implements the Reducer interface. The Reducer interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the intermediate key and value types, the second two define the final output key and value types.
Q3 Mention what are the main configuration parameters that user need to specify to run MapReduce Job ?
Answer:The user of MapReduce framework needs to specify the following:
- Job’s input locations in the distributed file system
- Job’s output location in the distributed file system
- Input format
- Output format
- Class containing the map function
- Class containing the reduce function
- JAR file containing the mapper, reducer and driver classes
Q4 What Mapper does?
Answer: Mapper is the first phase of MapReduce phase which process map task.Mapper reads key/value pairs and emit key/value pair.Maps are the individual tasks that transform input records into intermediate records.The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.
Q5 Is there an easy way to see the status and health of a cluster?
Answer: There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system.The JobTracker status page will display the state of all nodes, as well as the job queue and status about all currently running jobs and tasks. The NameNode status page will display the state of all nodes and the amount of free space, and provides the ability to browse the DFS via the web.
Q6 Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
Answer:
- apache.hadoop.mapreduce.Mapper
- apache.hadoop.mapreduce.Reducer
Q7 Explain what is Sequencefileinputformat?
Answer: Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.
Q8 What are ‘maps’ and ‘reduces’?
Answer: ‘Maps’ and ‘Reduces’ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input location, and based on the input type, it will generate a key value pair,that is, an intermediate output in local machine.’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output.
Q9 What does conf.setMapper Class do?
Answer: Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper.
Q10 What are the methods in the Reducer class and order of their invocation?
Answer: The Reducer class contains the run() method, which call its own setup() method only once, it also call a reduce() method for each input and finally calls it cleanup() method.
Q11 Explain what is the purpose of RecordReader in Hadoop?
Answer: In Hadoop, the RecordReader loads the data from its source and converts it into key, value pairs suitable for reading by the Mapper.
Q12 Explain MapReduce and its needs while programming with Apache Pig
Answer: All programs in Apache Pig have been written usually in query language which is also called nowadays as Pig Latin. It is has some similarity with SQL language of query as well. In order get the query executed, you must also remember to make use of an engine that specialises in this. Queries are converted from pig engines into jobs and therefore MapReduce will act as an engine of execution which is required to run programs.
Q13 What are some typical functions of Job Tracker?
Answer: The following are some typical tasks of JobTracker:-
- When Client applications submit map reduce jobs to the Job tracker
- The JobTracker talks to the Name node to determine the location of the data
- The JobTracker locates TaskTracker nodes with available slots at or near the data
- The JobTracker submits the work to the chosen TaskTracker nodes
- The TaskTracker nodes are monitored. If they do not submit heartbeat signals they are deemed to have failedand the work is scheduled on different TaskTracker
- When the work is completed, the JobTracker updates its status
- Client applications can poll the JobTracker for information
Q14What are the four basic parameters of a mapper?
Answer: The four basic parameters of a mapper are LongWritable, text; text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.
Q15 How can we change the split size if our commodity hardware has less storage space?
Answer: If our commodity hardware has less storage space, we can change the split size by writing the ‘custom splitter’. There is a feature of customization in Hadoop which can be called from the main method.
Q16 What is a TaskInstance?
Answer: The actual Hadoop MapReduce jobs that run on each slave node are referred to as Task instances. Every task instance has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.
Q17 What do the master class and the output class do?
Answer: Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output location.
Q18 What is the input type/format in MapReduce by default?
Answer: By default the type input type in MapReduce is ‘text’.
Q19 Is it mandatory to set input and output type/format in MapReduce?
Answer: No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as ‘text’.
Q20 How is Hadoop different from other data processing tools?
Answer: In Hadoop, based upon your requirements, you can increase or decrease the number of mappers without bothering about the volume of data to be processed. This is the beauty of parallel processing in contrast to the other data processing tools available.
Q21 What does job conf class do?
Answer: MapReduce needs to logically separate different jobs running on the same cluster. ‘Job conf class’ helps to do job level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive and represent the type of job that is being executed.
Q22Is it important for Hadoop MapReduce jobs to be written in Java?
Answer: It is not necessary to write Hadoop MapReduce jobs in java but users can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API.
Q23 What is a Combiner?
Answer: A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.
Q24 What do sorting and shuffling do?
Answer: Sorting and shuffling are responsible for creating a unique key and a list of values.Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.
Q25 What are the four basic parameters of a reducer?
Answer: The four basic parameters of a reducer are Text, IntWritable, Text, and IntWritable.The first two represent intermediate output parameters and the second two represent final output parameters.
Q26 What are the key differences between Pig vs MapReduce?
Answer: PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store. As part of managing this data flow it moves data feeding it to process1, taking output and feeding it to process2. The core features are preventing execution of subsequent stages if previous stage fails, manages temporary storage of data and most importantly compresses and rearranges processing steps for faster processing. While this can be done for any kind of processing tasks Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all jobs in a Pig are map reduce jobs or data movement jobs. Pig allows for custom functions to be added which can be used for processing in Pig, some default ones are like ordering, grouping, distinct, count etc.
Mapreduce on the other hand is a data processing paradigm, it is a framework for application developers to write code in so that it is easily scaled to PB of tasks, this creates a separation between the developer that writes the application vs the developer that scales the application. Not all applications can be migrated to Map reduce but good few can be including complex ones like k-means to simple ones like counting unique in a dataset.
Q27 Why we cannot do aggregation or addition in a mapper? Why we require reducer for that?
Answer: We cannot do aggregation or addition in a mapper because sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.
Q28 What does a split do?
Answer: Before transferring the data from hard disk location to map method, there is a phase or method called the ‘Split Method’. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but reads data from the block and pass it to the mapper.Be default, Split is taken care by the framework. Split method is equal to the block size and is used to divide block into bunch of splits.
Q29 What does the text input format do?
Answer: In text input format, each line will create a line off-set, that is a hexa-decimal number. Key is considered as a line off-set and value is considered as a whole line text. This is how the data gets processed by a mapper. The mapper will receive the ‘key’ as a ‘LongWritable’ parameter and value as a ‘Text’ parameter.
Q30 What does a MapReduce partitioner do?
Answer: A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer thus allows evenly distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.
Q31 Can we rename the output file?
Answer: Yes we can rename the output file by implementing multiple format output class
Q32What is Streaming?
Answer: Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other programming language.
Q33 Explain what is Speculative Execution?
Answer: In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed.
Q34 Is it possible to start reducers while some mappers still run? Why?
Answer: No. Reducer’s input is grouped by the key. The last mapper could theoretically produce key already consumed by running reducer.
Q35 Describe reduce side join between tables with one-on-one relationship?
Answer: Mapper produces key/value pairs with join ids as keys and row values as value. Corresponding rows from both tables are grouped together by the framework during shuffle and sort phase.Reduce method in reducer obtains join id and two values, each represents row from one table. Reducer joins the data.
Q36 Can you run Map – Reduce jobs directly on Avro data?
Answer: Yes, Avro was specifically designed for data processing via Map-Reduce.
Q37 Can reducers communicate with each other?
Answer: Reducers always run in isolation and they can never communicate with each other as per the Hadoop MapReduce programming paradigm.
Q38 How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
Answer:You can do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting.
Q39 What is TaskTracker?
Answer:TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker.Each Task Tracker is responsible to execute and manage the individual tasks assigned by Job Tracker. Task Tracker also handles the data motion between the map and reduce phases.One Prime responsibility of Task Tracker is to constantly communicate with the Job Tracker the status of the Task.If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.
Q40 How to set mappers and reducers for Hadoop jobs?
Answer: Users can configure JobConf variable to set number of mappers and reducers.job.setNumMaptasks() and job.setNumreduceTasks().
Q41 What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?
Answer:Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process.Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process.One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.
Q42 What do you know about NLineOutputFormat?
Answer: NLineOutputFormat splits ‘n’ lines of input as one split.
Q43 True or false: Each reducer must generate the same number of key/value pairs as its input had.
Answer: False. Reducer may generate any number of key/value pairs including zero.
Q44 When is the reducers are started in a MapReduce job?
Answer: In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.
Q45 Name Job control options specified by MapReduce.Since this framework supports chained operations wherein an input of one map job serves as the output for other, there is a need for job controls to govern these complex operations. The various job control options are:
- submit(): to submit the job to the cluster and immediately return
- waitforCompletion(boolean): to submit the job to the cluster and wait for its completion
Q46 Decide if the statement is true or false: Each combiner runs exactly once.
Answer: False. The framework decides whether combiner runs zero, once or multiple times.
Q47 Define a straggler.
Answer: Straggler is either map or reduce task that takes unusually long time to complete.
Q48 Explain what is distributed Cache in MapReduce Framework ?
Answer: Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file.
Q49 How JobTracker schedules a task?
Answer: The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
Q50 What is chain Mapper?
Answer: Chain Mapper class is a special implementation of Mapper class through which a set of mapper classes can be run in a chain fashion, within a single map task.In this chained pattern execution, first mapper output will become input for second mapper and second mappers output to third mapper, and so on until the last mapper.
very nice blog...I will definitely follow your blog in future
ReplyDeleteHadoop Online Training
Hadoop Training
Hadoop Training in Hyderabad
Bigdata Hadoop Online Training in Hyderabad
Best Hadoop Online Training in Hyderabad
Top questions and Answers https://madanswer.com
ReplyDeleteTop questions and Answers https://madanswer.com/hadoop
ReplyDeleteThis comment has been removed by the author.
ReplyDelete