HADOOP INTERVIEW QUESTIONS

https://data-flair.training/blogs/hadoop-interview-questions-and-answers/

Hadoop Interview Questions and Answers

This blog post on Hadoop interview questions and answers if one of our most important article on Hadoop Blog. Intervews are very critical part of ones career and it is important to know correct answers of the questions that are asked in the interview to gain enough knowledge and confidence. The Hadoop Interview Questions were prepared by the industry Experts at DataFlair. We have divided the whole post in two parts:

1. Hadoop Interview Questions for HDFS

2. Hadoop Interview Questions for MapReduce

Hadoop Interview Questions for HDFS

These 50+ Hadoop Interview Questions and Answers for HDFS are from different components of HDFS. If you want to become a Hadoop Admin or Hadoop developer, then DataFlair is an appropriate place.

We were fully alert while framing these Hadoop Interview questions. Do comment your thoughts in comment section below.

In this section of Hadoop interview questions and answers, we have covered 50+ Hadoop interview questions and answers in detail. We have covered HDFS Hadoop interview questions and answers for freshers, HDFS Hadoop interview questions and answers for experienced as well as some advanced Hadoop interview questions and answers.

HDFS Hadoop Interview Questions and Answers

Basic Questions And Answers for Hadoop Interview

1) What is Hadoop HDFS – Hadoop Distributed File System?

Hadoop distributed file system-HDFS is the primary storage system of Hadoop. HDFS stores very large files running on a cluster of commodity hardware. It works on the principle of storage of less number of large files rather than the huge number of small files. HDFS stores data reliably even in the case of hardware failure. It also provides high throughput access to the application by accessing in parallel.

Components of HDFS:

· NameNode – It works as Master in Hadoop cluster. Namenode stores meta-data i.e. number of blocks, replicas and other details. Meta-data is present in memory in the master to provide faster retrieval of data. NameNode maintains and also manages the slave nodes, and assigns tasks to them. It should deploy on reliable hardware as it is the centerpiece of HDFS.

· DataNode – It works as Slave in Hadoop cluster. In Hadoop HDFS, DataNode is responsible for storing actual data in HDFS. It also performs read and writes operation as per request for the clients. DataNodes can deploy on commodity hardware.

Read about HDFS in detail.

2) What are the key features of HDFS?

The various Features of HDFS are:

· Fault Tolerance – In Apache Hadoop HDFS, Fault-tolerance is working strength of a system in unfavorable conditions. Hadoop HDFS is highly fault-tolerant, in HDFS data is divided into blocks and multiple copies of blocks are created on different machines in the cluster. If any machine in the cluster goes down due to unfavorable conditions, then a client can easily access their data from other machines which contain the same copy of data blocks.

· High Availability – HDFS is highly available file system; data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in the HDFS cluster. Hence, when a client wants to access his data, they can access their data from the slaves which contains its blocks and which is available on the nearest node in the cluster. At the time of failure of a node, a client can easily access their data from other nodes.

· Data Reliability – HDFS is a distributed file system which provides reliable data storage. HDFS can store data in the range of 100s petabytes. It stores data reliably by creating a replica of each and every block present on the nodes and hence, provides fault tolerance facility.

· Replication – Data replication is one of the most important and unique features of HDFS. In HDFS, replication data is done to solve the problem of data loss in unfavorable conditions like crashing of the node, hardware failure and so on.

· Scalability – HDFS stores data on multiple nodes in the cluster, when requirement increases we can scale the cluster. There are two scalability mechanisms available: vertical and horizontal.

· Distributed Storage – In HDFS all the features are achieved via distributed storage and replication. In HDFS data is stored in distributed manner across the nodes in the HDFS cluster.

Read about HDFS Features in detail.

3) What is the difference between NAS and HDFS?

· Hadoop distributed file system (HDFS) is the primary storage system of Hadoop. HDFS designs to store very large files running on a cluster of commodity hardware. While Network-attached storage (NAS) is a file-level computer data storage server. NAS provides data access to a heterogeneous group of clients.

· HDFS distribute data blocks across all the machines in a cluster. Whereas NAS, data stores on a dedicated hardware.

· Hadoop HDFS is designed to work with MapReduce Framework. In MapReduce Framework computation move to the data instead of Data to computation. NAS is not suitable for MapReduce, as it stores data separately from the computations.

· Hadoop HDFS runs on the cluster commodity hardware which is cost effective. While a NAS is a high-end storage device which includes high cost.

4) List the various HDFS daemons in HDFS cluster?

The daemon runs in HDFS cluster are as follows:

· NameNode – It is the master node. It is responsible for storing the metadata of all the files and directories. It also has information about blocks, their location, replicas and other detail.

· Datanode – It is the slave node that contains the actual data. DataNode also performs read and write operation as per request for the clients.

· Secondary NameNode – Secondary NameNode download the FsImage and EditLogs from the NameNode. Then it merges EditLogs with the FsImage periodically. It keeps edits log size within a limit. It stores the modified FsImage into persistent storage. which we can use FsImage in the case of NameNode failure.

5) What is NameNode and DataNode in HDFS?

NameNode – It works as Master in Hadoop cluster. Below listed are the main function performed by NameNode:

· Stores metadata of actual data. E.g. Filename, Path, No. of blocks, Block IDs, Block Location, No. of Replicas, and also Slave related configuration.

· It also manages Filesystem namespace.

· Regulates client access request for actual file data file.

· It also assigns work to Slaves (DataNode).

· Executes file system namespace operation like opening/closing files, renaming files/directories.

· As Name node keep metadata in memory for fast retrieval. So it requires the huge amount of memory for its operation. It should also host on reliable hardware.

DataNode – It works as Slave in Hadoop cluster. Below listed are the main function performed by DataNode:

· Actually, stores Business data.

· It is actual worker node, so it handles Read/Write/Data processing.

· Upon instruction from Master, it performs creation/replication/deletion of data blocks.

· As DataNode store all the Business data, so it requires the huge amount of storage for its operation. It should also host on Commodity hardware.

These were some general Hadoop interview questions and answers. Now let us take some Hadoop interview questions and answers specially for freshers.

Hadoop Interview Question and Answer for Freshers

6) What do you mean by metadata in HDFS?

In Apache Hadoop HDFS, metadata shows the structure of HDFS directories and files. It provides the various information about directories and files like permissions, replication factor. NameNode stores metadata Files which are as follows:

· FsImage – FsImage is an “Image file”. It contains the entire filesystem namespace and stored as a file in the namenode’s local file system. It also contains a serialized form of all the directories and file inodes in the filesystem. Each inode is an internal representation of file or directory’s metadata.

· EditLogs – EditLogs contains all the recent modifications made to the file system of most recent FsImage. When namenode receives a create/update/delete request from the client. Then this request is first recorded to edits file.

If you face any doubt while reading the Hadoop interview questions and answers drop a comment and we will get back to you.

7) What is Block in HDFS?

This one is very important Hadoop interview questions and answers asked in most of the interviews.

Block is a continuous location on the hard drive where data is stored. In general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster. HDFS client does not have any control on the block like block location. NameNode decides all such things.

The default size of the HDFS block is 128 MB, which we can configure as per the requirement. All blocks of the file are of the same size except the last block, which can be the same size or smaller.
If the data size is less than the block size, then block size will be equal to the data size. For example, if the file size is 129 MB, then 2 blocks will be created for it. One block will be of default size 128 MB and other will be 1 MB only and not 128 MB as it will waste the space (here block size is equal to data size). Hadoop is intelligent enough not to waste rest of 127 MB. So it is allocating 1Mb block only for 1MB data.

The major advantages of storing data in such block size are that it saves disk seek time.

Read about HDFS Data Blocks in Detail.

8) Why is Data Block size set to 128 MB in Hadoop?

Because of the following reasons Block size is 128 MB:

· To reduce the disk seeks (IO). Larger the block size, lesser the file blocks. Thus, less number of disk seeks. And block can transfer within respectable limits and that to parallelly.

· HDFS have huge data sets, i.e. terabytes and petabytes of data. If we take 4 KB block size for HDFS, just like Linux file system, which has 4 KB block size. Then we would be having too many blocks and therefore too much of metadata. Managing this huge number of blocks and metadata will create huge overhead. Which is something which we don’t want? So, the block size is set to 128 MB.
On the other hand, block size can’t be so large. Because the system will wait for a very long time for the last unit of data processing to finish its work.

9) What is the difference between a MapReduce InputSplit and HDFS block?

Tip for these type of Hadoop interview questions and and answers: Start with the definition of Block and InputSplit and answer in a comparison language and then cover its data representation, size and example and that too in a comparison language.

By definition-

· Block- Block in Hadoop is the continuous location on the hard drive where HDFS store data. In general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster.

· InputSplit- InputSplit represents the data which individual Mapper will process. Further split divides into records. Each record (which is a key-value pair) will be processed by the map.

Data representation-

· Block- It is the physical representation of data.

· InputSplit- It is the logical representation of data. Thus, during data processing in MapReduce program or other processing techniques use InputSplit. In MapReduce, important thing is that InputSplit does not contain the input data. Hence, it is just a reference to the data.

Size-

· Block- The default size of the HDFS block is 128 MB which is configured as per our requirement. All blocks of the file are of the same size except the last block. The last Block can be of same size or smaller. In Hadoop, the files split into 128 MB blocks and then stored into Hadoop Filesystem.

· InputSplit- Split size is approximately equal to block size, by default.

Example-
Consider an example, where we need to store the file in HDFS. HDFS stores files as blocks. Block is the smallest unit of data that can store or retrieved from the disk. The default size of the block is 128MB. HDFS break files into blocks and stores these blocks on different nodes in the cluster. We have a file of 130 MB, so HDFS will break this file into 2 blocks.

Now, if one wants to perform MapReduce operation on the blocks, it will not process, as the 2nd block is incomplete. InputSplit solves this problem. InputSplit will form a logical grouping of blocks as a single block. As the InputSplit include a location for the next block. It also includes the byte offset of the data needed to complete the block.

From this, we can conclude that InputSplit is only a logical chunk of data. i.e. it has just the information about blocks address or location. Thus, during MapReduce execution, Hadoop scans through the blocks and create InputSplits.

Read InputSplit vs HDFS Blocks in Hadoop in detail.

10) How can one copy a file into HDFS with a different block size to that of existing block size configuration?

By using bellow commands one can copy a file into HDFS with a different block size:

–Ddfs.blocksize=block_size, where block_size is in bytes.

So, consider an example to explain it in detail:

Suppose, you want to copy a file called test.txt of size, say of 128 MB, into the hdfs. And for this file, you want the block size to be 32MB (33554432 Bytes) in place of the default (128 MB). So, can issue the following command:

Hadoop fs –Ddfs.blocksize=33554432-copyFromlocal/home/dataflair/test.txt/sample_hdfs.

Now, you can check the HDFS block size associated with this file by:

hadoop fs –stat %o/sample_hdfs/test.txt

You can also check it by using the NameNode web UI for seeing the HDFS directory.

These are very common type of Hadoop interview questions and answers faced during the interview of a fresher.

Frequently Asked Question in Hadoop Interview

11) Which one is the master node in HDFS? Can it be commodity hardware?

Name node is the master node in HDFS. The NameNode stores metadata and works as High Availability machine in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space. It cannot be a commodity as the entire HDFS works on it.

12) In HDFS, how Name node determines which data node to write on?

Answer these type of Hadoop interview questions answers very shortly and to the point.

Namenode contains Metadata i.e. number of blocks, replicas, their location, and other details. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the Datanodes, and assigns tasks to them.

13) What is a Heartbeat in Hadoop?

Heartbeat is the signals that NameNode receives from the DataNodes to show that it is functioning (alive).

NameNode and DataNode do communicate using Heartbeat. If after a certain time of heartbeat DataNode does not send any response to NameNode, then that Node is dead. So, NameNode in HDFS will create new replicas of those blocks on other DataNodes.

Heartbeats carry information about total storage capacity. It also, carry the fraction of storage in use, and the number of data transfers currently in progress.
The default heartbeat interval is 3 seconds. One can change it by usingdfs.heartbeat.interval in hdfs-site.xml.

14) Can multiple clients write into an Hadoop HDFS file concurrently?

Multiple clients cannot write into an Hadoop HDFS file at same time. Apache Hadoop follows single writer multiple reader models. When HDFS client opens a file for writing, then NameNode grant a lease. Now suppose, some other client wants to write into that file. It asks NameNode for a write operation in Hadoop. NameNode first checks whether it has granted the lease for writing into that file to someone else or not. If someone else acquires the lease, then it will reject the write request of the other client.

Read HDFS Data Write Operation in detail.

15) How data or file is read in Hadoop HDFS?

To read from HDFS, the first client communicates to namenode for metadata. The Namenode responds with details of No. of blocks, Block IDs, Block Location, No. of Replicas. Then, the client communicates with Datanode where the blocks are present. Clients start reading data parallel from the Datanode. It read on the basis of information received from the namenodes.

Once an application or HDFS client receives all the blocks of the file, it will combine these blocks to form a file. To improve read performance, the location of each block ordered by their distance from the client. HDFS selects the replica which is closest to the client. This reduces the read latency and bandwidth consumption. It first read the block in the same node. Then another node in the same rack, and then finally another Datanode in another rack.

Read HDFS Data Read Operation in detail.

16) Does HDFS allow a client to read a file which is already opened for writing?

Yes, the client can read the file which is already opened for writing. But, the problem in reading a file which is currently open for writing, lies in the consistency of data. HDFS does not provide the surety that the data which it has written into the file will be visible to a new reader. For this, one can call the hflush operation. It will push all the data in the buffer into write pipeline. Then the hflush operation will wait for acknowledgments from the datanodes. Hence, by doing this, the data that client has written into the file before the hflush operation visible to the reader for sure.

If you encounter any doubt or query in the Hadoop interview questions, feel free to ask us in the comment section below and our support team will get back to you.

17) Why is Reading done in parallel and writing is not in HDFS?

Client read data parallelly because by doing so the client can access the data fast. Reading in parallel makes the system fault tolerant. But the client does not perform the write operation in Parallel. Because writing in parallel might result in data inconsistency.

Suppose, you have a file and two nodes are trying to write data into a file in parallel. Then the first node does not know what the second node has written and vice-versa. So, we can not identify which data to store and access.

Client in Hadoop writes data in pipeline anatomy. There are various benefits of a pipeline write:

· More efficient bandwidth consumption for the client – The client only has to transfer one replica to the first datanode in the pipeline write. So, each node only gets and send one replica over the network (except the last datanode only receives data). This results in balanced bandwidth consumption. As compared to the client writing three replicas into three different datanodes.

· Smaller sent/ack window to maintain – The client maintain a much smaller sliding window. Sliding window record which blocks in the replica is sending to the DataNodes. It also records which blocks are waiting for acks to confirm the write has been done. In a pipeline write, the client appears to write data to only one datanode.

18) What is the problem with small files in Apache Hadoop?

Hadoop is not suitable for small data. Hadoop HDFS lacks the ability to support the random reading of small files. Small file in HDFS is smaller than the HDFS block size (default 128 MB). If we are storing these huge numbers of small files, HDFS can’t handle these lots of files. HDFS works with the small number of large files for storing large datasets. It is not suitable for a large number of small files. A large number of many small files overload NameNode since it stores the namespace of HDFS.

Solution –

· HAR (Hadoop Archive) Files – HAR files deal with small file issue. HAR has introduced a layer on top of HDFS, which provides interface for file accessing. Using Hadoop archive command we can create HAR files. These file runs a MapReduce job to pack the archived files into a smaller number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS.

· Sequence Files – Sequence Files also deal with small file problem. In this, we use the filename as key and the file contents as the value. Suppose we have 10,000 files, each of 100 KB, we can write a program to put them into a single sequence file. Then one can process them in a streaming fashion.

19) What is throughput in HDFS?

The amount of work done in a unit time is known as Throughput. Below are the reasons due to HDFS provides good throughput:

· Hadoop works on Data Locality principle. This principle state that moves computation to data instead of data to computation. This reduces network congestion and therefore, enhances the overall system throughput.

· The HDFS is Write Once and Read Many Model. It simplifies the data coherency issues as the data written once, one can not modify it. Thus, provides high throughput data access.

20) Comparison between Secondary NameNode and Checkpoint Node in Hadoop?

Secondary NameNode download the FsImage and EditLogs from the NameNode. Then it merges EditLogs with the FsImage periodically. Secondary NameNode stores the modified FsImage into persistent storage. So, we can use FsImage in the case of NameNode failure. But it does not upload the merged FsImage with EditLogs to active namenode. While Checkpoint node is a node which periodically creates checkpoints of the namespace.

Checkpoint Node in Hadoop first downloads FsImage and edits from the active NameNode. Then it merges them (FsImage and edits) locally, and at last, it uploads the new image back to the active namenode.

The above 7-20 Hadoop interview questions and answers were for freshers, However experienced can also go through these Hadoop interview questions and answers for revising the basics.

21) What is a Backup node in Hadoop?

Backup node provides the same checkpointing functionality as the Checkpoint node (Checkpoint node is a node which periodically creates checkpoints of the namespace. Checkpoint Node downloads FsImage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode). In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace, which is always synchronized with the active NameNode state.

The Backup node does not need to download FsImage and edits files from the active NameNode in order to create a checkpoint, as would be required with a Checkpoint node or Secondary Namenode since it already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into the local FsImage file and reset edits. One Backup node is supported by the NameNode at a time. No checkpoint nodes may be registered if a Backup node is in use.

22) How does HDFS ensure Data Integrity of data blocks stored in HDFS?

Data Integrity ensures the correctness of the data. But, it is possible that the data will get corrupted during I/O operation on the disk. Corruption can occur due to various reasons network faults, or buggy software. Hadoop HDFS client software implements checksum checking on the contents of HDFS files.

In Hadoop, when a client creates an HDFS file, it computes a checksum of each block of the file. Then stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it first checks. Then it verifies that the data it received from each Datanode matches the checksum. Checksum stored in the associated checksum file. And if not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.

23) What do you mean by the NameNode High Availability in hadoop?

In Hadoop 1.x, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would be unable to read, write file or list files. In such event, the whole Hadoop system would be out of service until new namenode is up.

Hadoop 2.0 overcomes SPOF. Hadoop 2.x provide support for multiple NameNode. High availability feature gives an extra NameNode (active standby NameNode) to Hadoop architecture. This extra NameNode configured for automatic failover. If active NameNode fails, then Standby Namenode takes all its responsibility. And cluster work continuously.

The initial implementation of namenode high availability provided for single active/standby namenode. However, some deployment requires high degree fault-tolerance. Hadoop 3.x enable this feature by allowing the user to run multiple standby namenode. The cluster tolerates the failure of 2 nodes rather than 1 by configuring 3 namenode & 5 journal nodes.

Read about HDFS NameNode High Availability.

24) What is Fault Tolerance in Hadoop HDFS?

Fault-tolerance in HDFS is working strength of a system in unfavorable conditions. Unfavorable conditions are the crashing of the node, hardware failure and so on. HDFS control faults by the process of replica creation. When client stores a file in HDFS, Hadoop framework divide the file into blocks. Then client distributes data blocks across different machines present in HDFS cluster. And, then create the replica of each block is on other machines present in the cluster.

HDFS, by default, creates 3 copies of a block on other machines present in the cluster. If any machine in the cluster goes down or fails due to unfavorable conditions. Then also, the user can easily access that data from other machines in which replica of the block is present.

Read about HDFS Fault Tolerance in detail.

25) Describe HDFS Federation.

In Hadoop 1.0, HDFS architecture for entire cluster allows only single namespace.

Limitations-

· Namespace layer and storage layer are tightly coupled. This makes alternate implementation of namenode difficult. It also restricts other services to use block storage directly.

· A namespace is not scalable like datanode. Scaling in HDFS cluster is horizontally by adding datanodes. But we can’t add more namespace to an existing cluster.

· There is no separation of the namespace. So, there is no isolation among tenant organization that is using the cluster.

In Hadoop 2.0, HDFS Federation overcomes this limitation. It supports too many NameNode/ Namespaces to scale the namespace horizontally. In HDFS federation isolate different categories of application and users to different namespaces. This improves Read/ write operation throughput adding more namenodes.

Read about HDFS Federation in detail.

26) What is the default replication factor in Hadoop and how will you change it?

The default replication factor is 3. One can change replication factor in following three ways:

· By adding this property to hdfs-site.xml:

1. <property>

2. <name>dfs.replication</name>

3. <value>5</value>

4. <description>Block Replication</description>

5. </property>

· One can also change the replication factor on a per-file basis using the command:
hadoop fs –setrep –w 3 / file_location

· One can also change replication factor for all the files in a directory by using:
hadoop fs –setrep –w 3 –R / directoey_location

27) Why Hadoop performs replication, although it results in data redundancy?

In HDFS, Replication provides the fault tolerance. Replication is one of the unique features of HDFS. Data Replication solves the issue of data loss in unfavorable conditions. Unfavorable conditions are the hardware failure, crashing of the node and so on.

HDFS by default creates 3 replicas of each block across the cluster in Hadoop. And we can change it as per the need. So if any node goes down, we can recover data on that node from the other node. In HDFS, Replication will lead to the consumption of a lot of space. But the user can always add more nodes to the cluster if required. It is very rare to have free space issues in practical cluster. As the very first reason to deploy HDFS was to store huge data sets. Also, one can change the replication factor to save HDFS space. Or one can also use different codec provided by the Hadoop to compress the data.

28) What is Rack Awareness in Apache Hadoop?

In Hadoop, Rack Awareness improves the network traffic while reading/writing file. In Rack Awareness NameNode chooses the DataNode which is closer to the same rack or nearby rack. NameNode achieves Rack information by maintaining the rack ids of each DataNode. Thus, this concept chooses Datanodes based on the Rack information.

HDFS NameNode makes sure that all the replicas are not stored on the single rack or same rack. It follows Rack Awareness Algorithm to reduce latency as well as fault tolerance.

Default replication factor is 3. Therefore according to Rack Awareness Algorithm:

· The first replica of the block will store on a local rack.

· The next replica will store on another datanode within the same rack.

· And the third replica stored on the different rack.

In Hadoop, we need Rack Awareness for below reason: It improves-

· Data high availability and reliability.

· The performance of the cluster.

· Network bandwidth.

Read about HDFS Rack Awareness in detail.

29) Explain the Single point of Failure in Hadoop?

In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would unable to read/write files. In such event, whole Hadoop system would be out of service until new namenode is up.

Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. High availability feature provides an extra NameNode to hadoop architecture. This feature provides automatic failover. If active NameNode fails, then Standby-Namenode takes all the responsibility of active node. And cluster continues to work.

The initial implementation of Namenode high availability provided for single active/standby namenode. However, some deployment requires high degree fault-tolerance. So new version 3.0 enable this feature by allowing the user to run multiple standby namenode. The cluster tolerates the failure of 2 nodes rather than 1 by configuring 3 namenode & 5 journalnodes.

30) Explain Erasure Coding in Apache Hadoop?

For several purposes, HDFS, by default, replicates each block three times. Replication also provides the very simple form of redundancy to protect against the failure of datanode. But replication is very expensive. 3 x replication scheme results in 200% overhead in storage space and other resources.
Hadoop 2.x introduced a new feature called “Erasure Coding” to use in the place of Replication. It also provides the same level of fault tolerance with less space store and 50% storage overhead.

Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID implements Erasure Coding through striping. In which it divides logical sequential data (such as a file) into the smaller unit (such as bit, byte or block). After that, it stores data on different disk.

Encoding – In this, RAID calculates and sort Parity cells for each strip of data cells. Then, recover error through the parity. Erasure coding extends a message with redundant data for fault tolerance. Its codec operates on uniformly sized data cells. In Erasure Coding, codec takes a number of data cells as input and produces parity cells as the output.

There are two algorithms available for Erasure Coding:

· XOR Algorithm

· Reed-Solomon Algorithm

Read about Erasure Coding in detail

31) What is Balancer in Hadoop?

Data may not always distribute uniformly across the datanodes in HDFS due to following reasons:

· A lot of deletes and writes

· Disk replacement

Data Blocks allocation strategy tries hard to spread new blocks uniformly among all the datanodes. In a large cluster, each node has different capacity. While quite often you need to delete some old nodes, also add new nodes for more capacity.

The addition of new datanode becomes a bottleneck due to below reason:

· When Hadoop framework allocates all the new blocks and read from new datanode. This will overload the new datanode.

HDFS provides a tool called Balancer that analyzes block placement and balances across the datanodes.

These are very common type of Hadoop interview questions and answers faced during the interview of an experienced professional.

32) What is Disk Balancer in Apache Hadoop?

Disk Balancer is a command line tool, which distributes data evenly on all disks of a datanode. This tool operates against a given datanode and moves blocks from one disk to another.

Disk balancer works by creating and executing a plan (set of statements) on the datanode. Thus, the plan describes how much data should move between two disks. A plan composes multiple steps. Move step has source disk, destination disk and the number of bytes to move. And the plan will execute against an operational datanode.

By default, disk balancer is not enabled. Hence, to enable diskbalnecr dfs.disk.balancer.enabled must be set true in hdfs-site.xml.

When we write new block in hdfs, then, datanode uses volume choosing the policy to choose the disk for the block. Each directory is the volume in hdfs terminology. Thus, two such policies are: Round-robin and Available space

· Round-robin distributes the new blocks evenly across the available disks.

· Available space writes data to the disk that has maximum free space (by percentage).

Read about HDFS Disk Balancer in detail.

33) What is active and passive NameNode in Hadoop?

In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, then all clients would be unable to read, write file or list files. In such event, whole Hadoop system would be out of service until new namenode is up.

Hadoop 2.0 overcomes this SPOF. Hadoop 2.0 provides support for multiple NameNode. High availability feature provides an extra NameNode to Hadoop architecture for automatic failover.

· Active NameNode – It is the NameNode which works and runs in the cluster. It is also responsible for all client operations in the cluster.

· Passive NameNode – It is a standby namenode, which has similar data as active NameNode. It simply acts as a slave, maintains enough state to provide a fast failover, if necessary.

If Active NameNode fails, then Passive NameNode takes all the responsibility of active node. The cluster works continuously.

34) How is indexing done in Hadoop HDFS?

Apache Hadoop has a unique way of indexing. Once Hadoop framework store the data as per the Data Bock size. HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

35) What is a Block Scanner in HDFS?

Block scanner verify whether the data blocks stored on each DataNodes are correct or not. When Block scanner detects corrupted data block, then following steps occur:

· First of all, DataNode report NameNode about the corrupted block.

· After that, NameNode will start the process of creating a new replica. It creates new replica using the correct replica of the corrupted block present in other DataNodes.

· When the replication count of the correct replicas matches the replication factor 3, then delete corrupted block

36) How to perform the inter-cluster data copying work in HDFS?

HDFS use distributed copy command to perform the inter-cluster data copying. That is as below:

hadoop distcp hdfs://<source NameNode> hdfs://<target NameNode>

DistCp (distributed copy) is a tool also used for large inter/intra-cluster copying. It uses MapReduce to affect its distribution, error handling and recovery and reporting. This distributed copy tool enlarges a list of files and directories into the input to map tasks.

37) What are the main properties of hdfs-site.xml file?

hdfs-site.xml – It specifies configuration setting for HDFS daemons in Hadoop. It also provides default block replication and permission checking on HDFS.

The three main hdfs-site.xml properties are:

1. dfs.name.dir gives you the location where NameNode stores the metadata (FsImage and edit logs). It also specifies where DFS should locate, on the disk or onto the remote directory.

2. dfs.data.dir gives the location of DataNodes where it stores the data.

3. fs.checkpoint.dir is the directory on the file system. Hence, on this directory secondary NameNode stores the temporary images of edit logs.

38) How can one check whether NameNode is working or not?

One can check the status of the HDFS NameNode in several ways. Most usually, one uses the jps command to check the status of all daemons running in the HDFS.

39) How would you restart NameNode?

NameNode is also known as Master node. It stores meta-data i.e. number of blocks, replicas, and other details. NameNode maintains and manages the slave nodes, and assigns tasks to them.

By following two methods, you can restart NameNode:

· First stop the NameNode individually using ./sbin/hadoop-daemons.sh stop namenode command. Then, start the NameNode using ./sbin/hadoop-daemons.sh start namenode command.

· Use ./sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all the demons first. Then start all the daemons.

The above Hadoop interview questions and answers were for experienced but freshers can also refer these Hadoop interview questions and answers for in depth knowledge. Now let’s move forward with some advanced Hadoop interview questions and answers.

Advanced Questions for Hadoop Interview

40) How NameNode tackle Datanode failures in Hadoop?

HDFS has a master-slave architecture in which master is namenode and slave are datanode. HDFS cluster has single namenode that manages file system namespace (metadata) and multiple datanodes that are responsible for storing actual data in HDFs and performing the read-write operation as per request for the clients.

NameNode receives Heartbeat and block report from Datanode. Heartbeat receipt implies that the datanode is alive and functioning properly and block report contains a list of all blocks on a datanode. When NameNode observes that DataNode has not sent heartbeat message after a certain amount of time, the datanode is marked as dead. The namenode replicates the blocks of the dead node to another datanode using the replica created earlier. Hence, NameNode can easily handle Datanode failure.

41) Is Namenode machine same as DataNode machine as in terms of hardware in Hadoop?

NameNode is highly available server, unlike DataNode. NameNode manages the File System Namespace. It also maintains the metadata information. Metadata information is the number of blocks, their location, replicas and other details. It also executes file system execution such as naming, closing, opening files/directories.

Because of the above reasons, NameNode requires higher RAM for storing the metadata for millions of files. Whereas, DataNode is responsible for storing actual data in HDFS. It performs read and write operation as per request of the clients. Therefore, Datanode needs to have a higher disk capacity for storing huge data sets.

42) If DataNode increases, then do we need to upgrade NameNode in Hadoop?

Namenode stores meta-data i.e. number of blocks, their location, replicas. In Hadoop, meta-data is present in memory in the master for faster retrieval of data. NameNode manages and maintains the slave nodes, and assigns tasks to them. It regulates client’s access to files.

It also executes file system execution such as naming, closing, opening files/directories. During Hadoop installation, framework determines NameNode based on the size of the cluster. Mostly we don’t need to upgrade the NameNode because it does not store the actual data. But it stores the metadata, so such requirement rarely arise.

43) Explain what happens if, during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3?

Replication factor can be set for the entire cluster to adjust the number of replicated block. It ensures high data availability.

The cluster will have n-1 duplicated blocks, for every block that are present in HDFS. So, if the replication factor during PUT operation is set to 1 in place of the default value 3. Then it will have a single copy of data. If one set replication factor 1. And if DataNode crashes under any circumstances, then an only single copy of the data would lose.

44) What are file permissions in HDFS? how does HDFS check permissions for files/directory?

Hadoop distributed file system (HDFS) implements a permissions model for files/directories.
For each files/directory, one can manage permissions for a set of 3 distinct user classes: owner, group, others

There are also 3 different permissions for each user class: Read (r), write (w), execute(x)

· For files, the w permission is to write to the file and the r permission is to readthe file.

· For directories, w permission is to create or delet the directory. The rpermission is to list the contents of the directory.

· The X permission is to access a child of the directory.

HDFS check permissions for files or directory:

· We can also check the owner’s permissions, if the user name matches the owner of directory.

· If the group matches the directory’s group, then Hadoop tests the user’s group permissions.

· Hadoop tests the “other” permission, when owner and the group names doesn’t match.

· If none of the permissions checks succeed, then client’s request is denied.

45) How one can format Hadoop HDFS?

One can format HDFS by using bin/hadoop namenode –format command.

$bin/hadoop namenode –format$ command formats the HDFS via NameNode.

Formatting implies initializing the directory specified by the dfs.name.dir variable. When you run this command on existing files system, then, you will lose all your data stored on your NameNode.

Hadoop NameNode directory contains the FsImage and edit files. This hold the basic information’s about Hadoop file system. So, this basic information includes like which user created files etc.
Hence, when we format the NameNode, then it will delete above information from directory. This information is present in the hdfs-site.xml as dfs.namenode.name.dir. So, formatting a NameNode will not format the DataNode.
NOTE: Never format, up and running Hadoop Filesystem. You will lose data stored in the HDFS.

46) What is the process to change the files at arbitrary locations in HDFS?

HDFS doesn’t support modifications at arbitrary offsets in the file or multiple writers. But a single writer writes files in append-only format. Writes to a file in HDFS are always made at the end of the file.

47) Differentiate HDFS & HBase.

Data write process

· HDFS- Append method

· HBase- Bulk incremental, random write

Data read process

· HDFS- Table scan

· HBase- Table scan/random read/small range scan

Hive SQL querying

· HDFS- Excellent

· HBase- Average

Read about HBase in detail.

These are some advanced Hadoop interview questions and answers for HDFS that will help you in answer many more interview questions in the best manner

Learn Hadoop from Industry Experts

48) What is meant by streaming access?

HDFS works on the principle of “write once, read many”. Its focus is on fast and accurate data retrieval. Steaming access means reading the complete data instead of retrieving a single record from the database.

49) How to transfer data from Hive to HDFS?

One can transfer data from Hive by writing the query:

hive> insert overwrite directory ‘/’ select * from emp;

Hence, the output you receive will be stored in part files in the specified HDFS path.

50) How to add/delete a Node to the existing cluster?

To add a Node to the existing cluster follow:

Add the host name/Ip address in dfs.hosts/slaves file. Then, refresh the cluster with

$hadoop dfsamin -refreshNodes

To delete a Node to the existing cluster follow:

Add the hostname/Ip address to dfs.hosts.exclude/remove the entry from slaves file. Then, refresh the cluster with $hadoop dfsamin -refreshNodes

$hadoop dfsamin -refreshNodes

51) How to format the HDFS? How frequently it will be done?

These type of Hadoop Interview Questions and Answers are also taken very short and to the point. Giving very lengthy answer here is unnecessary and may lead to negative points.

$hadoop namnode -format.

Note: Format the HDFS only once that to during initial cluster setup.

52) What is the importance of dfs.namenode.name.dir in HDFS?

dfs.namenode.name.dir contains the fsimage file for namenode.

We should configure it to write to atleast two filesystems on physical hosts/namenode/secondary namenode. Because if we lose FsImage file we will lose entire HDFS file system. Also there is no other recovery mechanism if there is no FsImage file available.

Number 40-52 were the advanced Hadoop interview question and answer to get indepth knowledge in handling difficult Hadoop interview questions and answers.

This was all about the Hadoop Interview Questions and Answers

These questions are frequently asked Hadoop interview questions and answers. You can read here some more Hadoop HDFS interview questions and answers.

After going through these top Hadoop Interview questions and answers you will be able to confidently face a interview and will be able to answer Hadoop Interview questions and answers asked in your interview in the best manner. These Hadoop Interview Questions are suggested by the experts at DataFlair

Key –

Q.1 – Q.5 Basic Hadoop Interview Questions

Q.6 – Q.10 HDFS Hadoop interview questions and answers for freshers

Q. 11- Q. 20 Frequently asked Questions in Hadoop Interview

Q.21 – Q39 were the HDFS Hadoop interview questions and answer for experienced

Q.40 – Q.52 were the advanced HDFS Hadoop interview questions and answers

These Hadoop interview questions and answers are categorized so that you can pay more attention to questions specified for you, however, it is recommended that you go through all the Hadoop interview questions and answers for complete understanding.

If you have any more doubt or query on Hadoop Interview Questions and Answers, Drop a comment and our support team will be happy to help you. Now let’s jump to our second part of Hadoop Interview Questions i.e. MapReduce Interview Questions and Answers.

Hadoop Interview Questions and Answers for MapReduce

It is difficult to pass the Hadoop interview as it is fast and growing technology. To get you through this tough path the MapReduce Hadoop interview questions and answers will serve as the backbone. This section contains the commonly asked MapReduce Hadoop interview questions and answers.

In this section on MapReduce Hadoop interview questions and answers, we have covered 50+ Hadoop interview questions and answers for MapReduce in detail. We have covered MapReduce Hadoop interview questions and answers for freshers, MapReduce Hadoop interview questions and answers for experienced as well as some advanced Mapreduce Hadoop interview questions and answers.

These 50 MapReduce Hadoop Interview Questions are framed by keeping in mind the need of an era, and the trending pattern of the interview that is being followed by the companies. The interview questions of Hadoop MapReduce are dedicatedly framed by the company experts to help you to reach your goal.

All the best!!!!

Top 50 MapReduce Hadoop Interview Questions and Answers for Hadoop Jobs.

Basic MapReduce Hadoop Interview Questions and Answers

53) What is MapReduce in Hadoop?

Hadoop MapReduce is the data processing layer. It is the framework for writing applications that process the vast amount of data stored in the HDFS.

It processes a huge amount of data in parallel by dividing the job into a set of independent tasks (sub-job). In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce.

· Map – It is the first phase of processing. In which we specify all the complex logic/business rules/costly code. The map takes a set of data and converts it into another set of data. It also breaks individual elements into tuples (key-value pairs).

· Reduce – It is the second phase of processing. In which we specify light-weight processing like aggregation/summation. The output from the map is the input to Reducer. Then, Reducer combines tuples (key-value) based on the key. And then, modifies the value of the key accordingly.

Read about Hadoop MapReduce in Detail.

54) What is the need of MapReduce in Hadoop?

In Hadoop, when we have stored the data in HDFS, how to process this data is the first question that arises? Transferring all this data to a central node for processing is not going to work. And we will have to wait forever for the data to transfer over the network. Google faced this same problem with its Distributed Goggle File System (GFS). It solved this problem using a MapReduce data processing model.

Challenges before MapReduce

· Time-consuming – By using single machine we cannot analyze the data (terabytes) as it will take a lot of time.

· Costly – All the data (terabytes) in one server or as database cluster which is very expensive. And also hard to manage.

MapReduce overcome these challenges

· Time-efficient – If we want to analyze the data. We can write the analysis code in Map function. And the integration code in Reduce function and execute it. Thus, this MapReduce code will go to every machine which has a part of our data and executes on that specific part. Hence instead of moving terabytes of data, we just move kilobytes of code. So this type of movement is time-efficient.

· Cost-efficient – It distributes the data over multiple low config machines.

Hadoop MapReduce Job Execution Flow Diagram

55) What is Mapper in Hadoop?

Mapper task processes each input record (from RecordReader) and generates a key-value pair. This key-value pairs generated by mapper is completely different from the input pair. The Mapper store intermediate-output on the local disk. Thus, it does not store its output on HDFS. It is temporary data and writing on HDFS will create unnecessary multiple copies. Mapper only understands key-value pairs of data. So before passing data to the mapper, it, first converts the data into key-value pairs.

Mapper only understands key-value pairs of data. So before passing data to the mapper, it, first converts the data into key-value pairs. InputSplit and RecordReader convert data into key-value pairs. Input split is the logical representation of data. RecordReader communicates with the InputSplit and converts the data into Kay-value pairs. Hence

· Key is a reference to the input value.

· Value is the data set on which to operate.

Number of maps depends on the total size of the input. i.e. the total number of blocks of the input files. Mapper= {(total data size)/ (input split size)} If data size= 1 Tb and input split size= 100 MB Hence, Mapper= (1000*1000)/100= 10,000

Mapper= {(total data size)/ (input split size)} If data size= 1 Tb and input split size= 100 MB Hence, Mapper= (1000*1000)/100= 10,000

If data size= 1 Tb and input split size= 100 MB HenceMapper= (1000*1000)/100= 10,000

Mapper= (1000*1000)/100= 10,000

Read about Mapper in detail.

56) What is Reducer in Hadoop?

Reducer takes the output of the Mapper (intermediate key-value pair) as the input. After that, it runs a reduce function on each of them to generate the output. Thus the output of the reducer is the final output, which it stored in HDFS. Usually, in Reducer, we do aggregation or summation sort of computation. Reducer has three primary phases-

· Shuffle- The framework, fetches the relevant partition of the output of all the Mappers for each reducer via HTTP.

· Sort- The framework groups Reducers inputs by the key in this Phase. Shuffle and sort phases occur simultaneously.

· Reduce- After shuffling and sorting, reduce task aggregates the key-value pairs. In this phase, call the reduce (Object, Iterator, OutputCollector, Reporter) method for each <key, (list of values)> pair in the grouped inputs.

With the help of Job.setNumreduceTasks(int) the user set the number of reducers for the job.
Hence, right number of reducers is 0.95 or 1.75 multiplied by (<no. of nodes>*<no. of maximum container per node>)

Read about Reducer in detail.

57) How to set mappers and reducers for MapReduce jobs?

One can configure JobConf to set number of mappers and reducers.

· For Mapper – job.setNumMaptasks()

· For Reducer – job.setNumreduceTasks()

These were some general MapReduce Hadoop interview questions and answers. Now let us take some Mapreduce Hadoop interview questions and answers specially for freshers.

MapReduce Hadoop Interview Question and Answer for Freshers

58) What is the key- value pair in Hadoop MapReduce?

Hadoop MapReduce implements a data model, which represents data as key-value pairs. Both input and output to MapReduce Framework should be in Key-value pairs only. In Hadoop, if a schema is static we can directly work on the column instead of key-value. But, the schema is not static we will work on keys and values. Keys and values are not the intrinsic properties of the data. But the user analyzing the data chooses a key-value pair.

A Key-value pair in Hadoop MapReduce generate in following way:

· InputSplit – It is the logical representation of data. InputSplit represents the data which individual Mapper will process.

· RecordReader – It converts the split into records which are in form of Key-value pairs. That is suitable for reading by the mapper.
By Default RecordReader uses TextInputFormat for converting data into a key-value pair.

· Key – It is the byte offset of the beginning of the line within the file.

· Value – It is the contents of the line, excluding line terminators. For

For Example, file content is- on the top of the crumpetty Tree

Key- 0

Value- on the top of the crumpetty Tree

Read about MapReduce Key-value pair in detail.

59) What is the need of key-value pair to process the data in MapReduce?

Hadoop MapReduce works on unstructured and semi-structured data apart from structured data. One can read the Structured data like the ones stored in RDBMS by columns.

But handling unstructured data is feasible using key-value pairs. The very core idea of MapReduce work on the basis of these pairs. Framework map data into a collection of key-value pairs by mapper and reducer on all the pairs with the same key.

In most of the computations- Map operation applies on each logical “record” in our input. This computes a set of intermediate key-value pairs. Then apply reduce operation on all the values that share the same key. This combines the derived data properly.

Thus, we can say that key-value pairs are the best solution to work on data problems on MapReduce.

60) What are the most common InputFormats in Hadoop?

In Hadoop, Input files store the data for a MapReduce job. Input files which stores data typically reside in HDFS. Thus, in MapReduce, InputFormat defines how these input files split and read. InputFormat creates InputSplit.

Most common InputFormat are:

· FileInputFormat – For all file-based InputFormat it is the base class . It also specifies input directory where data files are present. FileInputFormat also read all files. And, then divides these files into one or more InputSplits.

· TextInputFormat – It is the default InputFormat of MapReduce. It uses each line of each input file as a separate record. Thus, performs no parsing.
Key- byte offset.
Value- It is the contents of the line, excluding line terminators.

· KeyValueTextInputFormat – It also treats each line of input as a separate record. But the main difference is that TextInputFormat treats entire line as the value. While the KeyValueTextInputFormat breaks the line itself into key and value by the tab character (‘/t’).
Key- Everything up to tab character.
Value- Remaining part of the line after tab character.

· SequenceFileInputFormat – It reads sequence files.
Key & Value- Both are user-defined.

Read about Mapreduce InputFormat in detail.

61) Explain InputSplit in Hadoop?

InputFormat creates InputSplit. InputSplit is the logical representation of data. Further Hadoop framework divides InputSplit into records. Then mapper will process each record. The size of split is approximately equal to HDFS block size (128 MB). In MapReduce program, Inputsplit is user defined. So, the user can control split size based on the size of data.

InputSplit in mapreduce has a length in bytes. It also has set of storage locations (hostname strings). It use storage location to place map tasks as close to split’s data as possible. According to the inputslit size each Map tasks process. So that the largest one gets processed first, this minimize the job runtime. In MapReduce, important thing is that InputSplit is just a reference to the data, not contain input data.

By calling ‘getSplit()’ client who is running job calculate the split for the job . And then send to the application master and it will use their storage location to schedule map tasks. And that will process them on the cluster. In MapReduce, split is send to the createRecordReader() method. It will create RecordReader for the split in mapreduce job. Then RecordReader generate record (key-value pair). Then it passes to the map function.

Read about MapReduce InputSplit in detail.

62) Explain the difference between a MapReduce InputSplit and HDFS block.

Tip for these type of Mapreduce Hadoop interview questions and and answers: Start with the definition of Block and InputSplit and answer in a comparison language and then cover its data representation, size and example and that too in a comparison language.

By definition-

· Block – It is the smallest unit of data that the file system store. In general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster.

· InputSplit – InputSplit represents the data which individual Mapper will process. Further split divides into records. Each record (which is a key-value pair) will be processed by the map.

Size-

· Block – The default size of the HDFS block is 128 MB which is configured as per our requirement. All blocks of the file are of the same size except the last block. The last Block can be of same size or smaller. In Hadoop, the files split into 128 MB blocks and then stored into Hadoop Filesystem.

· InputSplit – Split size is approximately equal to block size, by default.

Data representation-

· Block – It is the physical representation of data.

· InputSplit – It is the logical representation of data. Thus, during data processing in MapReduce program or other processing techniques use InputSplit. In MapReduce, important thing is that InputSplit does not contain the input data. Hence, it is just a reference to the data.

63) What is the purpose of RecordReader in hadoop?

RecordReader in Hadoop uses the data within the boundaries, defined by InputSplit. It creates key-value pairs for the mapper. The “start” is the byte position in the file. Thus at ‘start the RecordReader should start generating key-value pairs. And the “end” is where it should stop reading records.

RecordReader in MapReduce job load data from its source. And then, converts the data into key-value pairs suitable for reading by the mapper. RecordReader communicates with the InputSplit until it does not read the complete file. The MapReduce framework defines RecordReader instance by the InputFormat. By, default RecordReader also uses TextInputFormat for converting data into key-value pairs.

TextInputFormat provides 2 types of RecordReader : LineRecordReader and SequenceFileRecordReader.

LineRecordReader in Hadoop is the default RecordReader that TextInputFormat provides. Hence, each line of the input file is the new value and the key is byte offset.

SequenceFileRecordReader in Hadoop reads data specified by the header of a sequence file.

Read about MapReduce RecordReder in detail.

64) What is Combiner in Hadoop?

In MapReduce job, Mapper Generate large chunks of intermediate data. Then pass it to reduce for further processing. All this leads to enormous network congestion. Hadoop MapReduce framework provides a function known as Combiner. It plays a key role in reducing network congestion.

The Combiner in Hadoop is also known as Mini-reducer that performs local aggregation on the mapper’s output. This reduces the data transfer between mapperand reducer and increases the efficiency.

There is no guarantee of execution of Ccombiner in Hadoop i.e. Hadoop may or may not execute a combiner. Also if required it may execute it more than 1 times. Hence, your MapReduce jobs should not depend on the Combiners execution.

Read about MapReduce Combiner in detail.

65) Explain about the partitioning, shuffle and sort phase in MapReduce?

Partitioning Phase – Partitioning specifies that all the values for each key are grouped together. Then make sure that all the values of a single key go on the same Reducer. Thus allows even distribution of the map output over the Reducer.

Shuffle Phase – It is the process by which the system sorts the key-value output of the map tasks. After that it transfer to the reducer.

Sort Phase – Mapper generate the intermediate key-value pair. Before starting of Reducer, map reduce framework sort these key-value pairs by the keys. It also helps reducer to easily distinguish when a new reduce task should start. Thus saves time for the reducer.

Read about Shuffling and Sorting in detail

66) What does a “MapReduce Partitioner” do?

Partitioner comes int the picture, if we are working on more than one reducer. Partitioner controls the partitioning of the keys of the intermediate map-outputs. By hash function, the key (or a subset of the key) is used to derive the partition. Partitioning specifies that all the values for each key grouped together. And it make sure that all the values of single key goes on the same reducer. Thus allowing even distribution of the map output over the reducers. It redirects the Mappers output to the reducers by determining which reducer is responsible for the particular key.

The total number of partitioner is equal to the number of Reducer. Partitioner in Hadoop will divide the data according to the number of reducers. Thus, single reducer process the data from the single partitioner.

Read about MapReduce Partitioner in detail.

67) If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?

So, Hadoop MapReduce by default uses ‘HashPartitioner’.

It uses the hashCode() method to determine, to which partition a given (key, value) pair will be sent. HashPartitioner also has a method called getPartition.

HashPartitioner also takes key.hashCode() & integer>MAX_VALUE. It takes these code to finds the modulus using the number of reduce tasks. Suppose there are 10 reduce tasks, then getPartition will return values 0 through 9 for all keys.

1. Public class HashPartitioner<k, v>extends Partitioner<k, v>

2. {

3. Public int getpartitioner(k key, v value, int numreduceTasks)

4. {

5. Return (key.hashCode() & Integer.Max_VALUE) % numreduceTasks;

6. }

7. }

These are very common type of MapReduce Hadoop interview questions and answers faced during the interview of a Fresher.

68) How to write a custom partitioner for a Hadoop MapReduce job?

This is one of the most common MapReduce Hadoop interview question and answer

It stores the results uniformly across different reducers, based on the user condition.

By setting a Partitioner to partition by the key, we can guarantee that records for the same key will go the same reducer. It also ensures that only one reducer receives all the records for that particular key.

By the following steps, we can write Custom partitioner for a Hadoop MapReduce job:

· Create a new class that extends Partitioner Class.

· Then, Override method getPartition, in the wrapper that runs in the MapReduce.

· By using method set Partitioner class, add the custom partitioner to the job. Or add the custom partitioner to the job as config file.

69) What is shuffling and sorting in Hadoop MapReduce?

Shuffling and Sorting takes place after the completion of map task. Shuffle and sort phase in Hadoop occurs simultaneously.

· Shuffling- Shuffling is the process by which the system sorts the key-value output of the map tasks and transfer it to the reducer. Shuffle phase is important for reducers, otherwise, they would not have any input. As shuffling can start even before the map phase has finished. So this saves some time and completes the task in lesser time.

· Sorting- Mapper generate the intermediate key-value pair. Before starting of reducer, mapreduce framework sort these key-value pair by the keys. It also helps reducer to easily distinguish when a new reduce task should start. Thus saves time for the reducer.

Shuffling and sorting are not performed at all if you specify zero reducer (setNumReduceTasks(0))

Read about Shuffling and Sorting in detail.

70) Why aggregation cannot be done in Mapper?

Mapper task processes each input record (From RecordReader) and generates a key-value pair. The Mapper store intermediate-output on the local disk.
We cannot perform aggregation in mapper because:

· Sorting takes place only on the Reducer function. Thus there is no provision for sorting in the mapper function. Without sorting aggregation is not possible.

· To perform aggregation, we need the output of all the Mapper function. Thus, which may not be possible to collect in the map phase. Because mappers may be running on different machines where the data blocks are present.

· If we will try to perform aggregation of data at mapper, it requires communication between all mapper functions. Which may be running on different machines. Thus, this will consume high network bandwidth and can cause network bottlenecking.

71) Explain map-only job?

MapReduce is the data processing layer of Hadoop. It is the framework for writing applications that process the vast amount of data stored in the HDFS. It processes the huge amount of data in parallel by dividing the job into a set of independent tasks (sub-job). In Hadoop, MapReduce have 2 phases of processing: Map and Reduce.

In Map phase we specify all the complex logic/business rules/costly code. Map takes a set of data and converts it into another set of data. It also break individual elements into tuples (key-value pairs). In Reduce phase we specify light-weight processing like aggregation/summation. Reduce takes the output from the map as input. After that it combines tuples (key-value) based on the key. And then, modifies the value of the key accordingly.

Learn Hadoop from Industry Experts

Consider a case where we just need to perform the operation and no aggregation required. Thus, in such case, we will prefer “Map-Only job” in Hadoop. In Map-Only job, the map does all task with its InputSplit and the reducer do no job. Map output is the final output.

This we can achieve by setting job.setNumreduceTasks(0) in the configuration in a driver. This will make a number of reducer 0 and thus only mapper will be doing the complete task.

Read about map-only job in Hadoop Mapreduce in detail.

72) What is SequenceFileInputFormat in Hadoop MapReduce?

SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence files are binary files that stores sequences of binary key-value pairs. These files are block-compress. Thus, Sequence files provide direct serialization and deserializationof several arbitrary data types.
Here Key and value- Both are user-defined.

SequenceFileAsTextInputFormat is variant of SequenceFileInputFormat. It converts the sequence file’s key value to text objects. Hence, by calling ‘tostring()’it performs conversion on the keys and values. Thus, this InputFormat makes sequence files suitable input for streaming.

SequenceFileAsBinaryInputFormat is variant of SequenceFileInputFormat. Hence, by using this we can extract the sequence file’s keys and values as an opaque binary object.

The above 58 – 72 MapReduce Hadoop interview questions and answers were for freshers, However experienced can also go through these MapReduce Hadoop interview questions and answers for revising the basics.

MapReduce Hadoop Interview questions and Answers for Experienced

73) What is KeyValueTextInputFormat in Hadoop?

KeyValueTextInputFormat- It treats each line of input as a separate record. It breaks the line itself into key and value. Thus, it uses the tab character (‘/t’) to break the line into a key-value pair.

Key- Everything up to tab character.
Value- Remaining part of the line after tab character.

Consider the following input file, where → represents a (horizontal) tab character:

But→ his face you could not see

Account→ of his beaver hat Hence,

Output:

Key- But
Value- his face you could not see
Key- Account
Value- of his beaver hat

74) Differentiate Reducer and Combiner in Hadoop MapReduce?

Combiner- The combiner is Mini-Reducer that perform local reduce task. It run on the Map output and produces the output to reducer input. Combiner is usually used for network optimization.

Reducer- Reducer takes a set of an intermediate key-value pair produced by the mapper as the input. Then runs a reduce function on each of them to generate the output. An output of the reducer is the final output.

· Unlike a reducer, the combiner has a limitation . i.e. the input or output key and value types must match the output types of the mapper.

· Combiners can operate only on a subset of keys and values . i.e. combiners can execute on functions that are commutative.

· Combiner functions take input from a single mapper. While reducers can take data from multiple mappers as a result of partitioning.

75) Explain the process of spilling in MapReduce?

Map task processes each input record (from RecordReader) and generates a key-value pair. The Mapper does not store its output on HDFS. Thus, this is temporary data and writing on HDFS will create unnecessary multiple copies. The Mapper writes its output into the circular memory buffer (RAM). Size of the buffer is 100 MB by default. We can also change it by using mapreduce.task.io.sort.mb property.

Now, spilling is a process of copying the data from the memory buffer to disc. It takes place when the content of the buffer reaches a certain threshold size. So, background thread by default starts spilling the contents after 80% of the buffer size has filled. Therefore, for a 100 MB size buffer, the spilling will start after the content of the buffer reach a size of 80MB.

76) What happen if number of reducer is set to 0 in Hadoop?

If we set the number of reducer to 0:

· Then no reducer will execute and no aggregation will take place.

· In such case we will prefer “Map-only job” in Hadoop. In map-Only job, the map does all task with its InputSplit and the reducer do no job. Map output is the final output.

In between map and reduce phases there is key, sort, and shuffle phase. Sort and shuffle phase are responsible for sorting the keys in ascending order. Then grouping values based on same keys. This phase is very expensive. If reduce phase is not required we should avoid it. Avoiding reduce phase would eliminate sort and shuffle phase as well. This also saves network congestion. As in shuffling an output of mapper travels to reducer,when data size is huge, large data travel to reducer.

77) What is Speculative Execution in Hadoop?

MapReduce breaks jobs into tasks and run these tasks parallely rather than sequentially. Thus reduces execution time. This model of execution is sensitive to slow tasks as they slow down the overall execution of a job. There are various reasons for the slowdown of tasks like hardware degradation. But it may be difficult to detect causes since the tasks still complete successfully. Although it takes more time than the expected time.

Hadoop framework doesn’t try to fix and diagnose slow running task. It tries to detect them and run backup tasks for them. This process is called Speculative execution in Hadoop. These backup tasks are called Speculative tasks in Hadoop.

First of all Hadoop framework launch all the tasks for the job in Hadoop MapReduce. Then it launch speculative tasks for those tasks that have been running for some time (one minute). And the task that have not made any much progress, on average, as compared with other tasks from the job.

If the original task completes before the speculative task. Then it will kill speculative task . On the other hand, it will kill the original task if the speculative task finishes before it.

Read about Speculative Execution in detail.

78) What counter in Hadoop MapReduce?

Counters in MapReduce are useful Channel for gathering statistics about the MapReduce job. Statistics like for quality control or for application-level. They are also useful for problem diagnosis.

Counters validate that:

· Number of bytes read and write within map/reduce job is correct or not

· The number of tasks launches and successfully run in map/reduce job is correct or not.

· The amount of CPU and memory consumed is appropriate for our job and cluster nodes.

There are two types of counters:

· Built-In Counters – In Hadoop there are some built-In counters for every job. These report various metrics, like, there are counters for the number of bytes and records. Thus, this allows us to confirm that it consume the expected amount of input. Also make sure that it produce the expected amount of output.

· User-Defined Counters – Hadoop MapReduce permits user code to define a set of counters. These are then increased as desired in the mapper or reducer. For example, in Java, use ‘enum’ to define counters.

Read about Counters in detail.

79) How to submit extra files(jars,static files) for MapReduce job during runtime in Hadoop?

MapReduce framework provides Distributed Cache to caches files needed by the applications. It can cache read-only text files, archives, jar files etc.
An application which needs to use distributed cache to distribute a file should make sure that the files are available on URLs.
URLs can be either hdfs:// or http://.

Now, if the file is present on the hdfs:// or http://urls. Then, user mentions it to be cache file to distribute. This framework will copy the cache file on all the nodes before starting of tasks on those nodes. The files are only copied once per job. Applications should not modify those files.

80) What is TextInputFormat in Hadoop?

TextInputFormat is the default InputFormat. It treats each line of the input file as a separate record. For unformatted data or line-based records like log files, TextInputFormat is useful. By default, RecordReader also uses TextInputFormat for converting data into key-value pairs. So,

· Key- It is the byte offset of the beginning of the line.

· Value- It is the contents of the line, excluding line terminators.

File content is- on the top of the building

so,
Key- 0

Value- on the top of the building

TextInputFormat also provides below 2 types of RecordReader-

· LineRecordReader

· SequenceFileRecordReader

Top Interview Quetions for Hadoop MapReduce

81) How many Mappers run for a MapReduce job?

Number of mappers depends on 2 factors:

· Amount of data we want to process along with block size. It is driven by the number of inputsplit. If we have block size of 128 MB and we expect 10TB of input data, we will have 82,000 maps. Ultimately InputFormat determines the number of maps.

· Configuration of the slave i.e. number of core and RAM available on slave. The right number of map/node can between 10-100. Hadoop framework should give 1 to 1.5 cores of processor for each mapper. For a 15 core processor, 10 mappers can run.

In MapReduce job, by changing block size we can control number of Mappers . By Changing block size the number of inputsplit increases or decreases.
By using the JobConf’s conf.setNumMapTasks(int num) we can increase the number of map task.

Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Mapper= (1000*1000)/100= 10,000

82) How many Reducers run for a MapReduce job?

Answer these type of MapReduce Hadoop interview questions answers very shortly and to the point.

With the help of Job.setNumreduceTasks(int) the user set the number of reduces for the job. To set the right number of reducesrs use the below formula:

0.95 Or 1.75 multiplied by (<no. of nodes> * <no. of maximum container per node>).

As the map finishes, all the reducers can launch immediately and start transferring map output with 0.95. With 1.75, faster nodes finsihes first round of reduces and launch second wave of reduces .
With the increase of number of reducers:

· Load balancing increases.

· Cost of failures decreases.

· Framework overhead increases.

These are very common type of MapReduce Hadoop interview questions and answers faced during the interview of an experienced professional.

83) How to sort intermediate output based on values in MapReduce?

Hadoop MapReduce automatically sorts key-value pair generated by the mapper. Sorting takes place on the basis of keys. Thus, to sort intermediate output based on values we need to use secondary sorting.

There are two possible approaches:

· First, using reducer, reducer reads and buffers all the value for a given key. Then, do an in-reducer sort on all the values. Reducer will receive all the values for a given key (huge list of values), this cause reducer to run out of memory. Thus, this approach can work well if number of values is small.

· Second, using MapReduce paradigm. It sort the reducer input values, by creating a composite key” (using value to key conversion approach) . i.e.by adding a part of, or entire value to, the natural key to achieve sorting technique. This approach is scalable and will not generate out of, memory errors.

We need custom partitioner to make that all the data with same key (composite key with the value) . So, data goes to the same reducer and custom comparator. In Custom comparator the data grouped by the natural key once it arrives at the reducer.

84) What is purpose of RecordWriter in Hadoop?

Reducer takes mapper output (intermediate key-value pair) as an input. Then, it runs a reducer function on them to generate output (zero or more key-value pair). So, the output of the reducer is the final output.

RecordWriter writes these output key-value pair from the Reducer phase to output files. OutputFormat determines, how RecordWriter writes these key-value pairs in Output files. Hadoop provides OutputFormat instances which help to write files on the in HDFS or local disk.

85) What are the most common OutputFormat in Hadoop?

Reducer takes mapper output as input and produces output (zero or more key-value pair). RecordWriter writes these output key-value pair from the Reducer phase to output files. So, OutputFormat determines, how RecordWriter writes these key-value pairs in Output files.

FileOutputFormat.setOutputpath() method used to set the output directory. So, every Reducer writes a separate in a common output directory.

Most common OutputFormat are:

· TextOutputFormat – It is the default OutputFormat in MapReduce. TextOutputFormat writes key-value pairs on individual lines of text files. Keys and values of this format can be of any type. Because TextOutputFormat turns them to string by calling toString() on them.

· SequenceFileOutputFormat – This OutputFormat writes sequences files for its output. It is also used between MapReduce jobs.

· SequenceFileAsBinaryOutputFormat – It is another form of SequenceFileInputFormat. which writes keys and values to sequence file in binary format.

· DBOutputFormat – We use this for writing to relational databases and HBase. Then, it sends the reduce output to a SQL table. It accepts key-value pairs, where the key has a type extending DBwritable.

Read about outputFormat in detail.

86) What is LazyOutputFormat in Hadoop?

FileOutputFormat subclasses will create output files (part-r-nnnn), even if they are empty. Some applications prefer not to create empty files, which is where LazyOutputFormat helps.

LazyOutputFormat is a wrapper OutputFormat. It make sure that the output file should create only when it emit its first record for a given partition.

To use LazyOutputFormat, call its SetOutputFormatClass() method with the JobConf.

To enable LazyOutputFormat, streaming and pipes supports a – lazyOutput option.

87) How to handle record boundaries in Text files or Sequence files in MapReduce InputSplits?

InputSplit’s RecordReader in MapReduce will “start” and “end” at a record boundary.

In SequenceFile, every 2k bytes has a 20 bytes sync mark between the records. And, the sync marks between the records allow the RecordReader to seek to the start of the InputSplit. It contains a file, length and offset. It also find the first sync mark after the start of the split. And, the RecordReader continues processing records until it reaches the first sync mark after the end of the split.

Similarly, Text files use newlines instead of sync marks to handle record boundaries.

88) What are the main configuration parameters in a MapReduce program?

The main configuration parameters are:

· Input format of data.

· Job’s input locations in the distributed file system.

· Output format of data.

· Job’s output location in the distributed file system.

· JAR file containing the mapper, reducer and driver classes

· Class containing the map function.

· Class containing the reduce function..

89) Is it mandatory to set input and output type/format in MapReduce?

No, it is mandatory.

Hadoop cluster, by default, takes the input and the output format as ‘text’.

TextInputFormat – MapReduce default InputFormat is TextInputFormat. It treats each line of each input file as a separate record and also performs no parsing. For unformatted data or line-based records like log files, TextInputFormat is also useful. By default, RecordReader also uses TextInputFormat for converting data into key-value pairs.

TextOutputFormat- MapReduce default OutputFormat is TextOutputFormat. It also writes (key, value) pairs on individual lines of text files. Its keys and values can be of any type.

90) What is Identity Mapper?

Identity Mapper is the default Mapper provided by Hadoop. When MapReduce program has not defined any mapper class then Identity mapper runs. It simply passes the input key-value pair for the reducer phase. Identity Mapper does not perform computation and calculations on the input data. So, it only writes the input data into output.

The class name is org.apache.hadoop.mapred.lib.IdentityMapper

91) What is Identity reducer?

Identity Reducer is the default Reducer provided by Hadoop. When MapReduce program has not defined any mapper class then Identity mapper runs. It does not mean that the reduce step will not take place. It will take place and related sorting and shuffling will also take place. But there will be no aggregation. So you can use identity reducer if you want to sort your data that is coming from the map but don’t care for any grouping.

The above MapReduce Hadoop interview questions and answers i.e Q. 73 – Q. 91 were for experienced but freshers can also refer these MapReduce Hadoop interview questions and answers for in depth knowledge. Now let’s move forward with some advanced MapReduce Hadoop interview questions and answers.

Advanced Interview Questions and Answers for Hadoop MapReduce

92) What is Chain Mapper?

We can use multiple Mapper classes within a single Map task by using Chain Mapperclass. The Mapper classes invoked in a chained (or piped) fashion. The output of the first becomes the input of the second, and so on until the last mapper. The Hadoop framework write output of the last mapper to the task’s output.

The key benefit of this feature is that the Mappers in the chain do not need to be aware that they execute in a chain. And, this enables having reusable specialized Mappers. We can combine these mappers to perform composite operations within a single task in Hadoop.

Hadoop framework take Special care when create chains. The key/values output by a Mapper are valid for the following mapper in the chain.

The class name is org.apache.hadoop.mapred.lib.ChainMapper

This is one of the very important Mapreduce Hadoop interview questions and answers

93) What are the core methods of a Reducer?

Reducer process the output the mapper. After processing the data, it also produces a new set of output, which it stores in HDFS. And, the core methods of a Reducer are:

· setup()- Various parameters like the input data size, distributed cache, heap size, etc this method configure. Function Definition- public void setup (context)

· reduce() – Reducer call this method once per key with the associated reduce task. Function Definition- public void reduce (key, value, context)

· cleanup() – Reducer call this method only once at the end of reduce task for clearing all the temporary files. Function Definition- public void cleanup (context)

94) What are the parameters of mappers and reducers?

The parameters for Mappers are:

· LongWritable(input)

· text (input)

· text (intermediate output)

· IntWritable (intermediate output)

The parameters for Reducers are:

· text (intermediate output)

· IntWritable (intermediate output)

· text (final output)

· IntWritable (final output)

95) What is the difference between TextinputFormat and KeyValueTextInputFormat class?

TextInputFormat – It is the default InputFormat. It treats each line of the input file as a separate record. For unformatted data or line-based records like log files, TextInputFormat is also useful. So,

· Key- It is byte offset of the beginning of the line within the file.

· Value- It is the contents of the line, excluding line terminators.

KeyValueTextInputFormat – It is like TextInputFormat. The reason is it also treats each line of input as a separate record. But the main difference is that TextInputFormat treats entire line as the value. While the KeyValueTextInputFormat breaks the line itself into key and value by the tab character (‘/t’). so,

· Key- Everything up to tab character.

· Value- Remaining part of the line after tab character.

For example, consider a file contents as below:

AL#Alabama

AR#Arkansas

FL#Florida

So, TextInputFormat
Key value
0 AL#Alabama 14
AR#Arkansas 23
FL#Florida

So, KeyValueTextInputFormat
Key value
AL Alabama
AR Arkansas
FL Florida

These are some of the advanced MapReduce Hadoop interview Questions and answers

96) How is the splitting of file invoked in Hadoop ?

InputFormat is responsible for creating InputSplit, which is the logical representation of data. Further Hadoop framework divides split into records. Then, Mapper process each record (which is a key-value pair).

By running getInputSplit() method Hadoop framework invoke Splitting of file . getInputSplit() method belongs to Input Format class (like FileInputFormat) defined by the user.

97) How many InputSplits will be made by hadoop framework?

InputFormat is responsible for creating InputSplit, which is the logical representation of data. Further Hadoop framework divides split into records. Then, Mapper process each record (which is a key-value pair).

MapReduce system use storage locations to place map tasks as close to split’s data as possible. By default, split size is approximately equal to HDFS block size (128 MB).

For, example the file size is 514 MB,
128MB: 1st block, 128Mb: 2nd block, 128Mb: 3rd block,
128Mb: 4th block, 2Mb: 5th block

So, 5 InputSplit is created based on 5 blocks.

If in case you have any confusion about any MapReduce Hadoop Interview Questions, do let us know by leaving a comment. we will be glad to solve your queries.

98) Explain the usage of Context Object.

With the help of Context Object, Mapper can easily interact with other Hadoop systems. It also helps in updating counters. So counters can report the progress and provide any application-level status updates.
It contains configuration details for the job.

99) When is it not recommended to use MapReduce paradigm for large scale data processing?

For iterative processing use cases it is not suggested to use MapReduce. As it is not cost effective, instead Apache Pig can be used for the same.

100) What is the difference between RDBMS with Hadoop MapReduce?

Size of Data

· RDBMS- Traditional RDBMS can handle upto gigabytes of data.

· MapReduce- Hadoop MapReduce can hadnle upto petabytes of data or more.

Updates

· RDBMS- Read and Write multiple times.

· MapReduce- Read many times but write once model.

Schema

· RDBMS- Static Schema that needs to be pre-defined.

· MapReduce- Has a dynamic schema

Processing Model

· RDBMS- Supports both batch and interactive processing.

· MapReduce- Supports only batch processing.

Scalability

· RDBMS- Non-Linear

· MapReduce- Linear

101) Define Writable data types in Hadoop MapReduce.

Hadoop reads and writes data in a serialized form in the writable interface. The Writable interface has several classes like Text, IntWritable, LongWriatble, FloatWritable, BooleanWritable. Users are also free to define their personal Writable classes as well.

102) Explain what does the conf.setMapper Class do in MapReduce?

Conf.setMapperclass sets the mapper class. Which includes reading data and also generating a key-value pair out of the mapper.

Number 40-52 were the advanced HDFS Hadoop interview question and answer to get indepth knowledge in handling difficult Hadoop interview questions and answers.

This was all about the Hadoop Interview Questions and Answers

These questions are frequently asked MapReduce Hadoop interview questions and answers. You can read here some more Hadoop MapReduce interview questions and answers.

After going through these MapReduce Hadoop Interview questions and answers you will be able to confidently face a interview and will be able to answer MapReduce Hadoop Interview questions and answers asked in your interview in the best manner. These MapReduce Hadoop Interview Questions are suggested by the experts at DataFlair

Key –

Q.53 – Q57 Basic MapReduce Hadoop interview questions and answers

Q.58 – Q72 MapReduce Hadoop interview questions and answer for Freshers

Q.73 -Q. 80 Hadoop MapReduce Interview Questions for Experienced

Q.81 – Q.91 Top questions asked in Hadoop Interview

Q.92 – Q.102 were the advanced MapReduce Hadoop interview questions and answers

These MapReduce Hadoop interview questions and answers are categorized so that you can pay more attention to questions specified for you, however, it is recommended that you go through all the Hadoop interview questions and answers for complete understanding.

If you have any more doubt or query on Hadoop Interview Questions and Answers for Mapreduce, Drop a comment and our support team will be happy to help you.

Hope the tutorial on Hadoop interview questions and answers was helpful to you.

----------------------------
http://www.bigdatatrunk.com/top-50-interview-questions-hdfs/

Q1 What does ‘jps’ command do?

Answer:It gives the status of the deamons which run Hadoop cluster. It gives the output mentioning the status of Namenode, Datanode, Secondary Namenode, Jobtracker and Tasktracker.

Q2.What if a Namenode has no data?

Answer: It cannot be part of the Hadoop cluster.

Q3. What happens to job tracker when Namenode is down?

Answer: When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.

Q4.What is a Namenode?

Answer: Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.

Q5.Replication causes data redundancy, then why is it pursued in HDFS?

Answer: HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at least 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.

Q6. What is a Datanode?

Answer: Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

Q7. Why do we use HDFS for applications having large data sets and not when there are lot of small files?

Answer: HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.

Q8.Explain the major difference between HDFS block and InputSplit.

Answer: In simple terms, block is the physical representation of data while split is the logical representation of data present in the block. Split acts a s an intermediary between block and mapper. Suppose we have two blocks:

Block 1: ii nntteell

Block 2: Ii ppaatt

Now, considering the map, it will read first block from ii till ll, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block. It then forms key-value pair using inputformat and records reader and sends map for further processing with inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.

Q9.What is a ‘block’ in HDFS?

Answer: A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

Q10.Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.

Answer: Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1, if the DataNode crashes under any circumstances, then only single copy of the data would be lost.

Q11.What are the most common Input Formats in Hadoop?

Answer: There are three most common input formats in Hadoop:

Text Input Format: Default input format in Hadoop
Key Value Input Format: used for plain text files where the files are broken into lines
Sequence File Input Format: used for reading files in sequence

Q12. What is commodity hardware?

Answer: Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.

Q13. What is the port number for NameNode,Secondary NameNode,DataNodes,TaskTracker and JobTracker?

Answer:

NameNode 50070
Secondary NameNode 50090
DataNodes 50075
JobTracker 50030
TaskTracker 50060

Q14. Explain about the process of inter cluster data copying.

Answer: HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.

Q15. What is a heartbeat in HDFS?

Answer: A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

Q16. Explain the difference between NAS and HDFS.

Answer: NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.

Q17. Explain about the indexing process in HDFS.

Answer: Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

Q18. What is a rack awareness and on what basis is data stored in a rack?

Answer: All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack t0 ensure if any entire rack fails we still have one copy in another rack.This is generally referred to as the Replica Placement Policy.

Q19. How NameNode Handles data node failures?

Answer:Through checksums. every data has a record followed by a checksum. if checksum doesnot match with the original then it reports an data corrupted error.

Q20. What is HDFS?

Answer:The Hadoop Distributed File System (HDFS) is a sub-project of the Apache Hadoop project.HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.

Q21. What are the key features of HDFS?

Answer: HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.

Q22. What is throughput? How does HDFS get a good throughput?

Answer: Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.

Q23. What is data-integrity in HDFS?

Answer: HDFS transparently checksums all data written to it and by default verifies checksums when reading data.A separate checksums created for every bytes of data(default is 512 bytes, because CRC-32 checksums is 4 bytes).Datanodes are responsible for verifying the data they receive before storing the data and checksums.It is possible to disable checksums by passing false to setverifychecksum() method on filesystem before using open() method to read file.

Q24. What all modes Hadoop can be run in?

Answer: Hadoop can run in three modes:

Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.

Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.

Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

Q25 What are the core components of Hadoop?

Answer: Core components of Hadoop are HDFS and MapReduce. HDFS is basically used to store large data sets and MapReduce is used to process such large data sets.

Q26. What is a metadata?

Answer: Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.

Q27. What happens when two clients try to write into the same HDFS file?

Answer:HDFS supports exclusive writes only. When the first client contacts the name-node to open the file for writing, the name-node grants a lease to the client to create this file. When the second client tries to open the same file for writing, the name-node will see that the lease for the file is already granted to another client, and will reject the open request for the second client

Q28. What is a daemon?

Answer: Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “Services” and in Dos is “TSR”.

Q29.What are file permissions in HDFS?

Answer:HDFS has a permission model for files and directories that is much like posix. They are three types of permissions

read permissions(x)
write permissions(w)
execute permissions(X)

Each file and directory has an owner and group and mode

Q30. What does Data Locality mean?

Answer:Data Locality means processing the data where it resides. It simply means that Hadoop Map-Reduce will do their best to schedule the map tasks and the reduce tasks such that most tasks read their input data from the local computer. In certain scenarios , mainly in the reduce phase exception to Data Locality may be needed.

Q31. What is the process to change the files at arbitrary locations in HDFS?

Answer: HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.

Q32. What is the process of indexing in HDFS?

Answer: Once data is stored HDFS will depend on the last part to find out where the next part of data would be stored.

Q33.Difference betweenHadoop fs -copyFromLocal and Hadoop fs -moveFromLocal

Answer: Hadoop fs -put and Hadoop fs -copyFromLocal both are same means it’ll copy the data from local to hdfs and local copy also available and it’s working like copy & paste.Hadoop fs -moveFromLocal command working as cut & paste means it’ll move the file from local to HDFS, but local copy is not available.

Q34. What happens if one Hadoop client renames a file or a directory containing this file while another client is still writing into it?

Answer:A file will appear in the name space as soon as it is created. If a writer is writing to a file and another client renames either the file itself or any of its path components, then the original writer will get an IOException either when it finishes writing to the current block or when it closes the file.

Q35.What is Secondary NameNode?

Answer: Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

Q36. What is default block size in HDFS?

Answer: As of Hadoop-2.4.0 release, the default block size in HDFS is 256 MB and prior to that it was 128 MB.

Q37. What are the limitations of HDFS file systems?

Answer: HDFS supports file operations reads, writes, appends and deletes efficiently but it doesn’t support file updates.HDFS is not suitable for large number of small sized files but best suits for large sized files. Because file system namespace maintained by Namenode is limited by it’s main memory capacity as namespace is stored in namenode’s main memory and large number of files will result in big fsimage file.

Q38. Is there an easy way to see the status and health of a cluster?

Answer: There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system.The JobTracker status page will display the state of all nodes, as well as the job queue and status about all currently running jobs and tasks. The NameNode status page will display the state of all nodes and the amount of free space, and provides the ability to browse the DFS via the web.

Q39. How do you debug a performance issue or a long running job?

Answer: This is an open ended question and the interviewer is trying to see the level of hands-on experience you have in solving production issues. Use your day to day work experience to answer this question. Here are some of the scenarios and responses to help you construct your answer. On a very high level you will follow the below steps.

Understand the symptom
Analyze the situation
Identify the problem areas
Propose solution

Q40. What is a sequence file in Hadoop?

Answer: Sequence file is used to store binary key/value pairs. Sequence files support splitting even when the data inside the file is compressed which is not possible with a regular compressed file. You can either choose to perform a record level compression in which the value in the key/value pair will be compressed. Or you can also choose to choose at the block level where multiple records will be compressed together.Consider case scenario: In M/R system, – HDFS block size is 64 MB. Now Input format is FileInputFormat and we have 3 files of size 64K, 65Mb and 127Mb. How many input splits will be made by Hadoop framework?

Hadoop will make 5 splits as follows:

split for 64K files
splits for 65MB files
splits for 127MB files

Q41. What happens when a datanode fails?

Answer: When a datanode fails:

Jobtracker and namenode detect the failure
On the failed node all tasks are re-scheduled
Namenode replicates the users data to another node

Q42. What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it?

Answer: Distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

Q43. What happens to a NameNode that has no data?

Answer:There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.

Q44. What is a block and block scanner in HDFS?

Answer:Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.

Q45. Why is a block in HDFS so Large?

Answer: HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be significantly longer than the time to seek to the start of the block.

Q46. What is HDFS High-Availability?

Answer: The 2.x release series of Hadoop adds support for HDFS high-availability (HA). In this implementation there is a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.

Q47. What are some typical functions of Job Tracker?

Answer: The following are some typical tasks of JobTracker:

When Client applications submit map reduce jobs to the Job tracker
The JobTracker talks to the Name node to determine the location of the data
The JobTracker locates TaskTtracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen Tasktracker nodes
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker
When the work is completed, the JobTracker updates its status
Client applications can poll the JobTracker for information

Q48. How does one switch off the “SAFEMODE” in HDFS?

Answer:You use the command: Hadoop dfsadmin –safemode leave

Q49. What is streaming access?

Answer: As HDFS works on the principle of ‘Write Once, Read Many’, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.

Q50. Is Namenode also a commodity?

Answer: No. Namenode can never be commodity hardware because the entire HDFS rely on it. It is the single point of failure in HFS. Namenode has to be a high-availability machine.

-------------------

Mareducer

Q1 What is MapReduce?

Answer: MapReduce is a parallel programming model which is used to process large data sets across hundreds or thousands of servers in a Hadoop cluster.Map/reduce brings compute to the data at data location in contrast to traditional parallelism, which brings data to the compute location.The Term MapReduce is composed of Map and Reduce phase. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples key/value pairs. The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job. The programming language for MapReduce is Java.All data emitted in the flow of a MapReduce program is in the form of Key/Value pairs.

Q2 Explain a MapReduce program.

Answer: A MapReduce program consists of 3 parts namely, Driver, Mapper, and Reducer.

The Driver code runs on the client machine and is responsible for building the configuration of the job and submitting it to the Hadoop Cluster. The Driver code will contain the main() method that accepts arguments from the command line.

The Mapper code reads the input files as <Key,Value> pairs and emits key value pairs. The Mapper class extends MapReduceBase and implements the Mapper interface. The Mapper interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the input key and value types, the second two define the output key and value types.

The Reducer code reads the outputs generated by the different mappers as <Key,Value> pairs and emits key value pairs. The Reducer class extends MapReduceBase and implements the Reducer interface. The Reducer interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the intermediate key and value types, the second two define the final output key and value types.

Q3 Mention what are the main configuration parameters that user need to specify to run MapReduce Job ?

Answer:The user of MapReduce framework needs to specify the following:

Job’s input locations in the distributed file system
Job’s output location in the distributed file system
Input format
Output format
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes

Q4 What Mapper does?

Answer: Mapper is the first phase of MapReduce phase which process map task.Mapper reads key/value pairs and emit key/value pair.Maps are the individual tasks that transform input records into intermediate records.The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

Q5 Is there an easy way to see the status and health of a cluster?

Q6 Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?

Answer:

apache.hadoop.mapreduce.Mapper
apache.hadoop.mapreduce.Reducer

Q7 Explain what is Sequencefileinputformat?

Answer: Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.

Q8 What are ‘maps’ and ‘reduces’?

Answer: ‘Maps’ and ‘Reduces’ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input location, and based on the input type, it will generate a key value pair,that is, an intermediate output in local machine.’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output.

Q9 What does conf.setMapper Class do?

Answer: Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper.

Q10 What are the methods in the Reducer class and order of their invocation?

Answer: The Reducer class contains the run() method, which call its own setup() method only once, it also call a reduce() method for each input and finally calls it cleanup() method.

Q11 Explain what is the purpose of RecordReader in Hadoop?

Answer: In Hadoop, the RecordReader loads the data from its source and converts it into key, value pairs suitable for reading by the Mapper.

Q12 Explain MapReduce and its needs while programming with Apache Pig

Answer: All programs in Apache Pig have been written usually in query language which is also called nowadays as Pig Latin. It is has some similarity with SQL language of query as well. In order get the query executed, you must also remember to make use of an engine that specialises in this. Queries are converted from pig engines into jobs and therefore MapReduce will act as an engine of execution which is required to run programs.

Q13 What are some typical functions of Job Tracker?

Answer: The following are some typical tasks of JobTracker:-

When Client applications submit map reduce jobs to the Job tracker
The JobTracker talks to the Name node to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes
The TaskTracker nodes are monitored. If they do not submit heartbeat signals they are deemed to have failedand the work is scheduled on different TaskTracker
When the work is completed, the JobTracker updates its status
Client applications can poll the JobTracker for information

Q14What are the four basic parameters of a mapper?

Answer: The four basic parameters of a mapper are LongWritable, text; text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.

Q15 How can we change the split size if our commodity hardware has less storage space?

Answer: If our commodity hardware has less storage space, we can change the split size by writing the ‘custom splitter’. There is a feature of customization in Hadoop which can be called from the main method.

Q16 What is a TaskInstance?

Answer: The actual Hadoop MapReduce jobs that run on each slave node are referred to as Task instances. Every task instance has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.

Q17 What do the master class and the output class do?

Answer: Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output location.

Q18 What is the input type/format in MapReduce by default?

Answer: By default the type input type in MapReduce is ‘text’.

Q19 Is it mandatory to set input and output type/format in MapReduce?

Answer: No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as ‘text’.

Q20 How is Hadoop different from other data processing tools?

Answer: In Hadoop, based upon your requirements, you can increase or decrease the number of mappers without bothering about the volume of data to be processed. This is the beauty of parallel processing in contrast to the other data processing tools available.

Q21 What does job conf class do?

Answer: MapReduce needs to logically separate different jobs running on the same cluster. ‘Job conf class’ helps to do job level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive and represent the type of job that is being executed.

Q22Is it important for Hadoop MapReduce jobs to be written in Java?

Answer: It is not necessary to write Hadoop MapReduce jobs in java but users can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API.

Q23 What is a Combiner?

Answer: A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.

Q24 What do sorting and shuffling do?

Answer: Sorting and shuffling are responsible for creating a unique key and a list of values.Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.

Q25 What are the four basic parameters of a reducer?

Answer: The four basic parameters of a reducer are Text, IntWritable, Text, and IntWritable.The first two represent intermediate output parameters and the second two represent final output parameters.

Q26 What are the key differences between Pig vs MapReduce?

Answer: PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store. As part of managing this data flow it moves data feeding it to process1, taking output and feeding it to process2. The core features are preventing execution of subsequent stages if previous stage fails, manages temporary storage of data and most importantly compresses and rearranges processing steps for faster processing. While this can be done for any kind of processing tasks Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all jobs in a Pig are map reduce jobs or data movement jobs. Pig allows for custom functions to be added which can be used for processing in Pig, some default ones are like ordering, grouping, distinct, count etc.

Mapreduce on the other hand is a data processing paradigm, it is a framework for application developers to write code in so that it is easily scaled to PB of tasks, this creates a separation between the developer that writes the application vs the developer that scales the application. Not all applications can be migrated to Map reduce but good few can be including complex ones like k-means to simple ones like counting unique in a dataset.

Q27 Why we cannot do aggregation or addition in a mapper? Why we require reducer for that?

Answer: We cannot do aggregation or addition in a mapper because sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.

Q28 What does a split do?

Answer: Before transferring the data from hard disk location to map method, there is a phase or method called the ‘Split Method’. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but reads data from the block and pass it to the mapper.Be default, Split is taken care by the framework. Split method is equal to the block size and is used to divide block into bunch of splits.

Q29 What does the text input format do?

Answer: In text input format, each line will create a line off-set, that is a hexa-decimal number. Key is considered as a line off-set and value is considered as a whole line text. This is how the data gets processed by a mapper. The mapper will receive the ‘key’ as a ‘LongWritable’ parameter and value as a ‘Text’ parameter.

Q30 What does a MapReduce partitioner do?

Answer: A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer thus allows evenly distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.

Q31 Can we rename the output file?

Answer: Yes we can rename the output file by implementing multiple format output class

Q32What is Streaming?

Answer: Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other programming language.

Q33 Explain what is Speculative Execution?

Answer: In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed.

Q34 Is it possible to start reducers while some mappers still run? Why?

Answer: No. Reducer’s input is grouped by the key. The last mapper could theoretically produce key already consumed by running reducer.

Q35 Describe reduce side join between tables with one-on-one relationship?

Answer: Mapper produces key/value pairs with join ids as keys and row values as value. Corresponding rows from both tables are grouped together by the framework during shuffle and sort phase.Reduce method in reducer obtains join id and two values, each represents row from one table. Reducer joins the data.

Q36 Can you run Map – Reduce jobs directly on Avro data?

Answer: Yes, Avro was specifically designed for data processing via Map-Reduce.

Q37 Can reducers communicate with each other?

Answer: Reducers always run in isolation and they can never communicate with each other as per the Hadoop MapReduce programming paradigm.

Q38 How can you set an arbitrary number of Reducers to be created for a job in Hadoop?

Answer:You can do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting.

Q39 What is TaskTracker?

Answer:TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker.Each Task Tracker is responsible to execute and manage the individual tasks assigned by Job Tracker. Task Tracker also handles the data motion between the map and reduce phases.One Prime responsibility of Task Tracker is to constantly communicate with the Job Tracker the status of the Task.If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.

Q40 How to set mappers and reducers for Hadoop jobs?

Answer: Users can configure JobConf variable to set number of mappers and reducers.job.setNumMaptasks() and job.setNumreduceTasks().

Q41 What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?

Answer:Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process.Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process.One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

Q42 What do you know about NLineOutputFormat?

Answer: NLineOutputFormat splits ‘n’ lines of input as one split.

Q43 True or false: Each reducer must generate the same number of key/value pairs as its input had.

Answer: False. Reducer may generate any number of key/value pairs including zero.

Q44 When is the reducers are started in a MapReduce job?

Answer: In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.

Q45 Name Job control options specified by MapReduce.Since this framework supports chained operations wherein an input of one map job serves as the output for other, there is a need for job controls to govern these complex operations. The various job control options are:

submit(): to submit the job to the cluster and immediately return
waitforCompletion(boolean): to submit the job to the cluster and wait for its completion

Q46 Decide if the statement is true or false: Each combiner runs exactly once.

Answer: False. The framework decides whether combiner runs zero, once or multiple times.

Q47 Define a straggler.

Answer: Straggler is either map or reduce task that takes unusually long time to complete.

Q48 Explain what is distributed Cache in MapReduce Framework ?

Answer: Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file.

Q49 How JobTracker schedules a task?

Answer: The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

Q50 What is chain Mapper?

Answer: Chain Mapper class is a special implementation of Mapper class through which a set of mapper classes can be run in a chain fashion, within a single map task.In this chained pattern execution, first mapper output will become input for second mapper and second mappers output to third mapper, and so on until the last mapper.

MAY 24, 2016

Comments

Ramakrishna4 September 2019 at 02:30
very nice blog...I will definitely follow your blog in future
Hadoop Online Training
Hadoop Training
Hadoop Training in Hyderabad
Bigdata Hadoop Online Training in Hyderabad
Best Hadoop Online Training in Hyderabad
ReplyDelete
Replies
Madanswer Ask Anything5 November 2020 at 00:41
Top questions and Answers https://madanswer.com
ReplyDelete
Replies
Madanswer Ask Anything5 November 2020 at 00:41
Top questions and Answers https://madanswer.com/hadoop
ReplyDelete
Replies
nanitech8 December 2022 at 02:45
This comment has been removed by the author.
ReplyDelete
Replies

Add comment

Search This Blog

lprao

HADOOP INTERVIEW QUESTIONS

Comments

Post a Comment

Popular Posts

LPRAO