allinone: Hadoop Developer interview questions and answers

Tuesday, July 16, 2013

Hadoop Developer interview questions and answers

What are supported programming languages for Map Reduce? –

The most common programming language is Java, but scripting languages are also supported via Hadoop streaming.

The original language supported is Java. However, as Hadoop became more and more popular various alternative scripting languages were incorporated

How does Hadoop process large volumes of data?

Hadoop ships the code to the data instead of sending the data to the code.

The basic design principles of Hadoop is to eliminate the data copying between different datanodes

What are sequence files and why are they important? –

Sequence files are a type of the file in the Hadoop framework that allow data to be sorted

Sequence files are intermediate files that are created by Hadoop after the map step

Hadoop is able to split data between different nodes gracefully while keeping data compressed. The sequence files have special markers that allow data to be split across entire cluster

What are map files and why are they important?

Map files are sorted sequence files that also have an index. The index allows fast data look up.

The Hadoop map file is a variation of the sequence file. They are very important for map-side join design pattern.

How can you use binary data in MapReduce?

Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file.

Binary data can be packaged in sequence files. Hadoop cluster does not work very well with large numbers of small files. Therefore, small files should be combined into bigger ones

What is map - side join?

Map-side join is done in the map phase and done in memory

The map-side join is a techinique that allows for splitting map file between different data nodes. The data will be loaded into memory. This technique allow very fast performance for the join.

What is reduce - side join?

Reduce-side join is a technique for merging data from different sources based on a specific key. There are no memory restrictions

The reduce side join is a technique for joining data of any size in the reduce step. The technique is much slower then map-side join. However, this technique does not have any requirements on data size.

What is HIVE?

Hive is a part of the Apache Hadoop project that provides SQL like interface for data processing

Hive is a project initially developed by facebook specifically for people with very strong SQL skills and not very strong Java skills who want to query data in Hadoop

What is PIG?

Pig is a part of the Apache Hadoop project that provides C-like scripting languge interface for data processing

Pig is a project that was developed by Yahoo for people with very strong skills in scripting languages. Using scripting language, it dynamically creates Map Reduce jobs automatically

How can you disable the reduce step?

A developer can always set the number of the reducers to zero. That will completely disable the reduce step.

If developer uses MapReduce API he has full access to any number of mappers and reducers for job execution

Why would a developer create a map-reduce without the reduce step?

There is a CPU intensive step that occurs between the map and reduce steps. Disabling the reduce step speeds up data processing

This is a map step only. MapReduce jobs are very common. They normally are used to perform transformations on data without sorting and aggregations

What is the default input format?

The default input format is TextInputFormat with byte offset as a key and entire line as a value.

Hadoop permits a large range of input formats. The default is text input format. This format is the simplest way to access data as text lines

How can you overwrite the default input format?

In order to overwrite default input format, a developer has to set new input format on job config before submitting the job to a cluster

Developer can always set different input formats on job configuration (e.g sequence files, binary files, compressed format)

What are the common problems with map-side join?

The most common problems with map-side joins are out of memory exceptions on slave nodes.

Map-side join uses memory for joining the data based on a key. As a result the data size is limited to the size of the available memory. If this exceeds available memory an out of memory error will occur

Which is faster: Map-side join or Reduce-side join? Why?

Map-side join is faster because join operation is done in memory.

The map-side join is faster. This is primarily due to usage of memory. Memory operations are always faster since there is no disk I/O involved.

Will settings using Java API overwrite values in configuration files?

Yes. The configuration settings using Java API take precedence

Developer has full control over the setting on Hadoop cluster. All configurations can be changed via Java API

What is AVRO?

Avro is a java serialization library

AVRO is an Apache project that is bridging the gap between unstructured data and structured data. The avro file format is highly optimized for network transmisions and splitable between different datanodes

Can you run Map - Reduce jobs directly on Avro data?

Yes, Avro was specifically designed for data processing via Map-Reduce

AVRO implements all necessary interfaces for MapReduce processing and avro data can be processed directly via Hadoop cluster

What is distributed cache?

The distributed cache is a component that allows developers to deploy jars for Map-Reduce processing.

Distributed cache is the Hadoop answer to the problem of deploying third-party libraries. Distributed cache will allow libraries to be deployed to all datanodes

What is the best performance one can expect from a Hadoop cluster?

The best performance expectation one can have is measured in seconds. This is because Hadoop can only be used for batch processing –

Hadoop specifically was designed for batch processing. There are a few additional components that will allow better performance. Near real-time and real-time Hadoop performance are not currently possible but are in the works.

What is writable?

Writable is a java interface that needs to be implemented for MapReduce processing.

Hadoop performs a lot of data transmissions between different datanodes. Writable is needed for mapreduce processing in order to improve performance of the data transmissions.

The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable. They have almost the same features as default java classes. What are these writable data types optimized for?

Writable data types are specifically optimized for network transmissions

Data needs to be represented in a format optimized for network transmission. Hadoop is based on the ability to send data between datanodes very quickly. Writable data types are used for this purpose.

Can a custom type for data Map-Reduce processing be implemented?

Yes, custom data types can be implemented as long as they implement writable interface.

Developers can easily implement new data types for any objects. It is common practice to use existing classes and extend them with writable interface.

What happens if mapper output does not match reducer input?

A real-time exception will be thrown and map-reduce job will fail.

Reducers are based on the mappers output and Java is a strongly typed language. Therefore, an exception will be thrown at run-time if types do not much

Can you provide multiple input paths to a map-reduce jobs?

Yes, developers can add any number of input paths.

The Hadoop framework is capable of taking different input paths and assigning different mappers for each one. This is a very convenient way of writing different mappers to handle various datasets.

Can you assign different mappers to different input paths?

Yes, different mappers can be assigned to different directories

Assigning different mappers to different data sources is the way to quickly and efficiently create code for processing multiple formats.

Can you suppress reducer output?

Yes, there is a special data type that will suppress job output.

There are a number of scenarios where output is not required from reducers. For instance, web crawling or image processing does not require external fetch or data processing.

Is there a map input format?

No, but sequence file input format can read map files

Map files are just a variation of sequence files. They store data in sorted order

What is the most important feature of map-reduce?

Ability to process data on the cluster of the machines without copying all the data over.

The fundamental difference of the Hadoop framework is that multiple machines will be used to process the same data and data is readily available for processing in distributed file system.

What is HBASE?

Hbase is a part of the Apache Hadoop project that provides interface for scanning large amount of data using Hadoop infrastructure

Hbase is one of the Hadoop framework projects that allow real time data scans across big data volumes. This is very often used to serve data from a cluster