Labels

Tuesday, July 16, 2013

Hadoop Developer interview questions and answers



What are supported programming languages for Map Reduce? –

The most common programming language is Java, but scripting languages are also supported via Hadoop streaming.
The original language supported is Java. However, as Hadoop became more and more popular various alternative scripting languages were incorporated

How does Hadoop process large volumes of data?
Hadoop ships the code to the data instead of sending the data to the code.
The basic design principles of Hadoop is to eliminate the data copying between different datanodes


What are sequence files and why are they important? –
Sequence files are a type of the file in the Hadoop framework that allow data to be sorted
Sequence files are intermediate files that are created by Hadoop after the map step
Hadoop is able to split data between different nodes gracefully while keeping data compressed. The sequence files have special markers that allow data to be split across entire cluster



What are map files and why are they important?
Map files are sorted sequence files that also have an index. The index allows fast data look up.
The Hadoop map file is a variation of the sequence file. They are very important for map-side join design pattern.


How can you use binary data in MapReduce?
Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file.
Binary data can be packaged in sequence files. Hadoop cluster does not work very well with large numbers of small files. Therefore, small files should be combined into bigger ones


What is map - side join?
Map-side join is done in the map phase and done in memory
The map-side join is a techinique that allows for splitting map file between different data nodes. The data will be loaded into memory. This technique allow very fast performance for the join.



What is reduce - side join?
Reduce-side join is a technique for merging data from different sources based on a specific key. There are no memory restrictions
The reduce side join is a technique for joining data of any size in the reduce step. The technique is much slower then map-side join. However, this technique does not have any requirements on data size.


What is HIVE?
Hive is a part of the Apache Hadoop project that provides SQL like interface for data processing
Hive is a project initially developed by facebook specifically for people with very strong SQL skills and not very strong Java skills who want to query data in Hadoop


What is PIG?
Pig is a part of the Apache Hadoop project that provides C-like scripting languge interface for data processing
Pig is a project that was developed by Yahoo for people with very strong skills in scripting languages. Using scripting language, it dynamically creates Map Reduce jobs automatically


How can you disable the reduce step?
A developer can always set the number of the reducers to zero. That will completely disable the reduce step.
If developer uses MapReduce API he has full access to any number of mappers and reducers for job execution


Why would a developer create a map-reduce without the reduce step?

There is a CPU intensive step that occurs between the map and reduce steps. Disabling the reduce step speeds up data processing
This is a map step only. MapReduce jobs are very common. They normally are used to perform transformations on data without sorting and aggregations

What is the default input format?
The default input format is TextInputFormat with byte offset as a key and entire line as a value.
Hadoop permits a large range of input formats. The default is text input format. This format is the simplest way to access data as text lines


How can you overwrite the default input format?
In order to overwrite default input format, a developer has to set new input format on job config before submitting the job to a cluster
Developer can always set different input formats on job configuration (e.g sequence files, binary files, compressed format)


What are the common problems with map-side join?
The most common problems with map-side joins are out of memory exceptions on slave nodes.
Map-side join uses memory for joining the data based on a key. As a result the data size is limited to the size of the available memory. If this exceeds available memory an out of memory error will occur

Which is faster: Map-side join or Reduce-side join? Why?
Map-side join is faster because join operation is done in memory.
The map-side join is faster. This is primarily due to usage of memory. Memory operations are always faster since there is no disk I/O involved.

Will settings using Java API overwrite values in configuration files?
Yes. The configuration settings using Java API take precedence
Developer has full control over the setting on Hadoop cluster. All configurations can be changed via Java API


What is AVRO?
Avro is a java serialization library
AVRO is an Apache project that is bridging the gap between unstructured data and structured data. The avro file format is highly optimized for network transmisions and splitable between different datanodes


Can you run Map - Reduce jobs directly on Avro data?

Yes, Avro was specifically designed for data processing via Map-Reduce
AVRO implements all necessary interfaces for MapReduce processing and avro data can be processed directly via Hadoop cluster


What is distributed cache?
The distributed cache is a component that allows developers to deploy jars for Map-Reduce processing.
Distributed cache is the Hadoop answer to the problem of deploying third-party libraries. Distributed cache will allow libraries to be deployed to all datanodes

What is the best performance one can expect from a Hadoop cluster?

The best performance expectation one can have is measured in seconds. This is because Hadoop can only be used for batch processing –
Hadoop specifically was designed for batch processing. There are a few additional components that will allow better performance. Near real-time and real-time Hadoop performance are not currently possible but are in the works.


What is writable?
Writable is a java interface that needs to be implemented for MapReduce processing.
Hadoop performs a lot of data transmissions between different datanodes. Writable is needed for mapreduce processing in order to improve performance of the data transmissions.


The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable. They have almost the same features as default java classes. What are these writable data types optimized for?

Writable data types are specifically optimized for network transmissions
Data needs to be represented in a format optimized for network transmission. Hadoop is based on the ability to send data between datanodes very quickly. Writable data types are used for this purpose.

Can a custom type for data Map-Reduce processing be implemented?

Yes, custom data types can be implemented as long as they implement writable interface.
Developers can easily implement new data types for any objects. It is common practice to use existing classes and extend them with writable interface.

What happens if mapper output does not match reducer input?

A real-time exception will be thrown and map-reduce job will fail.
Reducers are based on the mappers output and Java is a strongly typed language. Therefore, an exception will be thrown at run-time if types do not much



Can you provide multiple input paths to a map-reduce jobs?

Yes, developers can add any number of input paths.
The Hadoop framework is capable of taking different input paths and assigning different mappers for each one. This is a very convenient way of writing different mappers to handle various datasets.


Can you assign different mappers to different input paths?

Yes, different mappers can be assigned to different directories

Assigning different mappers to different data sources is the way to quickly and efficiently create code for processing multiple formats.


Can you suppress reducer output?

Yes, there is a special data type that will suppress job output.
There are a number of scenarios where output is not required from reducers. For instance, web crawling or image processing does not require external fetch or data processing.


Is there a map input format?

No, but sequence file input format can read map files
Map files are just a variation of sequence files. They store data in sorted order


What is the most important feature of map-reduce?
Ability to process data on the cluster of the machines without copying all the data over.
The fundamental difference of the Hadoop framework is that multiple machines will be used to process the same data and data is readily available for processing in distributed file system.


What is HBASE?

Hbase is a part of the Apache Hadoop project that provides interface for scanning large amount of data using Hadoop infrastructure
Hbase is one of the Hadoop framework projects that allow real time data scans across big data volumes. This is very often used to serve data from a cluster

26 comments:

  1. Nice content presentation! Thanks for putting the efforts on gathering useful content and sharing here. You can find more Hadoop interview related question and answers in the below forum.

    Hadoop interview questions and answers

    ReplyDelete
  2. Great questions on hadoop. If you are interested in installing hadoop. You can check this link. hadoop installation on ubuntu

    ReplyDelete
  3. Higher Level Abstractions for MapReduce - 2 - Hive - Introduction - Hive QL - Hive User Defined Functions - Hive Use Cases - NOSQL Databases - NoSQL Concepts - Review of RDBMS - - Need for NOSQL - Brewers CAP Theorem - ACID vs BASE - Different Types of NoSQL Databases - Key Value - Columnar - Document - Graph - Columnar Databases - Hadoop Ecosystem - HBASE vs Cassandra - HBASE Architecture - HBASE Data Modeling - HBASE Commands - HBASE Coprocessors - Endpoints - HBASE Coprocessors - Observers - SQOOP - Flume & OOZIE.. - http://www.21cssindia.com/courses/hadoop-online-training-182.html
    Employees to learn at their own pace and maintain control of learning “where, when and how” with boundless access 24/7by 21st Century Software Solutions. contact@21cssindia.com

    ReplyDelete
  4. Thanks for your reviews.i have learn to lot of hadoop.

    Hadoop Training in Chennai

    ReplyDelete
  5. Hadoop Developer Online Training, ONLINE TRAINING – IT SUPPORT – CORPORATE TRAINING http://www.21cssindia.com/courses/hadoop-online-training-182.html The 21st Century Software Solutions of India offers one of the Largest conglomerations of Software Training, IT Support, Corporate Training institute in India - +919000444287 - +917386622889 - Visakhapatnam,Hyderabad Hadoop Developer Online Training, Hadoop Developer Training, Hadoop Developer, Hadoop Developer Online Training| Hadoop Developer Training| Hadoop Developer| "Courses at 21st Century Software Solutions
    Talend Online Training -Hyperion Online Training - IBM Unica Online Training - Siteminder Online Training - SharePoint Online Training - Informatica Online Training - SalesForce Online Training - Many more… | Call Us +917386622889 - +919000444287 - contact@21cssindia.com
    Visit: http://www.21cssindia.com/courses.html"

    ReplyDelete
  6. Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post. It is really very helpful for us and I have gathered some important information from this blog.

    Big Data Training in Chennai

    ReplyDelete
  7. Hi I am Victoria lives in Chennai. I am a technology freak. Recently I did Java Course in Chennai at a leading Java Training Institutes in Chennai. This is really helpful for me to make a bright carrer in IT industry.

    ReplyDelete
  8. Dot Net Training Chennai

    Thanks for your wonderful post.It is really very helpful for us and I have gathered some important information from this blog.If anyone wants to get Dot Net Training in Chennai reach FITA, rated as No.1 Dot Net Training Institute in Chennai.

    Dot Net Course in Chennai

    Dot Net Training

    ReplyDelete
  9. Automation Training in Chennai

    I have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article. I did Loadrunner Training in Chennai. This is really useful for me. Suppose if anyone interested to learn Manual Testing Training in Chennai reach FITA academy located at Chennai Velachery.

    ReplyDelete
  10. You want big data interview questions and answers follow this link.
    http://kalyanhadooptraining.blogspot.in/search/label/Big%20Data%20Interview%20Questions%20and%20Answers

    ReplyDelete
  11. PHP Training Chennai

    I get a lot of great information from this blog. Thank you for your sharing this informative blog. Recently I did PHP course at a leading academy. If you are looking for best PHP Training Institute in Chennai visit FITA IT training academy which offer real timePHP Training in Chennai.

    PHP Course in Chennai

    ReplyDelete
  12. QTP Course in Chennai

    Hi, I wish to be a regular contributor of your blog. I have read your blog. Your information is really useful for beginner. I did Testing Training in Chennai at Fita training and placement academy which offer best Software Testing Training in Chennai with years of experienced professionals. This is really useful for me to make a bright career.

    Regards...

    Software Testing Training Institutes in Chennai

    ReplyDelete
  13. HTML5 Training

    Hi, Thanks for sharing this valuable blog.I was really impressed by reading this blog. I did HTML5 Training in Chennai at reputed HTML5 Training Institutes in Chennai. This is really useful for me to make a bright future in designing field.

    HTML5 Courses in Chennai

    ReplyDelete
  14. very nice blogs!!! i have to learning a lot of information for this sites.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.AWS course chennai | AWS Certification in chennai | AWS Certification chennai

    ReplyDelete
  15. Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post.
    VMWare course chennai | VMWare certification in chennai | VMWare certification chennai

    ReplyDelete
  16. There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this (Salesforce Training).

    ReplyDelete
  17. Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop Training in Chennai will help you to enter big data technology.

    ReplyDelete
  18. I have read your blog, it was good to read & I am getting some useful info's through your blog keep sharing... Informatica is an ETL tools helps to transform your old business leads into new vision. Learn Informatica training in chennai from corporate professionals with very good experience in informatica tool.
    Regards,
    Best Informatica Training In Chennai|Informatica training center in Chennai|Informatica training chennai

    ReplyDelete
  19. I have finally found a Worth able content to read. The way you have presented information here is quite impressive. I have bookmarked this page for future use. Thanks for sharing content like this once again. Keep sharing content like this.

    Software testing training in chennai | Software testing course | Software testing training chennai

    ReplyDelete
  20. This comment has been removed by the author.

    ReplyDelete
  21. Thank you so much for giving very useful information for me.
    datastage training in chennai

    ReplyDelete
  22. Thank you so much for sharing these questions. Would be very helpful in the interviews. I took Hadoop Developer Certification Training from E-Learnify.in. They were really helpful in clearing my doubts.

    ReplyDelete