What are supported
programming languages for Map Reduce? –
The most common programming language is Java, but scripting
languages are also supported via Hadoop streaming.
The original language supported is Java. However, as Hadoop
became more and more popular various alternative scripting languages were
incorporated
How does Hadoop
process large volumes of data?
Hadoop ships the code to the data instead of sending the
data to the code.
The basic design principles of Hadoop is to eliminate the
data copying between different datanodes
What are sequence
files and why are they important? –
Sequence files are a type of the file in the Hadoop
framework that allow data to be sorted
Sequence files are intermediate files that are created by
Hadoop after the map step
Hadoop is able to split data between different nodes
gracefully while keeping data compressed. The sequence files have special
markers that allow data to be split across entire cluster
What are map files
and why are they important?
Map files are sorted sequence files that also have an index.
The index allows fast data look up.
The Hadoop map file is a variation of the sequence file.
They are very important for map-side join design pattern.
How can you use
binary data in MapReduce?
Binary data can be used directly by a map-reduce job. Often
binary data is added to a sequence file.
Binary data can be packaged in sequence files. Hadoop
cluster does not work very well with large numbers of small files. Therefore,
small files should be combined into bigger ones
What is map - side
join?
Map-side join is done in the map phase and done in memory
The map-side join is a techinique that allows for splitting
map file between different data nodes. The data will be loaded into memory.
This technique allow very fast performance for the join.
What is reduce - side
join?
Reduce-side join is a technique for merging data from
different sources based on a specific key. There are no memory restrictions
The reduce side join is a technique for joining data of any
size in the reduce step. The technique is much slower then map-side join.
However, this technique does not have any requirements on data size.
What is HIVE?
Hive is a part of the Apache Hadoop project that provides
SQL like interface for data processing
Hive is a project initially developed by facebook
specifically for people with very strong SQL skills and not very strong Java
skills who want to query data in Hadoop
What is PIG?
Pig is a part of the Apache Hadoop project that provides
C-like scripting languge interface for data processing
Pig is a project that was developed by Yahoo for people with
very strong skills in scripting languages. Using scripting language, it
dynamically creates Map Reduce jobs automatically
How can you disable
the reduce step?
A developer can always set the number of the reducers to
zero. That will completely disable the reduce step.
If developer uses MapReduce API he has full access to any
number of mappers and reducers for job execution
Why would a developer
create a map-reduce without the reduce step?
There is a CPU intensive step that occurs between the map
and reduce steps. Disabling the reduce step speeds up data processing
This is a map step only. MapReduce jobs are very common.
They normally are used to perform transformations on data without sorting and aggregations
What is the default
input format?
The default input format is TextInputFormat with byte offset
as a key and entire line as a value.
Hadoop permits a large range of input formats. The default
is text input format. This format is the simplest way to access data as text
lines
How can you overwrite
the default input format?
In order to overwrite default input format, a developer has
to set new input format on job config before submitting the job to a cluster
Developer can always set different input formats on job
configuration (e.g sequence files, binary files, compressed format)
What are the common
problems with map-side join?
The most common problems with map-side joins are out of
memory exceptions on slave nodes.
Map-side join uses memory for joining the data based on a
key. As a result the data size is limited to the size of the available memory.
If this exceeds available memory an out of memory error will occur
Which is faster:
Map-side join or Reduce-side join? Why?
Map-side join is faster because join operation is done in
memory.
The map-side join is faster. This is primarily due to usage
of memory. Memory operations are always faster since there is no disk I/O
involved.
Will settings using
Java API overwrite values in configuration files?
Yes. The configuration settings using Java API take
precedence
Developer has full control over the setting on Hadoop
cluster. All configurations can be changed via Java API
What is AVRO?
Avro is a java serialization library
AVRO is an Apache project that is bridging the gap between
unstructured data and structured data. The avro file format is highly optimized
for network transmisions and splitable between different datanodes
Can you run Map -
Reduce jobs directly on Avro data?
Yes, Avro was specifically designed for data processing via
Map-Reduce
AVRO implements all necessary interfaces for MapReduce
processing and avro data can be processed directly via Hadoop cluster
What is distributed
cache?
The distributed cache is a component that allows developers
to deploy jars for Map-Reduce processing.
Distributed cache is the Hadoop answer to the problem of
deploying third-party libraries. Distributed cache will allow libraries to be
deployed to all datanodes
What is the best
performance one can expect from a Hadoop cluster?
The best performance expectation one can have is measured in
seconds. This is because Hadoop can only be used for batch processing –
Hadoop specifically was designed for batch processing. There
are a few additional components that will allow better performance. Near
real-time and real-time Hadoop performance are not currently possible but are
in the works.
What is writable?
Writable is a java interface that needs to be implemented
for MapReduce processing.
Hadoop performs a lot of data transmissions between
different datanodes. Writable is needed for mapreduce processing in order to
improve performance of the data transmissions.
The Hadoop API uses
basic Java types such as LongWritable, Text, IntWritable. They have almost the
same features as default java classes. What are these writable data types
optimized for?
Writable data types are specifically optimized for network
transmissions
Data needs to be represented in a format optimized for
network transmission. Hadoop is based on the ability to send data between
datanodes very quickly. Writable data types are used for this purpose.
Can a custom type for
data Map-Reduce processing be implemented?
Yes, custom data types can be implemented as long as they
implement writable interface.
Developers can easily implement new data types for any
objects. It is common practice to use existing classes and extend them with
writable interface.
What happens if
mapper output does not match reducer input?
A real-time exception will be thrown and map-reduce job will
fail.
Reducers are based on the mappers output and Java is a
strongly typed language. Therefore, an exception will be thrown at run-time if
types do not much
Can you provide
multiple input paths to a map-reduce jobs?
Yes, developers can add any number of input paths.
The Hadoop framework is capable of taking different input
paths and assigning different mappers for each one. This is a very convenient
way of writing different mappers to handle various datasets.
Can you assign
different mappers to different input paths?
Yes, different mappers can be assigned to different
directories
Assigning different mappers to different data sources is the
way to quickly and efficiently create code for processing multiple formats.
Can you suppress
reducer output?
Yes, there is a special data type that will suppress job
output.
There are a number of scenarios where output is not required
from reducers. For instance, web crawling or image processing does not require
external fetch or data processing.
Is there a map input
format?
No, but sequence file input format can read map files
Map files are just a variation of sequence files. They store
data in sorted order
What is the most
important feature of map-reduce?
Ability to process data on the cluster of the machines
without copying all the data over.
The fundamental difference of the Hadoop framework is that
multiple machines will be used to process the same data and data is readily available
for processing in distributed file system.
What is HBASE?
Hbase is a part of the Apache Hadoop project that provides
interface for scanning large amount of data using Hadoop infrastructure
Hbase is one of the Hadoop framework projects that allow
real time data scans across big data volumes. This is very often used to serve
data from a cluster
Nice content presentation! Thanks for putting the efforts on gathering useful content and sharing here. You can find more Hadoop interview related question and answers in the below forum.
ReplyDeleteHadoop interview questions and answers
Great questions on hadoop. If you are interested in installing hadoop. You can check this link. hadoop installation on ubuntu
ReplyDeleteYour posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post. It is really very helpful for us and I have gathered some important information from this blog.
ReplyDeleteBig Data Training in Chennai
Hi I am Victoria lives in Chennai. I am a technology freak. Recently I did Java Course in Chennai at a leading Java Training Institutes in Chennai. This is really helpful for me to make a bright carrer in IT industry.
ReplyDeleteDot Net Training Chennai
ReplyDeleteThanks for your wonderful post.It is really very helpful for us and I have gathered some important information from this blog.If anyone wants to get Dot Net Training in Chennai reach FITA, rated as No.1 Dot Net Training Institute in Chennai.
Dot Net Course in Chennai
Dot Net Training
Automation Training in Chennai
ReplyDeleteI have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article. I did Loadrunner Training in Chennai. This is really useful for me. Suppose if anyone interested to learn Manual Testing Training in Chennai reach FITA academy located at Chennai Velachery.
You want big data interview questions and answers follow this link.
ReplyDeletehttp://kalyanhadooptraining.blogspot.in/search/label/Big%20Data%20Interview%20Questions%20and%20Answers
Testing Training in Chennai
ReplyDeleteIts really awesome blog..If anyone wants to get Software Testing Training in Chennai visit FITA IT academy located at Chennai. Rated as No.1 Software Testing Training Institutes in Chennai
Software Testing Course in Chennai
QTP Course in Chennai
ReplyDeleteHi, I wish to be a regular contributor of your blog. I have read your blog. Your information is really useful for beginner. I did Testing Training in Chennai at Fita training and placement academy which offer best Software Testing Training in Chennai with years of experienced professionals. This is really useful for me to make a bright career.
Regards...
Software Testing Training Institutes in Chennai
HTML5 Training
ReplyDeleteHi, Thanks for sharing this valuable blog.I was really impressed by reading this blog. I did HTML5 Training in Chennai at reputed HTML5 Training Institutes in Chennai. This is really useful for me to make a bright future in designing field.
HTML5 Courses in Chennai
very nice blogs!!! i have to learning a lot of information for this sites.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.AWS course chennai | AWS Certification in chennai | AWS Certification chennai
ReplyDeleteYour posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post.
ReplyDeleteVMWare course chennai | VMWare certification in chennai | VMWare certification chennai
There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this (Salesforce Training).
ReplyDeleteI have read your blog, it was good to read & I am getting some useful info's through your blog keep sharing... Informatica is an ETL tools helps to transform your old business leads into new vision. Learn Informatica training in chennai from corporate professionals with very good experience in informatica tool.
ReplyDeleteRegards,
Best Informatica Training In Chennai|Informatica training center in Chennai|Informatica training chennai
thank you so much for giving very useful information for me.
ReplyDeleteHadoop Online Training
Hadoop Developer Online Training
Hadoop admin Online Training
Hadoop Architecture Online Training
This comment has been removed by the author.
ReplyDeleteThank you so much for giving very useful information for me.
ReplyDeletedatastage training in chennai
very nice blogs!!!
ReplyDeletemongoDB training in chennai
Thank you so much for sharing these questions. Would be very helpful in the interviews. I took Hadoop Developer Certification Training from E-Learnify.in. They were really helpful in clearing my doubts.
ReplyDeleteit is very excellent blog and useful post thank you for sharing with us , keep posting learn more about Hadoop admin useful information .thank you providing this important information on
ReplyDeleteHadoop Admin Online course Bangalore
hi ,your interview questions helped me to clear my interview and now i was clear with hadoop thank you !!! Hadoop Training in Velachery | Hadoop Training .
ReplyDeleteHadoop Training in Chennai | Hadoop .
Nice post Thankful to sharing Big Data Hadoop Online course India
ReplyDeleteThanks for sharing such details about big data and hadoop.Big data hadoop online Training
ReplyDeleteReally very useful post.
ReplyDeleteThank you for sharimg this excellent blog with us,keep posting more posts us....
hadoop admin training
Very good information. It will be helpful for too many people that are looking for this topic. Keep doing awesome rock blog.
ReplyDeleteBest PTE institute in ambala
Study visa consultants in ambala,
Best IELTS Institute in Ambala