What are supported
programming languages for Map Reduce? –
The most common programming language is Java, but scripting
languages are also supported via Hadoop streaming.
The original language supported is Java. However, as Hadoop
became more and more popular various alternative scripting languages were
incorporated
How does Hadoop
process large volumes of data?
Hadoop ships the code to the data instead of sending the
data to the code.
The basic design principles of Hadoop is to eliminate the
data copying between different datanodes
What are sequence
files and why are they important? –
Sequence files are a type of the file in the Hadoop
framework that allow data to be sorted
Sequence files are intermediate files that are created by
Hadoop after the map step
Hadoop is able to split data between different nodes
gracefully while keeping data compressed. The sequence files have special
markers that allow data to be split across entire cluster
What are map files
and why are they important?
Map files are sorted sequence files that also have an index.
The index allows fast data look up.
The Hadoop map file is a variation of the sequence file.
They are very important for map-side join design pattern.
How can you use
binary data in MapReduce?
Binary data can be used directly by a map-reduce job. Often
binary data is added to a sequence file.
Binary data can be packaged in sequence files. Hadoop
cluster does not work very well with large numbers of small files. Therefore,
small files should be combined into bigger ones
What is map - side
join?
Map-side join is done in the map phase and done in memory
The map-side join is a techinique that allows for splitting
map file between different data nodes. The data will be loaded into memory.
This technique allow very fast performance for the join.
What is reduce - side
join?
Reduce-side join is a technique for merging data from
different sources based on a specific key. There are no memory restrictions
The reduce side join is a technique for joining data of any
size in the reduce step. The technique is much slower then map-side join.
However, this technique does not have any requirements on data size.
What is HIVE?
Hive is a part of the Apache Hadoop project that provides
SQL like interface for data processing
Hive is a project initially developed by facebook
specifically for people with very strong SQL skills and not very strong Java
skills who want to query data in Hadoop
What is PIG?
Pig is a part of the Apache Hadoop project that provides
C-like scripting languge interface for data processing
Pig is a project that was developed by Yahoo for people with
very strong skills in scripting languages. Using scripting language, it
dynamically creates Map Reduce jobs automatically
How can you disable
the reduce step?
A developer can always set the number of the reducers to
zero. That will completely disable the reduce step.
If developer uses MapReduce API he has full access to any
number of mappers and reducers for job execution
Why would a developer
create a map-reduce without the reduce step?
There is a CPU intensive step that occurs between the map
and reduce steps. Disabling the reduce step speeds up data processing
This is a map step only. MapReduce jobs are very common.
They normally are used to perform transformations on data without sorting and aggregations
What is the default
input format?
The default input format is TextInputFormat with byte offset
as a key and entire line as a value.
Hadoop permits a large range of input formats. The default
is text input format. This format is the simplest way to access data as text
lines
How can you overwrite
the default input format?
In order to overwrite default input format, a developer has
to set new input format on job config before submitting the job to a cluster
Developer can always set different input formats on job
configuration (e.g sequence files, binary files, compressed format)
What are the common
problems with map-side join?
The most common problems with map-side joins are out of
memory exceptions on slave nodes.
Map-side join uses memory for joining the data based on a
key. As a result the data size is limited to the size of the available memory.
If this exceeds available memory an out of memory error will occur
Which is faster:
Map-side join or Reduce-side join? Why?
Map-side join is faster because join operation is done in
memory.
The map-side join is faster. This is primarily due to usage
of memory. Memory operations are always faster since there is no disk I/O
involved.
Will settings using
Java API overwrite values in configuration files?
Yes. The configuration settings using Java API take
precedence
Developer has full control over the setting on Hadoop
cluster. All configurations can be changed via Java API
What is AVRO?
Avro is a java serialization library
AVRO is an Apache project that is bridging the gap between
unstructured data and structured data. The avro file format is highly optimized
for network transmisions and splitable between different datanodes
Can you run Map -
Reduce jobs directly on Avro data?
Yes, Avro was specifically designed for data processing via
Map-Reduce
AVRO implements all necessary interfaces for MapReduce
processing and avro data can be processed directly via Hadoop cluster
What is distributed
cache?
The distributed cache is a component that allows developers
to deploy jars for Map-Reduce processing.
Distributed cache is the Hadoop answer to the problem of
deploying third-party libraries. Distributed cache will allow libraries to be
deployed to all datanodes
What is the best
performance one can expect from a Hadoop cluster?
The best performance expectation one can have is measured in
seconds. This is because Hadoop can only be used for batch processing –
Hadoop specifically was designed for batch processing. There
are a few additional components that will allow better performance. Near
real-time and real-time Hadoop performance are not currently possible but are
in the works.
What is writable?
Writable is a java interface that needs to be implemented
for MapReduce processing.
Hadoop performs a lot of data transmissions between
different datanodes. Writable is needed for mapreduce processing in order to
improve performance of the data transmissions.
The Hadoop API uses
basic Java types such as LongWritable, Text, IntWritable. They have almost the
same features as default java classes. What are these writable data types
optimized for?
Writable data types are specifically optimized for network
transmissions
Data needs to be represented in a format optimized for
network transmission. Hadoop is based on the ability to send data between
datanodes very quickly. Writable data types are used for this purpose.
Can a custom type for
data Map-Reduce processing be implemented?
Yes, custom data types can be implemented as long as they
implement writable interface.
Developers can easily implement new data types for any
objects. It is common practice to use existing classes and extend them with
writable interface.
What happens if
mapper output does not match reducer input?
A real-time exception will be thrown and map-reduce job will
fail.
Reducers are based on the mappers output and Java is a
strongly typed language. Therefore, an exception will be thrown at run-time if
types do not much
Can you provide
multiple input paths to a map-reduce jobs?
Yes, developers can add any number of input paths.
The Hadoop framework is capable of taking different input
paths and assigning different mappers for each one. This is a very convenient
way of writing different mappers to handle various datasets.
Can you assign
different mappers to different input paths?
Yes, different mappers can be assigned to different
directories
Assigning different mappers to different data sources is the
way to quickly and efficiently create code for processing multiple formats.
Can you suppress
reducer output?
Yes, there is a special data type that will suppress job
output.
There are a number of scenarios where output is not required
from reducers. For instance, web crawling or image processing does not require
external fetch or data processing.
Is there a map input
format?
No, but sequence file input format can read map files
Map files are just a variation of sequence files. They store
data in sorted order
What is the most
important feature of map-reduce?
Ability to process data on the cluster of the machines
without copying all the data over.
The fundamental difference of the Hadoop framework is that
multiple machines will be used to process the same data and data is readily available
for processing in distributed file system.
What is HBASE?
Hbase is a part of the Apache Hadoop project that provides
interface for scanning large amount of data using Hadoop infrastructure
Hbase is one of the Hadoop framework projects that allow
real time data scans across big data volumes. This is very often used to serve
data from a cluster
Nice content presentation! Thanks for putting the efforts on gathering useful content and sharing here. You can find more Hadoop interview related question and answers in the below forum.
ReplyDeleteHadoop interview questions and answers
Great questions on hadoop. If you are interested in installing hadoop. You can check this link. hadoop installation on ubuntu
ReplyDeleteHigher Level Abstractions for MapReduce - 2 - Hive - Introduction - Hive QL - Hive User Defined Functions - Hive Use Cases - NOSQL Databases - NoSQL Concepts - Review of RDBMS - - Need for NOSQL - Brewers CAP Theorem - ACID vs BASE - Different Types of NoSQL Databases - Key Value - Columnar - Document - Graph - Columnar Databases - Hadoop Ecosystem - HBASE vs Cassandra - HBASE Architecture - HBASE Data Modeling - HBASE Commands - HBASE Coprocessors - Endpoints - HBASE Coprocessors - Observers - SQOOP - Flume & OOZIE.. - http://www.21cssindia.com/courses/hadoop-online-training-182.html
ReplyDeleteEmployees to learn at their own pace and maintain control of learning “where, when and how” with boundless access 24/7by 21st Century Software Solutions. contact@21cssindia.com
Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post. It is really very helpful for us and I have gathered some important information from this blog.
ReplyDeleteBig Data Training in Chennai
Hi I am Victoria lives in Chennai. I am a technology freak. Recently I did Java Course in Chennai at a leading Java Training Institutes in Chennai. This is really helpful for me to make a bright carrer in IT industry.
ReplyDeleteDot Net Training Chennai
ReplyDeleteThanks for your wonderful post.It is really very helpful for us and I have gathered some important information from this blog.If anyone wants to get Dot Net Training in Chennai reach FITA, rated as No.1 Dot Net Training Institute in Chennai.
Dot Net Course in Chennai
Dot Net Training
Automation Training in Chennai
ReplyDeleteI have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article. I did Loadrunner Training in Chennai. This is really useful for me. Suppose if anyone interested to learn Manual Testing Training in Chennai reach FITA academy located at Chennai Velachery.
You want big data interview questions and answers follow this link.
ReplyDeletehttp://kalyanhadooptraining.blogspot.in/search/label/Big%20Data%20Interview%20Questions%20and%20Answers
PHP Training Chennai
ReplyDeleteI get a lot of great information from this blog. Thank you for your sharing this informative blog. Recently I did PHP course at a leading academy. If you are looking for best PHP Training Institute in Chennai visit FITA IT training academy which offer real timePHP Training in Chennai.
PHP Course in Chennai
Testing Training in Chennai
ReplyDeleteIts really awesome blog..If anyone wants to get Software Testing Training in Chennai visit FITA IT academy located at Chennai. Rated as No.1 Software Testing Training Institutes in Chennai
Software Testing Course in Chennai
QTP Course in Chennai
ReplyDeleteHi, I wish to be a regular contributor of your blog. I have read your blog. Your information is really useful for beginner. I did Testing Training in Chennai at Fita training and placement academy which offer best Software Testing Training in Chennai with years of experienced professionals. This is really useful for me to make a bright career.
Regards...
Software Testing Training Institutes in Chennai
HTML5 Training
ReplyDeleteHi, Thanks for sharing this valuable blog.I was really impressed by reading this blog. I did HTML5 Training in Chennai at reputed HTML5 Training Institutes in Chennai. This is really useful for me to make a bright future in designing field.
HTML5 Courses in Chennai
very nice blogs!!! i have to learning a lot of information for this sites.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.AWS course chennai | AWS Certification in chennai | AWS Certification chennai
ReplyDeleteYour posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post.
ReplyDeleteVMWare course chennai | VMWare certification in chennai | VMWare certification chennai
There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this (Salesforce Training).
ReplyDeleteHi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop Training in Chennai will help you to enter big data technology.
ReplyDeleteI have read your blog, it was good to read & I am getting some useful info's through your blog keep sharing... Informatica is an ETL tools helps to transform your old business leads into new vision. Learn Informatica training in chennai from corporate professionals with very good experience in informatica tool.
ReplyDeleteRegards,
Best Informatica Training In Chennai|Informatica training center in Chennai|Informatica training chennai
thank you so much for giving very useful information for me.
ReplyDeleteHadoop Online Training
Hadoop Developer Online Training
Hadoop admin Online Training
Hadoop Architecture Online Training
I have finally found a Worth able content to read. The way you have presented information here is quite impressive. I have bookmarked this page for future use. Thanks for sharing content like this once again. Keep sharing content like this.
ReplyDeleteSoftware testing training in chennai | Software testing course | Software testing training chennai
This comment has been removed by the author.
ReplyDeleteThank you so much for giving very useful information for me.
ReplyDeletedatastage training in chennai
very nice blogs!!!
ReplyDeletemongoDB training in chennai
very nice and informative blog
ReplyDeletebig data projects chennai
mobile computing projects chennai
cloud computing projects chennai
secure computing projects chennai
Thank you so much for sharing these questions. Would be very helpful in the interviews. I took Hadoop Developer Certification Training from E-Learnify.in. They were really helpful in clearing my doubts.
ReplyDeleteit is very excellent blog and useful post thank you for sharing with us , keep posting learn more about Hadoop admin useful information .thank you providing this important information on
ReplyDeleteHadoop Admin Online course Bangalore
hi ,your interview questions helped me to clear my interview and now i was clear with hadoop thank you !!! Hadoop Training in Velachery | Hadoop Training .
ReplyDeleteHadoop Training in Chennai | Hadoop .
Nice post Thankful to sharing Big Data Hadoop Online course India
ReplyDeleteCIITN is located in Prime location in Noida having best connectivity via all modes of public transport. CIITN offer both weekend and weekdays courses to facilitate Hadoop aspirants. Among all Hadoop Training Institute in Noida , CIITN's Big Data and Hadoop Certification course is designed to prepare you to match all required knowledge for real time job assignment in the Big Data world with top level companies. CIITN puts more focus in project based training and facilitated with Hadoop 2.7 with Cloud Lab—a cloud-based Hadoop environment lab setup for hands-on experience.
ReplyDeleteCIITNOIDA is the good choice for Big Data Hadoop Training in NOIDA in the final year. I have also completed my summer training from here. It provides high quality Hadoop training with Live projects. The best thing about CIITNOIDA is its experienced trainers and updated course content. They even provide you placement guidance and have their own development cell. You can attend their free demo class and then decide.
Hadoop Training in Noida
Big Data Hadoop Training in Noida
Thanks for sharing such details about big data and hadoop.Big data hadoop online Training
ReplyDeleteDevelop Marketing, Advertising and Campaign Management Techniques. Learn how to Research and Build the correct Marketing Strategies for key stakeholders. Pay Per Click Marketing. Google Ads.
ReplyDeletebest digital marketing, digital marketing, skartec digital marketing, skartec digital marketing academy, seo training in chennai, best seo service in chennai, digital marketing blog,seo company in chennai,seo course in chennai,seo training in chennai,digital marketing course in chennai,digital marketing course fees,,digital marketing training in chennai,online digital marketing courses,best digital marketing course,digital marketing classes,digital marketing institute,digital marketing training institute,best digital marketing course in chennai,online marketing courses,digital marketing institute in chennai,digital marketing training,digital marketing course,best online digital marketing courses,advanced digital marketing course,digital marketing course duration,digital marketing couse fees,digital marketing couse fees in chennai,best seo service in chennai,digital marketing blog
I have been searching for a useful post like this on salesforce course details, it is highly helpful for me and I have a great experience with this Salesforce Training who are providing certification and job assistance. Salesforce developer training in Noida
ReplyDeleteReally very useful post.
ReplyDeleteThank you for sharimg this excellent blog with us,keep posting more posts us....
hadoop admin training
Very good information. It will be helpful for too many people that are looking for this topic. Keep doing awesome rock blog.
ReplyDeleteBest PTE institute in ambala
Study visa consultants in ambala,
Best IELTS Institute in Ambala