Labels

Tuesday, July 16, 2013

Hadoop admin interview question and answers



Which operating system(s) are supported for production Hadoop deployment?

The main supported operating system is Linux. However, with some additional software Hadoop can be deployed on Windows.

What is the role of the namenode?

The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks on the system based on the replication policy. The namenode also supplies the specific addresses for the data based on the client requests.

What happen on the namenode when a client tries to read a data file?

The namenode will look up the information about file in the edit file and then retrieve the remaining information from filesystem memory snapshot
Since the namenode needs to support a large number of the clients, the primary namenode will only send information back for the data location. The datanode itselt is responsible for the retrieval.

What are the hardware requirements for a Hadoop cluster (primary and secondary namenodes and datanodes)?

There are no requirements for datanodes. However, the namenodes require a specified amount of RAM to store filesystem image in memory
Based on the design of the primary namenode and secondary namenode, entire filesystem information will be stored in memory. Therefore, both namenodes need to have enough memory to contain the entire filesystem image.

What mode(s) can Hadoop code be run in?

Hadoop can be deployed in stand alone mode, pseudo-distributed mode or fully-distributed mode. –
     Hadoop was specifically designed to be deployed on multi-node cluster. However, it also can be deployed on single machine and as a single process for testing purposes

How would an Hadoop administrator deploy various components of Hadoop in production?

Deploy namenode and jobtracker on the master node, and deploy datanodes and taskstrackers on multiple slave nodes
There is a need for only one namenode and jobtracker on the system. The number of datanodes depends on the available hardware

What is the best practice to deploy the secondary namenode

Deploy secondary namenode on a separate standalone machine
The secondary namenode needs to be deployed on a separate machine. It will not interfere with primary namenode operations in this way. The secondary namenode must have the same memory requirements as the main namenode.


Is there a standard procedure to deploy Hadoop?

.  No, there are some differences between various distributions. However, they all require that Hadoop jars be installed on the machine
There are some common requirements for all Hadoop distributions but the specific procedures will be different for different vendors since they all have some degree of proprietary software


What is the role of the secondary namenode?

Secondary namenode performs CPU intensive operation of combining edit logs and current filesystem snapshots
The secondary namenode was separated out as a process due to having CPU intensive operations and additional requirements for metadata back-up

What are the side effects of not running a secondary name node?

The cluster performance will degrade over time since edit log will grow bigger and bigger

 If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current filesystem checkpoint image.


What happen if a datanode loses network connection for a few minutes?

 The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, the extra replicas will be
 The replication factor is actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps track which blocks are located on that node. The moment the datanode is not avaialble it will trigger replication of the data from the existing replicas. However, if the datanode comes back up, overreplicated data will be deleted. Note: the data might be deleted from the original datanode.




What happen if one of the datanodes has much slower CPU?

The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the slowest worker will not have such big impact
Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will be created and job tracker will take the first result into consideration and the second instance of the task will be killed.

What is speculative execution?
Top of Form
If speculative execution is enabled, the job tracker will issue multiple instances of the same task on multiple nodes and it will take the result of the task that finished first. The other instances of the task will be killed.
Bottom of Form
The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker creates multiple instances of the same task and takes the result of the first successful task. The rest of the tasks will be discarded.


What is speculative execution?
Top of Form
.  If speculative execution is enabled, the job tracker will issue multiple instances of the same task on multiple nodes and it will take the result of the task that finished first. The other instances of the task will be killed.
The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker creates multiple instances of the same task and takes the result of the first successful task. The rest of the tasks will be discarded.


How many racks do you need to create an Hadoop cluster in order to make sure that the cluster operates reliably?

In order to ensure a reliable operation it is recommended to have at least 2 racks with rack placement configured
Hadoop has a built-in rack awareness mechanism that allows data distribution between different racks based on the configuration.


Are there any special requirements for namenode?

Yes, the namenode holds information about all files in the system and needs to be extra reliable

- The namenode is a single point of failure. It needs to be extra reliable and metadata need to be replicated in multiple places. Note that the community is working on solving the single point of failure issue with the namenode.


If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to that file (assuming the default apache and cloudera configuration)?

.  6
Based on the configuration settings the file will be divided into multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block will be replicated according to replication factor settings (default 3). 2 * 3 = 6 .


What is distributed copy (distcp)?

Distcp is a Hadoop utility for launching MapReduce jobs to copy data. The primary usage is for copying a large amount of data
One of the major challenges in the Hadoop enviroment is copying data across multiple clusters and distcp will allow multiple datanodes to be leveraged for parallel copying of the data.

What is replication factor?

Replication factor controls how many times each individual block can be replicated –
Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure.


What daemons run on Master nodes?

NameNode, Secondary NameNode and JobTracker
Hadoop is comprised of five separate daemons and each of these daemon run in its own JVM. NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker run on each Slave nodes.

What is rack awareness?

Rack awareness is the way in which the namenode decides how to place blocks based on the rack definitions
 Hadoop will try to minimize the network traffic between datanodes within the same rack and will only contact remote racks if it has to. The namenode is able to control this due to rack awareness

What is the role of the jobtracker in an Hadoop cluster? –

The jobtracker is responsible for scheduling tasks on slave nodes, collecting results, retrying failed tasks
The job tracker is the main component of the map-reduce execution. It control the division of the job into smaller tasks, submits tasks to individual tasktracker, tracks the progress of the jobs and reports results back to calling code. .


How does the Hadoop cluster tolerate datanode failures?

Top of Form
Since Hadoop is design to run on commodity hardware, the datanode failures are expected. Namenode keeps track of all available datanodes and actively maintains replication factor on all data.
The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become non-responsive. The namenode is the central "brain" of the HDFS and starts replication of the data the moment a disconnect is detected.



What is the procedure for namenode recovery?
A namenode can be recovered in two ways: starting new namenode from backup metadata or promoting secondary namenode to primary namenode
The namenode recovery procedure is very important to ensure the reliability of the data.It can be accomplished by starting a new namenode using backup data or by promoting the secondary namenode to primary.


Web-UI shows that half of the datanodes are in decommissioning mode. What does that mean? Is it safe to remove those nodes from the network?

This means that namenode is trying retrieve data from those datanodes by moving replicas to remaining datanodes. There is a possibility that data can be lost if administrator removes those datanodes before decomissioning finished .
Due to replication strategy it is possible to lose some data due to datanodes removal en masse prior to completing the decommissioning process. Decommissioning refers to namenode trying to retrieve data from datanodes by moving replicas to remaining datanodes


What does the Hadoop administrator have to do after adding new datanodes to the Hadoop cluster?

Since the new nodes will not have any data on them, the administrator needs to start the balancer to redistribute data evenly between all nodes.
Hadoop cluster will detect new datanodes automatically. However, in order to optimize the cluster performance it is recommended to start rebalancer to redistribute the data between datanodes evenly.


If the Hadoop administrator needs to make a change, which configuration file does he need to change?


  1. It depends on the nature of the change. Each node has it`s own set of configuration files and they are not always the same on each node
Correct Answer is A - Each node in the Hadoop cluster has its own configuration files and the changes needs to be made in every file. One of the reasons for this is that configuration can be different for every node.

Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What could be wrong?

 The cluster is in a safe mode. The administrator needs to wait for namenode to exit the safe mode before restarting the jobs again
This is a very common mistake by Hadoop administrators when there is no secondary namenode on the cluster and the cluster has not been restarted in a long time. The namenode will go into safemode and combine the edit log and current file system timestamp

Map Reduce jobs take too long. What can be done to improve the performance of the cluster?


One the most common reasons for performance problems on Hadoop cluster is uneven distribution of the tasks. The number tasks has to match the number of available slots on the cluster
Hadoop is not a hardware aware system. It is the responsibility of the developers and the administrators to make sure that the resource supply and demand match.

How often do you need to reformat the namenode?
Never. The namenode needs to formatted only once in the beginning. Reformatting of the namenode will lead to lost of the data on entire

The namenode is the only system that needs to be formatted only once. It will create the directory structure for file system metadata and create namespaceID for the entire file system. –

After increasing the replication level, I still see that data is under replicated. What could be wrong?

Data replication takes time due to large quantities of data. The Hadoop administrator should allow sufficient time for data replication
 Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy data around and if data size is big enough it is not uncommon that replication will take from a few minutes to a few hours.

74 comments:

  1. It is nice to see this blog for preparing all hadoop interview questions. Thanks for sharing with us.

    ReplyDelete
  2. Learning hadoop is very good idea , because hadoop is one of the best technologies to get best remuneration . And java technology people can learn this echnology very faster. Hadoop Tutorial

    ReplyDelete
  3. Thanks for sharing such useful information on the blog and refer the link Android Training in Chennai

    ReplyDelete
  4. Android Training in ChennaiFebruary 27, 2015 at 2:41 AM

    Awesome Blogs share more information and refer the link Android Training in Chennai

    ReplyDelete
  5. You want big data interview questions and answers follow this link.
    http://kalyanhadooptraining.blogspot.in/search/label/Big%20Data%20Interview%20Questions%20and%20Answers

    ReplyDelete
  6. I see this content as a Unique and very informative article. Impressive article like this may help many like me in finding the best Big Data Hadoop Training in Chennai and also the best Hadoop training institute in chennai

    ReplyDelete
  7. Thanks for sharing... Actually Hadoop is a highly growing & scoopful technology in IT market it’s an open-source software framework for managing big data in a distributed fashion on large commodity computing hardware. FITA provides Hadoop training in Chennai get in to fita and out with your career.

    ReplyDelete
  8. very nice !!! i have to learning a lot of information for this sites...Sharing for wonderful information.
    AWS Training in chennai | AWS Training chennai | AWS course in chennai

    ReplyDelete
  9. very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing. cloud computing training in chennai | cloud computing training chennai | cloud computing course in chennai | cloud computing course chennai

    ReplyDelete
  10. Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me.. VMWare Training in chennai | VMWare Training chennai | VMWare course in chennai | VMWare course chennai

    ReplyDelete
  11. Thanks for sharing this excellent post. Its really very informative and interesting. Keep update your blog. For a best Android training in Chennai please refer this site.
    Regards....
    Android Training in Chennai

    ReplyDelete

  12. The information you have given here is truly helpful to me. CCNA- It’s a certification program based on routing & switching for starting level network engineers that helps improve your investment in knowledge of networking & increase the value of employer’s network...
    Regards,
    ccna training in Chennai|ccna training institute in Chennai|ccna courses in Chennai

    ReplyDelete
  13. Hi Admin,
    We are waiting for the latest hadoop interview questions updates. Why dont u update the latest one now.

    Hadoop Tutorial

    ReplyDelete
  14. Pretty article! I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing.
    Regards,
    Informatica training in chennai|Best Informatica Training In Chennai|Informatica courses in Chennai

    ReplyDelete
  15. Great information about Hadoop. It will be helpful for us.
    Thank you !!
    I would like to share useful things for hadoop job seekers Hadoop Interview Questions .

    ReplyDelete
  16. Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

    Software testing training in chennai | Software testing training institutes in chennai | Software testing training institute in chennai

    ReplyDelete
  17. Really awesome blog. Your blog is really useful for me.
    Thanks for sharing this informative blog. Keep update your blog.
    Oracle Training In Chennai

    ReplyDelete
  18. Best SAS Training Institute In Chennai It’s too informative blog and I am getting conglomerations of info’s about Oracle interview questions and answer .Thanks for sharing, I would like to see your updates regularly so keep blogging.

    ReplyDelete
  19. I am really satisfied by your information. The Author shared Hadoop Admin interview Questions and answers. It’s really helpful for cracking Hadoop interviews. Q: What are schedulers and what are the three types of schedulers that can be used in Hadoop cluster? A: Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the jobtracker. The three types of schedulers are: 1. FIFO (First in First Out) Scheduler 2. Fair Scheduler 3. Capacity Scheduler Suppose if you’re looking for More Android interview Questions then I will share you link just has a look: Hadoop Interview Q/A's

    ReplyDelete
  20. Well post.It is good and impressive.Your have shared about the interview questions and it is really useful to crack Hadoop interviews.I suggest to visit this site to my students and also i shared about this site that who are in need of this type of questions.
    Regards,
    Best Informatica Training in Chennai | Informatica Training Center in Chennai

    ReplyDelete
  21. Nice collection of question and answers thank you for sharing. Know more about Big Data Hadoop Training

    ReplyDelete
  22. Nice collection of question and answers thank you for sharing.
    datamodeling training in chennai

    ReplyDelete
  23. Thank you so much for sharing these questions. Would be very helpful in the interviews. I took Hadoop Admin Training from E-Learnify.in. They were really helpful in clearing my doubts.

    ReplyDelete
  24. thanks for the informative post..

    ReplyDelete
  25. This blog having really useful questions and answers to cracking the interview .. explanation are clear so interesting to read.. keep rocking

    hadoop training institute in chenni | big data training institute in chennai | hadoop training in velachery | big data training in velachery

    ReplyDelete
  26. Thanks for sharing Valuable information about hadoop. Really helpful. Keep sharing........... If it possible share some more tutorials

    ReplyDelete
  27. Truely a very good article on how to handle the future technology. After reading your post,thanks for taking the time to discuss this,I feel happy about and I love learning more about this topic. keep sharing your information regularly for my future reference.

    Hadoop Training in Chennai

    Base SAS Training in Chennai

    ReplyDelete
  28. We share very great knowledgeful data post here.Thanks for Sharing the helpful information and thanks for sharing the amazing article.We are happy to see such an incredible article.software testing training institute in chennai | hadoop training in chennai

    ReplyDelete
  29. This is excellent information. It is amazing and wonderful to visit your site.Thanks for sharng this information,this is useful to me...
    Android training in chennai
    Ios training in chennai

    ReplyDelete


  30. Thanks for posting useful information.You have provided an nice article, Thank you very much for this one. And i hope this will be useful for many people.. and i am waiting for your next post keep on updating these kinds of knowledgeable things...Really it was an awesome article...very interesting to read..please sharing like this information......
    Web Design Development Company
    Mobile App Development Company

    ReplyDelete
  31. Really it was an awesome article...very interesting to read..You have provided an nice article....Thanks for sharing..
    Web Design Company
    Web Development Company

    ReplyDelete

  32. This is excellent information. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  33. This article is very much helpful and i hope this will be an useful information for the needed one. Keep on updating these kinds of informative things...

    Android App Development Company

    ReplyDelete
  34. Your thinking toward the respective issue is awesome also the idea behind the blog is very interesting which would bring a new evolution in respective field. Thanks for sharing.

    AWS Training in Chennai

    SEO Training in Chennai

    ReplyDelete
  35. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  36. Excellent Blog very imperative good content, this article is useful to beginners and real time Employees. Hadoop Admin and Developer Online Training

    ReplyDelete
  37. I read this content really awesome.You provided another one great article.I hope this information may change my business carrier.I can remember these things whenever taking the decision.

    Dot Net Training in Chennai

    Software Testing Training in Chennai

    ReplyDelete
  38. great and nice blog thanks sharing..I just want to say that all the information you have given here is awesome...Thank you very much for this one.
    web design Company
    web development Company
    web design Company in chennai
    web development Company in chennai
    web design Company in India
    web development Company in India

    ReplyDelete
  39. This blog is having the general information. Got a creative work and this is very different one.We have to develop our creativity mind.This blog helps for this. Thank you for this blog. This is very interesting and useful.
    PEGA Training in Chennai

    ReplyDelete
  40. it is really amazing...thanks for sharing....provide more useful information...
    Mobile app development company

    ReplyDelete
  41. I wondered upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
    iOS App Development Company

    ReplyDelete
  42. You have provided an nice article, Thank you very much for this one. And i hope this will be useful for many people.. and i am waiting for your next post keep on updating these kinds of knowledgeable things...
    Fitness SMS
    Salon SMS
    Investor Relation SMS

    ReplyDelete
  43. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    PEGA Training in Chennai

    ReplyDelete
  44. Nice, Get Certification in Big Data and Hadoop Developer Training in Noida from Croma Campus. The training program is packed with the Latest & Advanced modules like YARN, Flume, Oozie, Mahout & Chukwa. croma Campus gives well expierience menter & fully facility according other institute.

    ReplyDelete
  45. Nice, Get Certification in Big Data and Hadoop Developer Training in Noida from Croma Campus. The training program is packed with the Latest & Advanced modules like YARN, Flume, Oozie, Mahout & Chukwa. croma Campus gives well expierience menter & fully facility according other institute.

    ReplyDelete
  46. Great site for these post and i am seeing the most of contents have useful for my Carrier.Thanks to such a useful information.Any information are commands like to share him.
    PEGA Training in Chennai

    ReplyDelete
  47. I have seen a lot of blogs and Info. on other Blogs and Web sites But in this Hadoop Blog Information is useful very thanks for sharing it........

    ReplyDelete
  48. This is an excellent Information thanks for sharing keep Update Hadoop admin Online Training

    ReplyDelete
  49. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  50. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    SEO Company in India

    ReplyDelete
  51. Online mba in India
    DEIEDU is the best online Institute in the world with high class course outline and up to date learning materials. DEIEDU is providing the online mba in india, online mba in india, Distance learning mba courses in india, Correspondence mba in India Mba from distance in India, Online Executive Mba in India, distance Mba from India, Online distance mba in India. Distance learning mba degree in India.
    Address:
    401, fourth floor sg alpha tower
    Vashundhra (up)
    Phone: 9811210788
    Email: info@deiedu.in
    Website: http://www.deiedu.in/
    online mba in india

    ReplyDelete
  52. You have provided an nice article, Thank you very much for this one. And i hope this will be useful for many people.. and i am waiting for your next post keep on updating these kinds of knowledgeable things...
    Texting API
    Text message marketing
    Digital Mobile Marketing
    Sms API
    Sms marketing

    ReplyDelete
  53. Hadoop is one of the most and important upcoming course which is having the best opportunities in the market While I am having my PMP Course in Chennai I was supposed to examine the Hadoop course content also many jobs are prefering Hadoop Certified persons Well Thankyou for the Interview Questions

    ReplyDelete