Tuesday, July 23, 2013

HOW TO Become an Efficient Support Engineer in IT industry?

I have been a Production Support Engineer for over 7 years now. Over the years I have used several tools to help me work efficiently and be productive. In this post I have tried to list down the few things that I have learned being a Support Engineer.

Use a good text editor and master it (Sublime Text 2)

A text editor that is easy to use is a very good tool in the hands of a Support Engineer. No matter what product you support, you will always need a workspace to copy things quickly, make notes, etc. There have been several instances wherein I have had to transform text using simple actions such as find and replace. I also use my text editor to read/write code.
The text editor I recommend and use is Sublime Text 2. This is easily the best text editor I have used over the years. It has a fantastic set of nifty shortcuts, very good syntax highlighting that helps in reading code and also has a very simple word completion feature (really useful when writing code). The one feature I really like about Sublime Text 2 is that the unsaved files are not lost in case you accidentally close the editor and will be available when you open Sublime the next time. The default theme is very soothing too.
Some of the other good text editors I have used are :
Using the various utilities that come with these powerful text editors will speed up your work drastically.

Manage passwords using a Password Manager (Keepass)

As a support engineer you would have access several systems to get your work done. Most of these systems would be password protected and managing the access credentials could be really difficult.
Keepass is a free tool that can be used to manage your passwords and log in information.

Use a Desktop Automation Software (AutoHotKey)

My work involves a lot of emailing. There are a few sentences that I use in almost all my emails. Writing the same line repeatedly in every email is not an efficient way to work. I use AutoHotKey to auto complete my sentences that I repeatedly use.
For e.g. I usually end me emails with, “Please do let me know if you need any information.”
The tool is easy to use and allows you to set a combination of keys which when typed, expands to the required sentence. In the above example “Please do let me know if you need any information.” is expanded when I type “pdl”.

Use Microsoft Excel or a similar spreadsheet application

Microsoft Excel is something that I use for almost all kind of data transformation and analysis. This is a must have tool whether you are Support Engineer or no. For me it just helps to put my initial analysis on a spreadsheet and look at it when working on an issue. You could use other spreadsheet solutions that come as part of LibreOffice or Google Docs, but I personally prefer Microsoft Excel.

Use a Cloud Storage (Dropbox)

As a Support Engineer you may have to work on something that is critical in nature even when you don’t have access to your own laptop/PC. It is good to have some storage on the cloud to keep  your documents and reference notes so that they are easily accessible from another PC over the internet. I personally recommend Dropbox, but these days there are several other good solutions like Box, SkyDrive and Google Drive.

Track your time

Make it a point to track your time correctly. You can use any tool to track your time that works for you. I use a simple time tracking tool that I built for myself, however you can use Microsoft Excel, any text editor, etc. to track your time. Tracking your time will give you a good feedback on how much time you are taking for a particular task and this will help you give better effort estimations for tasks assigned to you.

Learn a scripting language

Try and learn a scripting language, for e.g. Shell scripting, PowerShell scripting for Windows, Ruby or Perl. I personally prefer Shell Scripting and Ruby. Knowing a scripting language will help you automate or work on adhoc requirements more efficiently. In my opinion learning to program or code in any one language helps one think in a more structured manner.

Develop basic SQL Skills

Learn the basics of SQL like SELECT, UPDATE, INSERT, DELETE and TRUNCATE.

Use a Screen Shot Tool (Greenshot)

I use Greenshot to take a screen shots of certain portions of the screen. This avoids taking the complete screen shot and using an image editor like Microsoft Paint or something similar to cut portions of the image before sharing it.

Ask “How can it be done?”

This is something that I picked up from my father. He always says, that if you are presented with a problem, the first obvious question should be : “How can it be done?”. This instills a problem-solving attitude in you and the only way you would go is forward. This is a simple but a very powerful advice.
Revisit your routine and try and simplify your work. You need to find all possible ways to identify monotonous work and try and simplify them and if possible automate them.
Doing the above will help you be efficient which will give you ample time and resources to help your customers solve their issues and problems.
Let me know your comments and also let me know if you are using any good tool that has helped you be more efficient and productive.

Tuesday, July 16, 2013

Hadoop Developer interview questions and answers

What are supported programming languages for Map Reduce? –

The most common programming language is Java, but scripting languages are also supported via Hadoop streaming.
The original language supported is Java. However, as Hadoop became more and more popular various alternative scripting languages were incorporated

How does Hadoop process large volumes of data?
Hadoop ships the code to the data instead of sending the data to the code.
The basic design principles of Hadoop is to eliminate the data copying between different datanodes

What are sequence files and why are they important? –
Sequence files are a type of the file in the Hadoop framework that allow data to be sorted
Sequence files are intermediate files that are created by Hadoop after the map step
Hadoop is able to split data between different nodes gracefully while keeping data compressed. The sequence files have special markers that allow data to be split across entire cluster

What are map files and why are they important?
Map files are sorted sequence files that also have an index. The index allows fast data look up.
The Hadoop map file is a variation of the sequence file. They are very important for map-side join design pattern.

How can you use binary data in MapReduce?
Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file.
Binary data can be packaged in sequence files. Hadoop cluster does not work very well with large numbers of small files. Therefore, small files should be combined into bigger ones

What is map - side join?
Map-side join is done in the map phase and done in memory
The map-side join is a techinique that allows for splitting map file between different data nodes. The data will be loaded into memory. This technique allow very fast performance for the join.

What is reduce - side join?
Reduce-side join is a technique for merging data from different sources based on a specific key. There are no memory restrictions
The reduce side join is a technique for joining data of any size in the reduce step. The technique is much slower then map-side join. However, this technique does not have any requirements on data size.

What is HIVE?
Hive is a part of the Apache Hadoop project that provides SQL like interface for data processing
Hive is a project initially developed by facebook specifically for people with very strong SQL skills and not very strong Java skills who want to query data in Hadoop

What is PIG?
Pig is a part of the Apache Hadoop project that provides C-like scripting languge interface for data processing
Pig is a project that was developed by Yahoo for people with very strong skills in scripting languages. Using scripting language, it dynamically creates Map Reduce jobs automatically

How can you disable the reduce step?
A developer can always set the number of the reducers to zero. That will completely disable the reduce step.
If developer uses MapReduce API he has full access to any number of mappers and reducers for job execution

Why would a developer create a map-reduce without the reduce step?

There is a CPU intensive step that occurs between the map and reduce steps. Disabling the reduce step speeds up data processing
This is a map step only. MapReduce jobs are very common. They normally are used to perform transformations on data without sorting and aggregations

What is the default input format?
The default input format is TextInputFormat with byte offset as a key and entire line as a value.
Hadoop permits a large range of input formats. The default is text input format. This format is the simplest way to access data as text lines

How can you overwrite the default input format?
In order to overwrite default input format, a developer has to set new input format on job config before submitting the job to a cluster
Developer can always set different input formats on job configuration (e.g sequence files, binary files, compressed format)

What are the common problems with map-side join?
The most common problems with map-side joins are out of memory exceptions on slave nodes.
Map-side join uses memory for joining the data based on a key. As a result the data size is limited to the size of the available memory. If this exceeds available memory an out of memory error will occur

Which is faster: Map-side join or Reduce-side join? Why?
Map-side join is faster because join operation is done in memory.
The map-side join is faster. This is primarily due to usage of memory. Memory operations are always faster since there is no disk I/O involved.

Will settings using Java API overwrite values in configuration files?
Yes. The configuration settings using Java API take precedence
Developer has full control over the setting on Hadoop cluster. All configurations can be changed via Java API

What is AVRO?
Avro is a java serialization library
AVRO is an Apache project that is bridging the gap between unstructured data and structured data. The avro file format is highly optimized for network transmisions and splitable between different datanodes

Can you run Map - Reduce jobs directly on Avro data?

Yes, Avro was specifically designed for data processing via Map-Reduce
AVRO implements all necessary interfaces for MapReduce processing and avro data can be processed directly via Hadoop cluster

What is distributed cache?
The distributed cache is a component that allows developers to deploy jars for Map-Reduce processing.
Distributed cache is the Hadoop answer to the problem of deploying third-party libraries. Distributed cache will allow libraries to be deployed to all datanodes

What is the best performance one can expect from a Hadoop cluster?

The best performance expectation one can have is measured in seconds. This is because Hadoop can only be used for batch processing –
Hadoop specifically was designed for batch processing. There are a few additional components that will allow better performance. Near real-time and real-time Hadoop performance are not currently possible but are in the works.

What is writable?
Writable is a java interface that needs to be implemented for MapReduce processing.
Hadoop performs a lot of data transmissions between different datanodes. Writable is needed for mapreduce processing in order to improve performance of the data transmissions.

The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable. They have almost the same features as default java classes. What are these writable data types optimized for?

Writable data types are specifically optimized for network transmissions
Data needs to be represented in a format optimized for network transmission. Hadoop is based on the ability to send data between datanodes very quickly. Writable data types are used for this purpose.

Can a custom type for data Map-Reduce processing be implemented?

Yes, custom data types can be implemented as long as they implement writable interface.
Developers can easily implement new data types for any objects. It is common practice to use existing classes and extend them with writable interface.

What happens if mapper output does not match reducer input?

A real-time exception will be thrown and map-reduce job will fail.
Reducers are based on the mappers output and Java is a strongly typed language. Therefore, an exception will be thrown at run-time if types do not much

Can you provide multiple input paths to a map-reduce jobs?

Yes, developers can add any number of input paths.
The Hadoop framework is capable of taking different input paths and assigning different mappers for each one. This is a very convenient way of writing different mappers to handle various datasets.

Can you assign different mappers to different input paths?

Yes, different mappers can be assigned to different directories

Assigning different mappers to different data sources is the way to quickly and efficiently create code for processing multiple formats.

Can you suppress reducer output?

Yes, there is a special data type that will suppress job output.
There are a number of scenarios where output is not required from reducers. For instance, web crawling or image processing does not require external fetch or data processing.

Is there a map input format?

No, but sequence file input format can read map files
Map files are just a variation of sequence files. They store data in sorted order

What is the most important feature of map-reduce?
Ability to process data on the cluster of the machines without copying all the data over.
The fundamental difference of the Hadoop framework is that multiple machines will be used to process the same data and data is readily available for processing in distributed file system.

What is HBASE?

Hbase is a part of the Apache Hadoop project that provides interface for scanning large amount of data using Hadoop infrastructure
Hbase is one of the Hadoop framework projects that allow real time data scans across big data volumes. This is very often used to serve data from a cluster

Hadoop admin interview question and answers

Which operating system(s) are supported for production Hadoop deployment?

The main supported operating system is Linux. However, with some additional software Hadoop can be deployed on Windows.

What is the role of the namenode?

The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks on the system based on the replication policy. The namenode also supplies the specific addresses for the data based on the client requests.

What happen on the namenode when a client tries to read a data file?

The namenode will look up the information about file in the edit file and then retrieve the remaining information from filesystem memory snapshot
Since the namenode needs to support a large number of the clients, the primary namenode will only send information back for the data location. The datanode itselt is responsible for the retrieval.

What are the hardware requirements for a Hadoop cluster (primary and secondary namenodes and datanodes)?

There are no requirements for datanodes. However, the namenodes require a specified amount of RAM to store filesystem image in memory
Based on the design of the primary namenode and secondary namenode, entire filesystem information will be stored in memory. Therefore, both namenodes need to have enough memory to contain the entire filesystem image.

What mode(s) can Hadoop code be run in?

Hadoop can be deployed in stand alone mode, pseudo-distributed mode or fully-distributed mode. –
     Hadoop was specifically designed to be deployed on multi-node cluster. However, it also can be deployed on single machine and as a single process for testing purposes

How would an Hadoop administrator deploy various components of Hadoop in production?

Deploy namenode and jobtracker on the master node, and deploy datanodes and taskstrackers on multiple slave nodes
There is a need for only one namenode and jobtracker on the system. The number of datanodes depends on the available hardware

What is the best practice to deploy the secondary namenode

Deploy secondary namenode on a separate standalone machine
The secondary namenode needs to be deployed on a separate machine. It will not interfere with primary namenode operations in this way. The secondary namenode must have the same memory requirements as the main namenode.

Is there a standard procedure to deploy Hadoop?

.  No, there are some differences between various distributions. However, they all require that Hadoop jars be installed on the machine
There are some common requirements for all Hadoop distributions but the specific procedures will be different for different vendors since they all have some degree of proprietary software

What is the role of the secondary namenode?

Secondary namenode performs CPU intensive operation of combining edit logs and current filesystem snapshots
The secondary namenode was separated out as a process due to having CPU intensive operations and additional requirements for metadata back-up

What are the side effects of not running a secondary name node?

The cluster performance will degrade over time since edit log will grow bigger and bigger

 If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current filesystem checkpoint image.

What happen if a datanode loses network connection for a few minutes?

 The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, the extra replicas will be
 The replication factor is actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps track which blocks are located on that node. The moment the datanode is not avaialble it will trigger replication of the data from the existing replicas. However, if the datanode comes back up, overreplicated data will be deleted. Note: the data might be deleted from the original datanode.

What happen if one of the datanodes has much slower CPU?

The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the slowest worker will not have such big impact
Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will be created and job tracker will take the first result into consideration and the second instance of the task will be killed.

What is speculative execution?
Top of Form
If speculative execution is enabled, the job tracker will issue multiple instances of the same task on multiple nodes and it will take the result of the task that finished first. The other instances of the task will be killed.
Bottom of Form
The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker creates multiple instances of the same task and takes the result of the first successful task. The rest of the tasks will be discarded.

What is speculative execution?
Top of Form
.  If speculative execution is enabled, the job tracker will issue multiple instances of the same task on multiple nodes and it will take the result of the task that finished first. The other instances of the task will be killed.
The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker creates multiple instances of the same task and takes the result of the first successful task. The rest of the tasks will be discarded.

How many racks do you need to create an Hadoop cluster in order to make sure that the cluster operates reliably?

In order to ensure a reliable operation it is recommended to have at least 2 racks with rack placement configured
Hadoop has a built-in rack awareness mechanism that allows data distribution between different racks based on the configuration.

Are there any special requirements for namenode?

Yes, the namenode holds information about all files in the system and needs to be extra reliable

- The namenode is a single point of failure. It needs to be extra reliable and metadata need to be replicated in multiple places. Note that the community is working on solving the single point of failure issue with the namenode.

If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to that file (assuming the default apache and cloudera configuration)?

.  6
Based on the configuration settings the file will be divided into multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block will be replicated according to replication factor settings (default 3). 2 * 3 = 6 .

What is distributed copy (distcp)?

Distcp is a Hadoop utility for launching MapReduce jobs to copy data. The primary usage is for copying a large amount of data
One of the major challenges in the Hadoop enviroment is copying data across multiple clusters and distcp will allow multiple datanodes to be leveraged for parallel copying of the data.

What is replication factor?

Replication factor controls how many times each individual block can be replicated –
Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure.

What daemons run on Master nodes?

NameNode, Secondary NameNode and JobTracker
Hadoop is comprised of five separate daemons and each of these daemon run in its own JVM. NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker run on each Slave nodes.

What is rack awareness?

Rack awareness is the way in which the namenode decides how to place blocks based on the rack definitions
 Hadoop will try to minimize the network traffic between datanodes within the same rack and will only contact remote racks if it has to. The namenode is able to control this due to rack awareness

What is the role of the jobtracker in an Hadoop cluster? –

The jobtracker is responsible for scheduling tasks on slave nodes, collecting results, retrying failed tasks
The job tracker is the main component of the map-reduce execution. It control the division of the job into smaller tasks, submits tasks to individual tasktracker, tracks the progress of the jobs and reports results back to calling code. .

How does the Hadoop cluster tolerate datanode failures?

Top of Form
Since Hadoop is design to run on commodity hardware, the datanode failures are expected. Namenode keeps track of all available datanodes and actively maintains replication factor on all data.
The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become non-responsive. The namenode is the central "brain" of the HDFS and starts replication of the data the moment a disconnect is detected.

What is the procedure for namenode recovery?
A namenode can be recovered in two ways: starting new namenode from backup metadata or promoting secondary namenode to primary namenode
The namenode recovery procedure is very important to ensure the reliability of the data.It can be accomplished by starting a new namenode using backup data or by promoting the secondary namenode to primary.

Web-UI shows that half of the datanodes are in decommissioning mode. What does that mean? Is it safe to remove those nodes from the network?

This means that namenode is trying retrieve data from those datanodes by moving replicas to remaining datanodes. There is a possibility that data can be lost if administrator removes those datanodes before decomissioning finished .
Due to replication strategy it is possible to lose some data due to datanodes removal en masse prior to completing the decommissioning process. Decommissioning refers to namenode trying to retrieve data from datanodes by moving replicas to remaining datanodes

What does the Hadoop administrator have to do after adding new datanodes to the Hadoop cluster?

Since the new nodes will not have any data on them, the administrator needs to start the balancer to redistribute data evenly between all nodes.
Hadoop cluster will detect new datanodes automatically. However, in order to optimize the cluster performance it is recommended to start rebalancer to redistribute the data between datanodes evenly.

If the Hadoop administrator needs to make a change, which configuration file does he need to change?

  1. It depends on the nature of the change. Each node has it`s own set of configuration files and they are not always the same on each node
Correct Answer is A - Each node in the Hadoop cluster has its own configuration files and the changes needs to be made in every file. One of the reasons for this is that configuration can be different for every node.

Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What could be wrong?

 The cluster is in a safe mode. The administrator needs to wait for namenode to exit the safe mode before restarting the jobs again
This is a very common mistake by Hadoop administrators when there is no secondary namenode on the cluster and the cluster has not been restarted in a long time. The namenode will go into safemode and combine the edit log and current file system timestamp

Map Reduce jobs take too long. What can be done to improve the performance of the cluster?

One the most common reasons for performance problems on Hadoop cluster is uneven distribution of the tasks. The number tasks has to match the number of available slots on the cluster
Hadoop is not a hardware aware system. It is the responsibility of the developers and the administrators to make sure that the resource supply and demand match.

How often do you need to reformat the namenode?
Never. The namenode needs to formatted only once in the beginning. Reformatting of the namenode will lead to lost of the data on entire

The namenode is the only system that needs to be formatted only once. It will create the directory structure for file system metadata and create namespaceID for the entire file system. –

After increasing the replication level, I still see that data is under replicated. What could be wrong?

Data replication takes time due to large quantities of data. The Hadoop administrator should allow sufficient time for data replication
 Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy data around and if data size is big enough it is not uncommon that replication will take from a few minutes to a few hours.