allinone: How to run Pig Latin

Friday, June 14, 2013

How to run Pig Latin

A Pig Latin is made of a series of operations, or transformations, that are applied to the input data to produce output.
Under the cover pig turns the transformations into a series of MapReduce jobs, but as a programmer you are mostly unaware of this, which allows you to focus on the data rather than the nature of the execution.

Pig runs in 2 modes :
1) Local Mode
2) Hadoop Mode

1) Local Mode : In local mode Pig runs in a single JVM & accesses the local file system. This mode is suitable only for small datasets & when trying out Pig. Local mode doesn't use Hadoop. Also it doesn't use Hadoop's local job runner, instead Pig translates queries into a physical plan that it executes itself. The execution type is set using the -x or -exectype option. To run in local mode, set the option to local:
$ pig -x local

2) Hadoop Mode : In Hadoop mode, Pig translates queries into MapReduce jobs & runs them on a Hadoop cluster. To use Hadoop mode you need to tell Pig which vesion of Hadoop you are using & where your cluster is running.
The Environment variable PIG_HADOOP_VERSION is used to tell Pig the version of Hadoop it is connecting to.
$ export PIG_HADOOP_VERSION = 20

Next we need to point Pig at the cluster namenode & jobtracker. If you already have Hadoop site file that define fs.default.name & mapred.jobtracker you can simply add Hadoop's configuration directory to Pig's classpath :
$ export PIG_CLASSPATH = $HADOOP_INSTALL/conf/

Alternatively ou can create a pig.properties file in Pig's “conf” directory, which sets these two properties. Here is an example for a pseudo-distributed setup :
fs.default.name=hdfs://localhost/
mapred.jobtracker= localhost:8021

once you have configured Pig to connect to a Hadoop cluster, you can launch Pig, setting the -x option to MapReduce or omitting it entirely, as Hadoop mode is the default:

/bin/pigscr file
#!/bin/sh
PIG_PATH = $HADOOP_HOME/bin/pig-0.7.0
PIG_CLASSPATH = $PIG_PATH/pig-0.3.0-core.jar:$HADOOP_HOME/conf \ PIG_HADOOP_VERSION = 0.20.2 \ $PIG_PATH/bin/pig $@