Tuesday, June 18, 2013

Import NOAA data into Apache HBase

Import NOAA data into Apache HBase

Apache HBase provides tools to import data from flat files into HBase tables. We recently used the tools to load NOAA Station data into HBase. The HBase table is called “station” with one column family “d”. NOAA station data is in the input file 201212station.txt. It has 15 fields separated by “|”. The first field, station id, will be used as row key in HBase. The rest of fields will be added as HBase columns. The following is the walk-through of the steps.
1. Create an HBase table
hbase> create table 'station', 'd'
2. Create an HDFS folder to hold the temporary data for bulk load
$hdfs dfs –mkdir /user/john/hbase
3. Run importtsv to generate temporary data
$hadoop jar /usr/lib/hbase/hbase.jar importtsv '-Dimporttsv.separator=|' -Dimporttsv.bulk.output=/user/john/hbase/tmp -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2,d:c3,d:c4,d:c5,d:c6,d:c7,d:c8,d:c9,d:c10,d:c11,d:c12,d:c13,d:c14 station /user/john/noaa/201212station.txt
4. Change the temporary folder permission
$hdfs dfs -chmod -R +rwx /user/john/hbase
5. Run bulk load
$hadoop jar /usr/lib/hbase/hbase.jar completebulkload /user/john/hbase/tmp station
The station data is now successfully loaded from file to HBase table. This can be confirmed by running an HBase shell command to get the station information with station ID 94994.
hbase> get 'station', '94994'
This entry was posted in Business Intelligence and tagged , , , . Bookmark the permalink.

No comments:

Post a Comment