Hadoop Distributed File System:
HDFS, the storage layer of Hadoop, is a distributed, scalable,
Java-based file system adept at storing large volumes of unstructured
data.
MapReduce:
MapReduce is a software framework that serves as the compute layer of
Hadoop. MapReduce jobs are divided into two (obviously named) parts. The
“Map” function divides a query into multiple parts and processes data
at the node level. The “Reduce” function aggregates the results of the
“Map” function to determine the “answer” to the query.
Hive:
Hive is a Hadoop-based data warehousing-like framework originally
developed by Facebook. It allows users to write queries in a SQL-like
language caled HiveQL, which are then converted to MapReduce. This
allows SQL programmers with no MapReduce experience to use the warehouse
and makes it easier to integrate with business intelligence and
visualization tools such as Microstrategy, Tableau, Revolutions
Analytics, etc.
Pig:
Pig Latin is a Hadoop-based language developed by Yahoo. It is
relatively easy to learn and is adept at very deep, very long data
pipelines (a limitation of SQL.)
HBase:
HBase is a non-relational database that allows for low-latency, quick
lookups in Hadoop. It adds transactional capabilities to Hadoop,
allowing users to conduct updates, inserts and deletes. EBay and
Facebook use HBase heavily.
Flume:
Flume is a framework for populating Hadoop with data. Agents are
populated throughout ones IT infrastructure – inside web servers,
application servers and mobile devices, for example – to collect data
and integrate it into Hadoop.
Oozie:
Oozie is a workflow processing system that lets users define a series
of jobs written in multiple languages – such as Map Reduce, Pig and Hive
-- then intelligently link them to one another. Oozie allows users to
specify, for example, that a particular query is only to be initiated
after specified previous jobs on which it relies for data are completed.
Ambari:
Ambari is a web-based set of tools for deploying, administering and
monitoring Apache Hadoop clusters. It's development is being led by
engineers from Hortonworoks, which include Ambari in its Hortonworks
Data Platform.
Avro:
Avro is a data serialization system that allows for encoding the schema
of Hadoop files. It is adept at parsing data and performing removed
procedure calls.
Mahout:
Mahout is a data mining library. It takes the most popular data mining
algorithms for performing clustering, regression testing and statistical
modeling and implements them using the Map Reduce model.
Sqoop:
Sqoop is a connectivity tool for moving data from non-Hadoop data
stores – such as relational databases and data warehouses – into Hadoop.
It allows users to specify the target location inside of Hadoop and
instruct Sqoop to move data from Oracle, Teradata or other relational
databases to the target.
HCatalog:
HCatalog is a centralized metadata management and sharing service for
Apache Hadoop. It allows for a unified view of all data in Hadoop
clusters and allows diverse tools, including Pig and Hive, to process
any data elements without needing to know physically where in the
cluster the data is stored.
BigTop:
BigTop is an effort to create a more formal process or framework for
packaging and interoperability testing of Hadoop's sub-projects and
related components with the goal improving the Hadoop platform as a
whole.
No comments:
Post a Comment