Map and reduce

The functional concepts of map, filter, and reduce

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Parallelizing that concept

Hadoop notes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Hadoop is a Java implementation of concepts strongly inspired by Google Filesystem (GFS) and Google's MapReduce paper.

If that doesn't tell you how Hadoop is useful, read the two relevant papers now: MapReduce, GFS.

Hadoop mostly consists of the distributed filesystem and the Map-Reduce framework that interacts with it. These two are relatively separate, in that each could be used without the other.

Note that as in many scaling systems, you are easily IO-bound. Unless you're google, you are probably calling half a dozen noded a cluster, and still have relatively few hard drives compared to the map jobs waiting on input from them, and compared to the amount of map-to-reduce transfers.

Consider numbers when planning. If you have gigs of data and it has to come from a few disks at at most ~75MB/s, even if you're alone on the system that's going to take minutes because of direct implication.

Hadoopy terms and details:

HDFS

The Hadoop Distributed Filesystem, a.k.a. Hadoop DFS, a.k.a. HDFS.

As the name suggests, tied to Hadoop, but from the map/reduce platform, there are storage options beyond HDFS

You often use HDFS by referring to it from jobs, which means all the work of storing it is done for you. You can also use the hdfs shell for metadata management, and copying files between local filesystem and HDFS.

In slight technical detail, HDFS consists of:

one NameNode, which basically keeps track of the filesystem names, and of where on further nodes parts of data is kept
one or (preferably) more DataNodes, among which data is distributed.

Map/Reduce

An execution framework that scales well (assuming a non-bottlenecked data source/sink, such as HDFS).

Note that by its nature it can be very high-throughput, but won't be real-time.

Terms and concepts in Hadoop's implementation:

jobs: A collection of code, data, and metadata that the job tracker expects - the things to complete as a whole
tasks: pieces of jobs, those that handle individual map and reduce steps
management around it (e.g. task distribution that chooses nodes near the data for execution)

A Hadoop job is a JAR file. In many tasks, this contains Java code (which is run in isolated VMs(verify)).

That is not the only option. You can also:

use Hadoop Streaming, in which you pack a JAR with command line executables that are actually the tasks (mapper, reducer, optional .). Note that things have to be actually executable on nodes, which means that the more complex or scripy these are, the more you may need to prepare nodes, e.g. installing libraries, install scripting environments, etc., or include all of these (see e.g. python freezing)
Use Hadoop pipes for using map-reduce from C/C++, which is like streaming, but differs e.g. in how data is fed to processes
do everything yourself and interface with HDFS (e.g. using libhdfs)

Related apache projects (mostly subprojects)

Pig

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A higher-level addition which should make creation of complex processes on Hadoop simpler.

Pig contains a language (Pig Latin) and a bunch of planning. Compilation produces sequences of Map-Reduce programs.

See e.g. http://hadoop.apache.org/pig/ http://www.cloudera.com/hadoop-training-pig-introduction

Hive

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A SQL-like wrapper accessing HDFS data, running on Hadoop. Makes relatively structured and (and table-like) data act like tables, regardless of its stored format. Quite useful when your needs can be expressed in terms of Hive expressions.

Similar to Pig in that it usually compiles to a number of MapReduce steps.

Apparently developed for flexible log aggregation, but has various other uses.

Zookeeper

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Supports coordination, mostly in the form of synchronization, providing in-order writes with eventual consistency, and has configurable redundancy.

Can be made the primary means of controlling a lot of your code. More generally, it lets you avoid reinventing (and debugging) a bunch of common distributed boilerplate code/services.

Some simple examples where ZooKeeper is useful include computation with shared memory, message queues, or simpler things like notifications, or storing configuration.

See also:

http://hadoop.apache.org/zookeeper/

Mahout

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A machine learning system, meant to be scalable where possible, and easy to use.

Currently has clustering, classification, some collaborative stuff, some evolutionary stuff, and has other things planned (e.g. SVM, linear regression, easier interoperablity)

See also:

http://lucene.apache.org/mahout/

Chukwa and Flume (logging)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

There are two distinct projects that do collection of logs and system  metrics, letting you do log analytics of recent data, with some tuneable fault tolerance.

See also:

Hama

Avro (data serialization)

See Avro

Whirr

Other related projects

Dumbo

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A project that makes it easier to use Hadoop's MapReduce from Python. Uses hadoop streaming under the covers, but saves you from most manual work.

Can use HDFS. Even without HDFS it can be handy, e.g. to execute simpler tasks on multiple nodes.

See also:

Map and reduce

Contents

The functional concepts of map, filter, and reduce

Parallelizing that concept

Hadoop notes

HDFS

Map/Reduce

See also

Related apache projects (mostly subprojects)

Pig

Hive

Zookeeper

Mahout

Chukwa and Flume (logging)

Hama

Avro (data serialization)

Whirr

Other related projects

Dumbo

Navigation menu