Map and reduce
The functional concepts of map, filter, and reduce
Parallelizing that concept
Hadoop notes
Hadoop is a Java implementation of concepts strongly inspired by Google Filesystem (GFS) and Google's MapReduce paper.
If that doesn't tell you how Hadoop is useful, read the two relevant papers now: MapReduce, GFS.
Hadoop mostly consists of the distributed filesystem and the Map-Reduce framework that interacts with it. These two are relatively separate, in that each could be used without the other.
Note that as in many scaling systems, you are easily IO-bound. Unless you're google, you are probably calling half a dozen noded a cluster, and still have relatively few hard drives compared to the map jobs waiting on input from them, and compared to the amount of map-to-reduce transfers.
Consider numbers when planning. If you have gigs of data and it has to come from a few disks at at most ~75MB/s, even if you're alone on the system that's going to take minutes because of direct implication.
Hadoopy terms and details:
HDFS
The Hadoop Distributed Filesystem, a.k.a. Hadoop DFS, a.k.a. HDFS.
As the name suggests, tied to Hadoop, but from the map/reduce platform, there are storage options beyond HDFS
You often use HDFS by referring to it from jobs, which means all the work of storing it is done for you. You can also use the hdfs shell for metadata management, and copying files between local filesystem and HDFS.
In slight technical detail, HDFS consists of:
- one NameNode, which basically keeps track of the filesystem names, and of where on further nodes parts of data is kept
- one or (preferably) more DataNodes, among which data is distributed.
Map/Reduce
An execution framework that scales well (assuming a non-bottlenecked data source/sink, such as HDFS).
Note that by its nature it can be very high-throughput, but won't be real-time.
Terms and concepts in Hadoop's implementation:
- jobs: A collection of code, data, and metadata that the job tracker expects - the things to complete as a whole
- tasks: pieces of jobs, those that handle individual map and reduce steps
- management around it (e.g. task distribution that chooses nodes near the data for execution)
A Hadoop job is a JAR file. In many tasks, this contains Java code (which is run in isolated VMs(verify)).
That is not the only option. You can also:
- use Hadoop Streaming, in which you pack a JAR with command line executables that are actually the tasks (mapper, reducer, optional .). Note that things have to be actually executable on nodes, which means that the more complex or scripy these are, the more you may need to prepare nodes, e.g. installing libraries, install scripting environments, etc., or include all of these (see e.g. python freezing)
- Use Hadoop pipes for using map-reduce from C/C++, which is like streaming, but differs e.g. in how data is fed to processes
- do everything yourself and interface with HDFS (e.g. using libhdfs)
See also
- http://wiki.apache.org/hadoop/HadoopStreaming
- http://hadoop.apache.org/core/docs/current/streaming.html
Related apache projects (mostly subprojects)
Pig
A higher-level addition which should make creation of complex processes on Hadoop simpler.
Pig contains a language (Pig Latin) and a bunch of planning. Compilation produces sequences of Map-Reduce programs.
See e.g.
http://hadoop.apache.org/pig/
http://www.cloudera.com/hadoop-training-pig-introduction
Hive
A SQL-like wrapper accessing HDFS data, running on Hadoop. Makes relatively structured and (and table-like) data act like tables, regardless of its stored format. Quite useful when your needs can be expressed in terms of Hive expressions.
Similar to Pig in that it usually compiles to a number of MapReduce steps.
Apparently developed for flexible log aggregation, but has various other uses.
Zookeeper
Supports coordination, mostly in the form of synchronization, providing in-order writes with eventual consistency, and has configurable redundancy.
Can be made the primary means of controlling a lot of your code. More generally, it lets you avoid reinventing (and debugging) a bunch of common distributed boilerplate code/services.
Some simple examples where ZooKeeper is useful include computation with shared memory, message queues, or simpler things like notifications, or storing configuration.
See also:
Mahout
A machine learning system, meant to be scalable where possible, and easy to use.
Currently has clustering, classification, some collaborative stuff, some evolutionary stuff, and has other things planned (e.g. SVM, linear regression, easier interoperablity)
See also:
Chukwa and Flume (logging)
There are two distinct projects that do collection of logs and system metrics, letting you do log analytics of recent data, with some tuneable fault tolerance.
See also:
Hama
Avro (data serialization)
See Avro
Whirr
Dumbo
A project that makes it easier to use Hadoop's MapReduce from Python. Uses hadoop streaming under the covers, but saves you from most manual work.
Can use HDFS. Even without HDFS it can be handy, e.g. to execute simpler tasks on multiple nodes.
See also: