Relational data is often normalized to retain its integrity and remove redundancy.

Normalization poses problems for MapReduce because it makes reading a record a

nonlocal operation, and one of the central assumptions that MapReduce makes is that

it is possible to perform (high-speed) streaming reads and writes.

A web server log is a good example of a set of records that is not normalized (for example,

the client hostnames are specified in full each time, even though the same client

well-suited to analysis with MapReduce.

MapReduce is a linearly scalable programming model. The programmer writes two

from one set of key-value pairs to another. These functions are oblivious to the size of

the data or the cluster that they are operating on, so they can be used unchanged for a

small dataset and for a massive one. More important, if you double the size of the input

data, a job will run twice as slow. But if you also double the size of the cluster, a job

will run as fast as the original one. This is not generally true of SQL queries.


