Hadoop's Map Reduce component and example

  • I wanted to share the very popular posting from Oracle DBA, Fernando Garcia, within the NoSQL and Cloud Databases Community.
    Fernando is from Argentina and posted his knowledge in Spanish. With a posting this good, it needed to share amongst the English community as well.

An example of using Map Reduce

Map Reduce is a component of the Hadoop ecosystem.
It consists of a programming model that allows large volumes of data to be processed in parallel.
We have already dedicated several articles to Map Reduce.
It was time to put this programming model into practice.

Just as "Hello World" is the classic first example we implemented when we introduced a new programming language; "Word Count" is the example par excellence for introducing Map Reduce.

Today we will use Map Reduce to count the words that appear in the classic "Don Quixote de la Mancha".
How do we do it?
As always, we will work on our Cloudera virtual machine.
Then an example executing Map Reduce:

Hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
   -input / user / toadworld / input \
   -output / user / toadworld / output \
   -mapper /home/cloudera/toadworld/mapper_companybar.py \
   -reducer /home/cloudera/toadworld/reducer_adapter.html

Let's take a moment to analyze the arguments:

Input: In this HDFS directory is housed the text of "Don Quixote".
Output: In this Map Reduce directory will leave the results.
Mapper: The map function that we are going to build in the Python language.
Reducer: the reduction function we are going to build in the Python language.
Let's move on with the necessary steps to provide the arguments needed by Map Reduce.

1) Bring the texts of "Don Quixote" to a directory in HDFS
2) Develop a Python program with the "Map"
3) Develop a Python program with the "Reduce"


Bring the texts of "El Quijote" to a directory in HDFS

To simplify the example, we take a few lines of Cervantes' classic.

[Cloudera @ quickstart toadworld] $ pwd
/ Home / cloudera / toadworld

[Cloudera @ quickstart toadworld] $ ls -l
Total 12

-rw-rw-r-- 1 cloudera cloudera 390 Jun 26 08:50 quijote1.txt
-rw-rw-r-- 1 cloudera cloudera 420 Jun 26 08:51 quijote2.txt
-rw-rw-r-- 1 cloudera cloudera 431 Jun 26 08:52 quijote3.txt

[Cloudera @ quickstart toadworld] $ cat quijote1.txt

In a place of La Mancha, whose name I do not want to remember, there has not been a long time that lived a lord of the lance in the shipyard, old pork, skinny rocín and greyhound runner. A pot of something more cow than ram, spatter the most nights, duels and breaks on Saturdays, giblets on Fridays, some palomino add on Sundays, consumed the three parts of his estate.

[Cloudera @ quickstart toadworld] $ cat quijote2.txt

The rest of the party concluded the evening dress, the hairy shorts for the parties, their slippers of the same, and the days of midweek were honored with their vellorí of the finest. He had in his house a maid who was in his forties and a niece who was not in his twenties, and a boy in the country and in the square who saddled the rocin as he took the pruning sheat. It was the age of our hidalgo at the age of fifty.

[Cloudera @ quickstart toadworld] $ cat quijote3.txt

He was of a hard complexion, dry of flesh, thin of face, great early bird and friend of the hunt. They mean that he had the nickname "Quijada" or "Quesada," that there is some difference in the authors of this case, although by verisimilar conjectures it is understood that it was called Quijana. But this matters little to our story: it is enough that in the narration of him does not leave a point of truth.

[Cloudera @ quickstart toadworld] $

We bring the files to the HDFS distributed file system:

[Cloudera @ quickstart toadworld] $ hdfs dfs -mkdir / user / toadworld
[Cloudera @ quickstart toadworld] $ hdfs dfs -mkdir / user / toadworld / input
[Cloudera @ quickstart toadworld] $ hdfs dfs -put /home/cloudera/toadworld/quijote*.txt / user / toadworld / input

[Cloudera @ quickstart toadworld] $ hdfs dfs -ls / user / toadworld / input

Found 3 items

-rw-r-r-- 1 cloudera supergroup 390 2016-06-26 09:02 /user/toadworld/input/quijote1.txt
-rw-r-r-- 1 cloudera supergroup 420 2016-06-26 09:02 /user/toadworld/input/quijote2.txt
-rw-r-r-- 1 cloudera supergroup 431 2016-06-26 09:02 /user/toadworld/input/quijote3.txt

[Cloudera @ quickstart toadworld] $

Develop a Python program with the "Map"

Next we see the code of the "mapper" encoded in Python language:

[Cloudera @ quickstart toadworld] $ cat mapper_overapalabras.py

#! / Usr / bin / env python
Import sys
For line in sys.stdin:
    Line = line.strip ()
    Keys = linea.split ()
    For key in keys:
        Value = 1
        Print ('{0} \ t {1}'. Format (key, value))

Develop a Python program with the "Reduce"

Next we see the code of the "reducer" encoded in Python language:

[Cloudera @ quickstart toadworld] $ cat reducer_cuentapalabras.py

#! / Usr / bin / env python
Import sys
Last_clave = None
Total_word = 0
For line in sys.stdin:
    Line = line.strip ()
    Key, value = linea.split ("\ t", 1)
    Value = int (value)
    If ultima_clave == password:
        Total_word + = value
    Else:
        If ultima_clave:
            Print ("{0} \ t {1}". Format (last_clave, total_word))
        Total_word = value
        Last_clave = key
If ultima_clave == password:
    Print ("{0} \ t {1}". Format (last_clave, total_word))

Once we have built the programs in Python, we assign execution privileges:

[Cloudera @ quickstart toadworld] $ chmod + x * .py

Execution of the word counter

Now we invoke the framework and execute our solution to tell the words of Don Quixote:

[Cloudera @ quickstart toadworld] $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input / user / toadworld / input -output / user / toadworld / output -mapper / home / cloudera / toadworld / mapper_admin Py -reducer /home/cloudera/toadworld/reducer_adapter.py

PackageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar] /tmp/streamjob7912780591052138691.jar tmpDir = null
16/06/26 09:32:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/06/26 09:32:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/06/26 09:32:05 INFO mapred.FileInputFormat INFO: Total input paths to process: 3
06/16/26 09:32:06 INFO mapreduce.JobSubmitter: number of splits: 3
06/16/26 09:32:06 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1466955622287_0002
06/16/26 09:32:06 INFO impl.YarnClientImpl: Submitted application application_1466955622287_0002
16/06/26 09:32:06 INFO mapreduce.Job: The url to track the job: http: //quickstart.cloudera: 8088 / proxy / application_1466955622287_0002 /
06/16/26 09:32:06 INFO mapreduce.Job: Running job: job_1466955622287_0002
06/16/26 09:32:17 INFO mapreduce.Job: Job job_1466955622287_0002 running in uber mode: false
16/06/26 09:32:17 INFO mapreduce.Job: map 0% reduce 0%
16/06/26 09:32:35 INFO mapreduce.Job: map 33% reduce 0%
06/16/26 09:32:37 INFO mapreduce.Job: map 100% reduce 0%
06/16/26 09:32:45 INFO mapreduce.Job: map 100% reduce 100%
06/16/26 09:32:45 INFO mapreduce.Job: Job job_1466955622287_0002 completed successfully
06/16/26 09:32:45 INFO mapreduce.Job: Counters: 49
               File System Counters
                              FILE: Number of bytes read = 2115
                              FILE: Number of bytes written = 454143
                              FILE: Number of read operations = 0
                              FILE: Number of large read operations = 0
                              FILE: Number of write operations = 0
                              HDFS: Number of bytes read = 1592
                              HDFS: Number of bytes written = 1257
                              HDFS: Number of read operations = 12
                              HDFS: Number of large read operations = 0
                              HDFS: Number of write operations = 2
               Job Counters
                              Launched map tasks = 3
                              Launched reduce tasks = 1
                              Data-local map tasks = 3
                              Total time spent by all maps in occupied slots (ms) = 51775
                              Total time spent by all reduces in occupied slots (ms) = 6869
                              Total time spent by all map tasks (ms) = 51775
                              Total time spent by all reduce tasks (ms) = 6869
                              Total vcore-seconds taken by all map tasks = 51775
                              Total Vcore-seconds taken by all reduce tasks = 6869
                              Total megabyte-seconds taken by all map tasks = 53017600
                              Total megabyte-seconds taken by all reduce tasks = 7033856
               Map-Reduce Framework
                              Map input records = 3
                              Map output records = 217
                              Map output bytes = 1675
                              Map output materialized bytes = 2127
                              Input split bytes = 351
                              Combine input records = 0
                              Combine output records = 0
                              Reduce input groups = 140
                              Reduce shuffle bytes = 2127

Viewing results

The following are the results obtained:

[Cloudera @ quickstart toadworld] $ hdfs dfs -ls / user / toadworld / output
Found 2 items

-rw-r-r-- 1 cloudera supergroup 0 2016-06-26 09:32 / user / toadworld / output / _SUCCESS
-rw-r-r-- 1 cloudera supergroup 1257 2016-06-26 09:32 / user / toadworld / output / part-00000

[Cloudera @ quickstart toadworld] $ hdfs dfs -cat / user / toadworld / output / part-00000

The 1
In 1
It was 1
...
...
Out of 21
To say 1
...
...
the 2
in 5
...
The 9
Place 1
...
...
I lived 1
And 7
«Quesada», 1
«Quijada», 1
Quijana. 1

[Cloudera @ quickstart toadworld] $

That is all! We have completed a practical example of using Map Reduce to count the words of the Cervantes classic, "Don Quixote de la Mancha."

Visit and share your knowledge with our ToadWorld database community!

About the Author
Danny Torres
Since 2000, Danny Torres has been an Enterprise Tech Support Advisor supporting database tools. Within the KCS (Knowledge Centered Support) groups, Danny is certified as a Social Media and Community Professional...