MapReduce Distributed Data Processing
- 7 minsContents
I am currently consulting, designing and implementing data analysis infrastructure, Data Warehouse, Lakehouse for individuals and organizations in need. You can see and try a system I have built here. Please contact me via email: lakechain.nguyen@gmail.com. Thank you!
Overview
In the previous article, I introduced HDFS - Distributed File System (you can review here). With HDFS, we have a system that can store unlimited data (not dependent on hardware). To process extremely large amounts of data, distributed storage on HDFS nodes, we need a calculation method called MapReduce.
Instead of data being concentrated on one node for processing (costly, impossible), the MapReduce program will be sent to the nodes that have data and use the resources of that node to calculate and store the results. This whole process is done automatically, the user (programmer) only needs to define 2 functions Map and Reduce:
-
map: is a transformation function, takes as input a pair of <Key, Value> and needs to return 1 or more new pairs of <Key, Value>.
-
reduce: is a synthesis function, takes as input a pair of <Key, Value[]> in which the input values list contains all values with the same key and needs to return a pair of <Key, Value> as result.
WordCount Problem
Problem: Given a data file (log) stored on HDFS, count the number of times a word appears in the file and write the result on HDFS. Each word is separated by a space.
Pseudocode
In this program, the map function will split the input text into words, with each separated word returning a pair of words with the same count of 1. The reduce function receives input as a word and a list of all the counts of that word, it will calculate the total count to get the number of occurrences of that word.
Install and run on Hadoop cluster
I will use the docker containers built from this article to install and test the WordCount program.
Start HDFS
root@node01:~# $HADOOP_HOME/sbin/start-dfs.sh
Create a file on HDFS as input for the Wordcount problem
root@node01:~# hdfs dfs -D dfs.replication=2 -appendToFile /lib/hadoop/logs/*.log /input_wordcount.log
Source code Java
Save the file as
WordCount.java
in the wordcount folder
Add environment variables
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
Compile code
root@node01:~/wordcount# hadoop com.sun.tools.javac.Main WordCount.java
root@node01:~/wordcount# jar cf wc.jar WordCount*.class
Run and check the results
root@node01:~/wordcount# hadoop jar wc.jar WordCount /input_wordcount.log /ouput_wordcount
root@node01:~/wordcount# hdfs dfs -cat /ouput_wordcount/part-r-00000
"script". 1
#1 10
#110 1
#129 1
...
Overview of MapReduce program operations
The MapReduce program execution process can be summarized in the following stages:
-
Init: An
ApplicationMaster
(AM) is initialized and maintained until the program ends, its task is to manage and coordinate the execution tasks on the nodes. We will learn more about AM in the following article. -
Map: AM identifies the nodes with input data on HDFS and requests them to perform the input data transformation according to the instructions written by the user in the
map
function -
Shuffle and Sort: The <Key, Value> pairs resulting from the
map
function will be shuffled between nodes and sorted by key. <Key, Value> pairs with the same key will be grouped together. -
Reduce: In this phase, the <Key, Value> pairs with the same key will be processed according to the instructions written in the
reduce
function to get the final result.
When processing large amounts of data, AM will create many tasks (each task processes a block) and execute them on many different nodes to increase performance. If during the running process, a node is broken, AM will transfer the tasks of that node to execute on another node without affecting the other tasks and the entire program.
Conclusion
Through this article, I have introduced the most basic issues about the MapReduce computation model through an example with the Wordcount problem. You can learn more about MapReduce in this article of Google. See you in the next articles!