Spark guidelines for beginer

- 8 mins

In the previous article, I introduced you to the MapReduce computational model that allows processing large amounts of data stored on many computers. However, the MapReduce model has a disadvantage that it has to continuously read and write data from the hard drive, which makes the MapReduce program slow. To solve this problem, Spark was born based on the idea that: data only needs to be read once from the input and written once to the output, during the processing of intermediate data will be in memory instead of having to continuously read and write from the hard drive, thereby significantly improving performance compared to MapReduce. Experiments show that Spark can process up to 100 times faster than MapReduce.

Contents

  1. Overview
  2. Setting up a development environment
  3. Getting started with Spark
  4. Running a Spark program on a Hadoop cluster
  5. Conclusion

:pray: I am currently consulting, designing and implementing data analysis infrastructure, Data Warehouse, Lakehouse for individuals and organizations in need. You can see and try a system I have built here. Please contact me via email: lakechain.nguyen@gmail.com. Thank you!

Overview

Spark is a general-purpose, fast, and distributed computing engine that can process huge amounts of data in parallel and distributed across multiple computers simultaneously. Below is a diagram describing the operation of a Spark application when running on a cluster of computers:

Spark working

Similar to MapReduce, the Spark program will be sent to the nodes that have data, one node will be selected as the master to run the driver process, the other nodes will be workers. The Driver creates tasks and divides them among the workers according to the local processing principle, that is, data on which node will be processed by the executor on that node. To increase processing efficiency, data will be uploaded and maintained in the memory of workers in a data structure called Resilient Distributed Dataset (RDD) with the following characteristics:

RDD provides 2 types of Operations: transformations and actions:

All Transformations are lazy, they will not process data when called, only when an Action is called and the result of a Transformation is needed, then the Transformation will perform the calculation. This design makes Spark run more efficiently because it does not have to store too much unused intermediate data in memory. After each Actions call, the intermediate data is also released from memory, if the user wants to keep it for use in the next Actions, they can use the persist or cache method.

Along with RDD, Spark also provides Dataframe and Dataset, which are both distributed data structures. In Dataframe, data is organized into column names similar to tables in a database, while in Dataset, each element is a JVM type-safe Object, so it is very convenient to handle structured data.

Install the development environment

That’s the basics, now let’s get started with the installation and test coding!

  1. First, download the latest version of Spark here
  2. Next, unzip and copy to the installation directory on your computer (on my computer, it is the ~/Install folder)
    $ tar -xzvf spark-3.1.2-bin-hadoop3.2.tgz
    $ mv spark-3.1.2-bin-hadoop3.2 ~/Install/spark
    
  3. Add environment variables
    export SPARK_HOME=~/Install/spark
    export PATH=$PATH:$SPARK_HOME/bin
    
  4. Reload environment variables
    $ source ~/.bash_profile
    
  5. You also need to install Java and Scala. Note that you need to choose the Java and Scala versions that are suitable for Spark, here I choose Java 11 and Scala 2.12

  6. Check if Spark has been installed successfully using spark-shell
$ spark-shell
Spark context Web UI available at http://localhost:4041
Spark context available as 'sc' (master = local[*], app id = local-1676799060327).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/
         
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.17)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

Example starting with Spark

With Spark, I use Scala language with IDE Intellij. For convenience, I have created a sample Spark project here you can clone or fork. In this project, there are some notes as follows:

spark-submit --class Main \
--master local[*] \ # configure cluster manager
--deploy-mode client \ # Configure deployment: client (driver located on current node) or cluster (driver located on any node of the cluster)
--driver-memory 2g \ # Configure ram for driver
--executor-memory 2g \ # Configure ram for executor
--executor-cores 2 \ # Configure number of cores (threads) for executor
--num-executors 1 \ # Configure number of executors
--conf spark.dynamicAllocation.enabled=false \
--conf spark.scheduler.mode=FAIR \
--packages org.web3j:abi:4.5.10,org.postgresql:postgresql:42.5.0 \ # Configure additional libraries
--files ./application.properties \
../spark_template_dev.jar \
application.properties &

Run Spark program on Hadoop cluster

First, you need to install Spark on Hadoop cluster according to the instructions here. Note that because Spark will use Yarn as Cluster Manager, you only need to install Spark on 1 node and will execute spark jobs on that node.

Next, you pull the code from git to the node where Spark is installed and run the job in the run/product directory, note that you need to edit the --master configuration in the run.sh file from local[*] to yarn.

When a Spark application is running, we can monitor it on the Spark UI interface

Spark UI

Conclusion

Today’s article will stop at introducing the most basic concepts and knowledge about Spark along with initial instructions for installing and getting acquainted with this framework. In the following articles, I will go deeper into the content related to Spark programming, using modules such as Spark Sql, Spark Streaming… See you again in the next articles!

Lake Nguyen

Lake Nguyen

Founder of Chainslake

comments powered by Disqus