How Apache Spark works
Spark Architecture
Introduction
Apache spark is a distributed compute engine used to process large volume/amount of data. In this article I am going to provide information on how it works behind the scenes.
What Spark Does
If you are new to big data eco-system then the information on apache spark can be a bit overwhelming. In this section, I am going to provid a basic scenario which will help understanding the use case of Apache spark in a bit easier way.
Consider a new retail store which is open in your neighborhood. In the early days, the shop owner is keeping track of all the store transactions in a spreadhsheet on a personal computer. With time, the business is also growing so his computer is not performing well as the data is growing. Now the owner upgrades the computer spec (In technical terms vertical scaling). For some time he has solved the issue. After few more years, the owner now has 10 stores in the city and willing to track everything on a single computer. As the computer can can have limited amount of resources (RAM, DISK, CPU), it will take a very long time to process large volume of data and in most cases the spreadhsheet program will crach while handleing such scenario. So a single machine does not have enough capabilities to process large amount of data/information.
In order to counter this issue, we have to use a cluster/group of machines (More than one machine working in harmony to accomplish the task/data processing). A framework which coordinates and make sure the tasks are being performed without issue is required and Apache spark does just that. Cluster of machibes which spark uses to execute tasks are managed by cluster managers like Yarn
or Mesos
.
Spark Architecture
Any spark application contains of two major components.
- driver process
- executor process
Driver Process
Driver process runs the application. The driver process is mainly responsible for 3 things:
- Maintain info about the app
- Responding to inputs
- Analyze, disctibute and execute work
Executor Process
Executors are responsible for performing the task/job. Mainly an executor does 2 things
- Perform task
- Report the state of computation to the driver
Here are key points:
- Spark uses cluster managers to track available cluster resources.
- Driver process will execute the task/commands across the executers.
Languages Supported By Spark
- Scala
- Java
- Python
- SQL
- R
For every programming language, there is a sparkSession
object available which is the entry point for running the spark code. For non JVM language like python or R, the code written is translated to be run on executor JVMs.
Spark APIs
As menitoned in the previous articles, spark has 2 main sets of API
- Low-level (Unstructured) API
- High-level (Structured) API
SparkSession
As we discussed in the architecture section, the spark application can be controlled by a driver
process called SparkSession
. There is just one SparkSession
for every spark application.
Partitions
For performing parallel execution, spark needs to create chunk from the data which is called partitions. Every partition sits on a physical machine. Partitions represents how the data is distributed across the clusters.
In dataframe,we do not need to specify partitions. Spark will take care of doing it. This is important concept when we conver performance optimization.
Transformations and actions
In spark, the core data structures are immutable, in other words they can not be modified after creation. For instance, a spark dataframe once created can not be changed. This may sound a bit weird but after understanding transfaormations
and actions
togather, it will be more clear.
Now, any operation that needs to be performed on the dataframe is called transformation
. This can be filtering the data, performing some arithmetic operation etc. You can perform multiple transformations on a dataframe or any other spark data structure. Since spark is lazily evaluated the operations will not be executed immediately. Behind the scene,spark creates a logic plan to execute the transformations.
Actions will trigger the computation. As the transformations create a logical plan, the action will trigger the actual computation. Until we perform an action
, no actual computation will take place.
Here are some examples of actions which are commonly used:
- View Data
- Collect data to native object
- Write output to some data source
There is another concept associated with transformation which I want to mention here:
- Narrow transformations - Here one partition on transformation will produce just one output partition
1 -> 1
. In narrow transformation, the operation happens in memory and hence it is faster. - Wide transformatons - Here one partion on transform will produce n output partitions
1 -> n
. Wide transformation produces more partitions, so intermediate data needs to be stored in disk so this results in a slower operation. This is also referred to asshuffling
.
References
- spark - The definitive guide
- https://spark.apache.org/docs/latest/cluster-overview.html
- https://data-flair.training/blogs/how-apache-spark-works/#:~:text=Just%20like%20Hadoop%20MapReduce%2C%20it,in%20its%20own%20Java%20process.
- https://sparkbyexamples.com/pyspark-tutorial/#features
- https://www.databricks.com/glossary/what-are-transformations
Share this post
Twitter
Reddit
LinkedIn
Pinterest
Email