Contact Info
London Office :
Denizli Office :
Pamukkale Üniversitesi Teknoloji Geliştirme Merkezi B Blok Oda:Z17, 20070 Denizli
+88 (0) 101 0000 000
Certified Kubernetes Administrator Certified Developer on Apache Cassandra


Apache Spark

  1. What is Apache Spark?
  2. History
  3. Why Apache Spark
  4. Apache Spark Structure and Architecture
  5. Spark APIs
  6. Demo
What is Apache Spark?

Apache Spark is a general purpose platform for quickly processing large scale data that is developed in Scala programming language.
    • A framework for distributed computing
    • In-memory, fault tolerant data structures
    • API that supports Scala, Java, Python, R, SQL
    • Open source
Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.
    • Unified: Unified platform for developing big data applications. Within the same platform you can perform data loading, SQL query, data transform, machine learning, streaming, computation. (RDD)
    • Computing Engine: perform computations over data somewhere (cloud, file, sql database, hadoop, amazon s3 etc.). Not store it.
    • Libraries: self and external libraries. Core engine changes little. SparkSQL, MLLib, Streaming, GraphX.  
Map: capture data in a convenient structure for data analysis  Reduce: make analysis at captured data.  Analogy to RDMS(Relation Data Management System):
    • Map: select the wanted fields with SELECT clause and filter with WHERE clause.
    • Reduce: make calculations on data with COUNT, SUM, HAVING etc. clauses.
Why Apache Spark? Apache Spark General Usage Areas Interactive Query
  • Enterprise-scale data volumes accessible to interactive query for business intelligence (BI)
  • Faster time to job completion allows analysts to ask the “next” question about their data & business
Large-Scale Batch
  • Data cleaning to improve data quality (missing data, entity resolution, unit mismatch, etc.)
  • Nightly ETL processing from production systems
Complex Analytics
  • Forecasting vs. “Nowcasting” (e.g. Google Search queries analyzed en masse for Google Flu Trends to predict outbreaks)
  • Data mining across various types of data
Event Processing
  • Web server log file analysis (human-readable file formats that are rarely read by humans) in
    near-real time
  • Responsive monitoring of RFID-tagged devices
Model Building
  • Predictive modeling answers questions of “what will happen?”
  • Self-tuning machine learning, continually updating algorithms, and predictive modeling
Apache Spark General Usage Areas Spark is implemented inside many types of products, across a multitude of industries and organizations

Apache Spark Structure and Architecture Apache Spark Structure and Architecture
  • Driver Process: runsmain function and user codes are in this process..
    • Maintain all relevant information about spark application
    • Give responses to user program and its inputs
    • Distribute the jobs to executors and put the jobs in order
  • Executor Process: run the jobs assigned by Driver
    • Runs the code assigned by driver.
    • Inform the driver about the status of jobs that is assigned to it.
  • Cluster Manager: Controls physical machine and allocate resources to spark applications.
    • Cluster Managers: spark standalone cm, YARN, Mesos, Kubernetes
Spark APIs Spark APIs- Low Label APIs- RDD
  • Resilient Distributed Dataset
  • Basic Spark Abstraction
  • Virtually Everything on Spark is built on RDD
  • Data operations are performed on RDD
  • An immutable, partitioned collection of records that can be operated in parallel.
  • No schema, Row is a Java/Python Object.
  • Can be persistent in memory or disk.
Spark APIs- Low Label APIs -RDD Spark DataSources Core Data Sources
  • CSV
  • JSON
  • Parquet
  • ORC
  • JDBC/ODBC connections
  • Plain-text files
Community Created DS
  • Cassandra
  • HBase
  • MongoDB
  • AWS Redshift
  • XML
  • Redis
  • And many, many others
< p style=”text-align: left;”>Spark Data Operations- Transformations


Return a new distributed dataset formed by passing each element of the source through a function func.


Return a new dataset formed by selecting those elements of the source on which funcreturns true.


Similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a Seq rather than a single item).


When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.


Return a new dataset that contains the union of the elements in the source dataset and the argument.


Return a new dataset that contains the distinct elements of the source dataset.

Spark Data Operations- Actions reduce(func)

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.


Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.


Return the number of elements in the dataset.


Return the first element of the dataset (similar to take(1)).


Return an array with the first n elements of the dataset.

takeSample(withReplacement,n um, [seed])

Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.


Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.


Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface.

Spark APIs- Low Label APIs-DAG
  • Transformations and Actions define an application’s Direct Acyclic Graph (DAG).
  • Using the DAG a physical execution plan is defined:
  • DAG Scheduler splits the DAG into multiple stages (stages are based on transformations, narrow transf. are piped together);
Spark APIs- Dataset DataFrame
  • Dataset
      • distributed collection of data
      • strong typed
      • uses SQL Engine
      • use Encoder for optimizing filtering, sorting and hashing without de-serializing the object
  • DataFrame
    • is a Dataset with named columns, Dataset[Rows]
    • equivalent of a relational database table (Schema)
    • not strongly typed
    • uses Catalyst optimizer on logical plan by pushing filtering and aggregations
    • uses Tungsten optimizer on physical plan by optimizing memory usage
  • Dataset and DataFrame were introduced In Spark 1.6 (DataFrame API as stable Dataset API as experimental ) Spark 2.X – Dataset API became stable
Spark APIs- Dataset DataFrame Spark APIs- SparkSQL
  • Provide for relational queries expressed in SQL, HiveQL and Scala
  • Seamlessly mix SQL queries with Spark programs
  • DataFrame/Dataset provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files
  • Leverages Hive frontend and metastore
    • Compatibility with Hive data, queries, and UDFs
    • HiveQL limitations may apply
    • Not ANSI SQL compliant
    • Little to no query rewrite optimization, automatic memory management or sophisticated workload management
  • Graduated from alpha status with Spark 1.3
  • Standard connectivity through JDBC/ODBC
Spark APIs- Streaming
  • Component of Spark
    • Project started in 2012
    • First alpha release in Spring 2013
    • Out of alpha with Spark 0.9.0
  • Discretized Stream (DStream) programming abstraction
    • Represented as a sequence of RDDs (micro-batches)
    • RDD: set of records for a specific time interval
    • Supports Scala, Java, and Python (with limitations)
  • Fundamental architecture: batch processing of datasets
Spark APIs- Streaming

Current Spark Streaming I/O

  • Input Sources
    • Kafka, Flume, Twitter, ZeroMQ, MQTT, TCP sockets
    • Basic sources: sockets, files, Akka actors
    • Other sources require receiver threads
  • Time-based stream processing
    • requires application-level adjustments
    • adding to complexity
Spark APIs- Streaming

Structured Streaming API

  • New streaming engine built on the Spark SQL engine
    • Alpha release in Spark 2.0
    • 2.2
    • Reduces programming differences between Spark Streaming and Spark
    • Includes improvements on state maintenance
Spark APIs- Machine Learning
  • Spark ML for machine learning library
    • RDD-based package spark. mllib now in maintenance mode
    • The primary API is now the DataFrame-based package
  • Provides common algorithm and utilities
    • Classification
    • Regression
    • Clustering
    • Collaborative filtering
    • Dimensionality reduction
  • Basically spark ML provides you with a toolset to create “pipelines” of different machine learning related transformations on your data.
    • It makes it easy to for example to chain feature extraction, dimensionality reduction, and the training of a classifier into 1 model, which as a whole can be later used for classification.
  • MLlib, is older and has been in development longer, it has more features because of this.

Post a Comment

Retype the CAPTCHA code from the image
Change the CAPTCHA codeSpeak the CAPTCHA code