Apache Spark
- What is Apache Spark?
- History
- Why Apache Spark
- Apache Spark Structure and Architecture
- Spark APIs
- Demo
-
- A framework for distributed computing
- In-memory, fault tolerant data structures
- API that supports Scala, Java, Python, R, SQL
- Open source
-
- Unified: Unified platform for developing big data applications. Within the same platform you can perform data loading, SQL query, data transform, machine learning, streaming, computation. (RDD)
- Computing Engine: perform computations over data somewhere (cloud, file, sql database, hadoop, amazon s3 etc.). Not store it.
- Libraries: self and external libraries. Core engine changes little. SparkSQL, MLLib, Streaming, GraphX. spark-packages.org
-
- Map: select the wanted fields with SELECT clause and filter with WHERE clause.
- Reduce: make calculations on data with COUNT, SUM, HAVING etc. clauses.
- Enterprise-scale data volumes accessible to interactive query for business intelligence (BI)
- Faster time to job completion allows analysts to ask the “next” question about their data & business
- Data cleaning to improve data quality (missing data, entity resolution, unit mismatch, etc.)
- Nightly ETL processing from production systems
- Forecasting vs. “Nowcasting” (e.g. Google Search queries analyzed en masse for Google Flu Trends to predict outbreaks)
- Data mining across various types of data
- Web server log file analysis (human-readable file formats that are rarely read by humans) in
near-real time - Responsive monitoring of RFID-tagged devices
- Predictive modeling answers questions of “what will happen?”
- Self-tuning machine learning, continually updating algorithms, and predictive modeling
- Driver Process: runsmain function and user codes are in this process..
- Maintain all relevant information about spark application
- Give responses to user program and its inputs
- Distribute the jobs to executors and put the jobs in order
- Executor Process: run the jobs assigned by Driver
- Runs the code assigned by driver.
- Inform the driver about the status of jobs that is assigned to it.
- Cluster Manager: Controls physical machine and allocate resources to spark applications.
- Cluster Managers: spark standalone cm, YARN, Mesos, Kubernetes
- Resilient Distributed Dataset
- Basic Spark Abstraction
- Virtually Everything on Spark is built on RDD
- Data operations are performed on RDD
- An immutable, partitioned collection of records that can be operated in parallel.
- No schema, Row is a Java/Python Object.
- Can be persistent in memory or disk.
- CSV
- JSON
- Parquet
- ORC
- JDBC/ODBC connections
- Plain-text files
- Cassandra
- HBase
- MongoDB
- AWS Redshift
- XML
- Redis
- And many, many others
map(func)
Return a new distributed dataset formed by passing each element of the source through a function func.
filter(func)Return a new dataset formed by selecting those elements of the source on which funcreturns true.
flatMap(func)Similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a Seq rather than a single item).
groupByKey([numTasks])When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
union(otherDataset)Return a new dataset that contains the union of the elements in the source dataset and the argument.
distinct([numTasks]))Return a new dataset that contains the distinct elements of the source dataset.
Spark Data Operations- Actions reduce(func)Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect()Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count()Return the number of elements in the dataset.
first()Return the first element of the dataset (similar to take(1)).
take(n)Return an array with the first n elements of the dataset.
takeSample(withReplacement,n um, [seed])Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
saveAsTextFile(path)Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
saveAsSequenceFile(path)Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface.
Spark APIs- Low Label APIs-DAG- Transformations and Actions define an application’s Direct Acyclic Graph (DAG).
- Using the DAG a physical execution plan is defined:
- DAG Scheduler splits the DAG into multiple stages (stages are based on transformations, narrow transf. are piped together);
- Dataset
-
-
- distributed collection of data
- strong typed
- uses SQL Engine
- use Encoder for optimizing filtering, sorting and hashing without de-serializing the object
-
- DataFrame
-
- is a Dataset with named columns, Dataset[Rows]
- equivalent of a relational database table (Schema)
- not strongly typed
- uses Catalyst optimizer on logical plan by pushing filtering and aggregations
- uses Tungsten optimizer on physical plan by optimizing memory usage
- Dataset and DataFrame were introduced In Spark 1.6 (DataFrame API as stable Dataset API as experimental ) Spark 2.X – Dataset API became stable
- Provide for relational queries expressed in SQL, HiveQL and Scala
- Seamlessly mix SQL queries with Spark programs
- DataFrame/Dataset provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files
- Leverages Hive frontend and metastore
- Compatibility with Hive data, queries, and UDFs
- HiveQL limitations may apply
- Not ANSI SQL compliant
- Little to no query rewrite optimization, automatic memory management or sophisticated workload management
- Graduated from alpha status with Spark 1.3
- Standard connectivity through JDBC/ODBC
- Component of Spark
- Project started in 2012
- First alpha release in Spring 2013
- Out of alpha with Spark 0.9.0
- Discretized Stream (DStream) programming abstraction
- Represented as a sequence of RDDs (micro-batches)
- RDD: set of records for a specific time interval
- Supports Scala, Java, and Python (with limitations)
- Fundamental architecture: batch processing of datasets
Current Spark Streaming I/O
- Input Sources
- Kafka, Flume, Twitter, ZeroMQ, MQTT, TCP sockets
- Basic sources: sockets, files, Akka actors
- Other sources require receiver threads
- Time-based stream processing
- requires application-level adjustments
- adding to complexity
Structured Streaming API
- New streaming engine built on the Spark SQL engine
- Alpha release in Spark 2.0
- 2.2
- Reduces programming differences between Spark Streaming and Spark
- Includes improvements on state maintenance
- Spark ML for machine learning library
- RDD-based package spark. mllib now in maintenance mode
- The primary API is now the DataFrame-based package spark.ml
- Provides common algorithm and utilities
- Classification
- Regression
- Clustering
- Collaborative filtering
- Dimensionality reduction
- Basically spark ML provides you with a toolset to create “pipelines” of different machine learning related transformations on your data.
- It makes it easy to for example to chain feature extraction, dimensionality reduction, and the training of a classifier into 1 model, which as a whole can be later used for classification.
- MLlib, is older and has been in development longer, it has more features because of this.