Learning_Journal/Spark_lessons.txt at master · hisah/Learning_Journal · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

### Advanced Analytics with Spark: https://github.com/sryza/aas
## Preamble
  The following are tasks that simply could not have been accomplished 5 or 10 years ago.
    1. Build a model to detect credit card fraud using thousands of features and billions of transactions
    2. Intelligently recommend millions of products to millions of users
    3. Estimate financial risk through simulations of portfolios that include millions of instruments
    4. Easily manipulate data from thousands of human genomes to detect genetic associations with disease
 We now live in an age of big data
  We now have tools for collecting, storing, and processing information at a scale previously unheard of.
  Behind these capabilities is an ecosystem of software that can leverage clusters of computers to chug through massive amounts of data.
    Example of these software systems: Hadoop, Spark, Flink
  Data science tries to bridge the gap between having access to these tools and all this data and doing something useful with it
## Spark
  Spark is an open source framework that combines an engine for distributing programs across clusters of machines with an elegant model     for writing programs atop it. It maintains MapReduce’s linear scalability and fault tolerance, but extends it in three important ways.
    1. Rather than relying on a rigid map-then-reduce format, its engine can execute a more general DAG of operators
    2. It complements this capability with a rich set of transformations that enable users to express computation more naturally
    3. Spark extends its predecessors with in-memory processing.
   Spark's Dataset and DataFrame abstractions enable developers to materialize any point in a processing pipeline into memory across the      cluster, meaning that future steps that want to deal with the same data set need not recompute it or reload it from disk.
    This capability opens up use cases that distributed processing engines could not previously approach.
   Spark is well suited for highly iterative algorithms that require multiple passes over a data set, as well as reactive applications        that quickly respond to user queries by scanning large in-memory data sets.

   Spark boasts strong integration with the variety of tools in the Hadoop ecosystem.
    1. It can read and write data in all of the data formats supported by MapReduce, allowing it to interact with formats commonly used to      store data on Hadoop, like Apache Avro and Apache Parquet (and good old CSV).
    2. It can read from and write to NoSQL databases like Apache HBase and Apache Cassandra.
    3. Its stream-processing library, Spark Streaming, can ingest data continuously from systems like Apache Flume and Apache Kafka.
    4. Its SQL library, SparkSQL, can interact with the Apache Hive Metastore, and the Hive on Spark initiative enabled Spark to be used        as an underlying execution engine for Hive, as an alternative to MapReduce.
    5. It can run inside YARN, Hadoop’s scheduler and resource manager, allowing it to share cluster resources dynamically and to be            managed with the same policies as other processing engines, like Map‐Reduce and Apache Impala.

## Data Analysis with Scala and Spark
  The Spark Programming Model: Spark programming starts with a dataset, usually residing in some form of distributed, persistent storage    like HDFS. Writing a Spark program typically consists of a few related steps:
  1. Define a set of transformations on the input data set.
  2. Invoke actions that output the transformed data sets to persistent storage or return results to the driver’s local memory.
  3. Run local computations that operate on the results computed in a distributed fashion. These can help you decide what                   transformations and actions to undertake next.