Skip to content

Latest commit

 

History

History
406 lines (398 loc) · 23.3 KB

File metadata and controls

406 lines (398 loc) · 23.3 KB
layout home
title Home
custom_title Apache Spark™ - Unified Engine for large-scale data analytics
description Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
type page
navigation
weight show
1
true
Simple.
Fast.
Scalable.
Unified.
Key features
Batch/streaming data
Batch/streaming data
Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
SQL analytics
SQL analytics
Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
Data science at scale
Data science at scale
Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
Machine Learning
Machine learning
Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
Python SQL Scala Java R
Run now
Install with 'pip'

$ pip install pyspark

$ pyspark

Use the official Docker image

$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark

QuickStart Machine Learning Analytics & Data Science
{% highlight python %} df = spark.read.json("logs.json") df.where("age > 21").select("name.first").show() {% endhighlight %}
{% highlight python %} # Every record contains a label and feature vector df = spark.createDataFrame(data, ["label", "features"])

Split the data into train/test datasets

train_df, test_df = df.randomSplit([.80, .20], seed=42)

Set hyperparameters for the algorithm

rf = RandomForestRegressor(numTrees=100)

Fit the model to the training data

model = rf.fit(train_df)

Generate predictions on the test dataset.

model.transform(test_df).show() {% endhighlight %}

{% highlight python %} df = spark.read.csv("accounts.csv", header=True)

Select subset of features and filter for balance > 0

filtered_df = df.select("AccountBalance", "CountOfDependents").filter("AccountBalance > 0")

Generate summary statistics

filtered_df.summary().show() {% endhighlight %}

Run now

$ docker run -it --rm spark /opt/spark/bin/spark-sql

spark-sql>

{% highlight sql %} SELECT name.first AS first_name, name.last AS last_name, age FROM json.logs.json WHERE age > 21; {% endhighlight %}
Run now

$ docker run -it --rm spark /opt/spark/bin/spark-shell

scala>

{% highlight scala %} val df = spark.read.json("logs.json") df.where("age > 21") .select("name.first").show() {% endhighlight %}
Run now

$ docker run -it --rm spark /opt/spark/bin/spark-shell

scala>

{% highlight java %} Dataset df = spark.read().json("logs.json"); df.where("age > 21") .select("name.first").show(); {% endhighlight %}
Run now

$ docker run -it --rm spark:r /opt/spark/bin/sparkR

>

{% highlight r %} df <- read.json(path = "logs.json") df <- filter(df, df$age > 21) head(select(df, df$name.first)) {% endhighlight %}

The most widely-used engine for scalable computing
Thousands of companies, including 80% of the Fortune 500, use Apache Spark.
Over 2,000 contributors to the open source project from industry and academia.
Ecosystem
Apache Spark integrates with your favorite frameworks, helping to scale them to thousands of machines.
Data science and Machine learning
scikit learn
pandas
TensorFlow
PyTorch
mlflow
R
NumPy
SQL analytics and BI
Apache Superset
PowerBI
Looker
Redash
Tableau
dbt
Storage and Infrastructure
Elasticsearch
mongoDB
Apache Kafka
Delta Lake
Kubernetes
Apache Airflow
Parquet
SQL Server
Cassandra
Apache Iceberg
Apache Orc
Spark SQL engine: under the hood
Apache Spark is built on an advanced distributed SQL engine for large-scale data
Adaptive Query Execution

Spark SQL adapts the execution plan at runtime, such as automatically setting the number of reducers and join algorithms.

Support for ANSI SQL

Use the same SQL you’re already comfortable with.

Structured and unstructured data

Spark SQL works on structured tables and unstructured data such as JSON or images.

TPC-DS 1TB No-Stats With vs. Without Adaptive Query Execution
Accelerates TPC-DS queries up to 8x
Join the community
Spark has a thriving open source community, with contributors from around the globe building features, documentation and assisting other users.