Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,6 @@
description: Learn how to deploy, configure, and benchmark Apache Spark SQL on Azure Cobalt 100 Arm64 VMs using the Gluten plugin and Velox backend for native query acceleration, with step-by-step setup and performance validation.
title: Run Apache Spark SQL workloads on Azure Cobalt 100 Arm64 using Gluten and Velox for accelerated analytics

draft: true
cascade:
draft: true

minutes_to_complete: 120

who_is_this_for: This is an advanced topic for data engineers, platform engineers, and developers who want to build and optimize high-performance Spark SQL workloads using native execution engines on Arm-based cloud environments.
Expand All @@ -16,7 +12,7 @@ learning_objectives:
- Build and integrate Gluten with the Velox backend for native query execution
- Configure Spark SQL for columnar and vectorized execution
- Generate and load TPC-DS datasets for benchmarking
- Run Spark SQL workloads and compare performance between vanilla Spark and Gluten + Velox
- Run Spark SQL workloads and compare performance between vanilla Spark and Gluten with Velox

prerequisites:
- A [Microsoft Azure account](https://azure.microsoft.com/) with access to Cobalt 100 based instances (Dpsv6)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Overview of Azure Cobalt 100 and Apache Spark with Gluten and Velox"
title: "Understand Azure Cobalt 100 and Apache Spark with Gluten and Velox"

weight: 2

Expand All @@ -8,56 +8,22 @@ layout: "learningpathall"

## Azure Cobalt 100 Arm-based processor

Azure’s Cobalt 100 is Microsoft’s first-generation, in-house Arm-based processor. Built on Arm Neoverse N2, Cobalt 100 is a 64-bit CPU that delivers strong performance and energy efficiency for cloud-native, scale-out Linux workloads such as web and application servers, data analytics, open-source databases, and caching systems. Running at 3.4 GHz, Cobalt 100 allocates a dedicated physical core for each vCPU, which helps ensure consistent and predictable performance.
Azure’s Cobalt 100 is Microsoft’s first-generation, in-house Arm-based processor. Built on Arm Neoverse N2, Cobalt 100 is a 64-bit CPU that delivers strong performance and energy efficiency for cloud-native, scale-out Linux workloads. These workloads include web and application servers, data analytics, open-source databases, and caching systems. Running at 3.4 GHz, Cobalt 100 allocates a dedicated physical core for each vCPU, which ensures consistent and predictable performance.

To learn more, see the Microsoft blog [Announcing the preview of new Azure VMs based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).

## Apache Spark with Gluten and Velox

Apache Spark is an open-source distributed data processing engine designed for large-scale data analytics. It provides high-level APIs for SQL, streaming, machine learning, and graph processing, and is widely used for building data pipelines and analytical workloads.
Apache Spark is an open-source distributed data processing engine designed for large-scale data analytics. It provides high-level APIs for SQL, streaming, machine learning, and graph processing, and is widely used for building data pipelines and analytical workloads. For more information about Apache Spark, see the [Apache Spark Documentation](https://spark.apache.org/docs/latest/).

By default, Spark executes queries using the JVM (Java Virtual Machine), which can introduce overhead in CPU-intensive workloads. To address this, modern acceleration frameworks like **Gluten** and **Velox** enable native execution for improved performance.
By default, Spark executes queries using the Java Virtual Machine (JVM), which can introduce overhead in CPU-intensive workloads. To address this, modern acceleration frameworks such as Gluten and Velox enable native execution for improved performance.

**Gluten** is an open-source Spark plugin that offloads Spark SQL execution from the JVM to native engines. It acts as a bridge between Spark and high-performance backends, enabling efficient query execution while maintaining compatibility with existing Spark workloads.
Gluten is an open-source Spark plugin that offloads Spark SQL execution from the JVM to native engines. It acts as a bridge between Spark and high-performance backends, enabling efficient query execution while maintaining compatibility with existing Spark workloads. For more information about Gluten, see [Gluten Project](https://github.com/apache/incubator-gluten).

**Velox** is a high-performance, vectorized execution engine written in C++. It is optimized for modern hardware, including Arm64 architectures such as Azure Cobalt 100. Velox processes data in a columnar format and uses vectorized execution to significantly reduce CPU overhead and improve query performance.
Velox is a high-performance, vectorized execution engine written in C++. It is optimized for modern hardware, including Arm64 architectures such as Azure Cobalt 100. Velox processes data in a columnar format and uses vectorized execution to significantly reduce CPU overhead and improve query performance. For more information about Velox, see [Velox Engine](https://github.com/facebookincubator/velox).

Together, **Gluten + Velox** provide:
### What you've learned and what's next

- Native (off-JVM) execution of Spark SQL queries
- Vectorized processing for faster computation
- Reduced memory and CPU overhead
- Improved performance on Arm-based infrastructure

To learn more, see:
- [Apache Spark Documentation](https://spark.apache.org/docs/latest/)
- [Gluten Project](https://github.com/apache/incubator-gluten)
- [Velox Engine](https://github.com/facebookincubator/velox)


### Key Capabilities

- **Native Query Execution:**
Spark SQL queries are executed using Velox instead of JVM-based execution.

- **Columnar Processing:**
Data is processed in columnar batches, improving cache efficiency and throughput.

- **Vectorized Execution:**
Multiple data values are processed in a single CPU instruction, accelerating computation.

- **Hardware Optimization:**
Velox is optimized for modern CPUs, including Arm64 (Azure Cobalt 100), delivering better performance per core.

### In This Learning Path

In this Learning Path, you will:

- Deploy Apache Spark on an Azure Cobalt 100 Arm64 virtual machine
- Build and integrate Gluten with the Velox backend
- Configure Spark to use native execution
- Run Spark SQL workloads using Gluten + Velox
- Generate and load TPC-DS benchmark datasets
- Execute analytical queries and measure performance
- Compare accelerated workloads against vanilla Spark
You've now learned about Azure Cobalt 100 Arm-based processors and Apache Spark. You've also understood how frameworks such as Gluten and Velox improve Spark SQL performance.

In the next section, you'll create a Cobalt 100 virtual machine for building a Spark SQL workload.
Loading
Loading