diff --git a/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/_index.md b/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/_index.md index b22020190f..51f7af0399 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/_index.md @@ -3,10 +3,6 @@ description: Learn how to deploy, configure, and benchmark Apache Spark SQL on Azure Cobalt 100 Arm64 VMs using the Gluten plugin and Velox backend for native query acceleration, with step-by-step setup and performance validation. title: Run Apache Spark SQL workloads on Azure Cobalt 100 Arm64 using Gluten and Velox for accelerated analytics -draft: true -cascade: - draft: true - minutes_to_complete: 120 who_is_this_for: This is an advanced topic for data engineers, platform engineers, and developers who want to build and optimize high-performance Spark SQL workloads using native execution engines on Arm-based cloud environments. @@ -16,7 +12,7 @@ learning_objectives: - Build and integrate Gluten with the Velox backend for native query execution - Configure Spark SQL for columnar and vectorized execution - Generate and load TPC-DS datasets for benchmarking - - Run Spark SQL workloads and compare performance between vanilla Spark and Gluten + Velox + - Run Spark SQL workloads and compare performance between vanilla Spark and Gluten with Velox prerequisites: - A [Microsoft Azure account](https://azure.microsoft.com/) with access to Cobalt 100 based instances (Dpsv6) diff --git a/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/background.md b/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/background.md index 790f393337..062e52874c 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/background.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/background.md @@ -1,5 +1,5 @@ --- -title: "Overview of Azure Cobalt 100 and Apache Spark with Gluten and Velox" +title: "Understand Azure Cobalt 100 and Apache Spark with Gluten and Velox" weight: 2 @@ -8,56 +8,22 @@ layout: "learningpathall" ## Azure Cobalt 100 Arm-based processor -Azure’s Cobalt 100 is Microsoft’s first-generation, in-house Arm-based processor. Built on Arm Neoverse N2, Cobalt 100 is a 64-bit CPU that delivers strong performance and energy efficiency for cloud-native, scale-out Linux workloads such as web and application servers, data analytics, open-source databases, and caching systems. Running at 3.4 GHz, Cobalt 100 allocates a dedicated physical core for each vCPU, which helps ensure consistent and predictable performance. +Azure’s Cobalt 100 is Microsoft’s first-generation, in-house Arm-based processor. Built on Arm Neoverse N2, Cobalt 100 is a 64-bit CPU that delivers strong performance and energy efficiency for cloud-native, scale-out Linux workloads. These workloads include web and application servers, data analytics, open-source databases, and caching systems. Running at 3.4 GHz, Cobalt 100 allocates a dedicated physical core for each vCPU, which ensures consistent and predictable performance. To learn more, see the Microsoft blog [Announcing the preview of new Azure VMs based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). ## Apache Spark with Gluten and Velox -Apache Spark is an open-source distributed data processing engine designed for large-scale data analytics. It provides high-level APIs for SQL, streaming, machine learning, and graph processing, and is widely used for building data pipelines and analytical workloads. +Apache Spark is an open-source distributed data processing engine designed for large-scale data analytics. It provides high-level APIs for SQL, streaming, machine learning, and graph processing, and is widely used for building data pipelines and analytical workloads. For more information about Apache Spark, see the [Apache Spark Documentation](https://spark.apache.org/docs/latest/). -By default, Spark executes queries using the JVM (Java Virtual Machine), which can introduce overhead in CPU-intensive workloads. To address this, modern acceleration frameworks like **Gluten** and **Velox** enable native execution for improved performance. +By default, Spark executes queries using the Java Virtual Machine (JVM), which can introduce overhead in CPU-intensive workloads. To address this, modern acceleration frameworks such as Gluten and Velox enable native execution for improved performance. -**Gluten** is an open-source Spark plugin that offloads Spark SQL execution from the JVM to native engines. It acts as a bridge between Spark and high-performance backends, enabling efficient query execution while maintaining compatibility with existing Spark workloads. +Gluten is an open-source Spark plugin that offloads Spark SQL execution from the JVM to native engines. It acts as a bridge between Spark and high-performance backends, enabling efficient query execution while maintaining compatibility with existing Spark workloads. For more information about Gluten, see [Gluten Project](https://github.com/apache/incubator-gluten). -**Velox** is a high-performance, vectorized execution engine written in C++. It is optimized for modern hardware, including Arm64 architectures such as Azure Cobalt 100. Velox processes data in a columnar format and uses vectorized execution to significantly reduce CPU overhead and improve query performance. +Velox is a high-performance, vectorized execution engine written in C++. It is optimized for modern hardware, including Arm64 architectures such as Azure Cobalt 100. Velox processes data in a columnar format and uses vectorized execution to significantly reduce CPU overhead and improve query performance. For more information about Velox, see [Velox Engine](https://github.com/facebookincubator/velox). -Together, **Gluten + Velox** provide: +### What you've learned and what's next -- Native (off-JVM) execution of Spark SQL queries -- Vectorized processing for faster computation -- Reduced memory and CPU overhead -- Improved performance on Arm-based infrastructure - -To learn more, see: -- [Apache Spark Documentation](https://spark.apache.org/docs/latest/) -- [Gluten Project](https://github.com/apache/incubator-gluten) -- [Velox Engine](https://github.com/facebookincubator/velox) - - -### Key Capabilities - -- **Native Query Execution:** - Spark SQL queries are executed using Velox instead of JVM-based execution. - -- **Columnar Processing:** - Data is processed in columnar batches, improving cache efficiency and throughput. - -- **Vectorized Execution:** - Multiple data values are processed in a single CPU instruction, accelerating computation. - -- **Hardware Optimization:** - Velox is optimized for modern CPUs, including Arm64 (Azure Cobalt 100), delivering better performance per core. - -### In This Learning Path - -In this Learning Path, you will: - -- Deploy Apache Spark on an Azure Cobalt 100 Arm64 virtual machine -- Build and integrate Gluten with the Velox backend -- Configure Spark to use native execution -- Run Spark SQL workloads using Gluten + Velox -- Generate and load TPC-DS benchmark datasets -- Execute analytical queries and measure performance -- Compare accelerated workloads against vanilla Spark +You've now learned about Azure Cobalt 100 Arm-based processors and Apache Spark. You've also understood how frameworks such as Gluten and Velox improve Spark SQL performance. +In the next section, you'll create a Cobalt 100 virtual machine for building a Spark SQL workload. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/benchmarking.md index 9471dfc9b8..05e67fc67b 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/benchmarking.md @@ -1,26 +1,18 @@ --- -title: Run TPC-DS Benchmark on Spark with Gluten + Velox (Arm64) +title: Run TPC-DS Benchmark on Spark with Gluten and Velox on Arm64 weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Run TPC-DS Benchmark on Spark +TPC-DS is an industry-standard benchmark that simulates a decision support workload across a realistic retail data model. In this section, you'll generate a 10 GB TPC-DS dataset, load it into Spark, and run five analytical queries to measure execution time on your Arm64 virtual machine (VM). -TPC-DS is an industry-standard benchmark that simulates a decision support workload across a realistic retail data model. In this section you generate a 10 GB TPC-DS dataset, load it into Spark, and run five analytical queries to measure execution time on your Arm64 VM. - -You run Spark in local mode using Parquet-formatted data and hand-written SQL queries. This avoids the schema mismatches and resource contention that commonly affect automated benchmarking frameworks such as `spark-sql-perf`, and gives you a reproducible, stable baseline. - -## Why Parquet and local mode? - -Tools like `spark-sql-perf` often fail against raw TPC-DS data because of schema mismatches between the generated CSV files and the expected column names, missing columns in certain query templates, and YARN resource allocation instability on a single-node VM. - -To avoid these issues, you convert the raw data to Parquet before querying. Parquet is a columnar format that Spark reads more efficiently than CSV, and it preserves schema consistently across sessions. Running Spark in local mode eliminates YARN scheduling overhead, which makes query times more reproducible and directly comparable. +Automated benchmarking frameworks such as `spark-sql-perf` are commonly affected by schema mismatches and resource contention. To avoid this, you'll run Spark in local mode using Parquet-formatted data and hand-written SQL queries. Parquet is a columnar format that Spark reads more efficiently than CSV, and it preserves schema consistently across sessions. Running Spark in local mode eliminates scheduling overhead, which makes query times more reproducible and directly comparable. ## Generate TPC-DS data -Clone the Databricks fork of `tpcds-kit` and build the `dsdgen` data generation tool. The Databricks fork is required here because the original `gregrahn/tpcds-kit` source does not build cleanly on Ubuntu 22.04 or 24.04 with GCC 10+. +Clone the Databricks fork of `tpcds-kit` and build the `dsdgen` data generation tool. The Databricks fork is required here because the original `gregrahn/tpcds-kit` source does not build cleanly on Ubuntu 22.04 or 24.04 with GCC 10+ ```console cd /opt @@ -52,7 +44,7 @@ The output is similar to: ## Upload data to HDFS -Before uploading, take HDFS out of safe mode, which it enters automatically after a restart to protect against data loss. Then create the target directory and upload all generated files. +Before uploading, take HDFS out of safe mode, which it enters automatically after a restart to protect against data loss. Then create the target directory and upload all generated files: ```console hdfs dfsadmin -safemode leave @@ -107,7 +99,7 @@ rm -rf /opt/tpcds10_parquet/* ## Configure Spark to use the Hive Metastore -Before starting `spark-shell`, you need to make two configuration changes so that Spark can communicate with the Hive Metastore that was set up in the previous section. +Before starting `spark-shell`, you need to make two configuration changes so that Spark can communicate with the Hive Metastore that you set up in the previous section. Copy the MySQL JDBC connector into Spark's JAR directory. Spark loads all JARs in this directory at startup, so placing the connector here ensures it is available when Spark connects to the MySQL-backed metastore: @@ -121,7 +113,7 @@ Create a symlink so that Spark picks up the Hive configuration automatically. Sp ln -s /opt/apache-hive-3.1.3-bin/conf/hive-site.xml /opt/spark/conf/hive-site.xml ``` -If the symlink already exists from a previous run, remove it first with `rm /opt/spark/conf/hive-site.xml` before re-creating it. +If the symlink already exists from a previous run, remove it with `rm /opt/spark/conf/hive-site.xml` before re-creating it. ## Start Spark shell @@ -138,7 +130,7 @@ $SPARK_HOME/bin/spark-shell \ ## Convert CSV to Parquet -The raw TPC-DS files are pipe-delimited CSV with no header row. This Scala snippet reads each table into a DataFrame, infers the column schema automatically, and writes the result as Parquet. Run this inside the `spark-shell` session you just started. +The raw TPC-DS files are pipe-delimited CSV with no header row. This Scala snippet reads each table into a DataFrame, infers the column schema automatically, and writes the result as Parquet. Run this inside the `spark-shell` session you started. ```scala val rawBase = "file:///opt/tpcds-data" @@ -168,7 +160,7 @@ Because the CSV files have no header row, Spark assigns generic positional colum ## Validate Parquet data -Count the rows in three of the largest fact tables to confirm the conversion completed without data loss. Run each line individually inside your `spark-shell` session. +Count the rows in three of the largest fact tables to confirm the conversion completed without data loss. Run each line individually inside your `spark-shell` session: ```scala spark.read.parquet("file:///opt/tpcds10_parquet/store_sales").count() @@ -256,9 +248,9 @@ def timedQuery(name: String, sqlText: String): Unit = { Each query uses positional column names (`_c2`, `_c22`, and so on) because the TPC-DS CSV files contain no header row. The five queries cover a range of analytical patterns: single-table aggregations across each of the three sales channels, a returns aggregation, and a dimension join. -### 1. Store sales aggregation +### Store sales aggregation -Aggregate total sales by item across the `store_sales` table, which at approximately 28 million rows is the largest fact table in the 10 GB dataset. +Aggregate total sales by item across the `store_sales` table, which at approximately 28 million rows is the largest fact table in the 10 GB dataset: ```scala timedQuery("q_store_sales_by_item", @@ -276,9 +268,9 @@ The output is similar to: q_store_sales_by_item took 1.548731698 seconds ``` -### 2. Catalog sales aggregation +### Catalog sales aggregation -Aggregate total sales by item across the `catalog_sales` table. +Aggregate total sales by item across the `catalog_sales` table: ```scala timedQuery("q_catalog_sales_by_item", @@ -296,9 +288,9 @@ The output is similar to: q_catalog_sales_by_item took 0.795856122 seconds ``` -### 3. Web sales aggregation +### Web sales aggregation -Aggregate total sales by item across the `web_sales` table. +Aggregate total sales by item across the `web_sales` table: ```scala timedQuery("q_web_sales_by_item", @@ -316,9 +308,9 @@ The output is similar to: q_web_sales_by_item took 0.423602822 seconds ``` -### 4. Store returns aggregation +### Store returns aggregation -Aggregate total returns by item from the `store_returns` table. +Aggregate total returns by item from the `store_returns` table: ```scala timedQuery("q_store_returns_by_item", @@ -336,7 +328,7 @@ The output is similar to: q_store_returns_by_item took 0.264841719 seconds ``` -### 5. Dimension join +### Dimension join Join `store_sales` with the `item` dimension table to combine transaction totals with item metadata. This query exercises Spark's hash join path and involves a shuffle to co-locate matching rows, which is why it takes noticeably longer than the single-table aggregations. This query type benefits most from Velox's native join execution when Gluten is enabled. @@ -359,7 +351,7 @@ q_join_store_sales_item took 2.203225285 seconds ## Inspect sample results -To verify the query results are meaningful, display the top 10 items by total sales. Items with negative `total_sales` values appear because the TPC-DS schema includes returns and price adjustments that can reduce net sales below zero — this is expected behaviour. +To verify the query results are meaningful, display the top 10 items by total sales: ```scala spark.sql(""" @@ -389,14 +381,9 @@ The output is similar to: |12552 |-31864.710000000006| +-------+-------------------+ ``` +Items with negative `total_sales` values appear because the TPC-DS schema includes returns and price adjustments that can reduce net sales below zero. - -## Summary - -You've run a complete TPC-DS benchmark baseline on Spark with an Arm64 VM. These results represent execution with Gluten disabled. You can enable Gluten and re-run the same queries to measure the performance improvement provided by the Velox native engine on Arm64. - - -## Re-run with Gluten + Velox enabled +## Re-run with Gluten and Velox enabled Now that you have a baseline, re-run the same queries with the Gluten native engine active. Gluten intercepts Spark's physical plan and replaces JVM-based operators with equivalent Velox C++ operators. The Parquet data and SQL queries are unchanged — only the `spark-shell` launch flags differ. @@ -422,7 +409,7 @@ $SPARK_HOME/bin/spark-shell \ --conf spark.driver.extraClassPath=/opt/gluten-jars/* ``` -Once the shell starts, re-register the tables and re-define the timing function. These are identical to the baseline run — no changes are needed to the Scala code: +After the shell starts, re-register the tables and re-define the timing function. These are identical to the baseline run: ```scala val pqBase = "file:///opt/tpcds10_parquet" @@ -487,7 +474,7 @@ q_join_store_sales_item took 1.579735646 seconds The `GlutenFallbackReporter` warning appears for every query and is expected in this configuration. It means that the `Exchange` operator — which handles the shuffle between the partial and final aggregation stages — fell back to JVM execution. The Velox backend does not support the shuffle operator in local mode, so Gluten applies the fallback automatically rather than failing. -The query execution in this configuration follows a split path: Velox handles the Parquet scan and partial aggregation in native C++ columnar format, then converts the intermediate result to JVM row format for the shuffle, and the final aggregation runs on the JVM. This conversion at the `Exchange` boundary adds overhead for smaller queries where shuffle is cheap, but still provides a net benefit for the join query where columnar processing of the large `store_sales` table outweighs the conversion cost. +The query execution in this configuration follows a split path: Velox handles the Parquet scan and partial aggregation in native C++ columnar format, then converts the intermediate result to JVM row format for the shuffle. The final aggregation runs on the JVM. This conversion at the `Exchange` boundary adds overhead for smaller queries where shuffle is cheap, but still provides a net benefit for the join query where columnar processing of the large `store_sales` table outweighs the conversion cost. To confirm whether Gluten is executing the queries natively rather than falling back to JVM operators, inspect the executed query plan after running a query: @@ -500,7 +487,7 @@ df.count() println(df.queryExecution.executedPlan) ``` -Using `df.queryExecution.executedPlan` after calling `count()` gives you the final physical plan that was actually executed, rather than the pre-execution estimate. This is important because Spark's Adaptive Query Execution (AQE) can change the plan at runtime, and `explain()` alone — without first triggering execution — prints the pre-AQE plan with `isFinalPlan=false`. +Using `df.queryExecution.executedPlan` after calling `count()` gives you the final physical plan that was executed, rather than the pre-execution estimate. This is important because Spark's Adaptive Query Execution (AQE) can change the plan at runtime, and `explain()` alone — without first triggering execution — prints the pre-AQE plan with `isFinalPlan=false`. For reference, this is what the pre-execution plan looks like when called with `explain()` before `count()`: @@ -515,7 +502,7 @@ AdaptiveSparkPlan isFinalPlan=false PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c2:int,_c22:double> ``` -The `HashAggregate` and `Exchange` operators are standard Spark JVM operators, which indicates that Gluten is falling back to JVM execution for this aggregation. However, `Batched: true` on the `FileScan` line is significant — this means Spark is reading the Parquet file in columnar batch mode, which Gluten enables for its native Parquet reader. The scan is offloaded to Velox even when the aggregation is not. +The `HashAggregate` and `Exchange` operators are standard Spark JVM operators, which indicates that Gluten is falling back to JVM execution for this aggregation. However, `Batched: true` on the `FileScan` line is significant. This means Spark is reading the Parquet file in columnar batch mode, which Gluten enables for its native Parquet reader. The scan is offloaded to Velox even when the aggregation is not. When Gluten successfully takes over the full query path, the plan would instead show operators such as `VeloxColumnarToRow`, `GlutenHashAggregateExecTransformer`, and `GlutenColumnarExchange`. If you see only standard Spark operator names, the aggregation and join operators are running on the JVM. @@ -525,28 +512,21 @@ To check whether the Gluten plugin loaded at all, search the driver log for init grep -i "gluten\|velox" $SPARK_HOME/logs/spark-root-*.out | head -20 ``` -If Gluten loaded successfully you will see lines similar to `GlutenPlugin: Gluten build info` and `VeloxBackend: Velox backend initialised` near startup. - -## Compare baseline vs Gluten + Velox - +If Gluten loaded successfully, you'll see lines similar to `GlutenPlugin: Gluten build info` and `VeloxBackend: Velox backend initialised` near startup. -### Dimension join performance comparison +## Compare dimension join between baseline and Gluten with Velox -The most meaningful performance difference between JVM-only and Gluten + Velox is seen in the dimension join query, which joins the large `store_sales` fact table (28 million rows) with the `item` dimension table. This query exercises Spark's hash join and shuffle paths, and benefits most from Velox's native columnar execution before the shuffle boundary. +The most meaningful performance difference between JVM-only and Gluten with Velox is seen in the dimension join query, which joins the large `store_sales` fact table (28 million rows) with the `item` dimension table. This query exercises Spark's hash join and shuffle paths, and benefits most from Velox's native columnar execution before the shuffle boundary. | Query | Baseline (JVM) | Gluten + Velox | Change | |-------|---------------|----------------|--------| | Dimension join (store_sales × item) | 2.203 s | 1.580 s | -28% faster | -In this scenario, Velox offloads the Parquet scan and the hash join to native C++ code, while the shuffle (`Exchange`) and final aggregation still fall back to JVM execution. The result is a significant speedup for this join-heavy query, as the most expensive part—the join itself—is accelerated by Velox. Other queries in the benchmark may not show improvement or can be slower due to the overhead of converting between columnar and row formats at the shuffle boundary, but the dimension join demonstrates the clear benefit of native execution for large, complex operations. +In this scenario, Velox offloads the Parquet scan and the hash join to native C++ code, while the shuffle (`Exchange`) and final aggregation still fall back to JVM execution. The result is a significant speedup for this join-heavy query, as the most expensive part—the join itself—is accelerated by Velox. Other queries in the benchmark might not show improvement or can be slower due to the overhead of converting between columnar and row formats at the shuffle boundary, but the dimension join demonstrates the clear benefit of native execution for large, complex operations. -Full offload of the `Exchange` operator to Velox (eliminating JVM fallback) requires enabling the Gluten columnar shuffle, which is configured separately and not covered in this Learning Path. +Fully offloading the `Exchange` operator to Velox and eliminating JVM fallback requires enabling the Gluten columnar shuffle. This is configured separately and not covered in this Learning Path. -## What you've accomplished +## What you've accomplished -- Generated an industry-standard TPC-DS benchmark dataset at 10 GB scale -- Converted raw pipe-delimited CSV data to Parquet for efficient Spark querying -- Registered 24 TPC-DS tables as Spark temporary views -- Executed five analytical queries covering aggregation and join patterns on Arm64 -- Captured a reproducible JVM baseline and a Gluten + Velox accelerated result for direct comparison +You've now run a complete TPC-DS benchmark on an Arm64 Azure Cobalt 100 VM, comparing standard JVM execution against Gluten and Velox native acceleration. You generated a 10 GB dataset, converted it to Parquet, registered all 24 TPC-DS tables as Spark views, and ran five analytical queries across both configurations. The dimension join query showed a 28% improvement with Gluten and Velox, demonstrating the benefit of native Velox execution for large join workloads on Arm64. diff --git a/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/instance.md b/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/instance.md index cd88503d77..829c62b9d9 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/instance.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/instance.md @@ -8,81 +8,63 @@ layout: learningpathall ## Prerequisites and setup -There are several common ways to create an Arm-based Cobalt 100 virtual machine, and you can choose the method that best fits your workflow or requirements: +You can create an Arm-based Cobalt 100 virtual machine using any of the following methods that best fits your workflow or requirements: - The Azure Portal - The Azure CLI - An infrastructure as code (IaC) tool -In this section, you'll launch the Azure Portal to create a virtual machine with the Arm-based Azure Cobalt 100 processor. +In this section, you'll launch the Azure Portal to create a virtual machine (VM) with the Arm-based Azure Cobalt 100 processor. This Learning Path focuses on general-purpose virtual machines in the Dpsv6 series. For more information, see the [Microsoft Azure guide for the Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series). -While the steps to create this instance are included here for convenience, you can also refer to the [Deploy a Cobalt 100 virtual machine on Azure Learning Path](/learning-paths/servers-and-cloud-computing/cobalt/). +While the steps to create this instance are included here for convenience, for detailed steps, see the [Deploy a Cobalt 100 virtual machine on Azure Learning Path](/learning-paths/servers-and-cloud-computing/cobalt/). To learn more about Arm-based virtual machines in Azure, see "Getting Started with Microsoft Azure" in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure/). -## Create an Arm-based Azure virtual machine +## Create an Arm-based Azure virtual machine on the Azure portal -Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine: +To create an Azure virtual machine: -- Launch the Azure portal and navigate to **Virtual Machines**. -- Select **Create**, and select **Virtual Machine** from the drop-down list. -- Inside the **Basic** tab, fill in the instance details such as **Virtual machine name** and **Region**. -- Select the image for your virtual machine (for example, Ubuntu 22.04 LTS Server) and select **Arm64** as the VM architecture. -- In the **Size** field, select **See all sizes** and select the D-Series v6 family of virtual machines. -- Select **D8ps_v6** from the list as shown in the diagram below: +1. Launch the Azure portal and navigate to **Virtual Machines**. +2. Select **Create**, and select **Virtual Machine** from the drop-down list. +3. Inside the **Basic** tab, enter the instance details such as **Virtual machine name** and **Region**. +4. Select the image for your virtual machine (for example, Ubuntu 22.04 LTS Server) and select **Arm64** as the VM architecture. +5. In the **Size** field, select **See all sizes** and select the D-Series v6 family of virtual machines. +6. Select **D8ps_v6** from the list as shown in the following diagram: -![Azure Portal showing D-Series v6 VM size selection with D8ps_v6 highlighted alt-txt#center](images/instance.png "Select D8ps_v6 from the D-Series v6 family") +![Azure Portal showing D-Series v6 VM size selection with D8ps_v6 highlighted#center](images/instance.png "Select D8ps_v6 from the D-Series v6 family") -- For **Authentication type**, select **SSH public key**. +7. For **Authentication type**, select **SSH public key**. {{% notice Note %}} Azure generates an SSH key pair for you and lets you save it for future use. This method is fast, secure, and easy for connecting to your virtual machine. {{% /notice %}} -- Fill in the **Administrator username** for your VM. -- Select **Generate new key pair**, and select **RSA SSH Format** as the SSH Key Type. +8. Fill in the **Administrator username** for your VM. +9. Select **Generate new key pair**, and select **RSA SSH Format** as the SSH Key Type. {{% notice Note %}} RSA offers better security with keys longer than 3072 bits. {{% /notice %}} -- Give your SSH key a key pair name. -- In the **Inbound port rules**, select **HTTP (80)** and **SSH (22)** as the inbound ports, as shown below: +10. Give your SSH key a key pair name. +11. In the **Inbound port rules**, select **HTTP (80)** and **SSH (22)** as the inbound ports, as shown in the following screenshot: -![Azure Portal showing inbound port rules with HTTP (80) and SSH (22) selected alt-txt#center](images/instance1.png "Configure inbound port rules for HTTP and SSH access") +![Azure Portal showing inbound port rules with HTTP (80) and SSH (22) selected#center](images/instance1.png "Configure inbound port rules for HTTP and SSH access") -- Now select the **Review + Create** tab and review the configuration for your virtual machine. It should look like the following: +12. Select the **Review + Create** tab and review the configuration for your virtual machine. It should look like the following: -![Azure Portal Review + Create tab showing VM configuration summary ready for deployment alt-txt#center](images/ubuntu-pro.png "Review VM configuration before creation") +![Azure Portal Review + Create tab showing VM configuration summary ready for deployment#center](images/ubuntu-pro.png "Review VM configuration before creation") -- When you're happy with your selection, select the **Create** button and then **Download Private key and Create Resource** button. +13. After reviewing the configuration, select the **Create** button and then the **Download Private key and Create Resource** button. -![Azure Portal showing Create button and SSH key download dialog alt-txt#center](images/instance4.png "Download SSH key and create the virtual machine") +![Azure Portal showing Create button and SSH key download dialog#center](images/instance4.png "Download SSH key and create the virtual machine") Your virtual machine should be ready and running in a few minutes. You can SSH into the virtual machine using the private key, along with the public IP details. -![Azure Portal showing successful VM deployment with confirmation details alt-txt#center](images/final-vm.png "Successful VM deployment confirmation") - -{{% notice Note %}}To learn more about Arm-based virtual machines in Azure, see "Getting Started with Microsoft Azure" in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure/).{{% /notice %}} +![Azure Portal showing successful VM deployment with confirmation details#center](images/final-vm.png "Successful VM deployment confirmation") ## What you've learned and what's next -You have successfully created an Azure Cobalt 100 Arm64 virtual machine running **Ubuntu 22.04 LTS Server** with SSH authentication configured. The VM is now fully prepared for running distributed data processing workloads. - -On this VM, you have: - -- Set up a stable ARM64 environment -- Configured SSH access and hostname for cluster communication -- Prepared the system for big data stack installation (Hadoop, Spark, Hive) -- Ensured compatibility for Java 17 and ARM-based execution - -## What’s Next - -On this VM, you will now build a **high-performance Spark SQL analytics platform** using modern acceleration technologies. - -**You will:** +You've now successfully created an Azure Cobalt 100 Arm64 virtual machine running Ubuntu 22.04 LTS Server with SSH authentication configured. The VM is now fully prepared for running distributed data processing workloads. -- Install and configure **Hadoop (HDFS + YARN)** -- Install and configure **Apache Spark** -- Set up **Hive Metastore (MySQL-based)** -- Build and integrate **Gluten + Velox (native engine)** +Next, you'll build a high-performance Spark SQL analytics platform using modern acceleration technologies. diff --git a/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/setup-and-gluten.md b/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/setup-and-gluten.md index d190394ef1..cf2655007b 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/setup-and-gluten.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-velox-cobalt/setup-and-gluten.md @@ -1,64 +1,14 @@ --- -title: Deploy Spark SQL with Gluten + Velox on Arm64 (Stable Setup) +title: Deploy Apache Spark SQL with Gluten and Velox on Arm64 weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Deploy Apache Spark with Gluten + Velox on Arm64 +## Before you begin -This guide helps you **set up Spark with native acceleration (Gluten + Velox)** on Arm64 (Azure Cobalt 100). - -We will build everything step-by-step from scratch. - -- Apache Hadoop -- Apache Spark -- Apache Hive Metastore -- Gluten + Velox native engine - -## Objective - -In this guide, you will: - -- Install Hadoop, Spark, and Hive -- Configure a single-node cluster -- Fix Java 17 compatibility issues -- Build Gluten with Velox backend -- Enable native execution (off-JVM) -- Prepare system for benchmarking - - -## Why Gluten + Velox? - -- Spark (default) runs on JVM ❌ -- Gluten + Velox runs queries in native C++ engine - - -**Benefits:** - -- Faster execution -- Lower CPU usage -- Better ARM performance - -## Environment - -| Component | Value | -|----------|------| -| Architecture | Arm64 | -| OS | Ubuntu 22.04 / 24.04 | -| CPU | 4–8 vCPU | -| RAM | 8–32 GB | -| Disk | ≥ 80 GB | - -## System preparation - -We install all required tools for: - -- Java (Spark/Hadoop) -- Build tools (Gluten) -- Database (Hive metastore) - -Before you begin, switch to the root user and install all required system packages. This ensures you have the correct Java version, build tools, and database dependencies for Spark, Hadoop, Hive, and Gluten on Arm64. +Switch to the root user and install all required system packages. This ensures you have the correct Java version, build tools, and database dependencies for Spark, Hadoop, Hive, and Gluten on Arm64. ```console sudo -i @@ -68,15 +18,12 @@ openjdk-17-jdk wget tar git curl unzip build-essential \ python3-pip mysql-server maven cmake ninja-build pkg-config libssl-dev ``` -These tools are required for: -- Java runtime (Spark/Hadoop) -- Building Gluten (C++ dependencies) -- Hive metastore (MySQL) +Java runtime is necessary for Spark and Hadoop. C++ dependencies are necessary for building Gluten. MySQL is necessary for Hive metastore. ## Configure hostname -Hadoop requires proper hostname for internal communication. +Hadoop requires a proper hostname for internal communication. -Set the hostname to `spark-master` so Hadoop and Spark can communicate reliably on a single-node cluster. This prevents common networking issues during service startup. +Set the hostname to `spark-master` so that Hadoop and Spark can communicate reliably on a single-node cluster. This prevents networking issues during service startup. ```console hostnamectl set-hostname spark-master @@ -85,19 +32,13 @@ exec bash ## Configure hosts -Prevents connection errors (very important) - -Append the hostname to `/etc/hosts` to ensure all Hadoop and Spark services resolve the local node correctly. +Append the hostname to `/etc/hosts` to ensure all Hadoop and Spark services resolve the local node correctly. This prevents connection errors. ```console echo "127.0.0.1 spark-master" >> /etc/hosts ``` -## Setup passwordless SSH - -**Why?** - -- Hadoop services use SSH internally. +## Set up passwordless SSH Generate an SSH key pair for passwordless authentication. Hadoop daemons use SSH to manage services internally, so this step is required for smooth operation. @@ -105,7 +46,7 @@ Generate an SSH key pair for passwordless authentication. Hadoop daemons use SSH ssh-keygen -t rsa -P "" ``` -When prompted to enter the file location, press Enter to accept the default: +When prompted to enter the file location, press the Enter button to accept the default: ```output Enter file in which to save the key (/root/.ssh/id_rsa): @@ -121,12 +62,9 @@ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys ## Install Hadoop -**Why?** - -Hadoop provides: +Hadoop provides HDFS for storage and YARN for resource management. -- HDFS → Storage -- YARN → Resource manager +Install Hadoop: ```console cd /opt @@ -149,19 +87,15 @@ ln -s spark-3.4.2-bin-hadoop3 spark ## Install Hive -Hive provides: - -- Metadata (table structure) -- SQL layer for Spark - Download and extract Apache Hive 3.1.3. Hive provides the SQL metadata layer and metastore for Spark SQL. + ```console wget https://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz tar -xvf apache-hive-3.1.3-bin.tar.gz ln -s apache-hive-3.1.3-bin hive ``` -## Environment variables +## Set up environment variables Set up environment variables for Java, Hadoop, Spark, and Hive. This ensures all commands and scripts can find the correct binaries and configuration files. @@ -186,7 +120,7 @@ Apply the environment changes to your current shell: source ~/.consolerc ``` -## Hadoop directory setup +## Set up Hadoop directories HDFS needs storage directories. @@ -200,11 +134,7 @@ mkdir -p /opt/dfs/data ## Configure Hadoop -Define cluster behavior (single node setup) - -**core-site.xml** - -Create a minimal `core-site.xml` to define the default HDFS URI for your single-node cluster. +Create a minimal `core-site.xml` to define the default HDFS URI for a single-node cluster. ```console cat > $HADOOP_HOME/etc/hadoop/core-site.xml < $HADOOP_HOME/etc/hadoop/core-site.xml < $HADOOP_HOME/etc/hadoop/hdfs-site.xml < $HADOOP_HOME/etc/hadoop/yarn-site.xml <> $HADOOP_HOME/etc/hadoop/hadoop-env.sh <