diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/_index.md b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/_index.md new file mode 100644 index 0000000000..99faf38a3e --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/_index.md @@ -0,0 +1,69 @@ +--- +title: Deploy Alluxio on Azure Cobalt 100 Arm64 virtual machines for data orchestration and caching + +draft: true +cascade: + draft: true + +description: Learn how to install and configure Alluxio on an Azure Cobalt 100 Arm64 virtual machine, integrate it with Apache Spark, enable data caching, and benchmark performance improvements for analytics workloads. + + +minutes_to_complete: 90 + +who_is_this_for: This is an introductory topic for developers, data engineers, and platform engineers who want to build high-performance data pipelines and analytics systems using Alluxio on Arm-based cloud environments. + +learning_objectives: + - Install and configure Alluxio on Azure Cobalt 100 Arm64 virtual machines + - Configure data caching using Alluxio memory storage + - Integrate Alluxio with Apache Spark for analytics workloads + - Benchmark data access performance and understand caching benefits + +prerequisites: + - A [Microsoft Azure account](https://azure.microsoft.com/) with access to Cobalt 100 based instances (Dpsv6) + - Basic knowledge of Linux command-line operations + - Familiarity with SSH and remote server access + - Basic understanding of data processing, storage systems, and caching concepts + +author: Pareena Verma + +### Tags +skilllevels: Introductory +subjects: Containers and Virtualization +cloud_service_providers: + - Microsoft Azure + +armips: + - Neoverse + +tools_software_languages: + - Alluxio + - Apache Spark + - Java + +operatingsystems: + - Linux + +further_reading: + - resource: + title: Alluxio Official Website + link: https://www.alluxio.io/ + type: website + - resource: + title: Alluxio Documentation + link: https://docs.alluxio.io/ + type: documentation + - resource: + title: Apache Spark Documentation + link: https://spark.apache.org/docs/latest/ + type: documentation + - resource: + title: Azure Cobalt 100 processors + link: https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353 + type: documentation + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 +layout: "learningpathall" +learning_path_main_page: "yes" +--- diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/background.md b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/background.md new file mode 100644 index 0000000000..2e46daad71 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/background.md @@ -0,0 +1,53 @@ +--- +title: Understand Alluxio on Azure Cobalt 100 + +weight: 2 + +layout: "learningpathall" +--- + +## Why run Alluxio on Azure Cobalt 100 + +Alluxio on Arm-based Azure Cobalt 100 processors delivers high-performance data access for analytics and AI workloads. Cobalt 100's dedicated physical cores per vCPU provide consistent and predictable performance, which complements Alluxio’s in-memory caching and data orchestration capabilities. + +By combining Alluxio’s memory-centric architecture with the efficiency of Arm-based infrastructure, you can significantly reduce data access latency, accelerate compute frameworks like Apache Spark, and optimize overall data pipeline performance.sors delivers high-performance, low-latency data operations for real-time messaging and event processing. Cobalt 100's dedicated physical cores per vCPU provide consistent performance that suits Redis's in-memory architecture and event-driven workloads. + +## Azure Cobalt 100 Arm-based processor + +Azure’s Cobalt 100 is Microsoft’s first-generation, in-house Arm-based processor. Built on Arm Neoverse N2, Cobalt 100 is a 64-bit CPU that delivers strong performance and energy efficiency for cloud-native, scale-out Linux workloads such as web and application servers, data analytics, open-source databases, and caching systems. Running at 3.4 GHz, Cobalt 100 allocates a dedicated physical core for each vCPU, which helps ensure consistent and predictable performance. + +To learn more, see the Microsoft blog [Announcing the preview of new Azure VMs based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). + +## Alluxio + +Alluxio is an open-source data orchestration platform that enables fast and reliable access to data across distributed storage systems. It acts as a unified layer between compute frameworks and storage systems, improving performance for data-intensive applications. + +Alluxio is widely used in modern data platforms to accelerate analytics workloads by caching frequently accessed data in memory, reducing latency and minimizing repeated reads from slower storage systems such as local disks or cloud storage. + +Alluxio integrates seamlessly with popular analytics frameworks like Apache Spark, Presto, and Hadoop, making it ideal for building high-performance data pipelines and AI/ML workloads. + +To learn more, see the official [Alluxio documentation](https://docs.alluxio.io/). + +Alluxio provides key capabilities for data orchestration and performance optimization: + +- **Data Caching:** Frequently accessed data is stored in memory, significantly reducing access time compared to disk-based reads. + +- **Unified Namespace:** Alluxio presents a single logical view of data across multiple storage systems, simplifying data access. + +- **Tiered Storage:** Supports multiple storage layers (memory, SSD, HDD), enabling efficient data management based on access patterns. + +- **Compute Integration:** Works with analytics engines like Apache Spark to accelerate data processing without modifying application logic. + +Alluxio is commonly used in: + +- Big data analytics and processing +- AI and machine learning pipelines +- Data lake acceleration +- ETL and batch processing workflows +- High-performance data access layers + +In this Learning Path, you'll deploy Alluxio on an Azure Cobalt 100 Arm64 virtual machine and build a data orchestration and caching layer for analytics workloads. You will integrate Alluxio with Apache Spark and benchmark performance to understand how caching improves data access speed. + +## What you've learned and what's next + +You now have the context for why Azure Cobalt 100 and Alluxio are a strong combination for high-performance data orchestration and analytics workloads. Next, you'll create the virtual machine that will run Alluxio throughout this Learning Path. diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/deployment.md b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/deployment.md new file mode 100644 index 0000000000..25f40eec1e --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/deployment.md @@ -0,0 +1,213 @@ +--- +title: Deploy Alluxio on Azure Cobalt 100 +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Deploy Alluxio on Azure Cobalt 100 (Arm) + +This section guides you through installing Alluxio on an Azure Cobalt 100 Arm-based virtual machine and configuring it with local storage. + +You will set up a unified data orchestration layer that sits between compute frameworks and storage systems. + +### Why Alluxio? + +- Speeds up data access using memory caching +- Reduces repeated disk I/O +- Improves performance for analytics workloads + +## Update your system + +```bash +sudo apt update && sudo apt upgrade -y +``` + +## Install required dependencies +These tools are required for downloading and extracting software: + +```bash +sudo apt install -y wget curl tar rsync nano +``` + +## Install Java 11 (Required) + +Alluxio supports **Java 8 and Java 11**. +Java 17 will cause runtime errors sometimes (as already experienced). + +```bash +wget -qO - https://packages.adoptium.net/artifactory/api/gpg/key/public | \ +sudo gpg --dearmor -o /usr/share/keyrings/adoptium.gpg + +echo "deb [signed-by=/usr/share/keyrings/adoptium.gpg] https://packages.adoptium.net/artifactory/deb noble main" | \ +sudo tee /etc/apt/sources.list.d/adoptium.list + +sudo apt update +sudo apt install -y temurin-11-jdk +``` + +**Set Java:** + +```bash +sudo update-alternatives --config java +``` + +- Select Java 11 + +**Verify:** + +```bash +java -version +``` + +The output is similar to: + +```output +openjdk version "11.0.30" 2026-01-20 +openJDK Runtime Environment Temurin-11.0.30+7 (build 11.0.30+7) +openJDK 64-Bit Server VM Temurin-11.0.30+7 (build 11.0.30+7, mixed mode) +``` + +## Download and install Alluxio + +```bash +cd /opt +sudo wget https://downloads.alluxio.io/downloads/files/2.9.4/alluxio-2.9.4-bin.tar.gz +sudo tar -xvzf alluxio-2.9.4-bin.tar.gz +sudo mv alluxio-2.9.4 alluxio +sudo chown -R $USER:$USER /opt/alluxio +``` + +## Configure environment variables +This allows you to run Alluxio commands globally. + +```bash +echo 'export ALLUXIO_HOME=/opt/alluxio' >> ~/.bashrc +echo 'export PATH=$PATH:$ALLUXIO_HOME/bin' >> ~/.bashrc +source ~/.bashrc +``` + +## Configure Alluxio +Navigate to configuration directory: + +```bash +cd /opt/alluxio/conf +cp alluxio-env.sh.template alluxio-env.sh +cp alluxio-site.properties.template alluxio-site.properties +``` + +## Configure RAM-based storage +Alluxio uses memory for fast data access. + +**Edit:** + +```bash +nano alluxio-env.sh +``` + +**Add:** + +```bash +export ALLUXIO_RAM_FOLDER=/dev/shm +``` + +`/dev/shm` is a Linux in-memory filesystem (RAM-backed storage) + +## Configure core properties + +```bash +nano alluxio-site.properties +``` + +```bash +alluxio.master.hostname=localhost +alluxio.worker.memory.size=6GB +alluxio.master.mount.table.root.ufs=/mnt/data +``` + +**Explanation:** + +- `master.hostname` → where Alluxio master runs +- `worker.memory.size` → RAM allocated for caching +- `root.ufs` → underlying storage (your disk) + +## Setup storage directory +This is your underlying file system (UFS). + +```bash +sudo mkdir -p /mnt/data +sudo chmod -R 777 /mnt/data +``` + +## Start Alluxio +Format metadata (first time only): + +```bash +alluxio format +``` + +**Start Alluxio in local mode:** + +```bash +alluxio-start.sh local NoMount +``` + +The output is similar to: + +```output +Starting to monitor all local services. + ----------------------------------------- + --- [ OK ] The master service @ alluxio-arm64.xaxcsurvhrzefjc5ihdpsf2vbc.rx.internal.cloudapp.net is in a healthy state. + --- [ OK ] The job_master service @ alluxio-arm64.xaxcsurvhrzefjc5ihdpsf2vbc.rx.internal.cloudapp.net is in a healthy state. + --- [ OK ] The worker service @ alluxio-arm64.xaxcsurvhrzefjc5ihdpsf2vbc.rx.internal.cloudapp.net is in a healthy state. + --- [ OK ] The job_worker service @ alluxio-arm64.xaxcsurvhrzefjc5ihdpsf2vbc.rx.internal.cloudapp.net is in a healthy state. + --- [ OK ] The proxy service @ alluxio-arm64.xaxcsurvhrzefjc5ihdpsf2vbc.rx.internal.cloudapp.net is in a healthy state. +``` + +## Verify Alluxio services + +```bash +jps +``` + +**Expected output:** + +```output +AlluxioJobWorker +AlluxioJobMaster +Jps +AlluxioMaster +AlluxioProxy +AlluxioWorker +``` + +**Open:** +Open in your browser: + +```text +http://:19999 +``` + +![Alluxio dashboard showing cluster summary and worker status on Azure Cobalt 100 VM#center](images/alluxio-ui.png "Alluxio Web UI with cluster summary and worker details") + +## Alluxio UI Overview + +What you can see: + +- Master status (Leader node) +- Worker memory usage +- Storage capacity +- Cached data blocks +- Cluster health + +## What you've learned and what's next + +You have successfully: + +- Installed Alluxio on an Arm-based VM +- Configured compute and storage layers +- Enabled memory-based data caching +- Verified cluster health via UI + +You are now ready to integrate Alluxio with analytics frameworks. diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/firewall-setup.md b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/firewall-setup.md new file mode 100644 index 0000000000..1842268814 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/firewall-setup.md @@ -0,0 +1,51 @@ +--- +title: Create a firewall rule on Azure +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Configure Azure firewall for Alluxio Web UI + +To allow external traffic on port **19999** for Alluxio running on an Azure virtual machine, open the port in the Network Security Group (NSG) attached to the virtual machine's network interface or subnet. + +{{% notice Note %}}For more information about Azure setup, see [Getting started with Microsoft Azure Platform](/learning-paths/servers-and-cloud-computing/csp/azure/).{{% /notice %}} + +## Create a firewall rule in Azure + +To expose the TCP port **19999**, create a firewall rule. + +Navigate to the [Azure Portal](https://portal.azure.com), go to **Virtual Machines**, and select your virtual machine. + +![Azure Portal showing Virtual Machines list alt-txt#center](images/virtual_machine.png "Virtual Machines") + +In the left menu, select **Networking** and in the **Networking** select **Network settings** that's associated with the virtual machine's network interface. + +![Azure Portal showing Network settings with security group configuration alt-txt#center](images/networking.png "Network settings") + +Navigate to **Create port rule**, and select **Inbound port rule**. + +![Azure Portal showing Create port rule dropdown menu alt-txt#center](images/port_rule.png "Create rule") + +Configure the inbound security rule with the following settings: + +- **Source:** Any +- **Source port ranges:** * +- **Destination:** Any +- **Destination port ranges:** **19999** +- **Protocol:** TCP +- **Action:** Allow +- **Name:** allow-alluxio-port + +After filling in the details, select **Add** to save the rule. + +![Azure Portal showing inbound security rule form with port 9999 configuration alt-txt#center](images/inbound_rule.png "Network settings") + +The network firewall rule is now created, allowing Alluxio Web UI to be accessed over port **19999**. + +## What you've learned and what's next + +You've configured the Azure Network Security Group to allow incoming traffic on port 19999. This firewall rule enables external access to the Alluxio Web UI for monitoring cluster status and storage usage. + +Next, you'll integrate Alluxio with Apache Spark and begin analyzing cached data performance. diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/alluxio-data.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/alluxio-data.png new file mode 100644 index 0000000000..afcc548d06 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/alluxio-data.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/alluxio-load.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/alluxio-load.png new file mode 100644 index 0000000000..4b53c1b80d Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/alluxio-load.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/alluxio-ui.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/alluxio-ui.png new file mode 100644 index 0000000000..5661987430 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/alluxio-ui.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/final-vm.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/final-vm.png new file mode 100644 index 0000000000..5207abfb41 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/final-vm.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/inbound_rule.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/inbound_rule.png new file mode 100644 index 0000000000..89dcf4f7b6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/inbound_rule.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/instance.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/instance.png new file mode 100644 index 0000000000..285cd764a5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/instance.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/instance1.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/instance1.png new file mode 100644 index 0000000000..b9d22c352d Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/instance1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/instance4.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/instance4.png new file mode 100644 index 0000000000..2a0ff1e3b0 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/instance4.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/networking.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/networking.png new file mode 100644 index 0000000000..9d6d15f8a3 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/networking.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/port_rule.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/port_rule.png new file mode 100644 index 0000000000..681dc71aa1 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/port_rule.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/ubuntu-pro.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/ubuntu-pro.png new file mode 100644 index 0000000000..d54bd75ca6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/ubuntu-pro.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/virtual_machine.png b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/virtual_machine.png new file mode 100644 index 0000000000..cf6704fcc6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/images/virtual_machine.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/instance.md b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/instance.md new file mode 100644 index 0000000000..bc28a377a9 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/instance.md @@ -0,0 +1,76 @@ +--- +title: Create an Azure Cobalt 100 Arm64 virtual machine +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Provision Azure infrastructure for Alluxio + +Create an Arm-based Cobalt 100 virtual machine to host your Alluxio deployment. + +## Prerequisites and setup + +There are several common ways to create an Arm-based Cobalt 100 virtual machine, and you can choose the method that best fits your workflow or requirements: + +- The Azure Portal +- The Azure CLI +- An infrastructure as code (IaC) tool + +In this section, you'll launch the Azure Portal to create a virtual machine with the Arm-based Azure Cobalt 100 processor. + +The Learning Path focuses on general-purpose virtual machines in the Dpsv6 series. For more information, see the [Microsoft Azure guide for the Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series). + +While the steps to create this instance are included here for convenience, you can also refer to the [Deploy a Cobalt 100 virtual machine on Azure Learning Path](/learning-paths/servers-and-cloud-computing/cobalt/). + +## Create an Arm-based Azure virtual machine + +Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine: + +- Launch the Azure portal and navigate to **Virtual Machines**. +- Select **Create**, and select **Virtual Machine** from the drop-down list. +- Inside the **Basic** tab, fill in the instance details such as **Virtual machine name** and **Region**. +- Select the image for your virtual machine (for example, Ubuntu Pro 24.04 LTS) and select **Arm64** as the VM architecture. +- In the **Size** field, select **See all sizes** and select the D-Series v6 family of virtual machines. +- Select **D4ps_v6** from the list as shown in the diagram below: + +![Azure Portal showing D-Series v6 VM size selection with D4ps_v6 highlighted#center](images/instance.png "Select D4ps_v6 from the D-Series v6 family") + +- For **Authentication type**, select **SSH public key**. + +{{% notice Note %}} +Azure generates an SSH key pair for you and lets you save it for future use. This method is fast, secure, and easy for connecting to your virtual machine. +{{% /notice %}} + +- Fill in the **Administrator username** for your VM. +- Select **Generate new key pair**, and select **RSA SSH Format** as the SSH Key Type. + +{{% notice Note %}} +RSA offers better security with keys longer than 3072 bits. +{{% /notice %}} + +- Give your SSH key a key pair name. +- In the **Inbound port rules**, select **HTTP (80)** and **SSH (22)** as the inbound ports, as shown below: + +![Azure Portal showing inbound port rules with HTTP (80) and SSH (22) selected#center](images/instance1.png "Configure inbound port rules for HTTP and SSH access") + +- Now select the **Review + Create** tab and review the configuration for your virtual machine. It should look like the following: + +![Azure Portal Review + Create tab showing VM configuration summary ready for deployment#center](images/ubuntu-pro.png "Review VM configuration before creation") + +- When you're happy with your selection, select the **Create** button and then **Download Private key and Create Resource** button. + +![Azure Portal showing Create button and SSH key download dialog#center](images/instance4.png "Download SSH key and create the virtual machine") + +Your virtual machine should be ready and running in a few minutes. You can SSH into the virtual machine using the private key, along with the public IP details. + +![Azure Portal showing successful VM deployment with confirmation details#center](images/final-vm.png "Successful VM deployment confirmation") + +{{% notice Note %}}To learn more about Arm-based virtual machines in Azure, see "Getting Started with Microsoft Azure" in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure/).{{% /notice %}} + +## What you've learned and what's next + +You've created an Azure Cobalt 100 Arm64 virtual machine running Ubuntu 24.04 LTS with SSH authentication configured. The virtual machine is now ready for installing and running Alluxio workloads. + +Next, you'll install Alluxio on the VM and begin building a data orchestration and caching layer to accelerate analytics workloads and improve data access performance. diff --git a/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/integration-caching-and-performance.md b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/integration-caching-and-performance.md new file mode 100644 index 0000000000..2cb17a80df --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/alluxio-cobalt/integration-caching-and-performance.md @@ -0,0 +1,227 @@ +--- +title: Integrate Alluxio with Apache Spark and Optimize Performance +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Integrate Alluxio with Apache Spark + +This section demonstrates how to integrate Alluxio with Apache Spark, enable caching, and optimize data access performance. + +In this section, you will learn how to: + +- Connect Spark with Alluxio +- Enable in-memory caching +- Measure performance improvements + +## Why integrate Alluxio with Spark? + +**Without Alluxio:** + +```text +Spark → Disk → Slow (every time) +``` + +**With Alluxio:** + +```text +Spark → Alluxio → Memory → Fast +``` + +Alluxio caches frequently accessed data in memory, reducing repeated disk reads. + +## Install Apache Spark + +```bash +cd ~ +wget https://archive.apache.org/dist/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz +tar -xvzf spark-3.4.2-bin-hadoop3.tgz + +sudo mv spark-3.4.2-bin-hadoop3 /opt/spark +sudo chown -R $USER:$USER /opt/spark +``` + +## Configure Spark environment + +```bash +echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc +echo 'export PATH=$PATH:$SPARK_HOME/bin' >> ~/.bashrc +source ~/.bashrc +``` + +## Connect Spark with Alluxio +Edit Spark configuration: + +```bash +nano $SPARK_HOME/conf/spark-defaults.conf +``` + +**Add:** + +```bash +spark.hadoop.fs.alluxio.impl=alluxio.hadoop.FileSystem +spark.driver.extraClassPath=/opt/alluxio/client/alluxio-2.9.4-client.jar +spark.executor.extraClassPath=/opt/alluxio/client/alluxio-2.9.4-client.jar +``` + +**Explanation:** + +- Enables Spark to read from `alluxio://` +- Adds Alluxio client libraries to Spark + +## Create dataset + +```bash +rm -rf /mnt/data/demo +mkdir -p /mnt/data/demo +``` + +```bash +for i in {1..100000}; do + echo "record $i - alluxio spark learning" >> /mnt/data/demo/data.txt +done +``` + +**Verify:** + +```bash +wc -l /mnt/data/demo/data.txt +``` + +The output is similar to: + +```output +100000 /mnt/data/demo/data.txt +``` + +## Run Spark + +```bash +spark-shell +``` + +The output is similar to: + +```output +Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /___/ .__/\_,_/_/ /_/\_\ version 3.4.2 + /_/ + +Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.30) + Type in expressions to have them evaluated. + Type :help for more information. + +scala> +``` + +## Load data via Alluxio + +```bash +val df = spark.read.text("alluxio:///demo/data.txt") +df.count() +``` + +**Expected output:** + +```output +100000 +``` + +## Enable caching + +```bash +df.cache() +df.count() +``` + +This loads data into memory (Alluxio + Spark cache) + +## Measure performance + +**First run:** + +```bash +val t1 = System.nanoTime() +df.count() +val t2 = System.nanoTime() +println((t2 - t1)/1e9 + " seconds") +``` + +**Second run (cached):** + +```bash +val t3 = System.nanoTime() +df.count() +val t4 = System.nanoTime() +println((t4 - t3)/1e9 + " seconds") +``` + +```output +Disk read:            ~0.44 seconds +Alluxio first read:   ~0.44 seconds +Alluxio cached read:  ~0.39 seconds +``` + +**Performance analysis** + +- First read → data comes from disk +- Second read → data is served from memory (cache) +- Cached read is faster due to reduced disk I/O + + +## Verify in Alluxio UI + +**Open:** + +```text +http://:19999 +``` + +![Alluxio cluster load and worker resource usage during Spark job execution on Azure Cobalt 100 VM#center](images/alluxio-load.png "Alluxio cluster load and worker utilization during processing") + +![Alluxio data browser showing cached files and directories on Azure Cobalt 100 VM#center](images/alluxio-data.png "Alluxio data view displaying cached datasets") + +### What this shows: + +- Files stored in Alluxio namespace +- Cached dataset visibility +- Data available for fast access + +### Alluxio UI (Caching in Action) + +What to observe: +Increased worker memory usage +Cached file blocks +Active data access + +## Compare with direct file access + +```bash +val df1 = spark.read.text("file:///mnt/data/demo/data.txt") +val df2 = spark.read.text("alluxio:///demo/data.txt") +``` + +This shows the advantage of using Alluxio as a caching layer. + +## Key concepts + +- Alluxio sits between compute and storage +- Frequently used data is cached in memory +- Spark reads cached data instead of disk +- This improves analytics performance significantly + +## What you've learned and what's next + +You have successfully: + +- Integrated Spark with Alluxio +- Enabled distributed caching +- Measured performance improvements +- Validated results using real data + +You are now ready to extend this setup with cloud storage, Spark SQL, and distributed clusters.