Migrate SLURM accounting data when upgrading to a newer ParallelCluster version

Introduction

This guide describes how to migrate your SLURM accounting data when upgrading to a newer version of AWS ParallelCluster. The migration process has two main steps: first, creating a new cluster with the newer version, then initializing its accounting database using data from your previous cluster. After migration, the two clusters will each point to their own independent database. Note that this process requires downtime on your current cluster to ensure data integrity.

This guide focuses on ParallelCluster-specific constraints and recommendations. For additional details and recommendations specific to SLURM version upgrades, please refer to the official SLURM documentation.

If your cluster uses an external slurmdbd, please refer to the section Migration with External Slurmdbd for the differences that apply to each step.

Version Compatibility

Please refer to the official SLURM documentation to check the compatibility window of versions.

Procedure Overview

The high-level steps to migrate the SLURM accounting database are:

Stop the current cluster and backup the accounting data
Create a new database and import the accounting data
Set up the new ParallelCluster pointing to the new accounting database
Verify the migration was successful

Requirements

You have an existing cluster with SLURM accounting enabled
You can stop the compute fleet and SLURM daemons on the current cluster.
You can install MySQL client commands (mysql, mysqldump) on a host; could be whatever EC2 instance able to communicate with the database. In this tutorial, mysql commands are based on mariadb.
You have access to a database user with read/write permissions to the db cluster. This is required to create the new database and import export database dumps.

Best Practices

Always create a backup of your data before migration
Test the migration process in a non-production environment first
Verify data integrity after migration
Keep the old database as backup for a reasonable period

Step 1: Stop the fleet and SLURM daemons on the current cluster

In this step we will to stop using the accounting database and retrieve relevant information that we will later use to setup the new cluster.

Stop the compute fleet:

pcluster update-compute-fleet \
    --region <region> \
    --cluster-name <clusterA_Name> \
    --status STOP_REQUESTED

Take note of the SLURM configuration ClusterName in the current cluster:

scontrol show config | grep ClusterName
ClusterName             = <clusterA_Slurm_ClusterName>

Take note of the SLURM configuration StorageLoc in the current cluster:

sacctmgr show configuration | grep StorageLoc
StorageLoc             = <clusterA_DatabaseName>

Take note of the ID of the last job submitted to the current cluster:

sacct --format=jobid -X | tail -n 1
<clusterA_LastJobId>

Then stop all SLURM daemons:

systemctl stop slurmrestd   # Only if present
systemctl stop slurmctld 
systemctl stop slurmdbd

Step 2: Backup data on the current cluster

In this step we will backup cluster state and accounting data. The dump of accounting data will be then imported into the new database.

This is an example procedure that relies on mysqldump to perform a local snapshot of the database. We recommend users follow their preferred database backup procedure for this step, which may involve different tools or techniques depending on the specific database setup. In this example, since the backup is created locally on an instance, we recommend to check that there is enough space for the backup file and to save it to a persistent storage location for future use.

Backup the state of the current cluster:

mkdir -p pcluster-backup/var/spool/slurm.state
cp -R /var/spool/slurm.state pcluster-backup/var/spool/slurm.state

Backup data from the accounting database:

mysqldump <clusterA_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
--routines --triggers --events > slurm_accounting_backup.sql
# Type the password when prompted

If you are not using mariadb client, you probably need to add the option --set-gtid-purged=OFF to prevents replication-related issues.

Step 3: Copy data to the new database

In this step we will import the accounting data that we saved in the previous step in to the new database that will be used by the new cluster. We recommend creating a new database for a new cluster as safe approach for the migration as it decouples the two clusters.

This is an example procedure that relies on mysqldump to import a local snapshot of the old database into the new one. We recommend users follow their preferred database backup procedure for this step, which may involve different tools or techniques depending on the specific database setup.

Create a new database:

mysql -h <databaseHostname> \
-u <databaseAdminUsername> \
-p -e "CREATE DATABASE <clusterB_DatabaseName>;"
# Type the password when prompted

Import data to the new database:

mysql <clusterB_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
--triggers --routines --events < slurm_accounting_backup.sql
# Type the password when prompted

If you are not using mariadb client, you probably need to add the option --set-gtid-purged=OFF to prevents replication-related issues.

Verify that the new database contains the expected data.

In particular, check that the tables in the two databases match:

TABLES_A=$(
mysql <clusterA_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
-e 'SHOW TABLES;')

TABLES_B=$(
mysql <clusterB_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
-e 'SHOW TABLES;')

diff <(echo "$TABLES_A") <(echo "$TABLES_B")

and check that the number of recorded jobs match in the two databases:

COUNT_A=$(
mysql <clusterA_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
-e 'SELECT COUNT(*) FROM `<clusterA_Slurm_ClusterName>_job_table`;')

COUNT_B=$(
mysql <clusterB_DatabaseName> \
-h <databaseHostname> \
-u <databaseAdminUsername> -p \
-e 'SELECT COUNT(*) FROM `<clusterA_Slurm_ClusterName>_job_table`;')

diff <(echo "$COUNT_A") <(echo "$COUNT_B")

Please, notice that such validation is not comprehensive, however it can provide you enough confidence about the outcome of the copy.

Step 4: Setup the new cluster

Create the new cluster with SLURM accounting enabled. This cluster is configured to push accounting data to the same database used by the old cluster, but using its own cluster name, so accounting data from the new and old clusters would be separated at this point.

Create the new cluster with the following settings:

HeadNode:
  Networking:
    AdditionalSecurityGroups:
      - <database_ClientSecurityGroup>
Scheduling:
  SlurmSettings:    
    Database:
      Uri: <databaseHostname>:<databasePort>
      UserName: <databaseAdminUsername>
      PasswordSecretArn: <databasePasswordSecret>
      DatabaseName: <clusterB_DatabaseName>
    CustomSlurmSettings:
      - FirstJobId: <clusterA_LastJobId + 100>

Setting FirstJobId to a number greater than the last job submitted in the old cluster is required so that the new cluster does not overwrite the old accounting data. Note that 100 here is an arbitrary number.

Step 5: Point the new cluster to the existing accounting database

In this step, we will make the new cluster point to the accounting data from the old cluster, so that accounting data will be treated as coming from the same cluster.

Log into the head node to remove the cluster name guardrail. This step is needed to signal to Slurm that pointing to the old cluster data is intentional.

# From within the head node 
sudo rm /var/spool/slurm.state/clustername

Then, update the new cluster configuration by adding the custom SLURM setting ClusterName and leaving everything else unchanged:

Scheduling:
  SlurmSettings:    
    CustomSlurmSettings:      
      - ClusterName: <clusterA_Slurm_ClusterName>

Setting ClusterName to the old cluster name is required for Slurm to store new accounting data in the same tables where the old accounting data is stored.

Apply the new configuration through a cluster update:

pcluster update-cluster \
    --region <region> \
    --cluster-name <clusterB_Name> \
    --cluster-configuration <clusterB_Config>

Step 6: Verify old data is visible from the new cluster

Log into the new cluster head node and verify that the old accounting data is visible.

sacct --format=jobid,jobname,partition,alloccpus,account,user,uid,state,exitcode --starttime now-30days

Step 7: Verify new data can be written from the new cluster

In the new cluster, submit a dummy job:

sbatch --wrap 'sleep 60'

Take note of the job id and check that the accounting data for that job is visible

sacct --format=jobid,jobname,partition,alloccpus,account,user,uid,state,exitcode --starttime now-30days

Migration with External Slurmdbd

This section applies when your cluster uses an external slurmdbd instance. If your cluster uses an external slurmdbd instance, follow the same migration procedure described above, with the differences noted below.

Step 1: Stop the fleet and SLURM daemons on the current cluster

Same as Step 1: Stop the fleet and SLURM daemons on the current cluster, but the systemctl stop slurmdbd command must be run on the external slurmdbd instance, rather than the head node.

Step 2: Backup data on the current cluster

Same as Step 2: Backup data on the current cluster

Step 3: Copy data to the new database

Same as Step 3: Copy data to the new database

Step 3.b: Deploy new external SlurmDBD

Before creating the new cluster, you must deploy a new external slurmdbd instance using the ParallelCluster AMI that matches the new cluster version. This ensures the Slurm version on the external slurmdbd matches the Slurm version on the new cluster. You can follow the external slurmdbd tutorial to deploy the new instance, pointing it to the database.

Step 4: Setup the new cluster

Same as Step 4: Setup the new cluster, but the cluster configuration needs the SlurmSettings/ExternalSlurmdbd parameter, rather than SlurmSettings/Database section. Also, you need to configure the HeadNode/Networking/AdditionalSecurityGroups with the security groups required to communicate with the external slurmdbd.

HeadNode:
  Networking:
    AdditionalSecurityGroups:
      - <AccountingClientSecurityGroup>   # from the external slurmdbd stack
      - <SshClientSecurityGroup>          # from the external slurmdbd stack
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ExternalSlurmdbd:
      Host: <SlurmdbdPrivateIp>           # from ExternalSlurmDbd stack
      Port: <SlurmdbdPort>                # from ExternalSlurmDbd stack
      MungeKeySecretArn: <MungeKeySecretArn>  # from ExternalSlurmDbd stack
    CustomSlurmSettings:
      - FirstJobId: <clusterA_LastJobId + 100>

Step 5: Point the new cluster to the existing accounting database

Same as Step 5: Point the new cluster to the existing accounting database

Migrate SLURM accounting data when upgrading to a newer ParallelCluster version

Introduction

Version Compatibility

Procedure Overview

Requirements

Best Practices

Step 1: Stop the fleet and SLURM daemons on the current cluster

Step 2: Backup data on the current cluster

Step 3: Copy data to the new database

Step 4: Setup the new cluster

Step 5: Point the new cluster to the existing accounting database

Step 6: Verify old data is visible from the new cluster

Step 7: Verify new data can be written from the new cluster

Migration with External Slurmdbd

Step 1: Stop the fleet and SLURM daemons on the current cluster

Step 2: Backup data on the current cluster

Step 3: Copy data to the new database

Step 3.b: Deploy new external SlurmDBD

Step 4: Setup the new cluster

Step 5: Point the new cluster to the existing accounting database

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!