Skip to content

Latest commit

 

History

History
479 lines (380 loc) · 12.2 KB

File metadata and controls

479 lines (380 loc) · 12.2 KB

Security & Credential Vending (Multi-Cloud)

Pangolin provides mechanisms to securely vend temporary credentials to clients for S3, Azure ADLS Gen2, and Google Cloud Storage, enabling direct data access while maintaining security.

Overview

Instead of sharing long-term cloud credentials with clients (e.g., Spark jobs, Dremio, Trino), Pangolin acts as a trusted intermediary. It authenticates the client and then issues temporary, scoped credentials for specific storage resources.

Benefits:

  • ✅ No long-term credentials in client configurations
  • ✅ Automatic credential rotation (STS for S3, OAuth2 for Azure/GCP)
  • ✅ Scoped access to specific table locations
  • ✅ Centralized audit trail of data access
  • ✅ Support for cross-account/cross-cloud access
  • Multi-cloud support: S3, Azure ADLS Gen2, Google Cloud Storage
  • ⚠️ Local Filesystem: Supported for dev/test (no credential vending involved)

Configuration

Prerequisites

  1. AWS Credentials: Pangolin needs AWS credentials with permissions to:

    • Call sts:AssumeRole (for STS vending)
    • Access S3 buckets (for static credential vending)
    • Generate presigned URLs
  2. IAM Role (for STS vending): Create an IAM role that Pangolin can assume with S3 access permissions.

Environment Variables

# AWS Configuration
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...

# For STS Credential Vending
PANGOLIN_STS_ROLE_ARN=arn:aws:iam::123456789012:role/PangolinDataAccess
PANGOLIN_STS_SESSION_DURATION=3600  # 1 hour (default)

# For MinIO or S3-compatible storage
AWS_ENDPOINT_URL=http://minio:9000
AWS_ALLOW_HTTP=true

Warehouse Configuration

To enable credential vending, configure the vending_strategy in your warehouse definition.

AWS S3 (STS Mode):

curl -X POST http://localhost:8080/api/v1/warehouses \
  -H "X-Pangolin-Tenant: <tenant-id>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "production_warehouse",
    "storage_config": {
      "type": "s3",
      "bucket": "my-data-bucket",
      "region": "us-east-1",
      "s3.role-arn": "arn:aws:iam::123456789012:role/PangolinDataAccess"
    },
    "vending_strategy": {
       "AwsSts": {
          "role_arn": "arn:aws:iam::123456789012:role/PangolinDataAccess"
       }
    }
  }'

AWS S3 (Static Mode):

curl -X POST http://localhost:8080/api/v1/warehouses \
  -H "X-Pangolin-Tenant: <tenant-id>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "dev_warehouse",
    "storage_config": {
      "type": "s3",
      "bucket": "my-dev-bucket",
      "region": "us-east-1",
      "s3.access-key-id": "AKIA...",
      "s3.secret-access-key": "..."
    },
    "vending_strategy": "AwsStatic"
  }'

Azure ADLS Gen2 Configuration

OAuth2 Mode (Recommended):

curl -X POST http://localhost:8080/api/v1/warehouses \
  -H "X-Pangolin-Tenant: <tenant-id>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "azure_warehouse",
    "storage_config": {
      "type": "azure",
      "azure.account-name": "mystorageaccount",
      "azure.container": "data",
      "azure.client-id": "azure-client-id",
      "azure.client-secret": "azure-client-secret",
      "azure.tenant-id": "azure-tenant-id"
    },
    "vending_strategy": {
       "AzureOAuth": {
          "client_id": "azure-client-id",
          "client_secret": "azure-client-secret",
          "tenant_id": "azure-tenant-id"
       }
    }
  }'

Account Key Mode:

curl -X POST http://localhost:8080/api/v1/warehouses \
  -H "X-Pangolin-Tenant: <tenant-id>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "azure_warehouse",
    "storage_config": {
      "type": "azure",
      "azure.account-name": "mystorageaccount",
      "azure.container": "data",
      "azure.account-key": "your-account-key"
    },
    "vending_strategy": "AzureSas"
  }'

Google Cloud Storage Configuration

curl -X POST http://localhost:8080/api/v1/warehouses \
  -H "X-Pangolin-Tenant: <tenant-id>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gcp_warehouse",
    "storage_config": {
      "type": "gcs",
      "gcp.project-id": "my-gcp-project",
      "gcp.bucket": "my-data-bucket",
      "gcp.service-account-key": "{...json key...}"
    },
    "vending_strategy": "GcpDownscoped"
  }'

Features

Presigned URLs

Get a presigned URL to download a specific file (e.g., a metadata file or data file) without needing AWS credentials.

Endpoint: GET /v1/{prefix}/namespaces/{namespace}/tables/{table}/presign?location=s3://bucket/key

Request:

curl "http://localhost:8080/v1/analytics/namespaces/sales/tables/transactions/presign?location=s3://my-bucket/data/file.parquet" \
  -H "Authorization: Bearer <token>" \
  -H "X-Pangolin-Tenant: <tenant-id>"

Response:

{
  "url": "https://bucket.s3.amazonaws.com/key?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=..."
}

Use Case: Web applications, data preview tools, or clients that can't handle AWS credentials.


PyIceberg Integration

PyIceberg automatically uses Pangolin's credential vending when configured correctly for all supported cloud providers.

Automatic Credential Vending (S3)

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "pangolin",
    **{
        "uri": "http://localhost:8080",
        "prefix": "analytics",
        "token": "your-jwt-token",
        # No S3 credentials needed - Pangolin vends them automatically!
    }
)

# PyIceberg will request credentials from Pangolin for each table access
table = catalog.load_table("sales.transactions")
df = table.scan().to_pandas()  # Pangolin vends S3 credentials automatically

Automatic Credential Vending (Azure)

catalog = load_catalog(
    "pangolin_azure",
    **{
        "uri": "http://localhost:8080",
        "prefix": "azure_catalog",
        "token": "your-jwt-token",
        # No Azure credentials needed - Pangolin vends them automatically!
        # Pangolin provides: adls.token, adls.account-name, adls.container
    }
)

table = catalog.load_table("sales.transactions")
df = table.scan().to_pandas()  # Pangolin vends Azure credentials automatically

Automatic Credential Vending (GCP)

catalog = load_catalog(
    "pangolin_gcp",
    **{
        "uri": "http://localhost:8080",
        "prefix": "gcp_catalog",
        "token": "your-jwt-token",
        # No GCP credentials needed - Pangolin vends them automatically!
        # Pangolin provides: gcp-oauth-token, gcp-project-id
    }
)

table = catalog.load_table("sales.transactions")
df = table.scan().to_pandas()  # Pangolin vends GCP credentials automatically

How it works:

  1. PyIceberg requests table metadata from Pangolin
  2. Pangolin includes temporary cloud credentials in the response (based on warehouse type)
  3. PyIceberg uses these credentials to read data files from cloud storage
  4. Credentials expire after the configured duration (default: 1 hour)

Client-Provided Credentials

If you prefer to manage credentials yourself:

S3:

catalog = load_catalog(
    "pangolin",
    **{
        "uri": "http://localhost:8080",
        "prefix": "analytics",
        "token": "your-jwt-token",
        "s3.access-key-id": "AKIA...",
        "s3.secret-access-key": "...",
    }
)

Azure:

catalog = load_catalog(
    "pangolin_azure",
    **{
        "uri": "http://localhost:8080",
        "prefix": "azure_catalog",
        "token": "your-jwt-token",
        "adls.account-name": "mystorageaccount",
        "adls.account-key": "...",
    }
)

GCP:

catalog = load_catalog(
    "pangolin_gcp",
    **{
        "uri": "http://localhost:8080",
        "prefix": "gcp_catalog",
        "token": "your-jwt-token",
        "gcp-project-id": "my-project",
        "gcs.service-account-key": "/path/to/key.json",
    }
)

IAM Policy Examples

Pangolin Service Role Policy

This is the policy for the IAM role that Pangolin assumes to vend credentials:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-data-bucket/*",
        "arn:aws:s3:::my-data-bucket"
      ]
    }
  ]
}

Trust Policy for Pangolin

Allow Pangolin's AWS account/role to assume the data access role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::YOUR-PANGOLIN-ACCOUNT:role/PangolinService"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Security Best Practices

1. Use STS Credential Vending

Always use AwsSts strategy in production to vend temporary credentials instead of sharing static credentials.

2. Scope Credentials to Table Locations

Pangolin vends credentials scoped to specific table locations, limiting blast radius if credentials are compromised.

3. Set Short Expiration Times

Configure PANGOLIN_STS_SESSION_DURATION to the minimum time needed (default: 3600 seconds / 1 hour).

4. Use IAM Conditions

Add IAM policy conditions to restrict access by IP, time, or other factors:

{
  "Condition": {
    "IpAddress": {
      "aws:SourceIp": "10.0.0.0/8"
    }
  }
}

5. Monitor Credential Usage

Review CloudTrail logs for STS AssumeRole calls to detect unusual patterns.

6. Rotate Static Credentials

If using static credentials (AwsStatic), rotate them regularly.


Troubleshooting

"Access Denied" when reading data

Cause: Vended credentials don't have permissions for the S3 location.

Solution:

  1. Verify the IAM role has S3 permissions for the table location
  2. Check the role ARN in warehouse configuration
  3. Verify Pangolin can assume the role: aws sts assume-role --role-arn <arn> --role-session-name test

"Credentials expired" errors

Cause: STS credentials expired during a long-running query.

Solution:

  1. Increase PANGOLIN_STS_SESSION_DURATION (max: 43200 seconds / 12 hours)
  2. Configure your client to refresh credentials automatically
  3. For very long queries, consider using static credentials

PyIceberg not using vended credentials

Cause: PyIceberg may be using client-provided credentials instead.

Solution:

  1. Remove s3.access-key-id and s3.secret-access-key from PyIceberg config
  2. Verify warehouse has vending_strategy configured correctly
  3. Check Pangolin logs for credential vending requests

"Invalid security token" errors

Cause: Clock skew between Pangolin server and AWS.

Solution:

  1. Sync server time with NTP: sudo ntpdate -s time.nist.gov
  2. Verify server timezone is set correctly
  3. Check CloudTrail for timestamp-related errors

Cross-account access not working

Cause: Trust policy or permissions issue.

Solution:

  1. Verify trust policy allows Pangolin's role to assume the target role
  2. Check both the trust policy and the permissions policy
  3. Test with AWS CLI: aws sts assume-role --role-arn <target-role> --role-session-name test

Advanced Configuration

Custom Session Duration

# Set custom session duration (in seconds)
PANGOLIN_STS_SESSION_DURATION=7200  # 2 hours

External ID for Cross-Account Access

For enhanced security in cross-account scenarios:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::PANGOLIN-ACCOUNT:role/PangolinService"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "unique-external-id-12345"
        }
      }
    }
  ]
}

Set in Pangolin:

PANGOLIN_STS_EXTERNAL_ID=unique-external-id-12345

Regional Endpoints

For better performance, use regional STS endpoints:

AWS_STS_REGIONAL_ENDPOINTS=regional

Related Documentation