Warehouse Storage

Warehouses in Pangolin define where your actual data files are stored (Parquet, Avro, ORC files), separate from the catalog metadata.

Warehouse vs Backend Storage

It's important to understand the distinction:

Component	Purpose	Storage	Examples
Backend Storage	Catalog metadata	PostgreSQL, MongoDB, SQLite	Table schemas, partitions, snapshots
Warehouse Storage	Actual data files	S3, Azure Blob, GCS	Parquet files, metadata.json

┌─────────────────────────────────────┐
│   Pangolin Catalog (Backend)       │
│   - Table schemas                   │
│   - Partition info                  │
│   - Snapshot metadata               │
│   Stored in: PostgreSQL/Mongo/SQLite│
└──────────────┬──────────────────────┘
               │ Points to
               ▼
┌─────────────────────────────────────┐
│   Warehouse (Object Storage)        │
│   - Parquet data files              │
│   - Iceberg metadata files          │
│   - Manifest files                  │
│   Stored in: S3/Azure/GCS           │
└─────────────────────────────────────┘

Warehouse Concept

A warehouse in Pangolin is a named configuration that specifies:

Storage type: S3, Azure Blob Storage, or Google Cloud Storage
Location: Bucket/container and path prefix
Credentials: How to authenticate (static credentials or STS/IAM roles)
Region: Geographic location of storage

Tip

Flat Key Support: As of v0.1.0, storage_config supports flat keys (e.g., "s3.bucket": "mybucket") as an alternative to nested objects. This is often easier to pass via CLI or environment-driven scripts.

Example Warehouse (Flat Keys)

{
  "name": "production-s3",
  "storage_config": {
    "s3.bucket": "my-company-datalake",
    "s3.region": "us-east-1"
  },
  "vending_strategy": {
    "type": "AwsSts",
    "role_arn": "arn:aws:iam::123456789:role/PangolinDataAccess"
  }
}

Warehouse Patterns

Pattern 1: Warehouse Attached to Catalog

The catalog configuration includes a warehouse reference:

{
  "name": "analytics",
  "type": "local",
  "warehouse": "production-s3",
  "properties": {}
}

Benefits:

Centralized credential management
Consistent storage configuration
Automatic credential vending to clients
Easier to manage and audit

Client Configuration: Minimal - Pangolin vends credentials automatically

Pattern 2: Catalog Without Warehouse

The catalog has no warehouse attached:

{
  "name": "analytics",
  "type": "local",
  "warehouse": null,
  "properties": {}
}

Benefits:

Clients control their own storage access
Flexible for multi-cloud scenarios
Useful when clients have their own credentials

Client Configuration: Clients must configure storage themselves

Authentication Methods

VendingStrategy Enum

Pangolin uses the vending_strategy field to configure credential vending. The use_sts field is deprecated but kept for backward compatibility.

Available Strategies:

1. AwsSts - AWS STS Temporary Credentials (Recommended)

{
  "name": "prod-s3",
  "storage_config": {
    "bucket": "my-datalake",
    "region": "us-east-1",
    "s3.role-arn": "arn:aws:iam::123456789:role/PangolinDataAccess"
  },
  "vending_strategy": {
    "AwsSts": {
       "role_arn": "arn:aws:iam::123456789:role/PangolinDataAccess"
    }
  }
}

Note

STS Vending requires PANGOLIN_STS_ROLE_ARN server configuration.

2. AwsStatic - AWS Static Credentials

{
  "name": "dev-s3",
  "storage_config": {
    "bucket": "my-dev-datalake",
    "region": "us-east-1",
    "s3.access-key-id": "AKIAIOSFODNN7EXAMPLE",
    "s3.secret-access-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
  },
  "vending_strategy": "AwsStatic"
}

Pros: Simple, works everywhere (Development) Cons: Less secure, credentials don't expire

3. Client Provided (No Vending)

{
  "name": "client-provided",
  "storage_config": {
    "bucket": "my-datalake"
  },
  "vending_strategy": "None"
}

Clients must provide their own credentials via environment variables or Spark/Iceberg config.

Deprecated: use_sts Field

The use_sts boolean field is deprecated. Use vending_strategy instead.

Old Format (Deprecated):

{
  "use_sts": true,
  "role_arn": "arn:aws:iam::123:role/Access"
}

New Format (Current):

{
  "vending_strategy": {
    "type": "AwsSts",
    "role_arn": "arn:aws:iam::123:role/Access",
    "external_id": null
  }
}

Supported Storage Types

Storage	Status	Best For
AWS S3	✅ Production	Most common, excellent performance
Azure Blob	✅ Production	Azure-native deployments
Google Cloud Storage	✅ Production	GCP-native deployments
Local Filesystem	⚠️ Dev/Test	Local development & testing

Quick Start

1. Create a Warehouse

curl -X POST http://localhost:8080/api/v1/warehouses \
  -H "X-Pangolin-Tenant: my-tenant" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "production-s3",
    "storage_config": {
      "bucket": "my-datalake",
      "region": "us-east-1"
    },
    "vending_strategy": {
      "type": "AwsSts",
      "role_arn": "arn:aws:iam::123456789:role/DataAccess",
      "external_id": null
    }
  }'

2. Create a Catalog with Warehouse

curl -X POST http://localhost:8080/api/v1/catalogs \
  -H "X-Pangolin-Tenant: my-tenant" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "analytics",
    "type": "local",
    "warehouse": "production-s3"
  }'

3. Use from PyIceberg

from pyiceberg.catalog import load_catalog

# Pangolin vends credentials automatically
catalog = load_catalog(
    "pangolin",
    **{
        "uri": "http://localhost:8080/api/v1/catalogs/analytics",
        "warehouse": "s3://my-datalake/analytics/"
    }
)

# Create table - Pangolin handles storage access
catalog.create_table(
    "db.table",
    schema=schema
)

Client Configuration

With Warehouse (Recommended)

When a catalog has a warehouse attached, Pangolin automatically vends credentials to clients via the X-Iceberg-Access-Delegation header.

PyIceberg: No storage configuration needed PySpark: No storage configuration needed

Without Warehouse

When a catalog has no warehouse, clients must configure storage themselves.

PyIceberg: Configure S3/Azure/GCS credentials PySpark: Configure Hadoop filesystem properties

See individual storage guides for details:

Best Practices

Security

Use STS/IAM Roles: Prefer temporary credentials over static keys
Least Privilege: Grant minimum required permissions
Separate Warehouses: Use different warehouses for dev/staging/prod
Audit Access: Enable CloudTrail/Azure Monitor/GCS audit logs

Performance

Regional Colocation: Place warehouse in same region as compute
Bucket Naming: Use descriptive, hierarchical names
Lifecycle Policies: Archive old data to cheaper storage tiers
Compression: Use Snappy or Zstd for Parquet files

Organization

Naming Convention: {environment}-{region}-{purpose}
- Examples: prod-us-east-1-analytics, dev-eu-west-1-ml
Path Structure: s3://bucket/{catalog}/{namespace}/{table}/
Multi-Tenant: Use separate buckets or prefixes per tenant

Troubleshooting

Permission Denied

Error: Access Denied to s3://my-bucket/path/

Solutions:

Check IAM role permissions
Verify bucket policy
Check STS assume role permissions
Verify warehouse configuration

Credential Vending Not Working

Error: No credentials provided

Solutions:

Ensure catalog has warehouse attached
Check warehouse use_sts setting
Verify IAM role ARN
Check Pangolin server has permission to assume role

Slow Performance

Solutions:

Check region - ensure compute and storage are colocated
Enable S3 Transfer Acceleration
Use larger instance types for compute
Check network bandwidth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warehouse Storage

Warehouse vs Backend Storage

Warehouse Concept

Example Warehouse (Flat Keys)

Warehouse Patterns

Pattern 1: Warehouse Attached to Catalog

Pattern 2: Catalog Without Warehouse

Authentication Methods

VendingStrategy Enum

1. AwsSts - AWS STS Temporary Credentials (Recommended)

2. AwsStatic - AWS Static Credentials

3. Client Provided (No Vending)

Deprecated: use_sts Field

Supported Storage Types

Quick Start

1. Create a Warehouse

2. Create a Catalog with Warehouse

3. Use from PyIceberg

Client Configuration

With Warehouse (Recommended)

Without Warehouse

Best Practices

Security

Performance

Organization

Troubleshooting

Permission Denied

Credential Vending Not Working

Slow Performance

Next Steps

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Warehouse Storage

Warehouse vs Backend Storage

Warehouse Concept

Example Warehouse (Flat Keys)

Warehouse Patterns

Pattern 1: Warehouse Attached to Catalog

Pattern 2: Catalog Without Warehouse

Authentication Methods

VendingStrategy Enum

1. AwsSts - AWS STS Temporary Credentials (Recommended)

2. AwsStatic - AWS Static Credentials

3. Client Provided (No Vending)

Deprecated: use_sts Field

Supported Storage Types

Quick Start

1. Create a Warehouse

2. Create a Catalog with Warehouse

3. Use from PyIceberg

Client Configuration

With Warehouse (Recommended)

Without Warehouse

Best Practices

Security

Performance

Organization

Troubleshooting

Permission Denied

Credential Vending Not Working

Slow Performance

Next Steps