Skip to content

Add storage.cloud documentation and example scripts#241

Open
Sazwanismail wants to merge 1 commit intoVisual-Studio-Code:mainfrom
Sazwanismail:Sazwanismail-patch-1
Open

Add storage.cloud documentation and example scripts#241
Sazwanismail wants to merge 1 commit intoVisual-Studio-Code:mainfrom
Sazwanismail:Sazwanismail-patch-1

Conversation

@Sazwanismail
Copy link
Copy Markdown

Added documentation and example scripts for Google Cloud Storage usage, including quickstart guides, merging data, and CORS configuration.````markdown name=README.md # storage.cloud — Google Cloud Storage (GCS) docs & quickstart

storage.cloud is a compact documentation and example repo with copy-pastable commands and small scripts for common Google Cloud Storage tasks:

  • Authentication (gcloud ADC, service accounts, access tokens)
  • Browser access vs programmatic access (storage.cloud.google.com vs API)
  • Signed URLs and sharing
  • CORS configuration for browser clients
  • Practical patterns to merge many objects (CSV/text) in GCS
  • Load patterns for BigQuery and recommendations for large datasets

Repository layout

  • index.html — simple landing page / site overview
  • docs/
    • quickstart.md — commands & short how-tos
    • merge-data.md — Malay quick guide for combining files
    • signed-urls.md — signing approaches and examples
  • examples/
    • merge_csv_gcs.py — small Python script to merge CSVs in a GCS prefix
  • cors.json — CORS example
  • LICENSE — suggested license (MIT)

Quickstart highlights

  • Authenticate locally (ADC): gcloud auth application-default login

  • Get an access token for Authorization header: gcloud auth print-access-token

  • List buckets: gsutil ls gs://

  • Copy to/from GCS: gsutil cp ./local-file gs://BUCKET/OBJECT gsutil cp gs://BUCKET/OBJECT ./local-file

  • Create a signed URL: gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT

Security reminders

  • Use service accounts with least privilege in production.
  • Signed URLs grant access to anyone with the URL until expiry.
  • Avoid making sensitive objects public; prefer signed URLs or IAM roles.

Contributing

  • Send PRs or issues with improvements.
  • Keep examples minimal and targeted.
  • Include tests or usage examples when adding scripts.

If you want, I can:

  • Expand any doc into a full tutorial (step-by-step with screenshots).
  • Add a shell script to compose >32 objects safely.
  • Create a Dataflow/Beam starter pipeline for large-scale merges.

```html name=index.html
<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8"/> <title>storage.cloud — Google Cloud Storage quickstart</title> <meta name="viewport" content="width=device-width,initial-scale=1"/> <style> body { font-family: system-ui, -apple-system, Roboto, "Segoe UI", Arial; max-width: 980px; margin: 36px auto; line-height: 1.6; padding: 0 20px; color:#111 } pre{ background:#f6f8fa; padding:12px; overflow:auto; border-radius:6px; } header {border-bottom:1px solid #eee; margin-bottom:20px; padding-bottom:12px} h1{margin:0;padding:0} nav a{margin-right:12px;color:#1a73e8;text-decoration:none} small { color:#666 } </style> </head>
<body>
  <header> <h1>storage.cloud</h1> <p><small>Quick reference and starter content for Google Cloud Storage (GCS)</small></p> <nav> <a href="docs/quickstart.md">Quickstart</a> <a href="docs/merge-data.md">Combine data</a> <a href="docs/signed-urls.md">Signed URLs</a> <a href="examples/merge_csv_gcs.py">Examples</a> </nav> </header>

  <main> <section> <h2>Overview</h2> <p>storage.cloud collects short guides, command snippets, and small reusable scripts for common GCS workflows:</p> <ul> <li>Authenticated access (gcloud, service accounts, access tokens)</li> <li>Signed URLs for temporary public access</li> <li>CORS configuration for browser-based apps</li> <li>Merging objects (CSV/text) and loading to BigQuery</li> </ul> </section>

    <section>
      <h2>Helpful links</h2>
      <ul>
        <li><a href="https://storage.cloud.google.com/">storage.cloud.google.com</a> — web UI (requires Google sign-in)</li>
        <li><a href="https://console.cloud.google.com/storage/browser">Cloud Console — Storage browser</a></li>
        <li><a href="https://cloud.google.com/storage/docs">GCS documentation</a></li>
      </ul>
    </section>

    <section>
      <h2>Get started</h2>
      <p>Open <code>docs/quickstart.md</code> for commands and brief examples you can run locally. For combining files, see <code>docs/merge-data.md</code> and <code>examples/merge_csv_gcs.py</code>.</p>
    </section>
  </main> </body>
</html>
```

````markdown name=docs/quickstart.md
# Quickstart — Google Cloud Storage (GCS)

This page collects the most-used commands and short examples for getting started with GCS.

Prerequisites
- Install Google Cloud SDK (gcloud, gsutil)
- For programmatic examples, install the relevant client libraries (Python: google-cloud-storage)

Authentication
- Local developer (Application Default Credentials - ADC): gcloud auth application-default login

- Service account (recommended for server apps):
  1. Create: gcloud iam service-accounts create my-sa --display-name="My SA"
  2. Grant roles (example: objectViewer): gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:my-sa@PROJECT_ID.iam.gserviceaccount.com" \ --role="roles/storage.objectViewer"
  3. (Optional) Create key for local use: gcloud iam service-accounts keys create key.json \ --iam-account=my-sa@PROJECT_ID.iam.gserviceaccount.com

Access tokens
- Print an access token for HTTP Authorization: gcloud auth print-access-token
- Use as header: Authorization: Bearer <ACCESS_TOKEN>

Common gsutil commands
- List buckets: gsutil ls gs://

- List objects in a bucket/prefix: gsutil ls gs://BUCKET/PREFIX/

- Download object: gsutil cp gs://BUCKET/OBJECT ./local-file

- Upload file: gsutil cp ./local-file gs://BUCKET/OBJECT

- Make object public (use sparingly): gsutil acl ch -u AllUsers:R gs://BUCKET/OBJECT

Signed URLs
- Short way to create an expiring URL using a service account key: gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT
- Notes:
  - V4 signed URLs are supported; maximum expiry is 7 days.
  - Signed URLs allow access without a Google account.

CORS (for browser clients)
- Example file: cors.json (in repo)
- Apply: gsutil cors set cors.json gs://BUCKET

BigQuery ingestion
- BigQuery accepts wildcards — you can load many CSVs without pre-merging: bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv

Troubleshooting
- Permission denied: check IAM roles (roles/storage.objectViewer or a custom role).
- Invalid credentials: re-run `gcloud auth application-default login` or refresh service account keys.
- CORS errors: ensure bucket CORS includes your domain and required methods/headers.

Security recommendations
- Use service accounts with least privilege.
- Prefer IAM + uniform bucket-level access over ACLs.
- Do not embed long-lived keys in client-side code; use signed URLs for browser access.

Further reading
- GCS docs: https://cloud.google.com/storage/docs
- Signed URLs: https://cloud.google.com/storage/docs/access-control/signed-urls
- gsutil reference: https://cloud.google.com/storage/docs/gsutil
# Gabungkan Semua Data (Combine all data)

Dokumen ringkas (Bahasa Melayu) untuk menggabungkan fail yang tersimpan di Google Cloud Storage.

Sebelum mula
- Pastikan anda mempunyai akses ke bucket (roles/storage.objectViewer atau storage.objectAdmin untuk penulisan).
- Jika dataset besar (GB/TB), pertimbangkan Dataflow/Dataproc atau import terus ke BigQuery.

Pilihan umum

1) Quick merge (fail kecil/sederhana)
- Jika saiz kecil supaya muat dalam memori: gsutil cat gs://BUCKET/PATH/*.csv | gsutil cp - gs://BUCKET/PATH/combined.csv
- Risiko: memori dan rangkaian. Pastikan hanya digunakan untuk set kecil.

2) gsutil compose (gabungkan objek di GCS tanpa muat turun)
- Gabungkan hingga 32 objek per operasi: gsutil compose gs://BUCKET/part1.csv gs://BUCKET/part2.csv gs://BUCKET/combined.csv
- Untuk >32 objek, buat compose berperingkat (tree compose).
- Nota: compose menggabungkan byte; pastikan setiap objek mempunyai newline yang sesuai dan tidak menggabungkan header berganda.

3) Load terus ke BigQuery (disarankan untuk analisis)
- BigQuery boleh baca wildcard CSV: bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv
- Kelebihan: skala, tiada pra-gabungan, schema handling.

4) Skrip tersuai (contoh Python)
- Kelebihan: kawalan penuh (buang header berganda, normalisasi).
- Lihat `examples/merge_csv_gcs.py` untuk contoh.

Contoh strategi untuk >32 objek menggunakan gsutil compose (pattern)
- Bahagikan senarai objek kepada kumpulan 32, compose setiap kumpulan ke objek sementara, kemudian compose hasil sementara (ulang sehingga satu objek terakhir).
- Atau gunakan pipeline Dataflow untuk penstriman dan penulisan semula.

Amalan terbaik
- Untuk output akhir yang akan dianalitik: lebih baik tulis ke BigQuery atau Parquet (kolumnar) untuk kos dan prestasi.
- Pastikan logging dan versi objek dihidupkan jika data penting (object versioning).
- Gunakan encryption keys (CMEK) jika diperlukan.

Contoh ringkas penggunaan skrip
- Jika anda mahu, berikan:
  - Bucket name (contoh: gs://my-bucket/data/)
  - Prefix dan jenis fail (.csv)
  - Saiz anggaran (GB/TB) Saya akan hasilkan skrip automatik yang sesuai (bash atau Python) untuk anda.
# Signed URLs — Quick reference

Signed URLs grant temporary access to an object without requiring the requester to have a Google account.

Key points
- Signed URLs include authentication in query parameters.
- Anyone with the URL can access the object until expiry.
- Maximum expiry for V4 signed URLs is 7 days.

Create with gsutil (using service account key file)
  gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT

Create programmatically (Python example sketch)
- Use google-auth library to sign and generate a V4 URL, or use google-cloud-storage's blob.generate_signed_url method.
- Prefer service account credentials with least privilege.

Security tips
- Monitor signed-URL usage where possible (Cloud Storage logs).
- Shorter expiry reduces risk.
- Do not embed long-lived private keys in public repositories.

When to use
- Temporary downloads for users without Google accounts.
- Browser uploads (PUT) when combined with appropriate CORS settings.
#!/usr/bin/env python3
"""
examples/merge_csv_gcs.py

Merge CSV files in a GCS prefix into one CSV while keeping only the first header.

Requirements:
  pip install google-cloud-storage

Usage:
  export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"
  python3 examples/merge_csv_gcs.py my-bucket data/prefix/ output/combined.csv
"""
import sys
import csv
from io import StringIO
from google.cloud import storage

def merge_csvs(bucket_name, prefix, output_blob_name):
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blobs = sorted([b for b in bucket.list_blobs(prefix=prefix) if b.name.endswith('.csv')])

    if not blobs:
        print("No CSV files found with prefix:", prefix)
        return

    out_buf = StringIO()
    writer = None
    first = True

    for blob in blobs:
        print("Reading:", blob.name)
        # Use streaming download if files are large: download_as_text() may load entire object
        data = blob.download_as_text()
        reader = csv.reader(StringIO(data))
        header = next(reader, None)
        if first:
            writer = csv.writer(out_buf)
            if header:
                writer.writerow(header)
            for row in reader:
                writer.writerow(row)
            first = False
        else:
            # skip header for subsequent files
            for row in reader:
                writer.writerow(row)

    out_blob = bucket.blob(output_blob_name)
    out_blob.upload_from_string(out_buf.getvalue(), content_type='text/csv')
    print(f'Uploaded gs://{bucket_name}/{output_blob_name}')

if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("Usage: merge_csv_gcs.py BUCKET PREFIX OUTPUT_BLOB")
        print("Example: merge_csv_gcs.py my-bucket data/ output/combined.csv")
        sys.exit(1)
    merge_csvs(sys.argv[1], sys.argv[2], sys.argv[3])
[
  {
    "origin": ["https://example.com"],
    "method": ["GET", "HEAD", "PUT", "POST"],
    "responseHeader": ["Content-Type", "x-goog-meta-custom"],
    "maxAgeSeconds": 3600
  }
]
MIT License

Copyright (c) 2025

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction...
(You can replace with your full MIT text or preferred license.)

Would you like me to:

  • generate a shell helper to compose >32 objects safely (tree compose), or
  • produce a Dataflow/Beam starter for large merges, or
  • create a translated Malay full-site version?

Tell me which follow-up and I will produce the exact script or pipeline.

Added documentation and example scripts for Google Cloud Storage usage, including quickstart guides, merging data, and CORS configuration.````markdown name=README.md
# storage.cloud — Google Cloud Storage (GCS) docs & quickstart

storage.cloud is a compact documentation and example repo with copy-pastable commands and small scripts for common Google Cloud Storage tasks:

- Authentication (gcloud ADC, service accounts, access tokens)
- Browser access vs programmatic access (storage.cloud.google.com vs API)
- Signed URLs and sharing
- CORS configuration for browser clients
- Practical patterns to merge many objects (CSV/text) in GCS
- Load patterns for BigQuery and recommendations for large datasets

Repository layout
- index.html — simple landing page / site overview
- docs/
  - quickstart.md — commands & short how-tos
  - merge-data.md — Malay quick guide for combining files
  - signed-urls.md — signing approaches and examples
- examples/
  - merge_csv_gcs.py — small Python script to merge CSVs in a GCS prefix
- cors.json — CORS example
- LICENSE — suggested license (MIT)

Quickstart highlights
- Authenticate locally (ADC):
  gcloud auth application-default login

- Get an access token for Authorization header:
  gcloud auth print-access-token

- List buckets:
  gsutil ls gs://

- Copy to/from GCS:
  gsutil cp ./local-file gs://BUCKET/OBJECT
  gsutil cp gs://BUCKET/OBJECT ./local-file

- Create a signed URL:
  gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT

Security reminders
- Use service accounts with least privilege in production.
- Signed URLs grant access to anyone with the URL until expiry.
- Avoid making sensitive objects public; prefer signed URLs or IAM roles.

Contributing
- Send PRs or issues with improvements.
- Keep examples minimal and targeted.
- Include tests or usage examples when adding scripts.

If you want, I can:
- Expand any doc into a full tutorial (step-by-step with screenshots).
- Add a shell script to compose >32 objects safely.
- Create a Dataflow/Beam starter pipeline for large-scale merges.

````

```html name=index.html
<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8"/>
  <title>storage.cloud — Google Cloud Storage quickstart</title>
  <meta name="viewport" content="width=device-width,initial-scale=1"/>
  <style>
    body { font-family: system-ui, -apple-system, Roboto, "Segoe UI", Arial; max-width: 980px; margin: 36px auto; line-height: 1.6; padding: 0 20px; color:Visual-Studio-Code#111 }
    pre{ background:#f6f8fa; padding:12px; overflow:auto; border-radius:6px; }
    header {border-bottom:1px solid #eee; margin-bottom:20px; padding-bottom:12px}
    h1{margin:0;padding:0}
    nav a{margin-right:12px;color:#1a73e8;text-decoration:none}
    small { color:#666 }
  </style>
</head>
<body>
  <header>
    <h1>storage.cloud</h1>
    <p><small>Quick reference and starter content for Google Cloud Storage (GCS)</small></p>
    <nav>
      <a href="docs/quickstart.md">Quickstart</a>
      <a href="docs/merge-data.md">Combine data</a>
      <a href="docs/signed-urls.md">Signed URLs</a>
      <a href="examples/merge_csv_gcs.py">Examples</a>
    </nav>
  </header>

  <main>
    <section>
      <h2>Overview</h2>
      <p>storage.cloud collects short guides, command snippets, and small reusable scripts for common GCS workflows:</p>
      <ul>
        <li>Authenticated access (gcloud, service accounts, access tokens)</li>
        <li>Signed URLs for temporary public access</li>
        <li>CORS configuration for browser-based apps</li>
        <li>Merging objects (CSV/text) and loading to BigQuery</li>
      </ul>
    </section>

    <section>
      <h2>Helpful links</h2>
      <ul>
        <li><a href="https://storage.cloud.google.com/">storage.cloud.google.com</a> — web UI (requires Google sign-in)</li>
        <li><a href="https://console.cloud.google.com/storage/browser">Cloud Console — Storage browser</a></li>
        <li><a href="https://cloud.google.com/storage/docs">GCS documentation</a></li>
      </ul>
    </section>

    <section>
      <h2>Get started</h2>
      <p>Open <code>docs/quickstart.md</code> for commands and brief examples you can run locally. For combining files, see <code>docs/merge-data.md</code> and <code>examples/merge_csv_gcs.py</code>.</p>
    </section>
  </main>
</body>
</html>
```

````markdown name=docs/quickstart.md
# Quickstart — Google Cloud Storage (GCS)

This page collects the most-used commands and short examples for getting started with GCS.

Prerequisites
- Install Google Cloud SDK (gcloud, gsutil)
- For programmatic examples, install the relevant client libraries (Python: google-cloud-storage)

Authentication
- Local developer (Application Default Credentials - ADC):
  gcloud auth application-default login

- Service account (recommended for server apps):
  1. Create:
     gcloud iam service-accounts create my-sa --display-name="My SA"
  2. Grant roles (example: objectViewer):
     gcloud projects add-iam-policy-binding PROJECT_ID \
       --member="serviceAccount:my-sa@PROJECT_ID.iam.gserviceaccount.com" \
       --role="roles/storage.objectViewer"
  3. (Optional) Create key for local use:
     gcloud iam service-accounts keys create key.json \
       --iam-account=my-sa@PROJECT_ID.iam.gserviceaccount.com

Access tokens
- Print an access token for HTTP Authorization:
  gcloud auth print-access-token
- Use as header: Authorization: Bearer <ACCESS_TOKEN>

Common gsutil commands
- List buckets:
  gsutil ls gs://

- List objects in a bucket/prefix:
  gsutil ls gs://BUCKET/PREFIX/

- Download object:
  gsutil cp gs://BUCKET/OBJECT ./local-file

- Upload file:
  gsutil cp ./local-file gs://BUCKET/OBJECT

- Make object public (use sparingly):
  gsutil acl ch -u AllUsers:R gs://BUCKET/OBJECT

Signed URLs
- Short way to create an expiring URL using a service account key:
  gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT
- Notes:
  - V4 signed URLs are supported; maximum expiry is 7 days.
  - Signed URLs allow access without a Google account.

CORS (for browser clients)
- Example file: cors.json (in repo)
- Apply:
  gsutil cors set cors.json gs://BUCKET

BigQuery ingestion
- BigQuery accepts wildcards — you can load many CSVs without pre-merging:
  bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv

Troubleshooting
- Permission denied: check IAM roles (roles/storage.objectViewer or a custom role).
- Invalid credentials: re-run `gcloud auth application-default login` or refresh service account keys.
- CORS errors: ensure bucket CORS includes your domain and required methods/headers.

Security recommendations
- Use service accounts with least privilege.
- Prefer IAM + uniform bucket-level access over ACLs.
- Do not embed long-lived keys in client-side code; use signed URLs for browser access.

Further reading
- GCS docs: https://cloud.google.com/storage/docs
- Signed URLs: https://cloud.google.com/storage/docs/access-control/signed-urls
- gsutil reference: https://cloud.google.com/storage/docs/gsutil
````

````markdown name=docs/merge-data.md
# Gabungkan Semua Data (Combine all data)

Dokumen ringkas (Bahasa Melayu) untuk menggabungkan fail yang tersimpan di Google Cloud Storage.

Sebelum mula
- Pastikan anda mempunyai akses ke bucket (roles/storage.objectViewer atau storage.objectAdmin untuk penulisan).
- Jika dataset besar (GB/TB), pertimbangkan Dataflow/Dataproc atau import terus ke BigQuery.

Pilihan umum

1) Quick merge (fail kecil/sederhana)
- Jika saiz kecil supaya muat dalam memori:
  gsutil cat gs://BUCKET/PATH/*.csv | gsutil cp - gs://BUCKET/PATH/combined.csv
- Risiko: memori dan rangkaian. Pastikan hanya digunakan untuk set kecil.

2) gsutil compose (gabungkan objek di GCS tanpa muat turun)
- Gabungkan hingga 32 objek per operasi:
  gsutil compose gs://BUCKET/part1.csv gs://BUCKET/part2.csv gs://BUCKET/combined.csv
- Untuk >32 objek, buat compose berperingkat (tree compose).
- Nota: compose menggabungkan byte; pastikan setiap objek mempunyai newline yang sesuai dan tidak menggabungkan header berganda.

3) Load terus ke BigQuery (disarankan untuk analisis)
- BigQuery boleh baca wildcard CSV:
  bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv
- Kelebihan: skala, tiada pra-gabungan, schema handling.

4) Skrip tersuai (contoh Python)
- Kelebihan: kawalan penuh (buang header berganda, normalisasi).
- Lihat `examples/merge_csv_gcs.py` untuk contoh.

Contoh strategi untuk >32 objek menggunakan gsutil compose (pattern)
- Bahagikan senarai objek kepada kumpulan 32, compose setiap kumpulan ke objek sementara, kemudian compose hasil sementara (ulang sehingga satu objek terakhir).
- Atau gunakan pipeline Dataflow untuk penstriman dan penulisan semula.

Amalan terbaik
- Untuk output akhir yang akan dianalitik: lebih baik tulis ke BigQuery atau Parquet (kolumnar) untuk kos dan prestasi.
- Pastikan logging dan versi objek dihidupkan jika data penting (object versioning).
- Gunakan encryption keys (CMEK) jika diperlukan.

Contoh ringkas penggunaan skrip
- Jika anda mahu, berikan:
  - Bucket name (contoh: gs://my-bucket/data/)
  - Prefix dan jenis fail (.csv)
  - Saiz anggaran (GB/TB)
  Saya akan hasilkan skrip automatik yang sesuai (bash atau Python) untuk anda.
````

````markdown name=docs/signed-urls.md
# Signed URLs — Quick reference

Signed URLs grant temporary access to an object without requiring the requester to have a Google account.

Key points
- Signed URLs include authentication in query parameters.
- Anyone with the URL can access the object until expiry.
- Maximum expiry for V4 signed URLs is 7 days.

Create with gsutil (using service account key file)
  gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT

Create programmatically (Python example sketch)
- Use google-auth library to sign and generate a V4 URL, or use google-cloud-storage's blob.generate_signed_url method.
- Prefer service account credentials with least privilege.

Security tips
- Monitor signed-URL usage where possible (Cloud Storage logs).
- Shorter expiry reduces risk.
- Do not embed long-lived private keys in public repositories.

When to use
- Temporary downloads for users without Google accounts.
- Browser uploads (PUT) when combined with appropriate CORS settings.
````

```python name=examples/merge_csv_gcs.py
#!/usr/bin/env python3
"""
examples/merge_csv_gcs.py

Merge CSV files in a GCS prefix into one CSV while keeping only the first header.

Requirements:
  pip install google-cloud-storage

Usage:
  export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"
  python3 examples/merge_csv_gcs.py my-bucket data/prefix/ output/combined.csv
"""
import sys
import csv
from io import StringIO
from google.cloud import storage

def merge_csvs(bucket_name, prefix, output_blob_name):
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blobs = sorted([b for b in bucket.list_blobs(prefix=prefix) if b.name.endswith('.csv')])

    if not blobs:
        print("No CSV files found with prefix:", prefix)
        return

    out_buf = StringIO()
    writer = None
    first = True

    for blob in blobs:
        print("Reading:", blob.name)
        # Use streaming download if files are large: download_as_text() may load entire object
        data = blob.download_as_text()
        reader = csv.reader(StringIO(data))
        header = next(reader, None)
        if first:
            writer = csv.writer(out_buf)
            if header:
                writer.writerow(header)
            for row in reader:
                writer.writerow(row)
            first = False
        else:
            # skip header for subsequent files
            for row in reader:
                writer.writerow(row)

    out_blob = bucket.blob(output_blob_name)
    out_blob.upload_from_string(out_buf.getvalue(), content_type='text/csv')
    print(f'Uploaded gs://{bucket_name}/{output_blob_name}')

if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("Usage: merge_csv_gcs.py BUCKET PREFIX OUTPUT_BLOB")
        print("Example: merge_csv_gcs.py my-bucket data/ output/combined.csv")
        sys.exit(1)
    merge_csvs(sys.argv[1], sys.argv[2], sys.argv[3])
```

```json name=cors.json
[
  {
    "origin": ["https://example.com"],
    "method": ["GET", "HEAD", "PUT", "POST"],
    "responseHeader": ["Content-Type", "x-goog-meta-custom"],
    "maxAgeSeconds": 3600
  }
]
```

```text name=LICENSE
MIT License

Copyright (c) 2025

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
(You can replace with your full MIT text or preferred license.)
```

Would you like me to:
- generate a shell helper to compose >32 objects safely (tree compose), or
- produce a Dataflow/Beam starter for large merges, or
- create a translated Malay full-site version?

Tell me which follow-up and I will produce the exact script or pipeline.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant