From 84825bee15337958d5942be895501e336c57e95a Mon Sep 17 00:00:00 2001 From: Muhamad Sazwan Bin Ismail Date: Thu, 6 Nov 2025 10:28:39 +0800 Subject: [PATCH] Add storage.cloud documentation and example scripts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Added documentation and example scripts for Google Cloud Storage usage, including quickstart guides, merging data, and CORS configuration.````markdown name=README.md # storage.cloud — Google Cloud Storage (GCS) docs & quickstart storage.cloud is a compact documentation and example repo with copy-pastable commands and small scripts for common Google Cloud Storage tasks: - Authentication (gcloud ADC, service accounts, access tokens) - Browser access vs programmatic access (storage.cloud.google.com vs API) - Signed URLs and sharing - CORS configuration for browser clients - Practical patterns to merge many objects (CSV/text) in GCS - Load patterns for BigQuery and recommendations for large datasets Repository layout - index.html — simple landing page / site overview - docs/ - quickstart.md — commands & short how-tos - merge-data.md — Malay quick guide for combining files - signed-urls.md — signing approaches and examples - examples/ - merge_csv_gcs.py — small Python script to merge CSVs in a GCS prefix - cors.json — CORS example - LICENSE — suggested license (MIT) Quickstart highlights - Authenticate locally (ADC): gcloud auth application-default login - Get an access token for Authorization header: gcloud auth print-access-token - List buckets: gsutil ls gs:// - Copy to/from GCS: gsutil cp ./local-file gs://BUCKET/OBJECT gsutil cp gs://BUCKET/OBJECT ./local-file - Create a signed URL: gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT Security reminders - Use service accounts with least privilege in production. - Signed URLs grant access to anyone with the URL until expiry. - Avoid making sensitive objects public; prefer signed URLs or IAM roles. Contributing - Send PRs or issues with improvements. - Keep examples minimal and targeted. - Include tests or usage examples when adding scripts. If you want, I can: - Expand any doc into a full tutorial (step-by-step with screenshots). - Add a shell script to compose >32 objects safely. - Create a Dataflow/Beam starter pipeline for large-scale merges. ```` ```html name=index.html storage.cloud — Google Cloud Storage quickstart

storage.cloud

Quick reference and starter content for Google Cloud Storage (GCS)

Overview

storage.cloud collects short guides, command snippets, and small reusable scripts for common GCS workflows:

  • Authenticated access (gcloud, service accounts, access tokens)
  • Signed URLs for temporary public access
  • CORS configuration for browser-based apps
  • Merging objects (CSV/text) and loading to BigQuery

Helpful links

Get started

Open docs/quickstart.md for commands and brief examples you can run locally. For combining files, see docs/merge-data.md and examples/merge_csv_gcs.py.

``` ````markdown name=docs/quickstart.md # Quickstart — Google Cloud Storage (GCS) This page collects the most-used commands and short examples for getting started with GCS. Prerequisites - Install Google Cloud SDK (gcloud, gsutil) - For programmatic examples, install the relevant client libraries (Python: google-cloud-storage) Authentication - Local developer (Application Default Credentials - ADC): gcloud auth application-default login - Service account (recommended for server apps): 1. Create: gcloud iam service-accounts create my-sa --display-name="My SA" 2. Grant roles (example: objectViewer): gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:my-sa@PROJECT_ID.iam.gserviceaccount.com" \ --role="roles/storage.objectViewer" 3. (Optional) Create key for local use: gcloud iam service-accounts keys create key.json \ --iam-account=my-sa@PROJECT_ID.iam.gserviceaccount.com Access tokens - Print an access token for HTTP Authorization: gcloud auth print-access-token - Use as header: Authorization: Bearer Common gsutil commands - List buckets: gsutil ls gs:// - List objects in a bucket/prefix: gsutil ls gs://BUCKET/PREFIX/ - Download object: gsutil cp gs://BUCKET/OBJECT ./local-file - Upload file: gsutil cp ./local-file gs://BUCKET/OBJECT - Make object public (use sparingly): gsutil acl ch -u AllUsers:R gs://BUCKET/OBJECT Signed URLs - Short way to create an expiring URL using a service account key: gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT - Notes: - V4 signed URLs are supported; maximum expiry is 7 days. - Signed URLs allow access without a Google account. CORS (for browser clients) - Example file: cors.json (in repo) - Apply: gsutil cors set cors.json gs://BUCKET BigQuery ingestion - BigQuery accepts wildcards — you can load many CSVs without pre-merging: bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv Troubleshooting - Permission denied: check IAM roles (roles/storage.objectViewer or a custom role). - Invalid credentials: re-run `gcloud auth application-default login` or refresh service account keys. - CORS errors: ensure bucket CORS includes your domain and required methods/headers. Security recommendations - Use service accounts with least privilege. - Prefer IAM + uniform bucket-level access over ACLs. - Do not embed long-lived keys in client-side code; use signed URLs for browser access. Further reading - GCS docs: https://cloud.google.com/storage/docs - Signed URLs: https://cloud.google.com/storage/docs/access-control/signed-urls - gsutil reference: https://cloud.google.com/storage/docs/gsutil ```` ````markdown name=docs/merge-data.md # Gabungkan Semua Data (Combine all data) Dokumen ringkas (Bahasa Melayu) untuk menggabungkan fail yang tersimpan di Google Cloud Storage. Sebelum mula - Pastikan anda mempunyai akses ke bucket (roles/storage.objectViewer atau storage.objectAdmin untuk penulisan). - Jika dataset besar (GB/TB), pertimbangkan Dataflow/Dataproc atau import terus ke BigQuery. Pilihan umum 1) Quick merge (fail kecil/sederhana) - Jika saiz kecil supaya muat dalam memori: gsutil cat gs://BUCKET/PATH/*.csv | gsutil cp - gs://BUCKET/PATH/combined.csv - Risiko: memori dan rangkaian. Pastikan hanya digunakan untuk set kecil. 2) gsutil compose (gabungkan objek di GCS tanpa muat turun) - Gabungkan hingga 32 objek per operasi: gsutil compose gs://BUCKET/part1.csv gs://BUCKET/part2.csv gs://BUCKET/combined.csv - Untuk >32 objek, buat compose berperingkat (tree compose). - Nota: compose menggabungkan byte; pastikan setiap objek mempunyai newline yang sesuai dan tidak menggabungkan header berganda. 3) Load terus ke BigQuery (disarankan untuk analisis) - BigQuery boleh baca wildcard CSV: bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv - Kelebihan: skala, tiada pra-gabungan, schema handling. 4) Skrip tersuai (contoh Python) - Kelebihan: kawalan penuh (buang header berganda, normalisasi). - Lihat `examples/merge_csv_gcs.py` untuk contoh. Contoh strategi untuk >32 objek menggunakan gsutil compose (pattern) - Bahagikan senarai objek kepada kumpulan 32, compose setiap kumpulan ke objek sementara, kemudian compose hasil sementara (ulang sehingga satu objek terakhir). - Atau gunakan pipeline Dataflow untuk penstriman dan penulisan semula. Amalan terbaik - Untuk output akhir yang akan dianalitik: lebih baik tulis ke BigQuery atau Parquet (kolumnar) untuk kos dan prestasi. - Pastikan logging dan versi objek dihidupkan jika data penting (object versioning). - Gunakan encryption keys (CMEK) jika diperlukan. Contoh ringkas penggunaan skrip - Jika anda mahu, berikan: - Bucket name (contoh: gs://my-bucket/data/) - Prefix dan jenis fail (.csv) - Saiz anggaran (GB/TB) Saya akan hasilkan skrip automatik yang sesuai (bash atau Python) untuk anda. ```` ````markdown name=docs/signed-urls.md # Signed URLs — Quick reference Signed URLs grant temporary access to an object without requiring the requester to have a Google account. Key points - Signed URLs include authentication in query parameters. - Anyone with the URL can access the object until expiry. - Maximum expiry for V4 signed URLs is 7 days. Create with gsutil (using service account key file) gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT Create programmatically (Python example sketch) - Use google-auth library to sign and generate a V4 URL, or use google-cloud-storage's blob.generate_signed_url method. - Prefer service account credentials with least privilege. Security tips - Monitor signed-URL usage where possible (Cloud Storage logs). - Shorter expiry reduces risk. - Do not embed long-lived private keys in public repositories. When to use - Temporary downloads for users without Google accounts. - Browser uploads (PUT) when combined with appropriate CORS settings. ```` ```python name=examples/merge_csv_gcs.py #!/usr/bin/env python3 """ examples/merge_csv_gcs.py Merge CSV files in a GCS prefix into one CSV while keeping only the first header. Requirements: pip install google-cloud-storage Usage: export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json" python3 examples/merge_csv_gcs.py my-bucket data/prefix/ output/combined.csv """ import sys import csv from io import StringIO from google.cloud import storage def merge_csvs(bucket_name, prefix, output_blob_name): client = storage.Client() bucket = client.bucket(bucket_name) blobs = sorted([b for b in bucket.list_blobs(prefix=prefix) if b.name.endswith('.csv')]) if not blobs: print("No CSV files found with prefix:", prefix) return out_buf = StringIO() writer = None first = True for blob in blobs: print("Reading:", blob.name) # Use streaming download if files are large: download_as_text() may load entire object data = blob.download_as_text() reader = csv.reader(StringIO(data)) header = next(reader, None) if first: writer = csv.writer(out_buf) if header: writer.writerow(header) for row in reader: writer.writerow(row) first = False else: # skip header for subsequent files for row in reader: writer.writerow(row) out_blob = bucket.blob(output_blob_name) out_blob.upload_from_string(out_buf.getvalue(), content_type='text/csv') print(f'Uploaded gs://{bucket_name}/{output_blob_name}') if __name__ == "__main__": if len(sys.argv) != 4: print("Usage: merge_csv_gcs.py BUCKET PREFIX OUTPUT_BLOB") print("Example: merge_csv_gcs.py my-bucket data/ output/combined.csv") sys.exit(1) merge_csvs(sys.argv[1], sys.argv[2], sys.argv[3]) ``` ```json name=cors.json [ { "origin": ["https://example.com"], "method": ["GET", "HEAD", "PUT", "POST"], "responseHeader": ["Content-Type", "x-goog-meta-custom"], "maxAgeSeconds": 3600 } ] ``` ```text name=LICENSE MIT License Copyright (c) 2025 Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction... (You can replace with your full MIT text or preferred license.) ``` Would you like me to: - generate a shell helper to compose >32 objects safely (tree compose), or - produce a Dataflow/Beam starter for large merges, or - create a translated Malay full-site version? Tell me which follow-up and I will produce the exact script or pipeline. --- storage.cloud | 247 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 247 insertions(+) create mode 100644 storage.cloud diff --git a/storage.cloud b/storage.cloud new file mode 100644 index 00000000..9b214b15 --- /dev/null +++ b/storage.cloud @@ -0,0 +1,247 @@ +````markdown name=README.md +# storage.cloud — Docs & Quickstart + +storage.cloud is a quick-reference resource and starter content for working with Google Cloud Storage (GCS). It includes concise how-tos, examples, and small reusable scripts for common tasks such as authenticated access, signed URLs, CORS configuration, and merging data from many objects. + +Primary goals +- Provide clear, copy-pastable commands for common workflows. +- Give simple code examples (Python) for object operations and CSV merging. +- Offer troubleshooting tips and security reminders. + +Contents +- index.html — site landing page (simple overview). +- docs/quickstart.md — authentication, common gsutil/gcloud/bq commands, signed URLs. +- docs/merge-data.md — guide in Malay for combining files stored in GCS. +- examples/merge_csv_gcs.py — Python example to merge CSVs in a GCS prefix while handling headers. +- cors.json — example CORS configuration. + +Contributing +- Send issues or PRs with improvements. +- Keep examples minimal and focused. Prefer service accounts and least privilege for production. + +License +- Add your preferred license file (e.g., MIT) to the repo root. +```` + +```html name=index.html + + + + + storage.cloud — Google Cloud Storage quickstart + + + + +
+

storage.cloud

+

Quick reference and starter content for Google Cloud Storage (GCS)

+ +
+ +
+

Overview

+

This site contains short guides and example code for:

+
    +
  • Authenticated access (gcloud, service accounts, access tokens)
  • +
  • Signed URLs for temporary, non-account-based access
  • +
  • CORS configuration for browser clients
  • +
  • Merging files stored in GCS (CSV, text, etc.)
  • +
  • Loading directly into BigQuery for analytics
  • +
+
+ +
+

Important links

+ +
+ +
+

Get started

+

Open docs/quickstart.md for a minimal set of commands and examples you can run locally.

+
+ + +``` + +````markdown name=docs/quickstart.md +# Quickstart — Google Cloud Storage (GCS) + +This page collects the most-used commands and short examples for getting started with GCS. + +Authentication +- Local user ADC (Application Default Credentials) — for development: + gcloud auth application-default login + +- Service account (recommended for servers/apps): + 1. Create service account in Cloud Console or: + gcloud iam service-accounts create my-sa --display-name="My SA" + 2. Grant needed roles (example: objectViewer): + gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:my-sa@PROJECT_ID.iam.gserviceaccount.com" --role="roles/storage.objectViewer" + 3. Download key (if needed): + gcloud iam service-accounts keys create key.json --iam-account=my-sa@PROJECT_ID.iam.gserviceaccount.com + +Get an access token (for Authorization header): + gcloud auth print-access-token + +Common gsutil commands +- List buckets: + gsutil ls gs:// + +- List objects in a prefix: + gsutil ls gs://BUCKET/PREFIX/ + +- Download an object: + gsutil cp gs://BUCKET/OBJECT ./local-file + +- Upload a file: + gsutil cp ./local-file gs://BUCKET/OBJECT + +- Make an object publicly readable (not recommended for sensitive data): + gsutil acl ch -u AllUsers:R gs://BUCKET/OBJECT + +Signed URLs +- Create a signed URL for temporary access (using gsutil with a service account key): + gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT +- Signed URLs are valid up to 7 days when using V4 signing. + +CORS example +- Use cors.json (see repo) and apply: + gsutil cors set cors.json gs://BUCKET + +Uploading many files / analytics +- Load CSVs directly to BigQuery (no need to pre-concatenate): + bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv + +Notes & troubleshooting +- Permission denied: ensure the account has roles/storage.objectViewer (or higher) for read, or objectAdmin for writes. +- Invalid credentials: re-run `gcloud auth application-default login` or refresh service account tokens. +- CORS errors: configure the bucket CORS and allow the necessary origins and response headers. +```` + +````markdown name=docs/merge-data.md +# Gabungkan Semua Data (Combine all data) — Panduan ringkas + +Dokumen ini ditulis dalam Bahasa Melayu untuk arahan cepat bagaimana menggabungkan (merge) fail yang disimpan di Google Cloud Storage. + +Soalan awal (jawab jika perlu): +1. Fail di mana? (satu bucket / beberapa bucket) +2. Format fail? (CSV/JSON/Parquet) +3. Saiz anggaran? (MB/GB/TB) +4. Output dikehendaki? (satu fail di GCS, jadual BigQuery) + +Pilihan pantas + +Pilihan A — Gabungkan cepat untuk fail CSV kecil/sederhana (one‑shot) +- Menggunakan gsutil cat (berguna untuk fail kecil, ingat had memori): + gsutil cat gs://BUCKET/PATH/*.csv | gsutil cp - gs://BUCKET/PATH/combined.csv + +- Nota: Jika setiap CSV mempunyai header, gunakan skrip untuk membuang header bahagian kedua dan seterusnya (contoh di bawah). + +Pilihan B — gsutil compose (gabungkan objek tanpa muat turun) +- gsutil compose gs://BUCKET/part1.csv gs://BUCKET/part2.csv gs://BUCKET/combined.csv +- Had: 32 objek setiap compose step. Untuk >32, jalankan compose berperingkat (tree compose). + +Pilihan C — Muat naik terus ke BigQuery (disarankan untuk analitik besar) +- BigQuery boleh menerima wildcard CSVs: + bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv + +Pilihan D — Pipeline (untuk dataset besar/penukaran) +- Gunakan Dataflow (Apache Beam) atau Dataproc (Spark) untuk transformasi dan penulisan semula ke GCS / BigQuery. + +Contoh skrip Python — gabung CSV dan buang header berganda +- Fail contoh: `examples/merge_csv_gcs.py` (berguna jika anda mahu kawalan penuh sebelum muat naik semula). + +Perkara penting +- Pastikan service account/akaun anda mempunyai permission yang sesuai (roles/storage.objectViewer / storage.objectAdmin). +- Untuk perkongsian hasil: pertimbangkan signed URLs (maks 7 hari) atau tetapkan access controls yang sesuai. +- Untuk fail besar, elakkan memuatkan semuanya ke RAM — gunakan streaming atau gunakan Dataflow/Dataproc. + +Jika anda beritahu saya: +- lokasi bucket (contoh: gs://my-bucket/data/), format fail, dan ukuran anggaran, saya boleh hasilkan skrip yang diubah suai untuk anda. +```` + +```python name=examples/merge_csv_gcs.py +#!/usr/bin/env python3 +""" +Merge CSV files in a GCS prefix into one CSV while keeping only the first header. +Requirements: + pip install google-cloud-storage +Usage: + export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json" + python3 examples/merge_csv_gcs.py my-bucket data/prefix/ output/combined.csv +""" +import sys +import csv +from io import StringIO +from google.cloud import storage + +def merge_csvs(bucket_name, prefix, output_blob_name): + client = storage.Client() + bucket = client.bucket(bucket_name) + blobs = sorted([b for b in bucket.list_blobs(prefix=prefix) if b.name.endswith('.csv')]) + + out_buf = StringIO() + writer = None + first = True + + for blob in blobs: + print("Reading:", blob.name) + data = blob.download_as_text() + reader = csv.reader(StringIO(data)) + header = next(reader, None) + if first: + writer = csv.writer(out_buf) + if header: + writer.writerow(header) + for row in reader: + writer.writerow(row) + first = False + else: + # skip header for subsequent files + for row in reader: + writer.writerow(row) + + out_blob = bucket.blob(output_blob_name) + out_blob.upload_from_string(out_buf.getvalue(), content_type='text/csv') + print(f'Uploaded gs://{bucket_name}/{output_blob_name}') + +if __name__ == "__main__": + if len(sys.argv) != 4: + print("Usage: merge_csv_gcs.py BUCKET PREFIX OUTPUT_BLOB") + print("Example: merge_csv_gcs.py my-bucket data/ output/combined.csv") + sys.exit(1) + merge_csvs(sys.argv[1], sys.argv[2], sys.argv[3]) +``` + +```json name=cors.json +[ + { + "origin": ["https://example.com"], + "method": ["GET", "HEAD", "PUT", "POST"], + "responseHeader": ["Content-Type", "x-goog-meta-custom"], + "maxAgeSeconds": 3600 + } +] +``` + +If you want, I can: +- Generate additional localized guides (complete Malay translation). +- Produce a shell script for large-scale compose (handles >32 parts). +- Create a Dataflow (Beam) starter pipeline to merge/transform files at scale. + +Which follow-up would you like?