This project demonstrates end-to-end execution of MapReduce, Spark, and Dataproc Web UI tasks using Google Cloud Platform. It includes hands-on experiments, error handling, and working code samplesβall structured for learning and reproducibility.
bigdata-dataproc-challenge/ βββ Challenge-4.docx # Combined documentation/report for all tasks
βββ Task1-MapReduce/ β βββ data/ β β βββ Table_A.xlsm β β βββ Table_B.xlsm β β βββ .gitkeep β βββ src/ β β βββ mapper.py β β βββ reducer.py β β βββ .gitkeep β βββ README.md # Instructions and output summary for Task 1
βββ Task2-Spark/ β βββ data/ β β βββ Sample.xlsx # Output sample of 1000 rows β β βββ .gitkeep β βββ src/ β β βββ random_sample_task2.py # PySpark sampling code β β βββ .gitkeep β βββ README.md # Step-by-step Task 2 explanation
βββ Task3-WebUI/ | β βββ README.md # Task 3 explanation (Web UI)
- Wrote
mapper.pyandreducer.pyto simulate SQL JOIN ofTable_A(students) andTable_B(courses) using Hadoop Streaming. - Applied filter: DOB >= 1995-01-01
- Ran job both locally and on Dataproc using:
gsutilto move fileshadoop fsto stage inputhadoop-streaming.jarto execute
- Output: Successfully joined and filtered records, with 3 reducers, but only 1 producing the final result.
π More details in Task1-MapReduce/README.md
- Generated a 4.5 GB synthetic dataset using a Python script.
- Tested random sampling in RStudio using
data.table::fread()+sample(.N, 100) - Created
random_sample_task2.pyusing Gen AI - Ran Spark jobs using:
- β Persistent Dataproc cluster
- β Ephemeral (serverless) cluster
- β SSH into cluster β Jupyter β Spark + HDFS
- Output: 100 sampled records saved and shared as
Sample.xlsx
π See Task2-Spark/README.md for scripts and terminal outputs.
- Reused Task 1 Mapper/Reducer scripts
- Initially faced issues with JAR paths and GCS output permissions
- Debugged via
gcloudCLI before succeeding in Dataproc UI - Compared CLI vs Web UI runtimes (1.23s vs 1.17s)
- Learned:
- PySpark jobs are easier to manage due to better integration
- Hadoop JARs require more setup (permissions, bucket prep, GCS paths)
π Explained in Task3-WebUI/README.md
Each task folder contains a README.md with commands, screenshots, and helpful notes. Start from any of the following:
# Task 1: Hadoop MapReduce
cd Task1-MapReduce/
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -files src/mapper.py,src/reducer.py ...
# Task 2: PySpark
cd Task2-Spark/
gcloud dataproc jobs submit pyspark src/random_sample_task2.py ...
# Task 3: Use Dataproc Web UI
1. Go to https://console.cloud.google.com/dataproc
2. Choose "Submit Job"
3. Upload mapper, reducer, and dataset
π Documentation
All steps, screenshots, and results are consolidated in Challenge-4.docx.