Skip to content

Yurhigz/kafka-spark-realtime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📡 kafka-spark-realtime — Real-Time Clickstream Analytics Pipeline

A complete end-to-end pipeline for real-time analytics using Kafka + Spark Structured Streaming + PostgreSQL + Elasticsearch.


📌 Overview

It simulates and processes clickstream data in real time. It uses:

  • Kafka to ingest user click events
  • Spark Structured Streaming (via PySpark) for stream processing
  • PostgreSQL to store historical data
  • Elasticsearch + Kibana for search & visualization

Optional tools like Kafka Connect and Kafdrop are also included.


🧱 Architecture


\[Simulator (Go)]
↓
\[Kafka Topic: clicks]
↓
\[Spark Structured Streaming]
↓           ↘
\[PostgreSQL]   \[Elasticsearch]
↓
\[Kibana]


🚀 Features

  • ✅ Real-time event simulation (Go)
  • ✅ Kafka producer/consumer pipeline
  • ✅ Stream processing with PySpark
  • ✅ Sink to PostgreSQL and Elasticsearch
  • ✅ Docker Compose setup for full reproducibility
  • ✅ Extensible and production-oriented

🛠️ Tech Stack

Tool Role
Kafka Message broker for real-time events
Spark Stream processing engine (PySpark)
PostgreSQL Storage of historical events
Elasticsearch Real-time search and analytics
Kibana Visualization UI (optional)
Kafka Connect Sink connectors (optional)
Docker Compose Environment orchestration

📂 Project Structure


kafka-spark-realtime/
├── docker-compose.yml
├── simulator/
│   └── main.go
├── spark/
│   └── stream\_processor.py
├── sql/
│   └── create\_tables.sql
├── connectors/
│   ├── postgres-sink.json
│   └── elastic-sink.json
├── notebooks/
│   └── exploratory\_analysis.ipynb
└── README.md


▶️ Quick Start

  1. Clone the repo
git clone [https://github.com//kafka-spark-realtime.git](https://github.com/Yurhigz/kafka-spark-realtime.git)
cd kafka-spark-realtime
  1. Launch the environment
docker-compose up -d

2.bis Setup C Bindings

Installer le compilateur GCC :

sudo apt-get update
sudo apt-get install build-essential

Modifier la variable CGO_ENABLED

  1. Start the simulator
go run simulator/main.go
  1. Launch the Spark job
spark-submit spark/stream_processor.py
  1. Explore
  • PostgreSQL: localhost:5432
  • Elasticsearch: localhost:9200
  • Kibana: localhost:5601
  • Kafdrop: localhost:9000

📊 Example Event Format

{
  "user_id": "user_42",
  "page": "/product/1234",
  "event": "click",
  "timestamp": "2025-06-04T12:00:00Z"
}

🤝 License

MIT License — feel free to use for academic or enterprise learning.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages