Course project for CSE 535 (Distributed Systems). This repo implements a sharded banking service with intra-shard replication and cross-shard two‑phase commit (2PC), including leader election, recovery, and client-side retry logic.
What It Does
- Processes read-only balance queries and transfer transactions across a sharded dataset.
- Uses leader-based replication inside each shard to order and commit transactions.
- Coordinates cross-shard transfers with 2PC (prepare/commit/abort) and retryable decision delivery.
- Supports configurable consistency levels for writes:
LINEARIZABLE,ANY, andALL.
Architecture Highlights
- Intra-shard consensus: leader election with ballot numbers and replicated accept/commit log.
- Cross-shard transactions: coordinator/participant 2PC with timeouts, retries, and WAL-based recovery.
- Fault handling: node activation/deactivation, recovery flow, and checkpoint snapshots for catch-up.
- Client system: retries, leader hinting, async submission, and built-in benchmarking.
- Resharding: client-driven reshard plan based on observed cross-shard traffic.
Tech Stack
- Go 1.x
- gRPC + Protocol Buffers
- Concurrent process model (separate node processes + client/testrunner)
Run a local 3‑shard cluster (9 nodes total) and open the interactive test runner.
go build ./...
./scripts/run_cluster.sh --input <path_to_csv>The test runner supports interactive commands such as db, status, leader, performance, bench, and reshard. See the prompt for the full menu.
- Node server entrypoint:
cmd/server/main.go - Client entrypoint:
cmd/client/main.go - Interactive test runner:
cmd/testrunner/main.go - Core node logic:
internal/node/*.go - Client logic:
internal/client/*.go - Protocol definitions:
proto/* - Configs:
config.json,config_4x5.json
Cluster size, shard count, node addresses, timers, and dataset size are defined in config.json. You can switch configs by passing --config to the node and test runner binaries.
This project demonstrates practical distributed systems engineering, including:
- Designing and implementing leader-based replication.
- Coordinating atomic cross-shard transactions with 2PC.
- Building fault-tolerant recovery paths and replay-safe WAL handling.
- Instrumenting client‑side retries and benchmarking for throughput/latency.
If you want a walkthrough or a focused deep dive (e.g., 2PC recovery or leader election), I’m happy to explain specific design choices and tradeoffs.