Skip to content

Commit 5746809

Browse files
Update README.md after name change
1 parent 5863426 commit 5746809

1 file changed

Lines changed: 17 additions & 6 deletions

File tree

README.md

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,20 @@
1-
# OpenDataFlow
1+
# DataState: State-Driven ETL Orchestration
2+
23
NOTE This application is in BETA. It still needs some work to get to first release. Contributers are welcome
34

45
## Overview
5-
OpenDataFlow is a lightweight orchestration utility that runs and coordinates batch jobs over partitioned or time-sliced data so teams can schedule, recover, and migrate large data-processing pipelines without changing their ETL code.
6-
It does this by keeping track of the status of every data partition, reporting on status when needed, and using the status to determine a partition of data sets which is ready to be consumed by a job. Our one purpose is to associate that data to a particular job run and provide that info at runtime.
6+
7+
DataState is a lightweight orchestration utility that runs and coordinates batch jobs over partitioned or time-sliced data so teams can schedule, recover, and migrate large data-processing pipelines without changing their ETL code.
8+
9+
This is not Airflow. Airflow schedules tasks; DataState ensures data readiness.
10+
To Airflow, the batch cycle is a giant DAG, a tree structure representing a rigid order in which jobs will run.
11+
To DataState, every job is an independent task which only cares about the data it is consuming and producing. Provided the data requirements are met (see the five questions below), any job can run at any time with any degree of concurent jobs. The DataState tool makes this possible.
12+
13+
Data warehousing is all about data. Not about tasks. The data-centric approach is the right model to use. With DataState, we work **with** that model, not against it.
14+
15+
It does this by keeping track of the status of every data partition, reporting that status when requested, and using the status to determine a partition of data sets which is ready to be consumed by a job. Our one purpose is to associate that data to a particular job run and provide that info at runtime.
16+
17+
A side benefit of DataState is that it keeps connection info for every dataset, so that they do not have to be hardcoded into job scripts or maintained separately by each job. This enables easy integration with dataquality tools and saves support time and cost when diagnosing failures, and prevents accidents when job code is promoted to production (no code change, no configuration change needed).
718

819
### Why this exists
920
Many data teams spend disproportionate time and engineering effort on the same operational problems: recovering after platform outages, catching up missed cycles, and migrating huge datasets in stages. OpenDataFlow was born out of repeated large migrations and outages. It encodes the orchestration and state-tracking so recovery, catch-up, and phased migration are first-class, routine operations — using the same job scripts you already have.
@@ -28,7 +39,7 @@ So we asked two natural questions:
2839
2. **If it’s critical for root cause, isn’t it *even more* critical *before* the job runs?**
2940

3041
The answer to both was obvious.
31-
That insight birthed the **DataFlow utilities** and eventually, **OpenDataFlow**.
42+
That insight birthed the **DataFlow utilities** and eventually became what we are calling now, **DataState**.
3243

3344
---
3445

@@ -46,7 +57,7 @@ That insight birthed the **DataFlow utilities** and eventually, **OpenDataFlow**
4657

4758
We use these questions to enforce a protocol **identical in spirit to 2-phase commit**:
4859

49-
| 2PC Phase | OpenDataFlow Equivalent | Implementation |
60+
| 2PC Phase | DataState Equivalent | Implementation |
5061
|---------------------|----------------------------------------------------|--------------------------------------------------------------------------------|
5162
| **Phase 1: Prepare** | `RunJob` checks **all 5 Questions** | Scans `data_manifest`, `datastatus`, locks, paths, validation |
5263
| **Yes Vote** | All inputs = `READY`, no lock conflicts | Every input confirms: “I’m complete, valid, and exclusively available” |
@@ -59,7 +70,7 @@ We use these questions to enforce a protocol **identical in spirit to 2-phase co
5970
6071
---
6172

62-
### The Legend Moment: Autonomic Recovery
73+
### The Defining Moment: Autonomic Recovery
6374

6475
After a platform outage:
6576

0 commit comments

Comments
 (0)