You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+17-6Lines changed: 17 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,20 @@
1
-
# OpenDataFlow
1
+
# DataState: State-Driven ETL Orchestration
2
+
2
3
NOTE This application is in BETA. It still needs some work to get to first release. Contributers are welcome
3
4
4
5
## Overview
5
-
OpenDataFlow is a lightweight orchestration utility that runs and coordinates batch jobs over partitioned or time-sliced data so teams can schedule, recover, and migrate large data-processing pipelines without changing their ETL code.
6
-
It does this by keeping track of the status of every data partition, reporting on status when needed, and using the status to determine a partition of data sets which is ready to be consumed by a job. Our one purpose is to associate that data to a particular job run and provide that info at runtime.
6
+
7
+
DataState is a lightweight orchestration utility that runs and coordinates batch jobs over partitioned or time-sliced data so teams can schedule, recover, and migrate large data-processing pipelines without changing their ETL code.
8
+
9
+
This is not Airflow. Airflow schedules tasks; DataState ensures data readiness.
10
+
To Airflow, the batch cycle is a giant DAG, a tree structure representing a rigid order in which jobs will run.
11
+
To DataState, every job is an independent task which only cares about the data it is consuming and producing. Provided the data requirements are met (see the five questions below), any job can run at any time with any degree of concurent jobs. The DataState tool makes this possible.
12
+
13
+
Data warehousing is all about data. Not about tasks. The data-centric approach is the right model to use. With DataState, we work **with** that model, not against it.
14
+
15
+
It does this by keeping track of the status of every data partition, reporting that status when requested, and using the status to determine a partition of data sets which is ready to be consumed by a job. Our one purpose is to associate that data to a particular job run and provide that info at runtime.
16
+
17
+
A side benefit of DataState is that it keeps connection info for every dataset, so that they do not have to be hardcoded into job scripts or maintained separately by each job. This enables easy integration with dataquality tools and saves support time and cost when diagnosing failures, and prevents accidents when job code is promoted to production (no code change, no configuration change needed).
7
18
8
19
### Why this exists
9
20
Many data teams spend disproportionate time and engineering effort on the same operational problems: recovering after platform outages, catching up missed cycles, and migrating huge datasets in stages. OpenDataFlow was born out of repeated large migrations and outages. It encodes the orchestration and state-tracking so recovery, catch-up, and phased migration are first-class, routine operations — using the same job scripts you already have.
@@ -28,7 +39,7 @@ So we asked two natural questions:
28
39
2.**If it’s critical for root cause, isn’t it *even more* critical *before* the job runs?**
29
40
30
41
The answer to both was obvious.
31
-
That insight birthed the **DataFlow utilities** and eventually, **OpenDataFlow**.
42
+
That insight birthed the **DataFlow utilities** and eventually became what we are calling now, **DataState**.
32
43
33
44
---
34
45
@@ -46,7 +57,7 @@ That insight birthed the **DataFlow utilities** and eventually, **OpenDataFlow**
46
57
47
58
We use these questions to enforce a protocol **identical in spirit to 2-phase commit**:
0 commit comments