|
| 1 | +# OpenDataFlow |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +OpenDataFlow is a lightweight orchestration utility that runs and coordinates batch jobs over partitioned or time-sliced data so teams can schedule, recover, and migrate large data-processing pipelines without changing their ETL code. |
| 6 | + |
| 7 | +This quickstart uses H2 for simplicity. |
| 8 | + |
| 9 | +It has been tested on Ubuntu and runs in a bash shell. |
| 10 | + |
| 11 | +1. Requirements to run the demo: |
| 12 | + |
| 13 | +- bash and the standard command line utilities |
| 14 | +- jq (`sudo apt install -y jq`) |
| 15 | +- The DataFlow jar: dataflow-1.0.0.jar |
| 16 | +- the decryption passkey in $PASSKEY environment variable |
| 17 | + ` export PASSKEY=plugh ` |
| 18 | +- The two supplied scripts: `utility.sh` and `RunJob` |
| 19 | + |
| 20 | +Have jq on the your path, and put the dataflow-1.0.0.jar in the same directory as the scripts. |
| 21 | + |
| 22 | +2. Set up the h2 database, schema and tables: |
| 23 | + |
| 24 | +``` ./utility.sh createtables``` |
| 25 | + |
| 26 | + The initial connection to H2 creates the database, schema, and user automatically |
| 27 | + The createtables utility creates the standard dataflow tables in the database. |
| 28 | + |
| 29 | +3. Configure the 'loadbob' job and the datasets that it uses |
| 30 | +``` |
| 31 | + ./utility.sh dml "insert into dataset (datasetid) values ('bobin')" |
| 32 | + ./utility.sh dml "insert into dataset (datasetid) values ('bobout')" |
| 33 | + ./utility.sh dml "insert into job (datasetid,itemtype,jobid) values ('bobout','OUT','loadbob')" |
| 34 | + ./utility.sh dml "insert into job (datasetid,itemtype,jobid) values ('bobin' ,'IN', 'loadbob')" |
| 35 | +``` |
| 36 | +These insert test data into the schema that are enough to simulate a run. |
| 37 | + |
| 38 | +The first two commands register two datasets named 'bobin' and 'bobout'. |
| 39 | +The second two commands associates bobin and bobout as input and output data sets respectively with the job named 'loadbob' |
| 40 | +These inserts should only happen when one time to configure the job and datasets. |
| 41 | + |
| 42 | +4. Set a status for the input dataset |
| 43 | + |
| 44 | + ``` |
| 45 | +./utility.sh dml "insert into datastatus (dataid,datasetid,jobid,locktype,modified,status) values ('1.0','bobin','fakejob', 'OUT',now(),'READY')" |
| 46 | +``` |
| 47 | + |
| 48 | + |
| 49 | +We give it a fake dataid, and specify a fakejob that "produced" it. The status of READY for an OUT data chunk means that it is ready and safe to bebe consumed. |
| 50 | + |
| 51 | +5. "Write" a `loadbob.sh` to run. In this example it is just a one-liner that outputs some of the automatic environment variables. |
| 52 | + |
| 53 | +``` |
| 54 | + echo 'echo "running loadbob with dataid $dataid partition of input $bobin_DATASETID"' > loadbob.sh && chmod +x loadbob.sh |
| 55 | +``` |
| 56 | + |
| 57 | + One important note: the jobid is **inferred** from the name of the script. That means that if our jobid is 'loadbob' then the script has to be named 'oadbob.sh'. This is mandatory, but is just the way that the RunJob script is written. The intent is to keep it simple so that the only parameter to RunJob is the script name. |
| 58 | + |
| 59 | +6. Run the job with RunJob |
| 60 | + |
| 61 | +``` |
| 62 | + RunJob ./loadbob.sh |
| 63 | +``` |
| 64 | + |
| 65 | +Output should look like this: |
| 66 | + |
| 67 | +```text |
| 68 | + Mon Dec 1 04:08:50 PM CST 2025: Launching ./loadbob.sh with dataid 1.0 |
| 69 | + running loadbob with dataid 1.0 partition of input bobin |
| 70 | + Mon Dec 1 04:08:50 PM CST 2025: Job ./loadbob.sh is complete. Updating status |
| 71 | + 1 rows updated to READY for loadbob and 1.0 1 IN file-local locks released |
| 72 | +``` |
| 73 | +Two log-style messages, confirming the start and end of the loadbob job, and the one line output by the `loadbob.sh` script |
| 74 | +The last line informational message indicating that DataFlow has set the final status |
| 75 | + |
| 76 | + |
| 77 | +7. Checks: |
| 78 | + do `RunJob ./loadbob.sh` a second time, and confirm that it will refuse to do a duplicate run. |
| 79 | + check the data with utility: |
| 80 | + |
| 81 | +```text |
| 82 | + ./utility.sh runs |
| 83 | + DATAID DATASETID JOBID LOCKTYPE MODIFIED STATUS |
| 84 | + ------ --------- ----- -------- -------- ------ |
| 85 | + 1.0 bobout loadbob OUT 2025-12-01 16:08:49.740813 READY |
| 86 | + 1.0 bobin fakejob OUT 2025-12-01 15:46:19.56124 READY |
| 87 | +``` |
| 88 | + |
| 89 | + Check the data with direct SQL's: |
| 90 | + |
| 91 | +```text |
| 92 | + utility.sh sql "select * from datastatus" |
| 93 | +DATAID DATASETID JOBID LOCKTYPE MODIFIED STATUS |
| 94 | +------ --------- ----- -------- -------- ------ |
| 95 | +1.0 bobin fakejob OUT 2025-12-01 15:46:19.56124 READY |
| 96 | +1.0 bobout loadbob OUT 2025-12-01 16:08:49.740813 READY |
| 97 | +
|
| 98 | +``` |
| 99 | + |
| 100 | + |
| 101 | +## Remarks |
| 102 | + |
| 103 | +* We started with just the jar file and had to manually create the schema and tables. But if you build the package with maven, the tests will build the H2 database, schema, tables, user and password for you, and the dataflow-1.0.0.jar will be in utilities/target/dataflow-1.0.0.jar |
| 104 | + |
| 105 | +* Access to the h2 database for testing is through user ETL and password which was encrypted using the default passkey 'plugh'. You should encrypt your own password using your own passkey and put it into the dataflow.properties as soon as possible. |
| 106 | +The encrypted password and other connection information is in core/src/main/resources/dataflow.properties. You can copy it to your working directory and modify it, and the utilities will override the core/ properties file if they find this one. The encryption is easily done because it is one of the functions published by the utilities.sh tool. |
| 107 | + |
| 108 | + |
| 109 | +* In normal day-to-day operation you **never** need to update or insert the datastatus table, not in your code, not manually the way we did in this example. RunJob handles that for you. The inserts to job and dataset tables are one-time things to register and confure the datasets or to set up a test case. |
| 110 | +In exceptional cases such as handling errors, you encounter a job in FAILED state. If in that case you want the job to run again, you can reset the job to RESUBMIT. You can either do a dml command, like `utility.sh dml 'update datastatus set status to RESUBMIT where jobid='loadbob' and dataid='1.0'` though I would just endjob utility command: |
| 111 | +``` |
| 112 | + utility.sh endjob loadbob 1.0 RESUBMIT |
| 113 | +``` |
| 114 | +The big advantage is that you don't have to ask the scheduling team to make any changes, and you don't have to worry about command line parameters because there are not any. If the job is scheduled to run multiple times a day, then it will just catch up the next time it runs, and there are no changes in production at all except for the RESUBMIT status. That means you avoid an enormous amount of red tape and committee meetings just to rerun the job. |
| 115 | + |
| 116 | + |
| 117 | + * The actual dataset record has fields for things like hostname, database name, schema, table, username and (***encrypted***) password. These all appear as automatic variables to your script. This avoids all issues related to the temptation of hardcoding this metadata, the headaches involved with maintaining it, possible errors in connection strings, and having to make changes when moving from development to production. |
| 118 | +It is possible, and in our opinion best practice to ***hard-code nothing in your script***. Get it all from the metadata that DataFlow provides. For one thing if you have one job producing data and another job consuming it, you are now using named datasets and so both jobs are guaranteed to be using the same dataset. Almost no chance of second job picking up the wrong data because of a misconfiguration. |
| 119 | +Not only that, the framework guarantees that the second job will not have any false starts while the first job is running or in any error state. |
| 120 | + |
| 121 | +* The dataset metadata are not restricted to only jdbc connections. They can be repurposed to file system paths, web page urls, tcp endpoints, what have you. The semantics is entirely up to the consumer (the ETL script) which can do whatever they want with it. DataFlow doesn't use it at all. |
| 122 | + |
| 123 | +* Some of the other examples in the examples directory illustrate these points. |
| 124 | + |
| 125 | + |
0 commit comments