Skip to content

Commit b9c272b

Browse files
committed
workflows best practices
1 parent e93c6a0 commit b9c272b

4 files changed

Lines changed: 567 additions & 30 deletions

File tree

pages/Community and Best Practices/Data and Workflow Best Practices/Workflows.md

Lines changed: 0 additions & 30 deletions
This file was deleted.
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Workflow Best Practices
2+
:::warning 🛠️ Page Under Development
3+
Content is being actively developed and updated for this page. EarthCODE's documentation is a living document and will be continuously updated with detailed reviews.
4+
:::
5+
6+
7+
In this page, we describe the design decisions and best practices for creating and distributing your scientific workflows to maximize their value and impact within the EarthCODE ecosystem. A high-quality workflow is more than just code that runs; it is a complete, transparent, and robust scientific narrative, packaged in a way that is easy for others (and your future self) to understand, reuse, and reproduce. This guide provides practical recommendations based on widely accepted community standards and established software development principles. The suggested guidelines are summarized in the EarthCODE Quality Workflows checklist below.
8+
9+
**The effort put into quality assurance for research code should be proportionate to the analysis's complexity and risk. While not every script needs production-level rigor, reproducibility is the minimum standard**
10+
11+
When developing and publishing your workflow, consider these best practices:
12+
13+
- [Structure Your Project Logically](./workflow-best-practices.md#structure-your-project-logically) — Organize files consistently.
14+
- [Use Version Control Effectively](./workflow-best-practices.md#use-version-control-effectively) — Track changes using Git.
15+
- [Explicitly Define the Environment](./workflow-best-practices.md#explicitly-define-the-environment) — List dependencies and versions.
16+
- [Tell a Story (in Notebooks)](./workflow-best-practices.md#tell-a-story-in-notebooks) — Explain context/methods/results in Markdown.
17+
- [Modularize and Refactor Code](./workflow-best-practices.md#modularize-and-refactor-code) — Avoid duplication; use functions/modules.
18+
- [Adopt a Consistent Coding Style](./workflow-best-practices.md#adopt-a-consistent-coding-style) — Follow style guides (e.g., PEP 8).
19+
- [Build a Reproducible Analytical Pipeline](./workflow-best-practices.md#build-a-reproducible-analytical-pipeline) — Design for automation; configure externally.
20+
- [Implement Basic Testing](./workflow-best-practices.md#implement-basic-testing) — Include basic code checks.
21+
- [Ensure Executability](./workflow-best-practices.md#ensure-executability) — Package code and environment for reuse.
22+
- [Link Code Version to Results](./workflow-best-practices.md#link-code-version-to-results) — Link code versions to results via Experiments.
23+
24+
---
25+
<ClientOnly>
26+
<Checklist
27+
title="Workflow Quality Assurance Checklist"
28+
:items="[
29+
'Use a clear, standard directory structure e.g., code, environment, docs.',
30+
'Include a README.md explaining the project, setup, and usage.',
31+
'Use Git for version control from the start.',
32+
'Use .gitignore to exclude data, secrets, environment files e.g. .env, and outputs.',
33+
'Explicitly list all software dependencies in an environment file e.g., environment.yml, requirements.txt, Dockerfile.',
34+
'Pin key dependency versions in the environment file.',
35+
'Follow a standard code style guide e.g., PEP 8 for Python.',
36+
'Refactor repetitive code into functions or classes.',
37+
'Consider moving complex/reusable code into separate modules e.g., .py files.',
38+
'Use comments to explain the why, not the what of complex code.',
39+
'Add docstrings to functions and classes.',
40+
'Separate configuration parameters, paths, endpoints from code, preferably using environment variables.',
41+
'Ensure the workflow runs non-interactively from start to finish.',
42+
'For notebooks, regularly test with Restart Kernel and Run All Cells.',
43+
'Include basic checks e.g., assert statements to validate data or results.',
44+
'Document input data requirements clearly in the README.md.',
45+
'Access data from discoverable sources e.g., cloud storage, OSC Products rather than committing data.',
46+
'Package the workflow for execution e.g., container image, OGC Application Package.'
47+
]"
48+
storage-key="earthcode-quality-workflow"
49+
/>
50+
</ClientOnly>
51+
52+
53+
## How Research Code Differs
54+
55+
Research code often differs significantly from traditional software development. It's frequently written by domain experts, like scientists or analysts, whose main goal is to answer a specific research question, generate insights from data, or test a hypothesis. This contrasts with building a long-lasting production service.
56+
57+
A key characteristic is its **exploratory nature**; much research code starts this way, evolving rapidly as understanding grows, which can initially lead to less structured code compared to production software. The primary focus is often on obtaining scientifically correct results and insights, sometimes taking precedence over optimal software engineering practices like extensive testing or user interface design. While some research code might be developed for a single analysis or publication, increasingly, workflows are designed for reuse and adaptation. Crucially, unlike many commercial applications, the ability for others (and the original author) to exactly **reproduce the results** from the code and data is a fundamental requirement for scientific validity. Understanding these differences helps in applying quality assurance practices appropriately.
58+
59+
60+
## Why Focus on Quality Research Code
61+
62+
Although research code has unique characteristics, focusing on its quality is vital. High-quality, well-documented code is essential for others, including your future self, to trust your results. It forms the foundation for **reproducibility** – the ability to run the same analysis with the same data and get the same outcome, which is the cornerstone of scientific validation.
63+
64+
Clean, understandable code makes it easier for peers and collaborators to review your methods, verify your implementation, and identify potential errors or improvements. Well-structured and documented code is also easier to adapt for new datasets or research questions. Investing time in quality upfront prevents "technical debt" and saves significant effort later by avoiding the need to rewrite or debug poorly written code, enabling efficient building upon previous work.
65+
66+
Sharing high-quality code alongside data supports **transparency and open science**, allowing the broader community to understand, scrutinize, and benefit from your work. It also aligns with funding agency requirements for quality and auditability. Applying proportionate quality assurance practices, even to exploratory code, ultimately increases the reliability, impact, and longevity of your research.
67+
68+
69+
<!--
70+
Key pieces of insipiration:
71+
https://arxiv.org/pdf/1810.08055
72+
https://github.com/jupyter-guide/ten-rules-jupyter/tree/master/example1
73+
https://github.com/jupyter-guide/jupyter-guide
74+
https://best-practice-and-impact.github.io/qa-of-code-guidance/managers_guide.html
75+
https://goodresearch.dev/
76+
-->
77+
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
order: 3
3+
---
4+
5+
# Key Terms
6+
7+
## Acronyms, Abbreviations, and Terms
8+
9+
| **Term** | **Meaning** |
10+
|-----------|-------------|
11+
| **Logging** | Automatically recording configuration settings, inputs, and outputs when a workflow runs. Logging helps ensure that analyses can be **reproduced** and errors traced efficiently. |
12+
| **Maintainability** | The ease with which code can be read, understood, and updated by others. Well-documented, modular workflows are more maintainable and easier to build upon. |
13+
| **Modularity** | Writing code in **reusable, independent components** (modules or functions). This makes workflows easier to test, debug, and share between projects. |
14+
| **Notebooks** | Interactive documents combining code, text, and results (e.g., Jupyter Notebooks). Ideal for exploration and training, but final reproducible workflows should also be submitted as **scripts or packaged code** to ensure consistent execution. |
15+
| **Packages** | Reusable code collections distributed with documentation (e.g., Python packages or R libraries). Packages promote consistency, reduce duplication, and simplify sharing code used in workflows. |
16+
| **Parameters and Arguments** | Configurable inputs that allow workflows to be flexible and reusable. Parameters define expected inputs; arguments are the specific values passed when running the workflow. |
17+
| **Pipeline** | A structured sequence of processing steps, where outputs from one stage feed into the next. Workflows should define clear **input, processing, and output stages** to ensure reproducibility. |
18+
| **Readability** | The clarity of your code — making it understandable for collaborators and reviewers. Readable code is **consistent, well-documented, and logically structured**. |
19+
| **Reproducible Analytical Pipelines (RAP)** | Workflows built using open tools, good coding practices, and automation, ensuring that results can be independently reproduced. RAPs embody the FAIR and Open Science principles promoted by EarthCODE. |
20+
| **Scripts** | A code file that performs specific analysis steps or orchestrates an entire workflow. All scripts should run **without manual intervention** to support reproducibility. |
21+
| **Version Control** | Tracking and managing changes in your code over time. Tools like **Git**, combined with platforms such as **GitHub**, **GitLab**, or **BitBucket**, allow collaborative development and open publication of your workflows. |
22+
| **Virtual Environments** | Isolated software setups containing the specific versions of dependencies required by a workflow. They help ensure that analyses remain reproducible across systems and time. |

0 commit comments

Comments
 (0)