All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to semantic versioning.
- Change
CODEOWNERSowners to RDSA SAIL team.
- Added
sha256_sum,parse_pyproject_metadata,validate_env_vars, tohelpers/python.py. - Added
create_s3_uri&split_s3_uritocdp/helpers/s3_utils.py.
- Reordered docstrings to place Returns above Raises in
check_year_rangeinhelpers/pyspark.py
- Added functions
check_year_range&assert_same_distinct_valuetohelpers/pyspark.py.
- Updated
insert_df_to_hive_tablefunction incdp/io/output.py; enhanced schema mismatch error logging to show specific column differences.
- Added function
has_no_nullstohelpers/pyspark.py.
- Modified
load_csvinhelpers/pyspark.pyto improve logging with conditional inclusion of read options.
- Fix logging in
filter_out_valuesinhelpers/pyspark.py, use f-string for message formatting.
- Noted in
README.mdthat the documentation uses theons_mkdocs_themepackage for the MkDocs theme, with a link to its GitHub repository.
- Added function
filter_out_valuestohelpers/pyspark.py.
- Added RDSA email to
setup.cfg,README.md,index.md.
- Added
dump_environment_requirementsfunction tohelpers/python.py.
- Added support in
init_logger_advancedto detect existing root logger handlers and prevent duplicate configuration; updated corresponding unit tests.
- Removed explicit
mkdocsversion pin fromdocextra insetup.cfgbecausemkdocsis installed as a dependency ofons-mkdocs-theme.
- Added
smart_coalescefunction inhelpers/pyspark.py.
- Removed
generate_coverage_badge.yaml.
- Added
generate_coverage_badge.yamlGitHub Action to generate pytest coverage report when pushing to themainbranch.
- Updated
README.mdto include information about pytest coverage badge.
- Updated
cut_lineagefunction inhelpers/pyspark.pyto include a fallback for PySpark 3.2.3 when accessing thesparkSessionattribute.
- Pinned
mkdocsto1.6.0indocextra to resolve dependency conflict withons-mkdocs-theme.
- Added
zip_local_directory_to_s3&zip_s3_directory_to_s3tocdp/helpers/s3_utils.py. - Added
freezegunpackage todevdependency insetup.cfg. - Added
delete_old_objects_and_foldersfunction incdp/helpers/s3_utils.py - Added house-style example functions and unit tests to
docs/contribution_guide.md. - Added a function in
helpers/python.pycalledfile_sizeto check a file size in a local drive. - Added a function in
helpers/python.pycalledmd5_sumto create md5 hash for an object in a local drive. - Added a function in
cdp/helpers/s3_utils.pycalledcheck_fileto check a file exists, is not a directory and size > 0 in an s3 bucket. - Added a function in
helpers/python.pycalledfile_existsto check if a file exists in a local drive. - Added a function in
helpers/python.pycalledis_local_directoryto check if a directory exists in a local drive. - Added a function in
helpers/python.pycalledcheck_fileto check a file exists, is not a directory and size > 0 in a local drive. - Added a function in
helpers/python.pycalledread_headerto print the first line of a file in a local drive. - Added a function in
helpers/python.pycalledwrite_string_to_fileto write content into an existing file. - Added a function in
helpers/python.pycalledcreate_folderto create a directory.
- Changed a function in
cdp/helpers/s3_utils.pycalledcreate_folder_on_s3tocreate_folder.
- Added a function in
cdp/helpers/s3_utils.pycalledfile_sizeto check a file size in an s3 bucket. - Added a function in
cdp/helpers/s3_utils.pycalledmd5_sumto create md5 hash for an object in s3 bucket. - Added a function in
cdp/helpers/s3_utils.pycalledread_headerto read the first line of a file in s3 bucket. - Added a function in
cdp/helpers/s3_utils.pycalledwrite_string_to_fileto write a string into an exiting file in s3 bucket. - Added a function in
cdp/helpers/s3_utils.pycalleds3_walkthat mimics the functionality ofos.walkin s3 bucket using long filenames with slashes.
- Fixed
setup.cfgto includedata/*.dbfiles in thepyspark_log_parsermodule.
- Added
include_package_data = Trueandrdsa_utils.helpers.pyspark_log_parser = *.db, *.ipynbtosetup.cfgto include.dband.ipynbfiles in thepyspark_log_parsermodule.
- Added
include_package_data = Trueand* = *.dbtosetup.cfgto include SQLite database files in the package.
- Added
__init__.pytohelpers/pyspark_log_parser.
- Added
pyspark_log_parser/module inhelpers/. - Added
papermill,nbconvert,matplotlibdependencies.
- Added
multi_lineparam toload_jsonincdp/helpers/s3_utils. - Removed trailing whitespaces from
CHANGELOG.md.
- Added
time_it,setdiff,flatten_iterable,convert_types_iterable,interleave_iterables,pairwise_iterable,merge_multi_dfstohelpers/python.py. - Added
cache_time_df,count_nulls,aggregate_col,get_unique,drop_duplicates_reproducible,apply_col_func,pyspark_random_uniform,cumulative_array,union_mismatched_dfs,sum_columns,set_nulls,union_multi_dfs,join_multi_dfs,map_column_valuestohelpers/pyspark.py. - Added
codetimingpackage as a dependency. - Added
write_excelfunction tocdp/helpers/io/s3_utils.py. - Added
xlsxwriterandopenpyxldependency due towrite_excelfunction incdp/helpers/io/s3_utils.py.
- Ran
ruff check . fixon the codebase to comply with new PEP rules. - Added rules to
ruff.tomlto ignore A005 warnings forrdsa_utils/logging.pyandrdsa_utils/typing.py. - Upgraded
black,ruff,gitleaksto the latest version in.pre-commit-config.yaml. - Removed module-level scope for
spark_sessionfixture intest_utils.pyto ensure test isolation. - Updated Project Description for Python 3.12 and 3.13.
- Updated Copyright for 2025.
- Added acknowledgements to colleagues from DSC and MQD in
README.md.
- Added link and description of
easy_pipeline_runrepo toREADME.md.
- Modified
list_filesfunction incdp/helpers/s3_utils.pyto use pagination when listing objects from S3 buckets, improving handling of large buckets. - Added test cases for new pagination functionality in
list_filesfunction intests/cdp/helpers/test_s3_utils.py.
- Modified
insert_df_to_hive_tablefunction incdp/io/output.py. Added support for creating non-existent Hive tables, repartitioning by column or partition count, and handling missing columns with explicit type casting.
- Update
CODEOWNERSfile, changed email to GitHub username.
- Updated
ons-mkdocs-themeversion from1.1.2to1.1.3to fix issues with the crest not showing in the footer of documentation site.
- Updated the
ons-mkdocs-themeversion number indocrequirements insetup.cfg.
- Unpinned
pandasversion insetup.cfgto allow for more flexibility in dependency management. - Removed
numpyfromsetup.cfgas it will be installed automatically bypandas.
- Added
write_csvfunction insidecdp/helpers/s3_utils.py.
- Changed
cut_lineagefunction insidehelpers/pyspark.pyto make it compatible with newer PySpark versions.
- Added "How the Project is Organised" section to
README.md. - Fix docstring for
test_load_json_with_encodingintest_s3_utils.py.
- Added
load_jsontos3_utils.py.
- Added
InvalidS3FilePathErrortoexceptions.py. - Added
validate_s3_file_pathtos3_utils.py.
- Fixed docstring for
load_csvinhelpers/pyspark.py. - Call
validate_s3_file_pathfunction insidesave_csv_to_s3. - Call
validate_bucket_nameandvalidate_s3_file_pathfunction insidecdp/helpers/s3_utils/load_csv.
- Improved
truncate_external_hive_tableto handle both partitioned and non-partitioned Hive tables, with enhanced error handling and support for table identifiers in<database>.<table>or<table>formats.
- Added
load_csvtohelpers/pyspark.pywith kwargs parameter. - Added
truncate_external_hive_tabletohelpers/pyspark.py. - Added
get_tables_in_databasetocdp/io/input.py. - Added
load_csvtocdp/helpers/s3_utils.py. This loads a CSV from S3 bucket into a Pandas DataFrame.
- Removed
.config("spark.shuffle.service.enabled", "true")fromcreate_spark_session()not compatible with CDP. Added.config("spark.dynamicAllocation.shuffleTracking.enabled", "true")&.config("spark.sql.adaptive.enabled", "true"). - Change
mkdocstheme frommkdocs-tech-docs-templatetoons-mkdocs-theme. - Added more parameters to
load_and_validate_table()incdp/io/input.py.
- Temporarily pin
numpy==1.24.4due to https://github.com/numpy/numpy/issues/267100
- Added
zip_folderfunction toio/output.py.
- Modified
gcp_utils.py, added more helper functions for GCS. - Modified docstring for
InvalidBucketNameErrorinexceptions.py.
- Added
.isort.cfgto configureisortwith theblackprofile and recognizerdsa-utilsas a local repository. - Reformatted the entire codebase using
blackandisort.
- Updated
.pre-commit-config.yamlto includeblackandisortas pre-commit hooks for code formatting. - Updated
setup.cfgto includeblackandisortin thedevrequirements. - Updated
README.mdto includeblackformatting badge. - Updated
ruff.tomlto align withblack's formatting rules.
- Added
save_csv_to_s3function incdp/io/output.py.
- Modified docstrings in
cdp/helpers/s3_utils.py; remove type-hints from docstrings, type-hints already in function signatures. - Add Examples section in
delete_folderfunction ins3_utils.py. - Modified docstrings in
cdp/io/input.py&cdp/io/output.py; remove type-hints from docstrings, type-hints already in function signatures. - Updated
.gitignoreto excludemetastore_db/directory. - Standardised parameter names for consistency across
S3 utility functions
s3_utils.py
- Added
s3_utils.pymodule located incdp/helpers/.
- Updated
reference.md; includeds3_utils.py. - Updated
README.md; added Ruff and Python versions badges.
- Revised the "Further Reading on Reproducible Analytical Pipelines" section
in the
README.mdfor clarity.
- Breaking Change: Renamed module
cdswtocdp(Cloudera Data Platform). - Added a "Further Reading on Reproducible Analytical Pipelines" section to
README.mdto enhance resources on RAP best practices. - Added section on synchronising the
developmentbranch withmainto thebranch_and_deploy_guide.mdfile.
- Updated
contribution_guide.md; fix code block rendering issue inmkdocsby removing extra whitespaces.
- Updated
branch_and_deploy_guide.md, added section titled: "Merging Development to Main: A Guide for Maintainers"
- Updated
README.mdto include new badges for Deployment Status and PyPI version.
- Added
mkdocs-mermaid2-pluginto thedocextras_require insetup.cfg, enhancing documentation with MermaidJS diagram support. - Added
gitleaksand localrestrict-filenameshooks to.pre-commit-config.yaml. - Enhanced
README.mdheaders with relevant emojis for improved readability and engagement.
- Modified
README.md: Added Installation section and Git Workflow Diagram section with a MermaidJS diagram. - Improved the
branch_and_deploy_guide.mdandcontribution_guide.mddocumentation on branching strategy. - Updated
python_requiresinsetup.cfgto support Python versions>=3.8and<3.12, including all3.11.xversions. - Modified
pull_request_workflow.yamlto add Python3.11to the testing matrix. - Moved
pysparkfrom primary dependencies todevsection inextras_requireto streamline installation for users with pre-installed environments, requiring manual installation where necessary. - Renamed
isdirfunction incdsw/helpers/hdfs_utilstois_dirfor improved compliance with PEP 8 naming conventions. - Removed line stopping existing SparkSession in
create_spark_sessionto prevent Py4JError and enable seamless SparkContext management on GCP. - Refactor
save_csv_to_hdfsto use functions in/cdsw/helpers/hdfs_utils.py - Add function
delete_pathin/cdsw/helpers/hdfs_utils.py, and refactor docstring fordelete_fileanddelete_dir. - Modified
CHANGELOG.mdadded note on missingpre-v0.1.8releases due todeploy_pypi.yamlissues
- Added
pyproject.tomlandsetup.cfg.
- Removed
requirements.txtnow insetup.cfg.
- Added
builddependency in.github/workflows/deploy_pypi.yaml
- Modified Workflow Trigger in
.github/workflows/deploy_pypi.yaml
- Removed
.github/workflows/version_check.yaml
- Fix GitHub Branch Reference for deployment.
- Remove check of branch for deployment.
- Take workflows out of nested folder to have PyPI listing on merge to main branch.
- Workflows to have PyPI listing on merge to main branch.
- Typo in the documentation to install Python.
parametrize_casesandCasecode for use in test scripts.- Add in PR template.
- README with additional information and guidelines for contributors.
- Pull Request Workflow includes
testjob which installs Poetry and Run Tests. - Add
.pre-commit-config.yamlfor pre-commit hooks. - Add CODEOWNERS file to repository.
- Add mkdocs;
deploy_mkdocs.yamlanddocsFolder. - Add the helpers_spark.py and test_helpers_spark.py modules from cprices-utils.
- Add logging.py and test_logging.py module from cprices-utils.
- Add the helpers_python.py and test_helpers_python.py modules from cprices-utils.
- Add averaging_methods.py and test_averaging_methods.py.
- Add
init_logger_advancedinhelpers/logging.pymodule. - Add in the general validation functions from cprices-utils.
- Add
invalidate_impala_metadatafunction to thecdsw/impala.pymodule. - Add "search" Plugin and mkdocs GOV UK Theme via
mkdocs-tech-docs-template. - Add
pipeline_runlog.pyandhdfs_utils.pymodules fromepds_utils. - Add common custom exceptions.
- Add config load class.
- Add generic IO input functions.
- Add
docs/contribution_guide.md - Add functions from
epds_utilsintohelpers/pyspark.py,io/input.py,io/output.py. - Add various I/O functions from the io.py module in cprices-utils.
- Add modules to
docs/reference.md - Add mkdocs Plugins:
mkdocs-git-revision-date-localized-plugin,mkdocs-jupyter. - Add better navigation to
mkdocs.yml. - Add
save_csv_to_hdfsfunction tocdsw/io/output.py. - Add
docs/branch_and_deploy_guide.md. - Add
.github/workflows/deploy_pypi/version_check.yamland.github/workflows/deploy_pypi/deploy_pypi.yaml.
- Renamed
_typingmodule totyping. - Renamed modules in helpers directory to remove
helper_from names. - Relocated
logging.pyandvalidation.pyto root level. - Relocated
Getting Started for Developersintodocs/contribution_guide.md. - Migrated from
poetrytosetup.pyfor Python Code Packaging. - Upgrade
mkdocs-tech-docs-templateto0.1.2. - Moved CDSW related from
io/input.py&io/output.pyintocdsw/io/input.py&cdsw/io/output.py - Pin
pytestversion<8.0.0due to TvoroG/pytest-lazy-fixture#65 - Updated the license information.
- Fix paths for
get_window_specinaveraging_methods.py. - Fix
deploy_mkdocs.yaml, changedmkdocs-materialtomkdocs-tech-docs-template. - Fix module paths for unit test patches in
tests/cdsw/. - Fix
pull_request_workflow.yaml; ensured pytest failures are accurately reported in GitHub workflow by removing|| truecondition. - Fix
deploy_mkdocs.yaml, fixed Python version to3.10. - Fix
deploy_mkdocs.yaml, missing quotes for Python version.
- Remove
_version.py. - Remove all references to Poetry.
Note: Releases prior to v0.1.8 are not available on GitHub Releases and PyPI due to bugs in the GitHub Action
deploy_pypi.yaml, which deploys to PyPI and GitHub Releases.
- rdsa-utils v0.16.1: GitHub Release | PyPI
- rdsa-utils v0.16.0: GitHub Release | PyPI
- rdsa-utils v0.15.0: GitHub Release | PyPI
- rdsa-utils v0.14.1: GitHub Release | PyPI
- rdsa-utils v0.14.0: GitHub Release | PyPI
- rdsa-utils v0.13.3: GitHub Release | PyPI
- rdsa-utils v0.13.2: GitHub Release | PyPI
- rdsa-utils v0.13.1: GitHub Release | PyPI
- rdsa-utils v0.13.0: GitHub Release | PyPI
- rdsa-utils v0.12.1: GitHub Release | PyPI
- rdsa-utils v0.12.0: GitHub Release | PyPI
- rdsa-utils v0.11.0: GitHub Release | PyPI
- rdsa-utils v0.10.1: GitHub Release | PyPI
- rdsa-utils v0.10.0: GitHub Release | PyPI
- rdsa-utils v0.9.4: GitHub Release | PyPI
- rdsa-utils v0.9.3: GitHub Release | PyPI
- rdsa-utils v0.9.2: GitHub Release | PyPI
- rdsa-utils v0.9.1: GitHub Release | PyPI
- rdsa-utils v0.9.0: GitHub Release | PyPI
- rdsa-utils v0.8.0: GitHub Release | PyPI
- rdsa-utils v0.7.4: GitHub Release | PyPI
- rdsa-utils v0.7.3: GitHub Release | PyPI
- rdsa-utils v0.7.2: GitHub Release | PyPI
- rdsa-utils v0.7.1: GitHub Release | PyPI
- rdsa-utils v0.7.0: GitHub Release | PyPI
- rdsa-utils v0.6.0: GitHub Release | PyPI
- rdsa-utils v0.5.0: GitHub Release | PyPI
- rdsa-utils v0.4.4: GitHub Release | PyPI
- rdsa-utils v0.4.3: GitHub Release | PyPI
- rdsa-utils v0.4.2: GitHub Release | PyPI
- rdsa-utils v0.4.1: GitHub Release | PyPI
- rdsa-utils v0.4.0: GitHub Release | PyPI
- rdsa-utils v0.3.7: GitHub Release | PyPI
- rdsa-utils v0.3.6: GitHub Release | PyPI
- rdsa-utils v0.3.5: GitHub Release | PyPI
- rdsa-utils v0.3.4: GitHub Release | PyPI
- rdsa-utils v0.3.3: GitHub Release | PyPI
- rdsa-utils v0.3.2: GitHub Release | PyPI
- rdsa-utils v0.3.1: GitHub Release | PyPI
- rdsa-utils v0.3.0: GitHub Release | PyPI
- rdsa-utils v0.2.3: GitHub Release | PyPI
- rdsa-utils v0.2.2: GitHub Release | PyPI
- rdsa-utils v0.2.1: GitHub Release | PyPI
- rdsa-utils v0.2.0: GitHub Release | PyPI
- rdsa-utils v0.1.10: GitHub Release | PyPI
- rdsa-utils v0.1.9: GitHub Release | PyPI
- rdsa-utils v0.1.8: GitHub Release | PyPI
- rdsa-utils v0.1.7 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.6 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.5 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.4 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.3 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.2 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.1 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.0 - Not available on GitHub Releases or PyPI