__ Breaking Changes__
- Extract has been refactored to 3 different scripts: extract-schema, extract-data and extract-script
New Feature:
- Support for Jinja templating everywhere
- area property is now ignored in YAML files
- Support for Amazon Redshift and Snowflake
- Quickstart documentation upgraded
- single command setup and run using starlake.sh / starlake.cmd
- Updated quickstart with docker use
- Infer schema now recognize date as date not timestamp
New Feature:
- Domain & Jobs delivery in rest api
Bug Fix:
- Support dynamic value for comet metadata through rest api.
New Feature:
- Add Server mode Bug Fix:
- Extensions may be defined at the domain level
Bug Fix:
- Use Spark Project Jetty shaded class to remove extra jetty dependency in Starlake server
New feature:
- Added "serve --port 7070" to start starlake in server mode and wait for requests
New feature:
- Support any source to any sink using kafkaload including sink and source that are not kafka. This has been possible at the cost of a breaking change
- Support table and column remarks extraction on DB2 iSeries databases
CI:
- remove support of github registry
- Remove scala 2.11 support
New feature:
- Support JINJA in autojob
- Support external views defined using JINJA
- File Splitter allow to split file based on first column or position in line.
New feature:
- Add ACL Graph generation
Bug Fix:
- Improve GraphViz Generation
Bug Fix:
- Generate final name in Graphiz diagram
New feature:
- Improve cli doc generation. Extra doc can be added in docs/merge/cli folder
- prepare to deprecate xml tag in metadata section.
Bug Fix:
- Code improvement: JDBC is handled as a generic sink
- add extra parenthesis in BQ queries only for SELECT and WITH requests
New feature:
- Reduce assembly size
- Update to sbt 1.7.1
- Add interactive mode for transform with csv, json and table output formats
- Improve FS Sink handling
Bug Fix:
- Support empty env files
Bug Fix:
- Keep retrocompatibility with scala 2.11
New feature:
- Handle Mixed XSD / YML ingestion & validation
- Support JSON / XML descriptions in XLS files
- Support arrays in XLS files
Bug Fix:
- Support file system sink options in autojob
New feature:
- Enhance XLS support for escaping char
- Support HTTP Stream Source
- Support XSD Validation
- Transform jobs now report on the number of affected rows.
Bug Fix:
- Regression return value of an autojob
New feature:
- Support extra dsv options in conf file
- support any option stored in metadata.options as an option for the reader.
- Support VSCode Development
New feature:
- Upgrade Kafka libraries
- Simplify removal of comments in autojobs SQL statements.
New feature:
- deprecate usage of schema, schemaRefs in domains and dataset in autojobs. Prefer the use of table and tableRefs
Bug Fix:
- fix regression on Merge mode without Timestamp option
Bug Fix:
- Xls2Yml - Get a correct sheet name based on the schema name field
New feature:
- Improve XLS support for long name
- Handle rate limit exceeded by setting COMET_GROUPED_MAX to avoid HTTP 429 on some cloud providers.
Bug Fix: reorder transformation on attributes as follows:
- rename columns
- run script fields
- apply transformations (privacy: "sql: ...")
- remove ignore fields
- remove input filename column
Bug Fix:
- Handle field relaxation when in Append Mode and table does not exist.
Bug Fix:
- Make fields in rejected table optional
New feature:
- Rollback on support for kafka.properties files. It is useless since we already have a server-options properties.
New feature:
- Improve XLS support for metadata
New feature:
- Autoload kafka.properties file from metadata directory.
New feature:
- Parallel copy of files when loading and archiving
- Support renaming of domains and schemas in XLS
- Fixing release process
New feature:
- import step can be limited to one or more domains
New feature:
- Update Kafka / BigQuery libraries
- Add new preset env vars
- Allow renaming of domains and schemas
New feature:
- Vars in assertions are now substituted at load time
- Support SQL statement in privacy phase
- Support parameterized semantic types
- Add support for generic sink
- Allow use of custom deserializer on Kafka source
New feature:
- Drop Java 1.8 prerequisite for compilation
- Support custom database name for Hive compatible metastore
- Support custom dataset name in BQ
New feature:
- Drop support for Spark 2.3.X
- Allow table renaming on write
- Any Spark supported input is now allowed
- Env vars in env.yml files
New feature:
- Generate DDL from YML files with support for BigQuery, Snowflake, Synapse and Postgres #51 / #56
- Improve XLS handling: Add support for presql / postsql, tags, primary and foreign keys #59
- Add optional application of row & column level security
- Databricks Support
- Signification reduction of memory consumption
- Support application.conf file in metadata folder (COMET_METADATA_FS and COMET_ROOT must still be passed as env variables)
Bug Fix:
- Include env var and option when running presql in ingestion mode #58
New feature:
- Support merging dataset with updated schema
- Support publishing to github packages
- Reduce number of dependencies
- Allow Audit sink name configuration from environment variable
- Dropped support for elasticsearch 6
Bug Fix:
- Support timestamps as long in XML & JSOn FIles
New feature:
- Support XML Schema inference
- Support the ability to reject the whole file on error
- Improve error reporting
- Support engine on task SQL (query pushdown to BigQuery)
- Support last(n) partition on merge
- Added new env var to control parititioning COMET_SPARK_SQL_SOURCES_PARTITION_OVERWRITE_MODE
- Added env var to control BigQuery materialization on pushdown queries COMET_SPARK_BIGQUERY_MATERIALIZATION_PROJECT, COMET_SPARK_BIGQUERY_MATERIALIZATION_DATASET (default to materalization)
- Added env var to control BigQuery read data format COMET_SPARK_BIGQUERY_READ_DATA_FORMAT (default to AVRO)
- When COMET_MERGE_OPTIMIZE_PARTITION_WRITE is set and dynamic partition is active, only write partition containing new records or records to be deleted or updated for BQ (handled by Spark by default for FS).
- Add VALIDATE_ON_LOAD (comet-validate-on-load) property to raise an exception if one of the domain/job YML file is invalid. default to false
- Add custom file extensions property in Domain import
default-file-extensionsand env varCOMET_DEFAULT_FILE_EXTENSIONSBug Fix: - Loading empty files when the schema contains script fields
- Applying default value for an attribute when value in the input data is null
- Transformation job with BQ engine fails when no views block is defined
- XLS2YML : remove non-breaking spaces from Excel file cells to avoid parsing errors
- Fix merge using timestamp option
- Json ingestion fails with complex array of objects
- Remove duplicates on incoming when existingDF does not exist or is empty
- Parse Sink options correctly
- Handle extreme cases where audit lock raise an exception on creation
- Handle files without extension in the landing zone
- Store audit log with batch priority on BigQuery
Bug Fix:
- Handle Jackson bug
New feature:
- Add ability to ignore some fields (only top level fields supported)
- BREAKING CHANGE: Handle multiple schemas during extraction. Update your
extractconfigurations before migrating to this version. - Improve InferSchemaJob
- Include primary keys & foreign keys in JDBC2Yml
Bug Fix:
- Handle rename in JSON / XML files
- Handle timestamp fields in JSON / XML files
- Do not partition rejected files
- Add COMET_CSV_OUTPUT_EXT env var to customize filename extension after ingestion when CSV is active.
New feature:
- Use the same variable for Lock timeout
- Improve logging when locking file fails
- File sink while still the default is now controlled by the sink tag in the YAML file. The option sink-to-file is removed and used for testing purpose only.
- Allow custom topic name for comet_offsets
- Add ability to coalesce(int) to kafka offloading feature
- Attributes may now be declared as primary and or foreign keys even though no check is made.
- Export schema and relations(PK / FK) as dot (graphviz) files.
- Support saving comet offsets to filesystem instead of kafka using the new setting comet-offsets-mode = "STREAM"
Bug Fix:
- Invalid YAML files produce now an error at startup instead of displaying a warning.
- Version skipped
New feature:
- Export all tables in JDBC2Yml generation
- Include table & column names when meeting unknown column type in JDBC source schema
- Better logging on forced conversion in JDBC2Yml
- Compute Hive Statistics on Table & Partitions
- DataGrip support with implementation of substitution for ${} in addition to {{}}
- Improve logging
- Add column type during for database extraction
- The name attribute inside a job file should reflect the filename. This attribute will soon be deprecated
- Allow Templating on jobs. Useful to generate Airflow / Oozie Dags from job.comet.yml/job.sql code
- Switch from readthedocs to docusaurus
- Add local and bigquery samples
- Custom var pattern through sql-pattern-parameter in reference.conf
Bug Fix:
- Avoid computing statistics on struct fields
- Make database-extractor optional in application.conf
New feature:
- Parameterize with Domain & Schema metadata in JDBC2Yml generation Bug Fix:
New feature:
- Auto compile with scala 2.11 for Spark 2 and with scala 2.12 for Spark 3. [457]
- Performance optimization when using Privacy Rules. [459]
- Rejected area and audit logs support can have their own write format (default-rejected-write-format and default-audit-write-format properties)
- Deep JSON & XML files are now validated against the schema
- Privacy is applied on deep JSON & XML inputs [461]
- Domains & Jobs may be defined in subdirectories allowing better metatdata files organization [462]
- Substitute variables through CLI & env files in views, assertions, presql, main sql and post sql requests [462]
- Semantic type Date supports dates with MMM month representation [463]
- Split reference.conf into multiple files. [460]
- Support kafka Source & Sink through Spark Streaming [460]
- Add an alternative way for applying privacy on XML files.[466]
- Generate Excel files from YML files
- Generate YML file from Database Schema
Bug Fix:
- Make Jackson lib provided. [457]
- Support Spark 2.3. by not using Dataframe.isEmpty [457]
- comet_input_file_name missing when ingesting Position files [466]
- Apply postsql queries on the accepted DataFrame [466]
- Check that scripted fields are defined at the end of the schema in the YML file [#384]
New feature:
- Allow sink options to be defined in YML instead of Spark Submit. [#450] [#454]
Bug Fix:
- Parse dates with yyyyMM format correctly [#451]
- Fix error when saving a csv with an empty DataFrame [#451]
- Keep column description in BQ tables when using Overwrite mode [#453]
Bug Fix:
- Support correctly merge mode in BQ [#449]
- Fix for sinking XML to BQ [#448]
New feature:
- Kafka Support improved
New feature:
- Optionally sink to file using property sink-to-file = ${?COMET_SINK_TO_FILE}
Bug Fix:
- Sink name was ignored and always considered as None
New feature:
- YML files are now renamed with the suffix .comet.yml
- Comet Schema is now published on SchemaStore. This allows Intellisense in VSCode & Intellij
- Assertions may now be executed as part of the Load and transform processes
- Shared Assertions UDF may be defined and stored in COMET_ROOT/metadata/assertions
- Views mays also be defined and shared in COMET_ROOT/metadata/views.
- Views are accessible in the load and transform processes.
- Domain may be now prefixed by the "load" tag. Defining a domain without the "load" tag is now deprecated
- AutoJob may be now prefixed by the "transform" tag. Defining a autojob without the "transform" tag is now deprecated
Breaking Changes:
- N.A.
Bug Fix:
- Use Spark Application Id for JobID information to make auditing easier
New feature:
- Expose a REST API to generate a Yaml Schema from an Excel file. [#387]
- Support ingesting multiline complex JSON. [#391]
- Support nested fields when generating schema for BigQuery tables. [#391]
- Enhancements on Spark to BigQuery schema. [#395]
- Support merging a part of a BigQuery Table, rather than all the Table. [#397]
- Enable setting BigQuery intermediate format when sinking using ${?COMET_INTERMEDIATE_BQ_FORMAT}. [#398] [#400]
- Enhancement on Merging mode: do not depend on parquet files when using BigQuery tables.
Dependencies:
- Update sbt to 1.4.4 [#385]
- Update scopt to 4.0.0 [#390]