Skip to content

RustedBytes/babylonify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

babylonify

Babylonify filters rows in a Parquet file (or an entire directory of Parquet files) by the language detected in a text column. It wraps lingua for language detection, polars for columnar IO, and Rayon for parallelism, producing compressed Parquet output with Zstandard.

Highlights

  • Detects the language for every row and retains only those matching one of the requested languages.
  • Accepts both ISO 639-1 codes (uk, en, ru, …) and language names (Ukrainian, English, русский, …).
  • Optional cleaning step removes numbers/emojis/symbols before detection so you can focus on alphabetic content.
  • Scales to many files: point the CLI at a directory and it mirrors the structure to an output directory.
  • Parallel row processing and streaming Parquet writers keep large datasets responsive.

Requirements

  • Rust toolchain with edition 2024 support installed via rustup.
  • Input Parquet files containing at least one string column with textual data.

Install

# from a local checkout
cargo install --path .

# alternatively, build locally without installing
cargo build --release

After installation the babylonify binary is placed on your Cargo bin path (~/.cargo/bin by default).

Quick start

Filter a single Parquet file, keeping Ukrainian rows from the default transcription column and cleaning the text before detection:

babylonify \
  --input data/transcripts.parquet \
  --output data/transcripts_uk.parquet \
  --lang uk \
  --clean

Batch-process every Parquet file in a directory by pointing --input at the folder. Outputs reuse the input file names within the provided output directory:

babylonify \
  --input data/raw/ \
  --output data/filtered/ \
  --lang english

Keep several languages by repeating --lang:

babylonify \
  --input cv22-opus-speech/ \
  --output cv22-opus-speech-filtered \
  --output-invalid cv22-opus-speech-rejected \
  --lang uk \
  --lang en \
  --lang ru \
  --clean

CLI reference

Run babylonify --help for the authoritative list. The most important options are:

Flag Description
-i, --input <PATH> Parquet file to filter, or a directory of Parquet files to batch-process.
--input-dir <DIR> Compatibility alias for --input <DIR>.
-o, --output <PATH/DIR> Output Parquet path. When the input is a directory, this must be a directory and files are written with their original names.
--output-invalid <PATH/DIR> Optional Parquet output for rejected rows. When the input is a directory, this must be a directory and mirrors the input file names.
-c, --column <NAME> Name of the text column to inspect. Defaults to transcription.
-l, --lang <LANG> Target language to keep. Repeat the flag to allow multiple languages. ISO codes, common aliases, and full names (case-insensitive) are accepted. Default: uk.
--threshold <FLOAT> Minimum confidence required for the top detected language to be kept. Must be between 0.0 and 1.0. Default: 0.6.
--keep-empty Preserve rows where the text column is NULL or an empty string.
--clean Normalize whitespace and strip non-letter/non-punctuation symbols before detection; the cleaned text replaces the original column in the output.
--threads <N> Set the Rayon thread pool size. Defaults to the current core count.

The output Parquet schema matches the input schema; when --clean is supplied the specified text column is replaced with the cleaned content.

Language aliases

Short codes and localized names for Ukrainian, English, Russian, Polish, German, French, and Spanish are recognized. Any other lingua-supported language can be addressed using its English enum name (for example italian, portuguese). Unknown values yield a helpful error.

Development

  • Format the codebase: cargo fmt --all.
  • Lint all targets and features: cargo clippy --all-targets --all-features -- -D warnings.
  • Run the full test suite: cargo test --all-features.
  • The integration tests under tests/cli.rs exercise the end-to-end CLI against synthetic Parquet fixtures.

License

Babylonify is distributed under the MIT License. See LICENSE for the full text.

About

A simple CLI to filter parquet files using languages

Topics

Resources

License

Stars

Watchers

Forks

Contributors