GitHub - mmanteli/redpajama-v2-filter-2023: Backups of RedPajama v2 filter scripts

General

The RedPajama dataset contains 40 different precalculated metrics for each document:

Choose the metrics that seems intuitively good (from the total 40), but incorporate at least the metrics used in Cultura X (12 different metrics)
Follow example by Cultura X and use the distributions of the metrics to choose appropriate threshold for each metric per language 2a. Extract the values of a certain metric 2b. Compute p-quantiles from the metric value distribution 2c. Select Q1 for thresholds that favor high values, Q3 for thresholds that favor low values
Filter the data with found thresholds

Code

GENERAL DIR STRUCTURE:

everything in amanda/ is something that is currently used
additional_scripts contains backups and old code
logs, samples, results are self evident (hopefully); some of these include README clarification.
test_datasets contains hf-dataset with flags for which metrics each text passed etc. => these used in filter_analysis

ANALYSIS SCRIPTS

metric_analysis.ipynb:

contains the choises made
contains instructions on how to run
contains the script to make the rule-file.

filter_analysis.ipynb:

contains analysis of filter results, how many passed
which metrics where used alone, which together
- runs slowly, sorry

DATA COLLECTION SCRIPTS

generate_sample_paths.sh

generates a file of random files to read in get_quantiles.py

get_quantiles.py and get_10th_quantiles.py

reads random sample of files, collects the chosen metrics, and calculates quantiles
get_quantiles calculates 10,25,75 and 90, get_10th_quantiles 10,20,...,90.

sl-quantiles.sh

used to run generate_sample_paths.sh and get_quantiles.py on LUMI

parse_quantiles.py

transforms results of get_quantiles and get_10th_quantiles to a json-file that is needed used for filtering.

FILTER

filter.py:

final of 4 scripts (older can be found in old_scripts), faster than previous
prints the id as json object if the document passes
assumes the rule file (see above) is in data/redpajama-v2/
the old versions do the filtering a bit slower and they return the results in different forms
- e.g. hf-dataset

sl-filter.sh:

runs filter.py on LUMI

Chosen thresholds and metrics

Pre-computed metrics (number_of_words, number_of_lines, number_of_characters, language_identification, perplexity, stop_words, special_characters, flagged_words, words_per_line_mean, short_line_ratio, character_repetition10gram, character_repetition5gram, word_repetition, unigram_entropy, lines_end_in_punctuation) were used
Filtering was based in 4 different thresholds based in p-quantiles inspired by Cultura X:
- 0.05% of the data was used to calculate the distributions for languages Italian, German, French and Spanish
- 0.02% of data was used to calculate the distribution for English
- lower p-th percentile was used for the metrics that favor high values (e.g number_of_words), while metrics favoring low values (e.g. flagged words) will the upper p-th percentile
Regular
- $p_1 = 10$
- $p_3 = 90$
Strict
- $p_1=20$
- $p_3 =80$
Even stricter
- $p_1=30$
- $p_3=70$
Strictest
- $p_1 = 40$
- $p_3 = 60$
Of these documents and signatures were filtered using:
- Regular for German, French, Italian, and Spanish
- Strict for English
- These filters were chosen based on goal of 100B tokens for each language

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General

Code

Chosen thresholds and metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
additional_scripts		additional_scripts
old-test-results		old-test-results
results		results
.gitignore		.gitignore
README.md		README.md
filter.py		filter.py
filter_analysis.ipynb		filter_analysis.ipynb
generate_sample_paths.sh		generate_sample_paths.sh
get_10th_quantiles.py		get_10th_quantiles.py
get_quantiles.py		get_quantiles.py
metric_analysis.ipynb		metric_analysis.ipynb
parse_quantiles.py		parse_quantiles.py
quality_thresholds_from_all_crawls.json		quality_thresholds_from_all_crawls.json
sl-filter.sh		sl-filter.sh
sl-quantiles.sh		sl-quantiles.sh

Folders and files

Latest commit

History

Repository files navigation

General

Code

Chosen thresholds and metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages