SACX-backup

Backup for SACX keyword extraction pipeline

NOTICE SEPARATE LUMI BRANCH FOR RUNNING THIS ON LUMI!!!

What it does?

Runs multiple multilabel classification models and uses the Integrated Gradients method to explain the results of the classifier.
- i.e. IG method scores words based on their relevance in the classification
- We're trying to see which words consistently score high -> keywords
Aggregates the results over all trained models and produces lists of keywords for each class in the classification.

How to run?

run train_and_explain.py with different seed values:
- with Slurm: run sbatch sl-train-explain.bash <insert seed value here> with 10+ different seed values.
- without Slurm: run python train_and_explain.py --seed=<insert seed value here> with 10+ different seed values.
- Create folder "explanations" beforehand, unless you define the parameter --explanations=<new path> to point somewhere else.
- you can modify these parameters on the command line/slurm script or in the file arguments.py.
run kws.py with the results produced by training-explaining.
If needed, corpus information can be used. In that case, before kws.py, run count_class_words.py which calculates term and document frequencies.
- give the result as a parameter --corpus_information to kws.py and a limit for term frequency as --min_word_freq

Multilingual?

you can give multiple languages as --language='["en", "fr", "zh"]' etc.
trains with all languages, explains them separately.
for laguages that do not use white space to separate words, you can use a spacy-parser by defining for example --parse_separately="zh" and --parser_model="zh_core_web_md"
- in this case, feed the languages in separately, so that only one parser is loaded at a time. I.e. if there is only one language that needs separate parsing, this can be done:

python train_and_explain.py --language='["en", "fr", "jp"]' <other params> --parse_separately="jp" --parser_model="jp_core_news_md"

... but if you have, say "zh" and "jp", then they need to be explained separately:

python train_multilabel.py --language='["jp", "zh"]' --save_model=<path to model>
python explain_multilabel.py --trained_model=<path to model> --language='["jp"]' --parse_separately="jp" --parser_model="jp_core_news_md"
python explain_multilabel.py --trained_model=<path to model> --language='["zh"]' --parse_separately="zh" --parser_model="zh_core_web_md"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SACX-backup

NOTICE SEPARATE LUMI BRANCH FOR RUNNING THIS ON LUMI!!!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
README.md		README.md
alignment.ipynb		alignment.ipynb
arguments.py		arguments.py
count_class_words.py		count_class_words.py
data_preparation.py		data_preparation.py
explain_multilabel.py		explain_multilabel.py
kws.py		kws.py
sl-train-explain.bash		sl-train-explain.bash
train_and_explain.py		train_and_explain.py
train_multilabel.py		train_multilabel.py

Folders and files

Latest commit

History

Repository files navigation

SACX-backup

NOTICE SEPARATE LUMI BRANCH FOR RUNNING THIS ON LUMI!!!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages