This is the code for NLPCC 2020 paper Label-Wised Document Pre-Training for Multi-Label Text Classification
- Ubuntu 16.04
- Python >= 3.6.0
- PyTorch >= 1.3.0
--dataand--outputs
We provide the proprecessed RMSC and AAPD datasets and pretrained checkpoints of LW-LSTM+PT+FT model and HLW-LSTM+PT+FT model to make sure reproducibility. Please download from the link and decompress to the root directory of this repository.
--data
|--aapd
|--label_test
|--label_train
...
|--rmsc
|--rmsc.data.test.json
|--rmsc.data.train.json
|--rmsc.data.valid.json
aapd_word2vec.model
aapd_word2vec.model.wv.vectors.npy
aapd.meta.json
aapd.pkl
rmsc_word2vec.model
rmsc_word2vec.model.wv.vectors.npy
rmsc.meta.json
rmsc.pkl
--outputs
|--aapd
|--rmsc
Note that the
data/aapdanddata/rmscis the initial dataset. Here we provide a split of RMSC (i.e. RMSC-V2).
- Testing on AAPD
python classification.py -config=aapd.yaml -in=aapd -gpuid [GPU_ID] -test- Testing on RMSC
python classification.py -config=rmsc.yaml -in=rmsc -gpuid [GPU_ID] -testIf you want to preprocess the dataset by yourself, just run the following command with name of dataset (e.g. RMSC or AAPD).
PYTHONHASHSEED=1 python preprocess.py -data=[RMSC/AAPD]Note that
PYTHONHASHSEEDis used in word2vec.
Pre-train the LW-PT model.
python pretrain.py -config=[CONFIG_NAME] -out=[OUT_INFIX] -gpuid [GPU_ID] -train -testCONFIG_NAME:aapd.yamlorrmsc.yamlOUT_INFIX: infix of outputs directory contains logs and checkpoints
Train the downstream model for MLTC task.
python classification.py -config=[CONFIG_NAME] -in=[IN_INFIX] -out=[OUT_INFIX] -gpuid [GPU_ID] -train -testIN_INFIX: infix of inputs directory contains pre-trained checkpoints
- build a static documents representation to facilitate downstream tasks
python build_doc_rep.py -config=[CONFIG_NAME] -in=[IN_INFIX] -gpuid [GPU_ID]Not used unless necessary.
- make RMSC-V2 dataset:
tests/make_rmsc.py - visual document embeddings:
tests/visual_emb.py - visual labels F1 score:
tests/visual_label_f1.py - case study:
tests/case_study.py
If you consider our work useful, please cite the paper:
@inproceedings{liu2020label,
title="Label-Wise Document Pre-Training for Multi-Label Text Classification",
author="Han Liu, Caixia Yuan and Xiaojie Wang",
booktitle="CCF International Conference on Natural Language Processing and Chinese Computing",
year="2020"
}