- Benchmark to classify multiple languages automatically - https://github.com/EleutherAI/lm-evaluation-harness
- Try to do a evaluation using https://github.com/vllm-project/vllm to speed-up evaluation process
- Use GlobalMMLU to try and evaluate our model with PT (https://arxiv.org/pdf/2412.03304)
- TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
- Llama 2: Open Foundation and Fine-Tuned Chat Models
- Gobal MMLU
- Motivational papers
- "Efficiently Adapting Pretrained Language Models to New Languages"
- Pretty similar work to what we're trying to achieve, they only focused on the "mean"
- "Getting the most out of your tokenizer for pre-training and domain adaptation"
- Good paper to understand which procedures to follow on the pre-tokenization step (compression with reg. exp/BPE/etc)
- Tokenizer training
- Calculation of new Embedding:
- Use weigthed average based on the initial tokens lengths
- E.g: New token: "martelo". Old tokens: ["mar", "telo"] -> the embed for "martelo" will be the weighted average of the embeds for "mar" and "telo", with 3/7 weighted for "mar" and 4/7 weighted for "telo"
- Try to translate the meaning of the word to the Original Language from the Target Language, and use the vector for that translated word
- E.g: New token: "martelo". Translation is "hammer", so use the embed for the word "hammer" (if the original language is English)
- Use weigthed average based on the initial tokens lengths
- Final deployment:
- Explore https://github.com/vllm-project/vllm to deploy a final LLM or to do some testings