All codes are in the directory named Assignment 1.
Check out our Report
- Dataset - the given corpus: The Atticus dataset of legal contacts
- A text file containing a list of stopwords (to evaluate the lexical density):
StopWords.txt - To install all dependencies, run this command with your terminal under
srcandtestsdirectories respectively:pip install -r requirements.txt - Note: All the prerequisite files are placed the archive at
Assignment 1/src/directory.
- Implementation of our Word Tokenizer
- Extract the given corpus.
- Concatenate text files to form a corpus.
- Analyze statistics about tokens in the corpus.
- Note: Unlike source codes which are stored at
srcdirectory, unit tests are stored undertestsdirectory.
Please run all cells in the Jupyter Notebook named Part 1 - Corpus Processing.ipynb sequentially.
- Please run the test file named
tests/test_word_tokenizer.py. - Please also run another test file named
tests/test_count_occurences.py.
Please run the test file named tests/test_lemmatizer.py.
This directory also contains the code and data used for evaluating sentence embedding models for semantic textual similarity tasks.
- Dataset: The Semeval 2016 Task-1 dataset (STS data) is provided in the directory
STS_Data/. - Output Directory: Results of the evaluation will be stored in
Part 2 - Output/. - Requirements: Dependencies are listed in
requirements.txt. Install them by running:pip install -r requirements.txt
The dataset is preloaded in the folder STS_Data/ and contains multiple .txt files:
STS2016.input.answer-answer.txtSTS2016.input.headlines.txtSTS2016.input.plagiarism.txtSTS2016.input.postediting.txtSTS2016.input.question-question.txt
Concatenate these files to form a single test dataset as required by your experiment.
- Open the Jupyter Notebook
Part 2 - Sentence Similarity.ipynb. - Run all cells sequentially to evaluate the pre-trained sentence embedding models.
The notebook includes:
- Loading the dataset.
- Applying pre-trained models such as SBERT and other LLM-based embeddings.
- Evaluating similarity and computing the Pearson correlation with ground truth scores.
- Outputs of the experiments will be saved in the directory
Part 2 - Output/.
Pre-trained sentence embedding models evaluated include:
- SBERT (Sentence-BERT): A model designed for semantic textual similarity tasks.
- Additional models (e.g., models based on recent generative LLMs).