This project evaluates and compares two Large Language Models (LLMs) — microsoft/phi-2 and Cohere Command — on several software engineering tasks, including code generation, test case generation, and documentation generation. The objective is to analyze their performance, output quality, and applicability in practical software engineering workflows.
- microsoft/phi-2: A lightweight open-source language model capable of code and text generation
- Cohere Command (cloud-based model)
- Code Generation
- Test Case Generation
- Documentation Generation
.
├── README.md
├── requirements.txt
├── src/
│ ├── __init__.py
│ ├── models.py
│ ├── tasks.py
│ ├── evaluator.py
├── data/
│ └── tasks/
│ ├── code_generation/
│ ├── test_generation/
│ └── doc_generation/
└── results/
├── raw/
└── reports/
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Run the evaluation:
# Use local models (default)
python src/evaluator.py
# Use Cohere models
python src/evaluator.py --model-type cohere
# Specify a specific model
python src/evaluator.py --model-type local
python src/evaluator.py --model-type local --model-name microsoft/phi-2Note: The first run may take several minutes as the Phi-2 model will be downloaded from HuggingFace. Subsequent runs will be significantly faster due to local caching.
Run the following command:
python src/evaluator.py- Python
- HuggingFace Transformers
- Cohere API
- JSON
Results will be stored in the results/ directory:
- Raw outputs in JSON format:
results/raw/ - Analysis reports in Markdown:
results/reports/
Example evaluation result:
Task: Code Generation
Model: microsoft/phi-2
Input: "Write a Python function to check if a number is prime."
Output: "def is_prime(n): ..."
Evaluation Score: (example value)