LLM Evaluation for Software Engineering Tasks

This project evaluates and compares two Large Language Models (LLMs) — microsoft/phi-2 and Cohere Command — on several software engineering tasks, including code generation, test case generation, and documentation generation. The objective is to analyze their performance, output quality, and applicability in practical software engineering workflows.

Models Used

microsoft/phi-2: A lightweight open-source language model capable of code and text generation
Cohere Command (cloud-based model)

Tasks Evaluated

Code Generation
Test Case Generation
Documentation Generation

Project Structure

.
├── README.md
├── requirements.txt
├── src/
│   ├── __init__.py
│   ├── models.py
│   ├── tasks.py
│   ├── evaluator.py
├── data/
│   └── tasks/
│       ├── code_generation/
│       ├── test_generation/
│       └── doc_generation/
└── results/
    ├── raw/
    └── reports/

Setup and Installation

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Run the evaluation:

# Use local models (default)
python src/evaluator.py

# Use Cohere models
python src/evaluator.py --model-type cohere

# Specify a specific model
python src/evaluator.py --model-type local 
python src/evaluator.py --model-type local --model-name microsoft/phi-2

Usage

Note: The first run may take several minutes as the Phi-2 model will be downloaded from HuggingFace. Subsequent runs will be significantly faster due to local caching.

Run the following command:

python src/evaluator.py

Technologies and Tools Used

Python
HuggingFace Transformers
Cohere API
JSON

Results

Results will be stored in the results/ directory:

Raw outputs in JSON format: results/raw/
Analysis reports in Markdown: results/reports/

Example Output

Example evaluation result:

Task: Code Generation
Model: microsoft/phi-2

Input: "Write a Python function to check if a number is prime."

Output: "def is_prime(n): ..."

Evaluation Score: (example value)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data/tasks		data/tasks
results		results
src		src
.env		.env
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Evaluation for Software Engineering Tasks

Models Used

Tasks Evaluated

Project Structure

Setup and Installation

Usage

Technologies and Tools Used

Results

Example Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation for Software Engineering Tasks

Models Used

Tasks Evaluated

Project Structure

Setup and Installation

Usage

Technologies and Tools Used

Results

Example Output

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages