Skip to content

ztavakolirad/LLM-Evaluation-for-Software-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Evaluation for Software Engineering Tasks

This project evaluates and compares two Large Language Models (LLMs) — microsoft/phi-2 and Cohere Command — on several software engineering tasks, including code generation, test case generation, and documentation generation. The objective is to analyze their performance, output quality, and applicability in practical software engineering workflows.

Models Used

  • microsoft/phi-2: A lightweight open-source language model capable of code and text generation
  • Cohere Command (cloud-based model)

Tasks Evaluated

  1. Code Generation
  2. Test Case Generation
  3. Documentation Generation

Project Structure

.
├── README.md
├── requirements.txt
├── src/
│   ├── __init__.py
│   ├── models.py
│   ├── tasks.py
│   ├── evaluator.py
├── data/
│   └── tasks/
│       ├── code_generation/
│       ├── test_generation/
│       └── doc_generation/
└── results/
    ├── raw/
    └── reports/

Setup and Installation

  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Run the evaluation:
# Use local models (default)
python src/evaluator.py

# Use Cohere models
python src/evaluator.py --model-type cohere

# Specify a specific model
python src/evaluator.py --model-type local 
python src/evaluator.py --model-type local --model-name microsoft/phi-2

Usage

Note: The first run may take several minutes as the Phi-2 model will be downloaded from HuggingFace. Subsequent runs will be significantly faster due to local caching.

Run the following command:

python src/evaluator.py

Technologies and Tools Used

  • Python
  • HuggingFace Transformers
  • Cohere API
  • JSON

Results

Results will be stored in the results/ directory:

  • Raw outputs in JSON format: results/raw/
  • Analysis reports in Markdown: results/reports/

Example Output

Example evaluation result:

Task: Code Generation
Model: microsoft/phi-2

Input: "Write a Python function to check if a number is prime."

Output: "def is_prime(n): ..."

Evaluation Score: (example value)

About

This project evaluates and compares two LLMs on various software engineering tasks, including code generation, test generation, and documentation. The models used are phi-2 and Cohere Command.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors