Secure Programming CodeLLMExp

This repository contains a reproducible analysis pipeline for CodeLLMExp, a multi-language dataset for studying automated vulnerability localization and explanation in AI-generated code.

The project focuses on three connected secure-programming tasks:

CWE classification: predict the vulnerability category from source code.
Vulnerable line localization: identify lines that are likely to contain the vulnerability.
Faithfulness-to-fix evaluation: compare predicted vulnerable lines with lines changed by the secure fix.

The pipeline intentionally uses lightweight and interpretable methods, including TF-IDF with classical machine learning models, rule-based fix-concept extraction, diff analysis, and line-level scoring heuristics.

Repository Structure

.
+-- CodeLLMExp.jsonl                         # Full dataset in JSON Lines format
+-- README.md
+-- requirements.txt
+-- PROJECT_SUMMARY.md                       # Detailed experiment summary
+-- IMPLEMENTATION_PLAN (1).md               # Implementation notes
+-- Explainable_Secure_Programming_CodeLLMExp.pdf
+-- Secure_programming (7).pdf
+-- cwe_seeds/                               # Canonical vulnerable seed examples
+-- seed/                                    # Generated/augmented seed examples
+-- source/                                  # Source snippets organized by language and CWE
+-- src/                                     # Reusable Python utilities
+-- notebooks/                               # End-to-end experiment notebooks

Important generated outputs such as processed CSV files, trained models, figures, metrics, and storyboard images are ignored by Git and can be regenerated by running the notebooks.

Dataset

The dataset contains vulnerable code snippets, fixed code, CWE labels, vulnerable-line annotations, and natural-language security explanations.

Summary after cleaning:

Metric	Value
Total samples	10,403
Languages	Python, Java, C
Unique CWE labels	29
Rows with vulnerable-line annotations	97.76%

Language distribution:

Language	Samples
Python	4,610
Java	3,088
C	2,705

Setup

Create and activate a Python environment:

python -m venv .venv
.\.venv\Scripts\Activate.ps1

Install dependencies:

pip install -r requirements.txt

Register the environment as a Jupyter kernel:

python -m ipykernel install --user --name codellmexp --display-name "Python (CodeLLMExp)"

Running the Pipeline

Run the notebooks in order:

notebooks/01_data_loading_cleaning.ipynb
notebooks/02_secure_fix_concept_extraction.ipynb
notebooks/03_cwe_classification_baseline.ipynb
notebooks/04_vulnerable_line_localization.ipynb
notebooks/05_faithfulness_to_fix_evaluation.ipynb
notebooks/06_storyboard_and_report_figures.ipynb

The notebooks write intermediate and final artifacts under data/processed/, data/splits/, and outputs/.

Main Results

The current experiment summary reports:

Task	Best/Key Result
CWE classification, Linear SVM	99.87% accuracy
CWE classification, Linear SVM	99.87% weighted F1
Vulnerable line localization	21.66% Top-1 accuracy
Vulnerable line localization	41.69% Top-5 accuracy

See PROJECT_SUMMARY.md for the full analysis, including fix-concept distributions, localization results by language/CWE, faithfulness-to-fix metrics, and case studies.

Notes on Version Control

The repository tracks source code, notebooks, seed/source examples, reports, and the main CodeLLMExp.jsonl dataset file.

Generated experiment artifacts are excluded through .gitignore, including:

outputs/
data/raw/
data/processed/
data/splits/
Python caches and notebook checkpoints
trained model files such as *.pkl

License

The dataset is released under the Creative Commons Attribution 4.0 International License.

Citation

If you use this dataset or analysis pipeline, please cite the accompanying report/paper and acknowledge the CodeLLMExp dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Secure Programming CodeLLMExp

Repository Structure

Dataset

Setup

Running the Pipeline

Main Results

Notes on Version Control

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cwe_seeds		cwe_seeds
notebooks		notebooks
seed		seed
source		source
src		src
.gitignore		.gitignore
Explainable_Secure_Programming_CodeLLMExp.pdf		Explainable_Secure_Programming_CodeLLMExp.pdf
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
Secure_programming (7).pdf		Secure_programming (7).pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Secure Programming CodeLLMExp

Repository Structure

Dataset

Setup

Running the Pipeline

Main Results

Notes on Version Control

License

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages