Follow below steps for implementing binary classification for detecting dead or rarely-used code in SQLite callgraph.
- Project Overview
- Quick Start
- Step-by-Step Implementation
- Understanding the Features
- Model Architectures
- Evaluation Metrics
This project uses neural networks to classify functions as either frequently used or dead/rarely used based on call graph topology and function characteristics. The system is demonstrated on the SQLite database codebase.
- Call Graph Parsing: Extracts function relationships from LLVM-generated call graphs
- Multi-Modal Features: Combines structural (graph-based) and semantic (name-based) features
- Multiple Architectures: Supports MLP, Transformer, and Graph Neural Network (GNN) models
- Comprehensive Evaluation: Includes accuracy, precision, recall, F1-score, and ROC-AUC metrics
Dataset: SQLite callgraph with 2,621 functions
- Distribution:
- ~1,648 rarely used functions (uses ≤ 2)
- ~973 frequently used functions (uses > 2)
# Install dependencies
pip install -r requirements.txt --break-system-packages
# Or install individually
pip install torch numpy pandas scikit-learn networkx matplotlib seaborn --break-system-packagesInstall dependencies:
pip install torch numpy pandas scikit-learn networkx matplotlib seabornFor GNN support (optional):
pip install torch-geometricRun the interactive demo to explore the data and train a quick model:
python quick_start.pypython -u dead_code_detector.py 2>&1 | tee results_mlp.txt python -u run_for_transformer.py --callgraph callgraph.txt --threshold 2 --seq_mode groups 2>&1 | tee results_for_transformer.txtpython -u run_for_gnn.py --callgraph callgraph.txt --threshold 2 2>&1 | tee results_for_gnn.txt
The callgraph.txt format:
Call graph node for function: 'functionName'<<address>> #uses=X
CS<None> calls function 'calledFunction1'
CS<None> calls function 'calledFunction2'
...
Extracted Features:
| Feature | Description | Type |
|---|---|---|
uses |
Number of times function is called | Numeric |
num_calls |
Number of functions this calls | Numeric |
in_degree |
How many functions call this (graph) | Numeric |
out_degree |
How many functions this calls (graph) | Numeric |
pagerank |
PageRank centrality score | Numeric |
func_length |
Length of function name | Numeric |
has_number |
Contains digits (0/1) | Binary |
has_underscore |
Contains underscore (0/1) | Binary |
is_internal |
Internal function (0/1) | Binary |
has_sqlite_prefix |
Starts with 'sqlite3' (0/1) | Binary |
Threshold-based labeling:
threshold = 2 # Configurable
label = 1 if uses <= threshold else 0Justification:
- Functions with 0-2 uses → Potentially dead/rarely used
- Functions with 3+ uses → Active/frequently used
- You can adjust the threshold based on your needs
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Normalize features (important for neural networks!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split: 60% train, 20% val, 20% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)Default Model (MLP):
Input (10 features)
↓
Linear(10 → 64) + BatchNorm + ReLU + Dropout(0.3)
↓
Linear(64 → 32) + BatchNorm + ReLU + Dropout(0.3)
↓
Linear(32 → 16) + BatchNorm + ReLU + Dropout(0.3)
↓
Linear(16 → 2) [Output logits]
Total Parameters: ~5,000 (lightweight!)
Loss Function: CrossEntropyLoss (standard for classification)
Optimizer: Adam with:
- Learning rate: 0.001
- Weight decay: 1e-5 (L2 regularization)
Learning Rate Scheduler: ReduceLROnPlateau
- Reduces LR when validation loss plateaus
- Factor: 0.5
- Patience: 10 epochs
Early Stopping:
- Patience: 20 epochs
- Monitors validation accuracy
- Saves best model
Primary Metrics:
- Accuracy: Overall correctness
- Precision: Of predicted rarely-used, how many are correct?
- Recall: Of actual rarely-used, how many did we find?
- F1-Score: Harmonic mean of precision and recall
- ROC-AUC: Area under ROC curve (discrimination ability)
Confusion Matrix:
Predicted
0 1
Actual 0 [TN] [FP]
1 [FN] [TP]
Best for: Tabular features
Pros: Fast, simple, interpretable
Cons: Doesn't use graph structure
from dead_code_detector import DeadCodeClassifier
model = DeadCodeClassifier(input_dim=10, hidden_dims=[64, 32, 16], dropout=0.3)Best for: Complex feature interactions
Pros: Self-attention mechanism
Cons: More parameters, slower training
from alternative_models import TransformerClassifier
model = TransformerClassifier(input_dim=10, d_model=64, nhead=4)Best for: Leveraging callgraph structure
Pros: Uses actual graph topology
Cons: Requires torch-geometric, more complex setup
from alternative_models import GNNClassifier
model = GNNClassifier(input_dim=10, hidden_dim=64, num_layers=3)Recommendation: Start with MLP (default), then try GNN if you want to leverage graph structure.
Baseline (Random Classifier): ~50% accuracy
Target Performance: 75-85% accuracy
Excellent Performance: >90% accuracy
Classification Report:
precision recall f1-score support
Frequently Used 0.85 0.88 0.86 195
Rarely Used 0.83 0.79 0.81 330
accuracy 0.84 525
macro avg 0.84 0.84 0.84 525
weighted avg 0.84 0.84 0.84 525
ROC-AUC Score: 0.91
High Precision for "Rarely Used":
- When model says code is rarely used, it's usually correct
- Good for automated code cleanup
High Recall for "Rarely Used":
- Model finds most of the rarely-used code
- Minimizes false negatives
High F1-Score:
- Balanced performance
- No extreme precision-recall tradeoff