This repository replicates the methodology (not the proprietary dataset) of the IEEE “Elephant and Mice Flow” study by rebuilding the entire pipeline with the public CIC-IDS2017 flow dataset. CIC-IDS2017 already contains bidirectional NetFlow-like features, so no packet capture or NFStream usage is required (NFStream can be applied later for live capture scenarios and is mentioned in the report only as an alternative data source).
elephant-mice-cicids2017/
├─ README.md
├─ requirements.txt
├─ data/
│ ├─ raw/
│ │ └─ CIC-IDS2017 # unzip the GeneratedLabelledFlows (GLF) archive here
│ ├─ intermediate/
│ │ ├─ cicids2017_merged.csv # merged GLF rows
│ │ ├─ cicids2017_paper_schema_glf.csv # NFStream-like schema
│ │ ├─ cicids2017_paper_base_glf.csv # 55,726-row working set
│ │ └─ cicids2017_labeled_chebyshev_glf.csv
│ └─ processed/
│ └─ elephant_mice_flows_paper_small.csv
├─ src/
│ ├─ config.py
│ ├─ data_prep/
│ │ ├─ merge_cicids2017.py
│ │ ├─ to_paper_schema_from_generated.py
│ │ ├─ to_paper_schema.py
│ │ ├─ label_elephants_chebyshev.py
│ │ └─ build_experiment_set_paper.py
│ ├─ models/
│ │ ├─ classical_baselines.py
│ │ └─ evaluate.py
│ └─ utils/
│ └─ logging_utils.py
├─ notebooks/
│ ├─ 01_explore_cicids2017.ipynb
│ ├─ 02_elephant_threshold_selection.ipynb
│ └─ 03_train_baselines.ipynb
├─ scripts/
│ ├─ 00_download_instructions.txt
│ ├─ 01_prepare_data.sh
│ └─ 02_train_models.sh
└─ reports/
├─ results_baselines.md
└─ plots/
- The GeneratedLabelledFlows (GLF) release provides the GLF-formatted CIC-IDS2017 tables. Download and unzip the archive locally.
- Place every CSV inside
data/raw/CIC-IDS2017/soscripts/01_prepare_data.shcan read them in bulk. - No data files are tracked in this repo—only the scripts that process them.
See scripts/00_download_instructions.txt for a concise reminder.
| Stage | Script/Notebook | Description |
|---|---|---|
| Merge raw CSVs | src/data_prep/merge_cicids2017.py |
Concatenates every GLF daily capture into data/intermediate/cicids2017_merged.csv with typed columns and strict decoding modes. |
| Paper schema | src/data_prep/to_paper_schema_from_generated.py |
Maps the merged file to NFStream-like fields, coerces ports/protocols, converts Flow Duration to milliseconds, and asserts the Flow Bytes/s sanity check. |
| Paper base | src/data_prep/build_experiment_set_paper.py |
Samples ~55,726 flows from the schema, logs bidirectional byte quantiles, and writes data/intermediate/cicids2017_paper_base_glf.csv. |
| Chebyshev labeling | src/data_prep/label_elephants_chebyshev.py |
Computes the μ+3σ Chebyshev cutoff from aggregated 4-tuple bytes, applies it per flow (online view), logs stats to artifacts/threshold.json, writes the full schema to data/intermediate/cicids2017_labeled_chebyshev_glf.csv, and publishes the trimmed modeling subset to data/processed/elephant_mice_flows_paper_small.csv. |
| Baselines | src/models/classical_baselines.py |
Runs LR, SVM, RF, DT, KNN, LDA, NB over the five paper features using 5-fold stratified group CV (4-tuple groups), aggregates accuracy/precision/recall/F1, and saves a confusion matrix plot to reports/plots/confusion_matrix.png. |
| Evaluation helper | src/models/evaluate.py |
Shared metrics/report helpers. |
| Online threshold | src/online/threshold_updater.py |
Recomputes μ, σ, and the Chebyshev cutoff on a sliding window (default 7 days) and refreshes artifacts/threshold.json on an interval so the “periodic/online” narrative remains true in deployment. |
| Online inference | src/online/predict.py |
Loads a persisted scikit-learn model plus the refreshed threshold to deliver an early ML prediction and a counting-based override for a single incoming flow JSON. |
| Notebooks 01–03 | notebooks/ |
Generate the descriptive tables/figures needed for the report. |
The final table used for modeling lives at data/processed/elephant_mice_flows_paper_small.csv, while data/intermediate/ now contains the merged file, GLF schema, sampled paper base, and the Chebyshev-labeled table for downstream analyses or alternative thresholds.
Install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtPrepare data (assumes CSVs are in place):
bash scripts/01_prepare_data.shTrain and log model baselines:
bash scripts/02_train_models.sh
cat reports/results_baselines.mdEach stage writes deterministic outputs into data/intermediate/ and data/processed/, enabling notebook reuse and reproducibility.
To mirror the paper’s “periodic/online” threshold refresh, run:
python -m src.online.threshold_updater --source data/intermediate/cicids2017_paper_schema_glf.csv --window 7d --interval 300sUse --run-once to recompute a single time (handy for CI) or leave it running to continually update artifacts/threshold.json. The updater records the window, cutoff, μ, σ, and as-of timestamp so downstream consumers know the context of the current threshold.
For a first-flow decision that matches the paper’s narrative (model + counting override):
python -m src.online.predict --flow path/to/flow.json --model-path artifacts/model.joblib --threshold-path artifacts/threshold.jsonThe predictor returns ml_pred, the counting_flag (if the 4-tuple’s bytes exceed the refreshed threshold after adding the current flow), and a final_flag that favors the counting signal when present.
If you just want to exercise the pipeline without downloading CIC-IDS2017, run:
python scripts/generate_synthetic_data.py
python -m src.models.classical_baselines --export-model
python -m src.online.threshold_updater --source data/intermediate/cicids2017_paper_schema_glf.csv --window 7d --interval 300s --run-once
python -m src.online.predict --flow artifacts/sample_flow.json --model-path artifacts/model.joblib --threshold-path artifacts/threshold.json --history-path data/intermediate/cicids2017_paper_schema_glf.csvThe helper script populates small synthetic CSVs and a sample_flow.json, allowing you to train, export the model, refresh the Chebyshev threshold, and emit an online prediction end-to-end.
- The original paper used a private enterprise backbone trace; this project mirrors the methodology using public flows.
- Elephant detection is operationally defined as “bytes ≥ μ + 3·σ” (Chebyshev). The cutoff is learned from aggregated 4-tuples but applied to each flow’s bytes to avoid future knowledge; on CIC-IDS2017 this yields only ≈0.09% elephants (49 of 55,726 flows), far below the ≈5% reported in the proprietary trace.
- Baseline models follow the same family (classical ML with a five-feature vector) and are now evaluated with 5-fold stratified group cross-validation so that no 4-tuple spans folds. Metrics are reported as means ± std. dev., and the best model’s confusion matrix is stored in
reports/plots/confusion_matrix.png. - Notebook outputs (9 figures, 3 tables) are tailored for academic reporting: dataset profiling, threshold justification, and classifier benchmarking.
- Export plots from the notebooks into
reports/plots/for insertion into papers or slide decks. - Extend
src/models/with deep-learning or streaming classifiers if you need online detection. - Replace the GLF dataset with your own NetFlow export by dropping new CSVs into
data/raw/CIC-IDS2017/(or pointingRAW_GLF_DIRelsewhere) and rerunning the preparation scripts.
This README intentionally states that we reproduce the structure and approach of the paper rather than exact numerical scores, making expectations clear for collaborators and reviewers.