This repository collects the material produced for Laboratory 4 of the AI and Cybersecurity course. The lab focuses on Natural Language Processing (NLP) applied to Bash command analysis. The goal is to classify sequences of Bash commands into malicious tactics (based on the MITRE ATT&CK framework) to identify attacker intent.
- Goal: Classify Bash command sessions into tactics (e.g., Discovery, Execution, Persistence).
- Data: Real-world Bash session logs labeled with tactics.
- Workflow: Tokenization → Embeddings (TF-IDF, Word2Vec) → Sequence Modeling (LSTM/GRU) → Classification.
- Deliverables: interactive notebooks under
lab/notebooks/and a LaTeX report inreport/.
Laboratory4/
├── lab/
│ ├── data/ # Dataset files (train.json, test.json)
│ ├── notebooks/ # Task-specific Jupyter notebooks (Task1–Task4)
│ └── Plots/ # Generated figures for the report
├── report/ # LaTeX report sources
├── resources/ # Logos and reference material
└── README.md # This file
The dataset consists of user sessions, where each session is a sequence of Bash commands entered by a user (or attacker).
- Training Data:
train.json(labeled sessions). - Test Data:
test.json(used for final evaluation). - Labels (Tactics):
- Discovery: Gathering information about the system (e.g.,
ls,whoami,netstat). - Execution: Running malicious code or binaries (e.g.,
./malware,bash -i). - Persistence: Maintaining access across restarts (e.g., modifying
crontab,.bashrc). - Defense Evasion: Hiding traces (e.g.,
rm .bash_history,history -c).
- Discovery: Gathering information about the system (e.g.,
| Task | Notebook | Focus | Key Findings |
|---|---|---|---|
| Task 1 | lab/notebooks/Task1.ipynb |
Exploratory Data Analysis (NLP) | - Analyzed command frequency and length distributions. - Tokenization strategies: splitting by space vs special characters. - Identified most common commands per tactic (e.g., wget for Execution). |
| Task 2 | lab/notebooks/Task2.ipynb |
Feature Extraction & Baselines | - TF-IDF: Effective for identifying tactic-specific keywords. - Word2Vec: Learned semantic relationships between commands (e.g., curl is close to wget).- Baseline Models: Logistic Regression and Random Forest on aggregated session vectors. |
| Task 3 | lab/notebooks/Task3.ipynb |
Sequential Models (RNNs) | - LSTM/GRU: Modeled the order of commands, which is crucial for intent. - Embedding Layer: Learned dense representations for command tokens. - Results: RNNs significantly outperformed baselines by capturing temporal dependencies. |
| Task 4 | lab/notebooks/Task4.ipynb |
Advanced & Explainability | - Attention Mechanisms: Highlighted which specific commands contributed most to the classification. - Confusion Matrix analysis: Revealed overlap between Discovery and Persistence strategies. - Error Analysis: Misclassified sessions often contained ambiguous or multi-purpose commands. |
-
Text Processing (Task 1 & 2)
- Custom tokenization was required to handle bash syntax (flags, pipes
|, redirects>). - Static embeddings (Word2Vec) captured functional similarities between tools.
- Custom tokenization was required to handle bash syntax (flags, pipes
-
Deep Learning for Sequences (Task 3)
- Treating a session as a "sentence" of commands proved highly effective.
- Bidirectional LSTMs captured context from both past and future commands in a session.
-
Model Interpretability (Task 4)
- Attention weights showed that the model correctly focuses on "payload" commands (e.g., downloading a file) rather than common utility commands (e.g.,
cd,ls).
- Attention weights showed that the model correctly focuses on "payload" commands (e.g., downloading a file) rather than common utility commands (e.g.,
-
Environment: Install standard data science and NLP libraries (
pandas,numpy,scikit-learn,matplotlib,seaborn,torch,gensim,nltk). -
Execution: Run notebooks in order (
Task1.ipynb$\to$ Task4.ipynb).- Note: Seed setting is included for reproducibility.
-
Data: Ensure
train.jsonandtest.jsonare inlab/data/.
| Name | GitHub | ||
|---|---|---|---|
| Renato Mignone | |||
| Claudia Sanna | |||
| Chiara Iorio |
