Laboratory 4 – NLP for Cybersecurity

This repository collects the material produced for Laboratory 4 of the AI and Cybersecurity course. The lab focuses on Natural Language Processing (NLP) applied to Bash command analysis. The goal is to classify sequences of Bash commands into malicious tactics (based on the MITRE ATT&CK framework) to identify attacker intent.

At a Glance

Goal: Classify Bash command sessions into tactics (e.g., Discovery, Execution, Persistence).
Data: Real-world Bash session logs labeled with tactics.
Workflow: Tokenization → Embeddings (TF-IDF, Word2Vec) → Sequence Modeling (LSTM/GRU) → Classification.
Deliverables: interactive notebooks under lab/notebooks/ and a LaTeX report in report/.

Repository Layout

Laboratory4/
├── lab/
│   ├── data/           # Dataset files (train.json, test.json)
│   ├── notebooks/      # Task-specific Jupyter notebooks (Task1–Task4)
│   └── Plots/          # Generated figures for the report
├── report/             # LaTeX report sources
├── resources/          # Logos and reference material
└── README.md           # This file

Dataset: Bash Command Sessions

The dataset consists of user sessions, where each session is a sequence of Bash commands entered by a user (or attacker).

Training Data: train.json (labeled sessions).
Test Data: test.json (used for final evaluation).
Labels (Tactics):
- Discovery: Gathering information about the system (e.g., ls, whoami, netstat).
- Execution: Running malicious code or binaries (e.g., ./malware, bash -i).
- Persistence: Maintaining access across restarts (e.g., modifying crontab, .bashrc).
- Defense Evasion: Hiding traces (e.g., rm .bash_history, history -c).

Notebook Guide & Key Experiments

Task	Notebook	Focus	Key Findings
Task 1	`lab/notebooks/Task1.ipynb`	Exploratory Data Analysis (NLP)	- Analyzed command frequency and length distributions. - Tokenization strategies: splitting by space vs special characters. - Identified most common commands per tactic (e.g., `wget` for Execution).
Task 2	`lab/notebooks/Task2.ipynb`	Feature Extraction & Baselines	- TF-IDF: Effective for identifying tactic-specific keywords. - Word2Vec: Learned semantic relationships between commands (e.g., `curl` is close to `wget`). - Baseline Models: Logistic Regression and Random Forest on aggregated session vectors.
Task 3	`lab/notebooks/Task3.ipynb`	Sequential Models (RNNs)	- LSTM/GRU: Modeled the order of commands, which is crucial for intent. - Embedding Layer: Learned dense representations for command tokens. - Results: RNNs significantly outperformed baselines by capturing temporal dependencies.
Task 4	`lab/notebooks/Task4.ipynb`	Advanced & Explainability	- Attention Mechanisms: Highlighted which specific commands contributed most to the classification. - Confusion Matrix analysis: Revealed overlap between Discovery and Persistence strategies. - Error Analysis: Misclassified sessions often contained ambiguous or multi-purpose commands.

Highlights per Task

Text Processing (Task 1 & 2)
- Custom tokenization was required to handle bash syntax (flags, pipes |, redirects >).
- Static embeddings (Word2Vec) captured functional similarities between tools.
Deep Learning for Sequences (Task 3)
- Treating a session as a "sentence" of commands proved highly effective.
- Bidirectional LSTMs captured context from both past and future commands in a session.
Model Interpretability (Task 4)
- Attention weights showed that the model correctly focuses on "payload" commands (e.g., downloading a file) rather than common utility commands (e.g., cd, ls).

Reproducing the Experiments

Environment: Install standard data science and NLP libraries (pandas, numpy, scikit-learn, matplotlib, seaborn, torch, gensim, nltk).
Execution: Run notebooks in order (Task1.ipynb $\to$ Task4.ipynb).
- Note: Seed setting is included for reproducibility.
Data: Ensure train.json and test.json are in lab/data/.

Authors

Name	GitHub	LinkedIn	Email
Renato Mignone
Claudia Sanna
Chiara Iorio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Laboratory 4 – NLP for Cybersecurity

At a Glance

Repository Layout

Dataset: Bash Command Sessions

Notebook Guide & Key Experiments

Highlights per Task

Reproducing the Experiments

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
lab		lab
report		report
resources		resources
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Laboratory 4 – NLP for Cybersecurity

At a Glance

Repository Layout

Dataset: Bash Command Sessions

Notebook Guide & Key Experiments

Highlights per Task

Reproducing the Experiments

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages