IOC Extractor

This program extracts Indicators of Compromise (IOCs) from text and document files and produces triage-friendly output. It scans the provided directory recursively, filters obvious placeholders and local-only values, and saves both structured JSON and CSV reports.

IOCs:

Domains
Emails
IPV4 addresses
MD5 hashes
SHA1 hashes
SHA256 hashes
URLs

Prerequisites

Python 3.12+
uv

Installation

Clone the repository:

git clone https://github.com/dfirsec/ioc_extractor.git

Navigate to the project directory:
```
cd ioc_extractor
```
Create or sync the environment with uv:
```
uv sync
```

This installs the project dependencies from pyproject.toml and uv.lock.

Usage

Run the extractor with uv:

uv run ioc_extractor.py <Directory containing potential IOCs>

If you prefer, the equivalent explicit form is:

uv run python ioc_extractor.py <Directory containing potential IOCs>

Replace <Directory containing potential IOCs> with the path to the directory where the files to be scanned are located.

Supported File Types

Text-like files:

.cfg
.conf
.config
.csv
.htm
.html
.ini
.json
.log
.md
.rtf
.txt
.xml
.yaml
.yml

Document files:

.docx
.pptx
.xlsx
.pdf

Output

When actionable IOCs are found, the program writes:

results.json: Full structured output with a global summary and per-file context
results_summary.csv: Deduplicated IOC summary with counts and file coverage
results_hits.csv: Every retained IOC hit with file path, line number, and context
ioc_whitelist.txt: User-editable allowlist file created automatically if missing

Whitelist Format

Add one entry per line in ioc_whitelist.txt.

# Global entry
example.com

# Type-specific entry
Domain:contoso.com
URL:https://status.contoso.com/health
Email:alerts@contoso.com

Plain values apply to every IOC type. TYPE:value entries apply only to that IOC type.

Behavior

The extractor now:

scans directories recursively
parses modern Office documents (.docx, .xlsx, .pptx) directly from their OOXML contents
parses PDFs when pypdf is installed
normalizes common obfuscation such as [.] and [@]
removes obvious false positives such as placeholder domains and file-name-like matches
suppresses private, reserved, loopback, and local-only IPv4 values
retains the first line number and context for each IOC per file

Limitations

Legacy Office formats such as .doc, .xls, and .ppt are not supported.
PDF parsing is disabled when pypdf is not installed; the CLI will tell you when that is the case.
OOXML extraction is text-focused and does not preserve document layout.

Example

uv run ioc_extractor.py "C:\Users\name\Documents\cases"

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
ioc_extractor.py		ioc_extractor.py
ioc_whitelist.txt		ioc_whitelist.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IOC Extractor

Prerequisites

Installation

Usage

Supported File Types

Output

Whitelist Format

Behavior

Limitations

Example

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IOC Extractor

Prerequisites

Installation

Usage

Supported File Types

Output

Whitelist Format

Behavior

Limitations

Example

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages