This program extracts Indicators of Compromise (IOCs) from text and document files and produces triage-friendly output. It scans the provided directory recursively, filters obvious placeholders and local-only values, and saves both structured JSON and CSV reports.
IOCs:
- Domains
- Emails
- IPV4 addresses
- MD5 hashes
- SHA1 hashes
- SHA256 hashes
- URLs
- Python 3.12+
uv
-
Clone the repository:
git clone https://github.com/dfirsec/ioc_extractor.git -
Navigate to the project directory:
cd ioc_extractor -
Create or sync the environment with
uv:uv sync
This installs the project dependencies from pyproject.toml and uv.lock.
Run the extractor with uv:
uv run ioc_extractor.py <Directory containing potential IOCs>
If you prefer, the equivalent explicit form is:
uv run python ioc_extractor.py <Directory containing potential IOCs>
Replace
<Directory containing potential IOCs>with the path to the directory where the files to be scanned are located.
Text-like files:
- .cfg
- .conf
- .config
- .csv
- .htm
- .html
- .ini
- .json
- .log
- .md
- .rtf
- .txt
- .xml
- .yaml
- .yml
Document files:
- .docx
- .pptx
- .xlsx
When actionable IOCs are found, the program writes:
results.json: Full structured output with a global summary and per-file contextresults_summary.csv: Deduplicated IOC summary with counts and file coverageresults_hits.csv: Every retained IOC hit with file path, line number, and contextioc_whitelist.txt: User-editable allowlist file created automatically if missing
Add one entry per line in ioc_whitelist.txt.
# Global entry
example.com
# Type-specific entry
Domain:contoso.com
URL:https://status.contoso.com/health
Email:alerts@contoso.com
Plain values apply to every IOC type. TYPE:value entries apply only to that IOC type.
The extractor now:
- scans directories recursively
- parses modern Office documents (
.docx,.xlsx,.pptx) directly from their OOXML contents - parses PDFs when
pypdfis installed - normalizes common obfuscation such as
[.]and[@] - removes obvious false positives such as placeholder domains and file-name-like matches
- suppresses private, reserved, loopback, and local-only IPv4 values
- retains the first line number and context for each IOC per file
- Legacy Office formats such as
.doc,.xls, and.pptare not supported. - PDF parsing is disabled when
pypdfis not installed; the CLI will tell you when that is the case. - OOXML extraction is text-focused and does not preserve document layout.
uv run ioc_extractor.py "C:\Users\name\Documents\cases"
This project is licensed under the MIT License.