🛡️ Log Audit & Blacklist Cross-Referencer

Overview

This project is a simulation of a Security Compliance Audit workflow. It simulates the process of ingesting raw, unstructured log data (emails and phone numbers), normalizing it, and cross-referencing it against a known "Blacklist" of compromised or spam actors.

The tool demonstrates an ETL (Extract, Transform, Load) pipeline approach using Python, utilizing Regular Expressions (Regex) for data scraping and Hash Sets for high-performance data matching.

🚀 Features

Synthetic Data Generation: Uses the Faker library to generate thousands of realistic (but fake) US phone numbers (with varying formats) and email addresses.
Advanced Pattern Matching: Custom Regex patterns capable of extracting phone numbers with dashes, dots, spaces, and extensions, as well as complex email structures.
High-Performance Lookup: Converts blacklist arrays into Python Sets to achieve O(1) time complexity during the cross-referencing phase, ensuring scalability even with millions of records.
Data Injection: Simulates real-world "hits" by injecting known blacklist entities into the clean logs.

📂 Project Structure

.
├── audit_script.py        # The main logic: scrapes data and runs the cross-reference
├── data.py                # Generates the 'clean' log data (Phone numbers/Emails)
├── black_list.py          # Generates the 'blacklist' data (Spam domains/Burner phones)
└── README.md              # Documentation

🛠️ Prerequisites

Python 3.x
Faker library (for generating synthetic data)

📦 Installation

Clone the repository:
```
git clonerepository
cd ETL_pipeline
```
Install dependencies:
```
pip install faker
```

⚡ Usage

Generate the Data: Ensure data.py and black_list.py are in the same directory. (Note: Ensure you include the injection logic in data.py to guarantee hits during the simulation).
Run the Audit: Execute the main script to parse the logs and find the threats.
```
python audit_script.py
```
View Results: The script will output the loading status, scraping progress, and a final count of matching hits found in the logs.

🔍 Technical Details

The Regex Logic

The project uses re.VERBOSE to handle complex string matching:

Phone Numbers: Handles formats like (555) 123-4567, 555.123.4567, and 555-123-4567 x1234.
Emails: Validates structure including specific handling for the @ symbol and domain extensions.

Performance Optimization

Naive cross-referencing (checking a list against a list) results in O(n*m) complexity. This project casts the blacklist to a set(), reducing the lookup to O(1) on average.

# Instead of iterating through a list, we use instant lookup
if phone_no in blacklist_phones_set:
    phone_no_hits.append(phone_no)

🤝 Contributing

Contributions are welcome! Please fork the repository and submit a pull request for any enhancements (e.g., adding multithreading for larger datasets or exporting results to CSV).

📄 License

This project is open-source and available under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ Log Audit & Blacklist Cross-Referencer

Overview

🚀 Features

📂 Project Structure

🛠️ Prerequisites

📦 Installation

⚡ Usage

🔍 Technical Details

The Regex Logic

Performance Optimization

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
audit_script.py		audit_script.py
black_list.py		black_list.py
data.py		data.py

Folders and files

Latest commit

History

Repository files navigation

🛡️ Log Audit & Blacklist Cross-Referencer

Overview

🚀 Features

📂 Project Structure

🛠️ Prerequisites

📦 Installation

⚡ Usage

🔍 Technical Details

The Regex Logic

Performance Optimization

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages