This project is a simulation of a Security Compliance Audit workflow. It simulates the process of ingesting raw, unstructured log data (emails and phone numbers), normalizing it, and cross-referencing it against a known "Blacklist" of compromised or spam actors.
The tool demonstrates an ETL (Extract, Transform, Load) pipeline approach using Python, utilizing Regular Expressions (Regex) for data scraping and Hash Sets for high-performance data matching.
- Synthetic Data Generation: Uses the
Fakerlibrary to generate thousands of realistic (but fake) US phone numbers (with varying formats) and email addresses. - Advanced Pattern Matching: Custom Regex patterns capable of extracting phone numbers with dashes, dots, spaces, and extensions, as well as complex email structures.
- High-Performance Lookup: Converts blacklist arrays into Python Sets to achieve O(1) time complexity during the cross-referencing phase, ensuring scalability even with millions of records.
- Data Injection: Simulates real-world "hits" by injecting known blacklist entities into the clean logs.
.
├── audit_script.py # The main logic: scrapes data and runs the cross-reference
├── data.py # Generates the 'clean' log data (Phone numbers/Emails)
├── black_list.py # Generates the 'blacklist' data (Spam domains/Burner phones)
└── README.md # Documentation
- Python 3.x
Fakerlibrary (for generating synthetic data)
-
Clone the repository:
git clonerepository cd ETL_pipeline -
Install dependencies:
pip install faker
-
Generate the Data: Ensure
data.pyandblack_list.pyare in the same directory. (Note: Ensure you include the injection logic indata.pyto guarantee hits during the simulation). -
Run the Audit: Execute the main script to parse the logs and find the threats.
python audit_script.py
-
View Results: The script will output the loading status, scraping progress, and a final count of matching hits found in the logs.
The project uses re.VERBOSE to handle complex string matching:
- Phone Numbers: Handles formats like
(555) 123-4567,555.123.4567, and555-123-4567 x1234. - Emails: Validates structure including specific handling for the
@symbol and domain extensions.
Naive cross-referencing (checking a list against a list) results in O(n*m) complexity. This project casts the blacklist to a set(), reducing the lookup to O(1) on average.
# Instead of iterating through a list, we use instant lookup
if phone_no in blacklist_phones_set:
phone_no_hits.append(phone_no)Contributions are welcome! Please fork the repository and submit a pull request for any enhancements (e.g., adding multithreading for larger datasets or exporting results to CSV).
This project is open-source and available under the MIT License.