Skip to content

added RFC on how to create a living knowledge base of owasp things#734

Open
northdpole wants to merge 1 commit intomainfrom
owasp-graph
Open

added RFC on how to create a living knowledge base of owasp things#734
northdpole wants to merge 1 commit intomainfrom
owasp-graph

Conversation

@northdpole
Copy link
Collaborator

No description provided.

@PRAteek-singHWY
Copy link
Contributor

PRAteek-singHWY commented Feb 1, 2026

@northdpole
Thanks a lot for sharing this sir, this is extremely helpful and very well structured.

I've gone through the RFC and it gives a clear architectural and experimental framework to build the proposal around. I'll spend some time digesting it in detail and start aligning my work proposal with this design and the pre-code experiments outlined here.

@PRAteek-singHWY
Copy link
Contributor

@northdpole

Thanks for putting this together Sir, the experimental framework is really clear.

I’m particularly interested in Module C (The Librarian) and want to start with the suggested pre-code experiments before proposing any concrete design or implementation.

The negation problem stands out — I’ve worked on gap analysis features before (#716) and have seen how basic similarity metrics can struggle with logical inversions in requirements (e.g., “Use X” vs “Do NOT use X”).

Plan:
I’ll start with the ASVS re-classification experiment:

  • Extract 50 ASVS requirements and strip metadata
  • Baseline: vector search with cosine similarity
  • Comparison: cross-encoder re-ranking (ms-marco-MiniLM-L-6-v2)
  • Target: >20% accuracy improvement on negative requirements

If the experiment is successful, I’m also interested in exploring hybrid search (vector + BM25), especially for cases like CVE identifiers where pure vector search often underperforms.

I'll take this up step by step .

I’ll share experiment results and observations before proposing any implementation.

I’m using AI tools (similar to Cursor/Windsurf) and have read Section 3.

Thank you .

@manshusainishab
Copy link
Contributor

Hi @northdpole ,

Thanks for putting together this RFC — the structure, pre-code experiments, and CI-first mindset are exactly the kind of system I enjoy working on.

I’d like to formally express my interest in owning Module B: Noise / Relevance Filter as my primary contribution, and I’m also happy to assist with adjacent modules where needed.

So Why Module B

The framing of Module B as a cheap, high-signal gate before expensive downstream processing resonates strongly with me. Getting this layer right feels critical to the quality, cost, and trustworthiness of the entire pipeline, especially given the planned regression dataset and CI enforcement.

Proposed Plan of Action (Aligned with the RFC)
I plan to follow the RFC strictly and start with experiments before any production code:

  1. Human Benchmark (Pre-Code Experiment)
    Manually label them as:
    Security Knowledge
    Noise (formatting, admin, linting, meta updates)
    This dataset will be versioned and reusable as an early “golden slice.”

  2. Prompt Iteration & Evaluation
    Start with a simple binary JSON output prompt:

“Is this content introducing or modifying security-relevant knowledge?”
Evaluate against the human benchmark.
Iterate until accuracy consistently exceeds 97%, with special attention to known failure modes (e.g., Code of Conduct updates, formatting-only diffs).

  1. Regex + LLM Cost Control
    Design the regex filter to aggressively eliminate obvious noise first (lockfiles, CSS, tests, config).
    Ensure the LLM is only invoked on borderline or content-heavy diffs.
    Document false positives / negatives clearly for future contributors.

  2. CI & Dataset Readiness
    Structure outputs so they can plug cleanly into the planned golden_dataset.json.
    Ensure behavior is deterministic and testable for CI regression checks.

And Cross-Module Contributions

While Module B would be my ownership area, I can also help with:
Module A: defining shared interfaces and assumptions between diff harvesting and filtering.
CI / Evaluation: contributing test cases and failure examples derived from Module B experiments.

I’ve read and understood Section 3 (Agent-Ready CI & AI-generated PR constraints) and I’m comfortable working within those boundaries.

Looking forward to collaborating — this project feels like a rare opportunity to build something both technically rigorous and genuinely useful.

Best,
Manshu

@PRAteek-singHWY
Copy link
Contributor

PRAteek-singHWY commented Feb 10, 2026

@northdpole

Thanks for putting this together Sir, the experimental framework is really clear.

I’m particularly interested in Module C (The Librarian) and want to start with the suggested pre-code experiments before proposing any concrete design or implementation.

The negation problem stands out — I’ve worked on gap analysis features before (#716) and have seen how basic similarity metrics can struggle with logical inversions in requirements (e.g., “Use X” vs “Do NOT use X”).

Plan: I’ll start with the ASVS re-classification experiment:

  • Extract 50 ASVS requirements and strip metadata
  • Baseline: vector search with cosine similarity
  • Comparison: cross-encoder re-ranking (ms-marco-MiniLM-L-6-v2)
  • Target: >20% accuracy improvement on negative requirements

If the experiment is successful, I’m also interested in exploring hybrid search (vector + BM25), especially for cases like CVE identifiers where pure vector search often underperforms.

I'll take this up step by step .

I’ll share experiment results and observations before proposing any implementation.

I’m using AI tools (similar to Cursor/Windsurf) and have read Section 3.

Thank you .

@northdpole Module C update (pre‑code experiment complete)

I ran the RFC‑required 50‑item ASVS experiment and also a 100‑item stability check to reduce variance (the negative subset is small, so a larger sample gives a more stable signal).

Results (negative top‑1):

  • 50‑item: 0.625 → 1.0
  • 100‑item: 0.6667 → 1.0

This passes the RFC success criteria (>20% improvement on negative requirements).

Design doc (pipeline + CI plan):

https://gist.github.com/PRAteek-singHWY/7b35f0edbd9b8354257f3f5366951dab

Hybrid search (BM25 + vector) is listed as a bonus. I have not implemented it yet; I plan to explore it after the pre‑code experiment and design are approved.

Next steps per RFC (please confirm):

  1. Finalize design + interfaces
  2. Build golden_dataset.json + evaluation harness (CI regression)
  3. Implement Module C retrieval + re‑rank + update detection
  4. Tune threshold against the golden dataset

@robvanderveer
Copy link
Collaborator

Awesome, but requires some redesigning I think. Let's find out together.

  1. Start the description of the proposed solution with the functionality promise:
    We can unlock all of OWASP content as one resource in a structured way using the new technologies that have come available with AI. People will be able to get comprehensive answers to their questions and lookup queries.

  2. It seems we’re scraping everything but that means that we’ll also be scraping multiple versions, as some projects have different folder for different versions ,of which some have not been published yet. I think that will lead to too much noise. A better option is to let repos have a robot.txt with the scraping folders listed and some optional metadata, like what we should call it.

  3. The module that fetches changes is trying to solve a problem that everybody has, and that already must have been solved. We shouldn’t reinvent that wheel. Llamaindex and Langchain have solutions for this. It’s just a matter of presenting the entire new files again and let that tech do the diffs, instead of looking at the GitHub diffs. The latter sounds more efficient, but we shouldn’t try to build a smart diff handler for chunking and embedding.
    A quick search found validatedpatterns-sandbox/vector-embedder. Dunno if it does diffs, but it does GitHub.
    By the way, the purpose of the module doesn’t really become clear. I seem to be missing a module that does the chunking and embedding calculation.

  4. We definitely should put the early designs of parsing links to opencre into
    the librarion module: if a source section has a link to opencre, that’s the link.

  5. We also should put the early designs of defining deliniation of sections into the chunking module: the source specifying patterns to search for that deliniate chunks.

Let’s book time next week and work an hour on this together. Slack me options please, if you’re open.

@PRAteek-singHWY
Copy link
Contributor

PRAteek-singHWY commented Feb 11, 2026

Hey @robvanderveer

Thanks for the detailed feedback. I updated the Module C design to align with your points.

Key changes:

  • Starts with the functionality promise.
  • Clarifies boundaries: Librarian now focuses on mapping/semantics only.
  • Adds link-first logic: if a source section has an OpenCRE link, that mapping is authoritative.
  • Moves chunk delineation and embedding ownership upstream (separate chunking module).
  • Assumes framework-based ingestion/change handling (LlamaIndex/LangChain style), not custom smart diff parsing in Librarian.
  • Keeps cross-encoder negation handling and CI regression gates.

Updated design:
https://gist.github.com/PRAteek-singHWY/7b35f0edbd9b8354257f3f5366951dab

Also happy to sync live for 1 hour around next week; I will share timing on Slack.

@shreyakash24
Copy link
Contributor

Hi @northdpole,
I would like to work on Module A. I have done its pre-code experiment to validate the technical feasibility of extracting high-signal security knowledge from the OWASP ecosystem.

Experiment Results & Quality Metrics:

  • 73.43% Token Compression: The pipeline successfully removed bulk infrastructure noise (CI/CD YAML, lockfiles, etc.). This represents a ~73% reduction in LLM operational costs by ensuring only semantic content is processed.

  • High Semantic Density (14.41 Chunks/k-token): The system isolates a high-density stream of actionable security knowledge chunks.

  • Precision & Integrity: Critical security documentation passed the filters, while infrastructure-only files were accurately rejected.

Shall I continue to write a detailed proposal regarding this?

@manshusainishab
Copy link
Contributor

manshusainishab commented Feb 21, 2026

Hi @northdpole ,

I’ve been thinking about a lightweight “Noise / Relevance Filter” (Module B). As your idea suggest to first apply a cheap regex-based filter to discard obvious non-knowledge changes (formatting, lockfiles, minor docs), and then use a small LLM classifier to determine whether a commit actually adds meaningful security knowledge.

AS plan suggests to validate this with a benchmark on ~100 historical commits to measure precision before proposing full integration.

Additionally, I’d like your thoughts on optionally adding a CodeRabbit AI layer to generate a structured diff summary before sending context to the LLM. Since CodeRabbit is free for open-source projects, it could provide higher-quality summaries and improve classification accuracy by giving the LLM better semantic context.

Would you be open to this direction, or prefer a simpler initial baseline first?

@PRAteek-singHWY
Copy link
Contributor

PRAteek-singHWY commented Feb 22, 2026

Hey team @northdpole , @robvanderveer , and @Pa04rth 👋

Following up on our recent architectural discussions, I’ve spent the last 10 days deeply analyzing the end-to-end pipeline for Project OIE (#734). as conveyed to Spyros Since I have 6-7 months of extended bandwidth due to my internship term and less academic pressure , my goal for this GSoC period is to take ownership of creating a complete, production-ready flow across the ecosystem, under guidance of all my mentors.

As Rob accurately stated: "We can unlock all of OWASP content as one resource in a structured way using the new technologies that have come available with AI."

To ensure complete clarity and alignment before the proposal deadline, I have physically mapped out the architectural blueprints and tool stacks for the entire project.

How the modules connect in one line:

The Upstream Ingestion Module provides clean, framework-delineated text chunks, The Librarian (Module C) intelligently maps those chunks while natively solving logical negations, and The Dashboard (Module D) acts as a high-speed human-review gate to ensure the OpenCRE graph is never corrupted.

I have broken down my blueprints into 4 detailed documents (with flow diagrams and tool selections):

🎯 1. System Goals & Architecture Flow

Mapping the Functionality Promise and visualizing exactly how the data flows from GitHub, through the three modules, to the Master Database.
📄 System_Goals_&_Architecture_Flow.pdf

📦 2. The Upstream Data Prep (Ingestion & Chunking)

Addressing Rob's feedback: Implementing robots.txt noise filtering, and delegating git-diff/state tracking to established frameworks (LlamaIndex / vector-embedder) so we don't reinvent the wheel. (3 Components explained)
📄 The_Upstream_Data_Prep_(Ingestion_&_Chunking).pdf

🧠 3. Module C: The Librarian (Semantic Intelligence)

Focusing strictly on mapping: Implementing Link-First authoritative overrides, and utilizing my successful Pre-Code Experiment (Cross-Encoders) to solve the "Negation Problem" with 100% accuracy. (2 Components explained)
📄 Module_C-The_Librarian(Semantic_Intelligence).pdf

📊 4. Module D: The Dashboard (Human-in-the-Loop)

Building a "Tinder-speed review UI with keyboard bindings to allow maintainers to clear <0.8 confidence threshold queues in minutes, while logging rejections for future ML training. (3 Components explained)
📄 Module_D-The_Dashboard(Human_in_the_loop).pdf

I would love your feedback on these blueprints to ensure my final proposal hits the exact mark you envision for this living knowledge base!

@@ -0,0 +1,262 @@
# RFC: The OpenCRE Scraper & Indexer (Project OIE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change name to OWASP Agent. Position it as promise first: the why, not the how. So not: 'scraper and indexer'

Don't rely just on vectors. Use Hybrid Search (Vector + Keyword/BM25).
Why: Vectors are bad at exact keyword matches (e.g., specific CVE IDs).

### Module D: HITL & Logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the workflow more clear. thanks

Copy link
Contributor

@PRAteek-singHWY PRAteek-singHWY Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @robvanderveer that makes a lot of sense.
I’ll rename this to OWASP Agent and adjust the introduction to focus first on the problem and the promise it delivers, before going into the implementation details.

I’ll also rework the workflow section to make the end-to-end flow clearer and more explicit, especially around module responsibilities and how data moves between ingestion, hybrid retrieval, semantic reasoning, human validation, and the master database.
I’ll iterate on the document accordingly.

@manshusainishab
Copy link
Contributor

Hi @northdpole,

I wanted to share a quick update on the Noise/Relevance Filter prototype.

I’ve extracted 100 randomly sampled historical commits and manually labeled them (80 noise / 20 security knowledge) to create a gold benchmark dataset. I then implemented a batch-based LLM classifier (Gemini) with rate limiting and evaluated it against this dataset.

Current results after prompt calibration:

  • Accuracy: 87%
  • Precision: 64%
  • Recall: 80%

I have significantly reduced false positives through stricter “new security concept” criteria, but there’s still room to improve precision further before proposing integration.

I’ve temporarily paused experimentation due to API quota limits, but I’ll continue refining the prompt and evaluation loop to push precision higher while keeping recall stable.

Would you prefer prioritizing higher precision (fewer false positives) even at the cost of some recall?

And I also want to get the feedback of adding a layer of coderabbitai so LLM can get a better understanding of the changes and code base.

this is the repo I have created if you are intrested
https://github.com/manshusainishab/OpenCRE_test_project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants