Skip to content

added RFC on how to create a living knowledge base of owasp things#734

Open
northdpole wants to merge 1 commit intomainfrom
owasp-graph
Open

added RFC on how to create a living knowledge base of owasp things#734
northdpole wants to merge 1 commit intomainfrom
owasp-graph

Conversation

@northdpole
Copy link
Copy Markdown
Collaborator

No description provided.

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

PRAteek-singHWY commented Feb 1, 2026

@northdpole
Thanks a lot for sharing this sir, this is extremely helpful and very well structured.

I've gone through the RFC and it gives a clear architectural and experimental framework to build the proposal around. I'll spend some time digesting it in detail and start aligning my work proposal with this design and the pre-code experiments outlined here.

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

@northdpole

Thanks for putting this together Sir, the experimental framework is really clear.

I’m particularly interested in Module C (The Librarian) and want to start with the suggested pre-code experiments before proposing any concrete design or implementation.

The negation problem stands out — I’ve worked on gap analysis features before (#716) and have seen how basic similarity metrics can struggle with logical inversions in requirements (e.g., “Use X” vs “Do NOT use X”).

Plan:
I’ll start with the ASVS re-classification experiment:

  • Extract 50 ASVS requirements and strip metadata
  • Baseline: vector search with cosine similarity
  • Comparison: cross-encoder re-ranking (ms-marco-MiniLM-L-6-v2)
  • Target: >20% accuracy improvement on negative requirements

If the experiment is successful, I’m also interested in exploring hybrid search (vector + BM25), especially for cases like CVE identifiers where pure vector search often underperforms.

I'll take this up step by step .

I’ll share experiment results and observations before proposing any implementation.

I’m using AI tools (similar to Cursor/Windsurf) and have read Section 3.

Thank you .

@manshusainishab
Copy link
Copy Markdown
Contributor

Hi @northdpole ,

Thanks for putting together this RFC — the structure, pre-code experiments, and CI-first mindset are exactly the kind of system I enjoy working on.

I’d like to formally express my interest in owning Module B: Noise / Relevance Filter as my primary contribution, and I’m also happy to assist with adjacent modules where needed.

So Why Module B

The framing of Module B as a cheap, high-signal gate before expensive downstream processing resonates strongly with me. Getting this layer right feels critical to the quality, cost, and trustworthiness of the entire pipeline, especially given the planned regression dataset and CI enforcement.

Proposed Plan of Action (Aligned with the RFC)
I plan to follow the RFC strictly and start with experiments before any production code:

  1. Human Benchmark (Pre-Code Experiment)
    Manually label them as:
    Security Knowledge
    Noise (formatting, admin, linting, meta updates)
    This dataset will be versioned and reusable as an early “golden slice.”

  2. Prompt Iteration & Evaluation
    Start with a simple binary JSON output prompt:

“Is this content introducing or modifying security-relevant knowledge?”
Evaluate against the human benchmark.
Iterate until accuracy consistently exceeds 97%, with special attention to known failure modes (e.g., Code of Conduct updates, formatting-only diffs).

  1. Regex + LLM Cost Control
    Design the regex filter to aggressively eliminate obvious noise first (lockfiles, CSS, tests, config).
    Ensure the LLM is only invoked on borderline or content-heavy diffs.
    Document false positives / negatives clearly for future contributors.

  2. CI & Dataset Readiness
    Structure outputs so they can plug cleanly into the planned golden_dataset.json.
    Ensure behavior is deterministic and testable for CI regression checks.

And Cross-Module Contributions

While Module B would be my ownership area, I can also help with:
Module A: defining shared interfaces and assumptions between diff harvesting and filtering.
CI / Evaluation: contributing test cases and failure examples derived from Module B experiments.

I’ve read and understood Section 3 (Agent-Ready CI & AI-generated PR constraints) and I’m comfortable working within those boundaries.

Looking forward to collaborating — this project feels like a rare opportunity to build something both technically rigorous and genuinely useful.

Best,
Manshu

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

PRAteek-singHWY commented Feb 10, 2026

@northdpole

Thanks for putting this together Sir, the experimental framework is really clear.

I’m particularly interested in Module C (The Librarian) and want to start with the suggested pre-code experiments before proposing any concrete design or implementation.

The negation problem stands out — I’ve worked on gap analysis features before (#716) and have seen how basic similarity metrics can struggle with logical inversions in requirements (e.g., “Use X” vs “Do NOT use X”).

Plan: I’ll start with the ASVS re-classification experiment:

  • Extract 50 ASVS requirements and strip metadata
  • Baseline: vector search with cosine similarity
  • Comparison: cross-encoder re-ranking (ms-marco-MiniLM-L-6-v2)
  • Target: >20% accuracy improvement on negative requirements

If the experiment is successful, I’m also interested in exploring hybrid search (vector + BM25), especially for cases like CVE identifiers where pure vector search often underperforms.

I'll take this up step by step .

I’ll share experiment results and observations before proposing any implementation.

I’m using AI tools (similar to Cursor/Windsurf) and have read Section 3.

Thank you .

@northdpole Module C update (pre‑code experiment complete)

I ran the RFC‑required 50‑item ASVS experiment and also a 100‑item stability check to reduce variance (the negative subset is small, so a larger sample gives a more stable signal).

Results (negative top‑1):

  • 50‑item: 0.625 → 1.0
  • 100‑item: 0.6667 → 1.0

This passes the RFC success criteria (>20% improvement on negative requirements).

Design doc (pipeline + CI plan):

https://gist.github.com/PRAteek-singHWY/7b35f0edbd9b8354257f3f5366951dab

Hybrid search (BM25 + vector) is listed as a bonus. I have not implemented it yet; I plan to explore it after the pre‑code experiment and design are approved.

Next steps per RFC (please confirm):

  1. Finalize design + interfaces
  2. Build golden_dataset.json + evaluation harness (CI regression)
  3. Implement Module C retrieval + re‑rank + update detection
  4. Tune threshold against the golden dataset

@robvanderveer
Copy link
Copy Markdown
Collaborator

Awesome, but requires some redesigning I think. Let's find out together.

  1. Start the description of the proposed solution with the functionality promise:
    We can unlock all of OWASP content as one resource in a structured way using the new technologies that have come available with AI. People will be able to get comprehensive answers to their questions and lookup queries.

  2. It seems we’re scraping everything but that means that we’ll also be scraping multiple versions, as some projects have different folder for different versions ,of which some have not been published yet. I think that will lead to too much noise. A better option is to let repos have a robot.txt with the scraping folders listed and some optional metadata, like what we should call it.

  3. The module that fetches changes is trying to solve a problem that everybody has, and that already must have been solved. We shouldn’t reinvent that wheel. Llamaindex and Langchain have solutions for this. It’s just a matter of presenting the entire new files again and let that tech do the diffs, instead of looking at the GitHub diffs. The latter sounds more efficient, but we shouldn’t try to build a smart diff handler for chunking and embedding.
    A quick search found validatedpatterns-sandbox/vector-embedder. Dunno if it does diffs, but it does GitHub.
    By the way, the purpose of the module doesn’t really become clear. I seem to be missing a module that does the chunking and embedding calculation.

  4. We definitely should put the early designs of parsing links to opencre into
    the librarion module: if a source section has a link to opencre, that’s the link.

  5. We also should put the early designs of defining deliniation of sections into the chunking module: the source specifying patterns to search for that deliniate chunks.

Let’s book time next week and work an hour on this together. Slack me options please, if you’re open.

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

PRAteek-singHWY commented Feb 11, 2026

Hey @robvanderveer

Thanks for the detailed feedback. I updated the Module C design to align with your points.

Key changes:

  • Starts with the functionality promise.
  • Clarifies boundaries: Librarian now focuses on mapping/semantics only.
  • Adds link-first logic: if a source section has an OpenCRE link, that mapping is authoritative.
  • Moves chunk delineation and embedding ownership upstream (separate chunking module).
  • Assumes framework-based ingestion/change handling (LlamaIndex/LangChain style), not custom smart diff parsing in Librarian.
  • Keeps cross-encoder negation handling and CI regression gates.

Updated design:
https://gist.github.com/PRAteek-singHWY/7b35f0edbd9b8354257f3f5366951dab

Also happy to sync live for 1 hour around next week; I will share timing on Slack.

@shreyakash24
Copy link
Copy Markdown
Contributor

Hi @northdpole,
I would like to work on Module A. I have done its pre-code experiment to validate the technical feasibility of extracting high-signal security knowledge from the OWASP ecosystem.

Experiment Results & Quality Metrics:

  • 73.43% Token Compression: The pipeline successfully removed bulk infrastructure noise (CI/CD YAML, lockfiles, etc.). This represents a ~73% reduction in LLM operational costs by ensuring only semantic content is processed.

  • High Semantic Density (14.41 Chunks/k-token): The system isolates a high-density stream of actionable security knowledge chunks.

  • Precision & Integrity: Critical security documentation passed the filters, while infrastructure-only files were accurately rejected.

Shall I continue to write a detailed proposal regarding this?

@manshusainishab
Copy link
Copy Markdown
Contributor

manshusainishab commented Feb 21, 2026

Hi @northdpole ,

I’ve been thinking about a lightweight “Noise / Relevance Filter” (Module B). As your idea suggest to first apply a cheap regex-based filter to discard obvious non-knowledge changes (formatting, lockfiles, minor docs), and then use a small LLM classifier to determine whether a commit actually adds meaningful security knowledge.

AS plan suggests to validate this with a benchmark on ~100 historical commits to measure precision before proposing full integration.

Additionally, I’d like your thoughts on optionally adding a CodeRabbit AI layer to generate a structured diff summary before sending context to the LLM. Since CodeRabbit is free for open-source projects, it could provide higher-quality summaries and improve classification accuracy by giving the LLM better semantic context.

Would you be open to this direction, or prefer a simpler initial baseline first?

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

PRAteek-singHWY commented Feb 22, 2026

Hey team @northdpole , @robvanderveer , and @Pa04rth 👋

Following up on our recent architectural discussions, I’ve spent the last 10 days deeply analyzing the end-to-end pipeline for Project OIE (#734). as conveyed to Spyros Since I have 6-7 months of extended bandwidth due to my internship term and less academic pressure , my goal for this GSoC period is to take ownership of creating a complete, production-ready flow across the ecosystem, under guidance of all my mentors.

As Rob accurately stated: "We can unlock all of OWASP content as one resource in a structured way using the new technologies that have come available with AI."

To ensure complete clarity and alignment before the proposal deadline, I have physically mapped out the architectural blueprints and tool stacks for the entire project.

How the modules connect in one line:

The Upstream Ingestion Module provides clean, framework-delineated text chunks, The Librarian (Module C) intelligently maps those chunks while natively solving logical negations, and The Dashboard (Module D) acts as a high-speed human-review gate to ensure the OpenCRE graph is never corrupted.

I have broken down my blueprints into 4 detailed documents (with flow diagrams and tool selections):

🎯 1. System Goals & Architecture Flow

Mapping the Functionality Promise and visualizing exactly how the data flows from GitHub, through the three modules, to the Master Database.
📄 System_Goals_&_Architecture_Flow.pdf

📦 2. The Upstream Data Prep (Ingestion & Chunking)

Addressing Rob's feedback: Implementing robots.txt noise filtering, and delegating git-diff/state tracking to established frameworks (LlamaIndex / vector-embedder) so we don't reinvent the wheel. (3 Components explained)
📄 The_Upstream_Data_Prep_(Ingestion_&_Chunking).pdf

🧠 3. Module C: The Librarian (Semantic Intelligence)

Focusing strictly on mapping: Implementing Link-First authoritative overrides, and utilizing my successful Pre-Code Experiment (Cross-Encoders) to solve the "Negation Problem" with 100% accuracy. (2 Components explained)
📄 Module_C-The_Librarian(Semantic_Intelligence).pdf

📊 4. Module D: The Dashboard (Human-in-the-Loop)

Building a "Tinder-speed review UI with keyboard bindings to allow maintainers to clear <0.8 confidence threshold queues in minutes, while logging rejections for future ML training. (3 Components explained)
📄 Module_D-The_Dashboard(Human_in_the_loop).pdf

I would love your feedback on these blueprints to ensure my final proposal hits the exact mark you envision for this living knowledge base!

@@ -0,0 +1,262 @@
# RFC: The OpenCRE Scraper & Indexer (Project OIE)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change name to OWASP Agent. Position it as promise first: the why, not the how. So not: 'scraper and indexer'

Don't rely just on vectors. Use Hybrid Search (Vector + Keyword/BM25).
Why: Vectors are bad at exact keyword matches (e.g., specific CVE IDs).

### Module D: HITL & Logging
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the workflow more clear. thanks

Copy link
Copy Markdown
Contributor

@PRAteek-singHWY PRAteek-singHWY Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @robvanderveer that makes a lot of sense.
I’ll rename this to OWASP Agent and adjust the introduction to focus first on the problem and the promise it delivers, before going into the implementation details.

I’ll also rework the workflow section to make the end-to-end flow clearer and more explicit, especially around module responsibilities and how data moves between ingestion, hybrid retrieval, semantic reasoning, human validation, and the master database.
I’ll iterate on the document accordingly.

@manshusainishab
Copy link
Copy Markdown
Contributor

Hi @northdpole,

I wanted to share a quick update on the Noise/Relevance Filter prototype.

I’ve extracted 100 randomly sampled historical commits and manually labeled them (80 noise / 20 security knowledge) to create a gold benchmark dataset. I then implemented a batch-based LLM classifier (Gemini) with rate limiting and evaluated it against this dataset.

Current results after prompt calibration:

  • Accuracy: 87%
  • Precision: 64%
  • Recall: 80%

I have significantly reduced false positives through stricter “new security concept” criteria, but there’s still room to improve precision further before proposing integration.

I’ve temporarily paused experimentation due to API quota limits, but I’ll continue refining the prompt and evaluation loop to push precision higher while keeping recall stable.

Would you prefer prioritizing higher precision (fewer false positives) even at the cost of some recall?

And I also want to get the feedback of adding a layer of coderabbitai so LLM can get a better understanding of the changes and code base.

this is the repo I have created if you are intrested
https://github.com/manshusainishab/OpenCRE_test_project

@ParthAggarwal16
Copy link
Copy Markdown
Contributor

Hi everyone @northdpole , @robvanderveer , @Pa04rth

I’m currently exploring Module D (Human-in-the-Loop review + logging) and wanted to briefly share the direction I’m considering so I can get early feedback from the community.

My current understanding is that Module D acts as the human validation layer for AI-generated classifications coming from Module C, and its main responsibility is to allow maintainers to review flagged items quickly while generating high-quality correction logs that can later be used to improve the model.

The approach I’m currently exploring focuses on three main components:

1. Review Interface (Fast HITL workflow)

A minimal React-based admin UI designed for very fast review cycles (~3 seconds per item).

The idea is a keyboard-optimized workflow similar to a “Tinder-style” review:

  • y → accept prediction
  • n → reject prediction
  • e → edit label
  • s → skip item

The goal is to minimize clicks and allow maintainers to process review queues extremely quickly.

2. Review Queue + API Layer

A lightweight Flask backend responsible for:

  • serving items requiring review (produced by Module C)
  • managing the review queue
  • handling authentication and role-based access control (RBAC) for maintainers
  • submitting review decisions

Example API endpoints:

  • GET /review/next
  • POST /review/submit
  • GET /dashboard/stats

3. Structured Logging (JSONL → S3 / MinIO)

Instead of storing corrections directly in a database, every review action would append a structured JSON entry to JSONL logs stored in S3/MinIO.

Example log entry:

{
  "item_id": "...",
  "input_text": "...",
  "ai_prediction": "...",
  "human_label": "...",
  "reviewer": "...",
  "timestamp": "..."
}

This keeps the correction history append-only and reproducible, while also creating a clean dataset for potential model retraining later.

Pre-Code Experiments

Before implementing anything, I plan to validate two assumptions:

1. Review Speed Test

Build a small prototype to test whether reviewers can approve/reject items in <3 seconds using keyboard shortcuts.

2. Logging Pipeline Test

Verify the append-only JSONL logging flow and S3/MinIO upload behavior.

I’m also exploring the bonus “Loss Warehousing” idea to capture structured correction events that could later be used for model retraining.

I’ll share a small design/experiment gist shortly once I finish documenting the approach.

If there are any existing expectations around queue storage, authentication, or logging format, I’d love to align with those early.

Thanks!

@ParthAggarwal16
Copy link
Copy Markdown
Contributor

Hi OpenCRE team @northdpole @robvanderveer @Pa04rth ,

I've been working on early design exploration for parts of the GSoC pipeline and wanted to share two draft design notes for feedback before moving into implementation.

These cover:

Module A — Information Harvesting

  • Nightly GitHub Actions pipeline
  • Incremental diff-based harvesting
  • Regex noise filtering (>90% file elimination before download)
  • Markdown diff parsing and raw change storage

Module D — Human-in-the-Loop Review

  • Keyboard-first review workflow (Accept / Reject / Edit)
  • Confidence-based queue routing
  • Append-only JSONL logging
  • Feedback loop for retraining the mapping model

Both modules include pre-code experiments to validate assumptions:

  • File filtering effectiveness on a real OWASP repository
  • Reviewer decision latency using a lightweight prototype

Gist:
https://gist.github.com/ParthAggarwal16/44da1185a9203da6e3114ba9d6d8c19e

This is still a draft and I'd really appreciate feedback on:

  • architectural assumptions
  • data flow boundaries between modules
  • anything that looks incompatible with the current OpenCRE pipeline.

Thanks!

@northdpole
Copy link
Copy Markdown
Collaborator Author

Hi everyone @northdpole , @robvanderveer , @Pa04rth

I’m currently exploring Module D (Human-in-the-Loop review + logging) and wanted to briefly share the direction I’m considering so I can get early feedback from the community.

My current understanding is that Module D acts as the human validation layer for AI-generated classifications coming from Module C, and its main responsibility is to allow maintainers to review flagged items quickly while generating high-quality correction logs that can later be used to improve the model.

The approach I’m currently exploring focuses on three main components:

1. Review Interface (Fast HITL workflow)

A minimal React-based admin UI designed for very fast review cycles (~3 seconds per item).

The idea is a keyboard-optimized workflow similar to a “Tinder-style” review:

  • y → accept prediction
  • n → reject prediction
  • e → edit label
  • s → skip item

The goal is to minimize clicks and allow maintainers to process review queues extremely quickly.

Yes that's the point, also look at the interface of git add -p similar to what you're describing.
Just noting: skip should mean that the chunk/prediction returns in the future or you can go back to it.

2. Review Queue + API Layer

A lightweight Flask backend responsible for:

Makes sense, just keep in mind that it should not be a whole new application, you can make it as a blueprint/new routes of the existing one.

2. Logging Pipeline Test
too bonus for now, let's nail down the basic first

@Mahaboobunnisa123
Copy link
Copy Markdown
Contributor

Hi team @northdpole, @Pa04rth @robvanderveer, I'd like to work on Module D (HITL & Logging) for GSoC.
I've been contributing to both OpenCRE and Cornucopia recently so I'm familiar with the codebase. I noticed from the existing Flask structure that Module D should be implemented as a blueprint with new routes rather than a standalone app, which aligns with how the existing application is structured.
I'm currently working on the pre-code experiment - building a minimal prototype to validate the 3-second keyboard-driven review flow. Will share results here shortly.
One question before I go deeper - should the review queue be backed by the existing Redis setup in the app, or is a simpler DB-backed queue preferred for the initial implementation?

@ParthAggarwal16
Copy link
Copy Markdown
Contributor

Hi everyone @northdpole , @robvanderveer , @Pa04rth
I’m currently exploring Module D (Human-in-the-Loop review + logging) and wanted to briefly share the direction I’m considering so I can get early feedback from the community.
My current understanding is that Module D acts as the human validation layer for AI-generated classifications coming from Module C, and its main responsibility is to allow maintainers to review flagged items quickly while generating high-quality correction logs that can later be used to improve the model.
The approach I’m currently exploring focuses on three main components:

1. Review Interface (Fast HITL workflow)

A minimal React-based admin UI designed for very fast review cycles (~3 seconds per item).
The idea is a keyboard-optimized workflow similar to a “Tinder-style” review:

  • y → accept prediction
  • n → reject prediction
  • e → edit label
  • s → skip item

The goal is to minimize clicks and allow maintainers to process review queues extremely quickly.

Yes that's the point, also look at the interface of git add -p similar to what you're describing. Just noting: skip should mean that the chunk/prediction returns in the future or you can go back to it.

2. Review Queue + API Layer

A lightweight Flask backend responsible for:

Makes sense, just keep in mind that it should not be a whole new application, you can make it as a blueprint/new routes of the existing one.

2. Logging Pipeline Test
too bonus for now, let's nail down the basic first

Hey @northdpole , thanks for the feedback !!
Good point about the git add -p style interaction, that’s actually very close to what I had in mind for keeping the review flow fast and keyboard-driven.

Also makes sense regarding skip. I’ll treat it more as a defer action so the item can return to the queue later or be revisited.

And noted on the architecture, I’ll keep the backend part lightweight and integrate it as routes/blueprints within the existing application instead of spinning up a separate service.

For now I’ll focus on getting the basic review and decision logging flow working first and keep the rest as future extensions.

Appreciate the guidance!

@Mahaboobunnisa123
Copy link
Copy Markdown
Contributor

Mahaboobunnisa123 commented Mar 12, 2026

Hi team @northdpole @robvanderveer @Pa04rth, Pre-code experiment update for Module D. Built a minimal keyboard-driven review prototype to validate the 3-second review flow assumption.
Results from testing:

  • 5 items reviewed using Y/N/E/S keyboard shortcuts
  • Average review time: 3.15s
  • The keyboard interface works well for quick decisions - once a reviewer is familiar with the content type, sub-3s reviews are achievable
  • JSONL-style log output confirmed working (item_id, action, time, label)

Observation: First-time reviewers take slightly longer on unfamiliar content (~4s), but repeat reviewers with domain knowledge should comfortably stay under 3s. This validates the keyboard-first approach.
Prototype: https://gist.github.com/Mahaboobunnisa123/ff0f22a51d5042e66da07154666ab10f

Screenshot (620)

Happy to take any feedback or suggestions on this. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants