added RFC on how to create a living knowledge base of owasp things#734
added RFC on how to create a living knowledge base of owasp things#734northdpole wants to merge 1 commit intomainfrom
Conversation
|
@northdpole I've gone through the RFC and it gives a clear architectural and experimental framework to build the proposal around. I'll spend some time digesting it in detail and start aligning my work proposal with this design and the pre-code experiments outlined here. |
|
Thanks for putting this together Sir, the experimental framework is really clear. I’m particularly interested in Module C (The Librarian) and want to start with the suggested pre-code experiments before proposing any concrete design or implementation. The negation problem stands out — I’ve worked on gap analysis features before (#716) and have seen how basic similarity metrics can struggle with logical inversions in requirements (e.g., “Use X” vs “Do NOT use X”). Plan:
If the experiment is successful, I’m also interested in exploring hybrid search (vector + BM25), especially for cases like CVE identifiers where pure vector search often underperforms. I'll take this up step by step . I’ll share experiment results and observations before proposing any implementation. I’m using AI tools (similar to Cursor/Windsurf) and have read Section 3. Thank you . |
|
Hi @northdpole , Thanks for putting together this RFC — the structure, pre-code experiments, and CI-first mindset are exactly the kind of system I enjoy working on. I’d like to formally express my interest in owning Module B: Noise / Relevance Filter as my primary contribution, and I’m also happy to assist with adjacent modules where needed. So Why Module B The framing of Module B as a cheap, high-signal gate before expensive downstream processing resonates strongly with me. Getting this layer right feels critical to the quality, cost, and trustworthiness of the entire pipeline, especially given the planned regression dataset and CI enforcement. Proposed Plan of Action (Aligned with the RFC)
And Cross-Module Contributions While Module B would be my ownership area, I can also help with: I’ve read and understood Section 3 (Agent-Ready CI & AI-generated PR constraints) and I’m comfortable working within those boundaries. Looking forward to collaborating — this project feels like a rare opportunity to build something both technically rigorous and genuinely useful. Best, |
@northdpole Module C update (pre‑code experiment complete) I ran the RFC‑required 50‑item ASVS experiment and also a 100‑item stability check to reduce variance (the negative subset is small, so a larger sample gives a more stable signal). Results (negative top‑1):
This passes the RFC success criteria (>20% improvement on negative requirements). Design doc (pipeline + CI plan): Hybrid search (BM25 + vector) is listed as a bonus. I have not implemented it yet; I plan to explore it after the pre‑code experiment and design are approved. Next steps per RFC (please confirm):
|
|
Awesome, but requires some redesigning I think. Let's find out together.
Let’s book time next week and work an hour on this together. Slack me options please, if you’re open. |
|
Hey @robvanderveer Thanks for the detailed feedback. I updated the Module C design to align with your points. Key changes:
Updated design: Also happy to sync live for 1 hour around next week; I will share timing on Slack. |
|
Hi @northdpole, Experiment Results & Quality Metrics:
Shall I continue to write a detailed proposal regarding this? |
|
Hi @northdpole , I’ve been thinking about a lightweight “Noise / Relevance Filter” (Module B). As your idea suggest to first apply a cheap regex-based filter to discard obvious non-knowledge changes (formatting, lockfiles, minor docs), and then use a small LLM classifier to determine whether a commit actually adds meaningful security knowledge. AS plan suggests to validate this with a benchmark on ~100 historical commits to measure precision before proposing full integration. Additionally, I’d like your thoughts on optionally adding a CodeRabbit AI layer to generate a structured diff summary before sending context to the LLM. Since CodeRabbit is free for open-source projects, it could provide higher-quality summaries and improve classification accuracy by giving the LLM better semantic context. Would you be open to this direction, or prefer a simpler initial baseline first? |
|
Hey team @northdpole , @robvanderveer , and @Pa04rth 👋 Following up on our recent architectural discussions, I’ve spent the last 10 days deeply analyzing the end-to-end pipeline for Project OIE (#734). as conveyed to Spyros Since I have 6-7 months of extended bandwidth due to my internship term and less academic pressure , my goal for this GSoC period is to take ownership of creating a complete, production-ready flow across the ecosystem, under guidance of all my mentors. As Rob accurately stated: "We can unlock all of OWASP content as one resource in a structured way using the new technologies that have come available with AI." To ensure complete clarity and alignment before the proposal deadline, I have physically mapped out the architectural blueprints and tool stacks for the entire project. How the modules connect in one line:
I have broken down my blueprints into 4 detailed documents (with flow diagrams and tool selections): 🎯 1. System Goals & Architecture FlowMapping the Functionality Promise and visualizing exactly how the data flows from GitHub, through the three modules, to the Master Database. 📦 2. The Upstream Data Prep (Ingestion & Chunking)Addressing Rob's feedback: Implementing 🧠 3. Module C: The Librarian (Semantic Intelligence)Focusing strictly on mapping: Implementing Link-First authoritative overrides, and utilizing my successful Pre-Code Experiment (Cross-Encoders) to solve the "Negation Problem" with 100% accuracy. (2 Components explained) 📊 4. Module D: The Dashboard (Human-in-the-Loop)Building a "Tinder-speed review UI with keyboard bindings to allow maintainers to clear <0.8 confidence threshold queues in minutes, while logging rejections for future ML training. (3 Components explained) I would love your feedback on these blueprints to ensure my final proposal hits the exact mark you envision for this living knowledge base! |
| @@ -0,0 +1,262 @@ | |||
| # RFC: The OpenCRE Scraper & Indexer (Project OIE) | |||
There was a problem hiding this comment.
Change name to OWASP Agent. Position it as promise first: the why, not the how. So not: 'scraper and indexer'
| Don't rely just on vectors. Use Hybrid Search (Vector + Keyword/BM25). | ||
| Why: Vectors are bad at exact keyword matches (e.g., specific CVE IDs). | ||
|
|
||
| ### Module D: HITL & Logging |
There was a problem hiding this comment.
Please make the workflow more clear. thanks
There was a problem hiding this comment.
Thank you @robvanderveer that makes a lot of sense.
I’ll rename this to OWASP Agent and adjust the introduction to focus first on the problem and the promise it delivers, before going into the implementation details.
I’ll also rework the workflow section to make the end-to-end flow clearer and more explicit, especially around module responsibilities and how data moves between ingestion, hybrid retrieval, semantic reasoning, human validation, and the master database.
I’ll iterate on the document accordingly.
|
Hi @northdpole, I wanted to share a quick update on the Noise/Relevance Filter prototype. I’ve extracted 100 randomly sampled historical commits and manually labeled them (80 noise / 20 security knowledge) to create a gold benchmark dataset. I then implemented a batch-based LLM classifier (Gemini) with rate limiting and evaluated it against this dataset. Current results after prompt calibration:
I have significantly reduced false positives through stricter “new security concept” criteria, but there’s still room to improve precision further before proposing integration. I’ve temporarily paused experimentation due to API quota limits, but I’ll continue refining the prompt and evaluation loop to push precision higher while keeping recall stable. Would you prefer prioritizing higher precision (fewer false positives) even at the cost of some recall? And I also want to get the feedback of adding a layer of coderabbitai so LLM can get a better understanding of the changes and code base. this is the repo I have created if you are intrested |
|
Hi everyone @northdpole , @robvanderveer , @Pa04rth I’m currently exploring Module D (Human-in-the-Loop review + logging) and wanted to briefly share the direction I’m considering so I can get early feedback from the community. My current understanding is that Module D acts as the human validation layer for AI-generated classifications coming from Module C, and its main responsibility is to allow maintainers to review flagged items quickly while generating high-quality correction logs that can later be used to improve the model. The approach I’m currently exploring focuses on three main components: 1. Review Interface (Fast HITL workflow)A minimal React-based admin UI designed for very fast review cycles (~3 seconds per item). The idea is a keyboard-optimized workflow similar to a “Tinder-style” review:
The goal is to minimize clicks and allow maintainers to process review queues extremely quickly. 2. Review Queue + API LayerA lightweight Flask backend responsible for:
Example API endpoints:
3. Structured Logging (JSONL → S3 / MinIO)Instead of storing corrections directly in a database, every review action would append a structured JSON entry to JSONL logs stored in S3/MinIO. Example log entry: This keeps the correction history append-only and reproducible, while also creating a clean dataset for potential model retraining later. Pre-Code Experiments Before implementing anything, I plan to validate two assumptions: 1. Review Speed Test Build a small prototype to test whether reviewers can approve/reject items in <3 seconds using keyboard shortcuts. 2. Logging Pipeline Test Verify the append-only JSONL logging flow and S3/MinIO upload behavior. I’m also exploring the bonus “Loss Warehousing” idea to capture structured correction events that could later be used for model retraining. I’ll share a small design/experiment gist shortly once I finish documenting the approach. If there are any existing expectations around queue storage, authentication, or logging format, I’d love to align with those early. Thanks! |
|
Hi OpenCRE team @northdpole @robvanderveer @Pa04rth , I've been working on early design exploration for parts of the GSoC pipeline and wanted to share two draft design notes for feedback before moving into implementation. These cover: Module A — Information Harvesting
Module D — Human-in-the-Loop Review
Both modules include pre-code experiments to validate assumptions:
Gist: This is still a draft and I'd really appreciate feedback on:
Thanks! |
Yes that's the point, also look at the interface of
Makes sense, just keep in mind that it should not be a whole new application, you can make it as a blueprint/new routes of the existing one.
|
|
Hi team @northdpole, @Pa04rth @robvanderveer, I'd like to work on Module D (HITL & Logging) for GSoC. |
Hey @northdpole , thanks for the feedback !! Also makes sense regarding skip. I’ll treat it more as a defer action so the item can return to the queue later or be revisited. And noted on the architecture, I’ll keep the backend part lightweight and integrate it as routes/blueprints within the existing application instead of spinning up a separate service. For now I’ll focus on getting the basic review and decision logging flow working first and keep the rest as future extensions. Appreciate the guidance! |
|
Hi team @northdpole @robvanderveer @Pa04rth, Pre-code experiment update for Module D. Built a minimal keyboard-driven review prototype to validate the 3-second review flow assumption.
Observation: First-time reviewers take slightly longer on unfamiliar content (~4s), but repeat reviewers with domain knowledge should comfortably stay under 3s. This validates the keyboard-first approach.
Happy to take any feedback or suggestions on this. Thank you! |

No description provided.