Skip to content

feat: update payload enrichment logic with input limits and filtering#273

Open
dwisiswant0 wants to merge 3 commits into
mainfrom
dwisiswant0/fix/issue-270
Open

feat: update payload enrichment logic with input limits and filtering#273
dwisiswant0 wants to merge 3 commits into
mainfrom
dwisiswant0/fix/issue-270

Conversation

@dwisiswant0
Copy link
Copy Markdown
Member

@dwisiswant0 dwisiswant0 commented Mar 15, 2025

Scale the number of inputs to process based on
limit or max-size options to generate additional
payloads.

Prolly fixes #270

Summary by CodeRabbit

  • New Features

    • Generation now supports responsive cancellation so long-running operations stop promptly and avoid emitting after cancel.
    • Streaming deduplication improves real-time removal of duplicate results during output.
    • Enforced output limits stop generation when line or byte caps are reached.
  • Tests

    • Added a test verifying the line-limit behavior and correct result counts.

with input limits and filtering

Scale the number of inputs to process based on
limit or max-size options to generate additional
payloads.

Signed-off-by: Dwi Siswanto <git@dw1.io>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 15, 2025

Review Change Stack

Walkthrough

ClusterBomb now accepts a context and an exported callback type; permutation generation checks ctx for cancellation. Execute streams and deduplicates results with ctx-aware draining, ExecuteWithWriter uses a cancelable context and calls cancel() when Limits or MaxSize are reached, and a unit test verifies the line limit.

Changes

Cancelable generation + streaming dedupe

Layer / File(s) Summary
ClusterBomb API
algo.go
Add context import and ClusterBombCallback type; change ClusterBomb signature to accept ctx context.Context and the new callback type.
ClusterBomb recursion & cancellation
algo.go
Add early ctx.Err() guard; check <-ctx.Done() in end-recursion and recursive steps; stop invoking callback after cancellation.
Mutator imports
mutator.go
Add math to imports to support MaxSize < math.MaxInt budget checks and remove prior draining dedupe import.
Execute: ctx-guarded emissions
mutator.go
Wrap sends of mined subdomains and template permutations with select on ctx.Done() to allow prompt cancellation.
Execute: streaming dedupe
mutator.go
Refactor Execute to stream deduplicated results via an intermediate channel and an in-memory seen map; respect ctx.Done() to stop consuming internal results.
ExecuteWithWriter: cancel-on-limit
mutator.go
Run Execute with context.WithCancel and call cancel() when Options.Limit or Options.MaxSize are reached while writing and counting payloads.
clusterBomb implementation
mutator.go
clusterBomb updated to accept context.Context, check cancellation, and avoid emitting after ctx is canceled.
Tests
mutator_test.go
Add TestMutatorLineLimit asserting that ExecuteWithWriter respects Options.Limit and that m.payloadCount matches the limit.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ExecuteWithWriter
    participant Execute
    participant clusterBomb
    participant Writer

    Caller->>ExecuteWithWriter: start with Options (Limit/MaxSize)
    ExecuteWithWriter->>Execute: Execute(ctxWithCancel)
    Execute->>clusterBomb: drive permutations (passes ctx)
    clusterBomb->>Execute: emit results (checks ctx)
    Execute->>ExecuteWithWriter: stream results (dedupe stage)
    ExecuteWithWriter->>Writer: write item, increment payloadCount
    alt limit or maxsize reached
      ExecuteWithWriter->>ExecuteWithWriter: cancel() ctx
      ExecuteWithWriter-->>Execute: generation stops
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hopped through ctx and streaming lanes,
Cancelled loops and counted gains.
Deduped whispers, limits met,
Five neat lines — a tidy set.
🐰✨

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning While the PR implements input limiting and filtering as shown in mutator.go and algo.go, it does not fully address the core performance issue in #270. The reviewer notes that input limiting alone won't improve runtime when the bottleneck is permutation generation, which requires context-based cancellation of generators. Implement context-aware cancellation in Execute() and clusterBomb() to allow ExecuteWithWriter() to cancel permutation generation when output limits are reached, directly addressing the performance issue.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title references 'payload enrichment logic with input limits and filtering' which describes the PR's changes to limit inputs and add filtering, but does not address the core performance optimization (context-aware cancellation for generators) that is the primary focus of the linked issue #270. Clarify whether this PR primarily addresses input limiting (current title) or generator-level cancellation for runtime improvement (issue #270 requirement). Consider a more specific title if the latter is the main goal.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Out of Scope Changes check ✅ Passed The changes to mutator.go, algo.go, and mutator_test.go are focused on input limiting, filtering, and generator cancellation, which align with the objectives of addressing #270's performance issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dwisiswant0/fix/issue-270

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
mutator.go (3)

305-309: Consider clarifying the MaxSize condition logic

The condition only applies MaxSize if it's less than or equal to the number of inputs. This means larger MaxSize values won't be applied, which might not be intuitive.

Consider revising the condition to apply MaxSize whenever it's greater than zero:

-if m.Options.MaxSize > 0 && m.Options.MaxSize <= len(inputs) {
+if m.Options.MaxSize > 0 {
    maxInputsToProcess = m.Options.MaxSize
    maxWordsToExtract = m.Options.MaxSize
    maxNumbersToExtract = m.Options.MaxSize
}

335-337: Ensure number extraction limits handle zero value correctly

The implementation correctly limits the number of extracted numbers, but should verify that maxNumbersToExtract is greater than zero before slicing.

Consider adding an additional check to ensure maxNumbersToExtract is greater than zero:

-if len(numbers) > maxNumbersToExtract && maxNumbersToExtract > 0 {
+if maxNumbersToExtract > 0 && len(numbers) > maxNumbersToExtract {
    numbers = numbers[:maxNumbersToExtract]
}

344-346: Similar consideration for word extraction limits

The implementation has the same pattern as number extraction. Consider consistent ordering of conditions.

-if len(extraWords) > maxWordsToExtract && maxWordsToExtract > 0 {
+if maxWordsToExtract > 0 && len(extraWords) > maxWordsToExtract {
    extraWords = extraWords[:maxWordsToExtract]
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65a13bb and f9ec0d3.

📒 Files selected for processing (1)
  • mutator.go (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Lint Test
🔇 Additional comments (7)
mutator.go (7)

292-295: Good addition of scaling variables for input processing control

These new variables provide fine-grained control over the enrichment process, which aligns well with the PR objective of implementing input limits.


298-304: Proper implementation of scaling based on limit option

The implementation correctly scales the number of inputs to process based on the limit option, which should improve performance when generating additional payloads.


311-313: Efficient slicing of inputs based on processing limits

This slicing operation optimizes performance by limiting the inputs to process, which is especially valuable when dealing with large input sets.


326-334: Good implementation of word filtering with minimum length

The filtering of words based on minimum length is a good practice to improve the quality of enriched payloads by excluding very short, potentially meaningless words.


348-353: Robust handling of word payloads

The revised approach properly handles both existing and new words, ensuring that existing values are not lost but enhanced with the new values.


355-360: Consistent implementation for number payloads

The same logical structure is correctly applied to the number payloads, maintaining consistency in the implementation.


362-363: Valuable debug logging addition

The debug log statement is a good practice as it provides visibility into the enrichment process, which will be helpful for troubleshooting and understanding the behavior of the code.

Copy link
Copy Markdown
Member

@tarunKoyalwar tarunKoyalwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to share some thoughts on this approach:

What this PR does well:

  • Reduces the number of inputs processed during enrichment phase
  • Adds filtering to limit extracted words/numbers
  • Code is clean and well-commented

My concerns (IMO):
I'm not entirely sure this will provide noticeable performance
improvements in practice. The issue is that --limit typically refers to
the output permutations (which can be millions), while the number of
inputs is usually much smaller (hundreds to thousands). For example:

  • Input: 1000 domains
  • Limit: 350,000 permutations
  • This PR would still process all 1000 inputs during enrichment

So the scaling still happens in the permutation phase, where we're not
addressing the early exit.

Alternative approach :
Looking at the codebase, I think the root issue is architectural - we have
an async channel-based design that requires draining (as noted in the
ExecuteWithWriter comment).

What if we modified the Execute() goroutine and clusterBomb() function to
accept a context? Then in ExecuteWithWriter(), we could cancel the context
once we hit the limit. This would stop generating new permutations
without needing to drain everything.

Something like:
ctx, cancel := context.WithCancel(context.Background())
defer cancel()

// In ExecuteWithWriter loop:
if limitReached {
cancel() // Stops permutation generation
}

Recommendation:
Given that this doesn't fully address the performance issue reported in
#270, I'd suggest we close this PR and open a new one with the
context-based approach. But I'm definitely open to other perspectives -
what do you think?

@dogancanbakir
Copy link
Copy Markdown
Member

@dwisiswant0 Please take a look at the message a

I wanted to share some thoughts on this approach:

What this PR does well:

  • Reduces the number of inputs processed during enrichment phase
  • Adds filtering to limit extracted words/numbers
  • Code is clean and well-commented

My concerns (IMO): I'm not entirely sure this will provide noticeable performance improvements in practice. The issue is that --limit typically refers to the output permutations (which can be millions), while the number of inputs is usually much smaller (hundreds to thousands). For example:

  • Input: 1000 domains
  • Limit: 350,000 permutations
  • This PR would still process all 1000 inputs during enrichment

So the scaling still happens in the permutation phase, where we're not addressing the early exit.

Alternative approach : Looking at the codebase, I think the root issue is architectural - we have an async channel-based design that requires draining (as noted in the ExecuteWithWriter comment).

What if we modified the Execute() goroutine and clusterBomb() function to accept a context? Then in ExecuteWithWriter(), we could cancel the context once we hit the limit. This would stop generating new permutations without needing to drain everything.

Something like: ctx, cancel := context.WithCancel(context.Background()) defer cancel()

// In ExecuteWithWriter loop: if limitReached { cancel() // Stops permutation generation }

Recommendation: Given that this doesn't fully address the performance issue reported in #270, I'd suggest we close this PR and open a new one with the context-based approach. But I'm definitely open to other perspectives - what do you think?

@dwisiswant0 fyi

@dogancanbakir dogancanbakir self-requested a review February 5, 2026 10:39
Mzack9999 added 2 commits May 8, 2026 22:16
Resolve mutator.go: keep pattern mining + wait-group producers from main,
apply context cancellation and streaming dedupe from this branch.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
mutator.go (2)

179-181: 💤 Low value

Redundant limit check — consider simplifying.

The limit check and cancel() call appears twice:

  • Lines 179-181: Before writing, triggers on payloadCount == Limit
  • Lines 210-212: After incrementing, triggers on payloadCount == Limit

The check at lines 210-212 is sufficient by itself — it cancels immediately after writing the last allowed item. The check at 179-181 then serves to skip subsequent iterations after cancellation.

This works correctly, but the dual-check pattern is a bit subtle. A cleaner approach might use a limitReached flag.

Also applies to: 210-212

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mutator.go` around lines 179 - 181, The redundant dual limit check can be
simplified: remove the pre-write check that tests hasLineLimit && m.payloadCount
== m.Options.Limit (the block that calls cancel() and continue) and keep a
single post-write check that increments m.payloadCount and calls cancel() when
m.payloadCount == m.Options.Limit; alternatively, introduce a limitReached
boolean (e.g., limitReached) updated when you call cancel() and then
short-circuit the loop using that flag to avoid the duplicate condition; update
references to m.payloadCount, m.Options.Limit and cancel() to use the single
post-write check or the new limitReached flag so the loop logic is clear and not
duplicated.

122-156: 💤 Low value

Deviation from coding guideline: Uses in-memory map instead of dedupe.NewDedupe().

The coding guideline specifies using dedupe.NewDedupe() from projectdiscovery/utils for result deduplication with configurable memory limits. The streaming dedupe implementation here uses a plain map[string]struct{} which can grow unbounded.

Given the PR's goal of supporting early cancellation with limits, the streaming approach may be intentional. However, for large result sets without limits, this could consume significant memory.

Consider either:

  1. Documenting this as an intentional deviation for streaming support
  2. Adding a size cap to the seen map with LRU eviction if memory becomes a concern

As per coding guidelines: "Use dedupe.NewDedupe() from projectdiscovery/utils for result deduplication with configurable memory limits"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mutator.go` around lines 122 - 156, The current streaming dedupe uses an
unbounded in-memory map (seen := make(map[string]struct{})) which violates the
guideline; replace this with the deduper from dedupe.NewDedupe() (importing
projectdiscovery/utils) and use its API to check-and-add entries inside the
goroutine that reads from the results channel so streaming + early ctx
cancellation still works; ensure you call the dedupe API in place of the seen
map checks, preserve draining behavior on ctx.Done() and closed results, and
configure the deduper with the appropriate memory/size limits or document the
intentional streaming deviation if you choose not to enforce a cap.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@mutator.go`:
- Around line 170-171: The hasByteBudget flag is incorrectly true when MaxSize
is unset (0); update the logic that sets hasByteBudget to only be true when the
user explicitly provided a positive MaxSize (e.g., require m.Options.MaxSize > 0
&& m.Options.MaxSize < math.MaxInt) so that m.Options.MaxSize default 0 does not
trigger byte-budget behavior; change the assignment where hasByteBudget is
computed (referencing m.Options.MaxSize and the hasByteBudget variable in the
mutator setup) and ensure downstream logic that relies on hasByteBudget (the
maxFileSize/cancellation flow) behaves as before for unset MaxSize.

---

Nitpick comments:
In `@mutator.go`:
- Around line 179-181: The redundant dual limit check can be simplified: remove
the pre-write check that tests hasLineLimit && m.payloadCount == m.Options.Limit
(the block that calls cancel() and continue) and keep a single post-write check
that increments m.payloadCount and calls cancel() when m.payloadCount ==
m.Options.Limit; alternatively, introduce a limitReached boolean (e.g.,
limitReached) updated when you call cancel() and then short-circuit the loop
using that flag to avoid the duplicate condition; update references to
m.payloadCount, m.Options.Limit and cancel() to use the single post-write check
or the new limitReached flag so the loop logic is clear and not duplicated.
- Around line 122-156: The current streaming dedupe uses an unbounded in-memory
map (seen := make(map[string]struct{})) which violates the guideline; replace
this with the deduper from dedupe.NewDedupe() (importing projectdiscovery/utils)
and use its API to check-and-add entries inside the goroutine that reads from
the results channel so streaming + early ctx cancellation still works; ensure
you call the dedupe API in place of the seen map checks, preserve draining
behavior on ctx.Done() and closed results, and configure the deduper with the
appropriate memory/size limits or document the intentional streaming deviation
if you choose not to enforce a cap.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: af6b06b3-db5c-4237-826e-575e7bfdb3f2

📥 Commits

Reviewing files that changed from the base of the PR and between f9ec0d3 and 09e18e7.

📒 Files selected for processing (3)
  • algo.go
  • mutator.go
  • mutator_test.go

Comment thread mutator.go
Comment on lines +170 to +171
hasLineLimit := m.Options.Limit > 0
hasByteBudget := m.Options.MaxSize < math.MaxInt
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bug: hasByteBudget incorrectly triggers when MaxSize is unset (default 0).

When MaxSize is not explicitly set, it defaults to 0 in Go. The condition m.Options.MaxSize < math.MaxInt evaluates to true, causing hasByteBudget to be set. Combined with line 183-188, this results in immediate cancellation on the first iteration since maxFileSize starts at 0.

This explains why TestMutatorResults and TestMutatorLineLimit explicitly set opts.MaxSize = math.MaxInt — without it, the test would fail. However, this is a breaking change for existing callers who don't set MaxSize.

Proposed fix
 	maxFileSize := m.Options.MaxSize
 	hasLineLimit := m.Options.Limit > 0
-	hasByteBudget := m.Options.MaxSize < math.MaxInt
+	hasByteBudget := m.Options.MaxSize > 0

This way, hasByteBudget is only true when the user explicitly sets a positive MaxSize value.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
hasLineLimit := m.Options.Limit > 0
hasByteBudget := m.Options.MaxSize < math.MaxInt
hasLineLimit := m.Options.Limit > 0
hasByteBudget := m.Options.MaxSize > 0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mutator.go` around lines 170 - 171, The hasByteBudget flag is incorrectly
true when MaxSize is unset (0); update the logic that sets hasByteBudget to only
be true when the user explicitly provided a positive MaxSize (e.g., require
m.Options.MaxSize > 0 && m.Options.MaxSize < math.MaxInt) so that
m.Options.MaxSize default 0 does not trigger byte-budget behavior; change the
assignment where hasByteBudget is computed (referencing m.Options.MaxSize and
the hasByteBudget variable in the mutator setup) and ensure downstream logic
that relies on hasByteBudget (the maxFileSize/cancellation flow) behaves as
before for unset MaxSize.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
mutator.go (1)

262-296: 💤 Low value

Streaming dedupe removes configurable memory limits.

The previous dedupe.NewDedupe() approach supported configurable memory limits for large permutation sets. The new in-memory seen map at line 271 can grow unbounded with the number of unique results. This is an acceptable tradeoff given the PR's goal of enabling early cancellation, but worth noting for workloads generating millions of unique permutations.

Consider adding a comment documenting this tradeoff or revisiting memory-bounded deduplication in a follow-up if memory pressure becomes an issue in practice. As per coding guidelines, "Use dedupe.NewDedupe() from projectdiscovery/utils for result deduplication with configurable memory limits".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mutator.go` around lines 262 - 296, The streaming dedupe implementation uses
an in-memory map called seen inside the goroutine (used when DedupeResults is
true and reading from the results channel) which can grow unbounded; update
mutator.go to add a clear comment above this goroutine describing the
memory-tradeoff and that this approach removes the configurable memory limits
previously provided by dedupe.NewDedupe(), and either (a) replace the seen map
with dedupe.NewDedupe() from projectdiscovery/utils to restore bounded memory
behavior or (b) document that this is a deliberate tradeoff and add a TODO to
revisit using dedupe.NewDedupe() for large permutation workloads.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@mutator.go`:
- Around line 262-296: The streaming dedupe implementation uses an in-memory map
called seen inside the goroutine (used when DedupeResults is true and reading
from the results channel) which can grow unbounded; update mutator.go to add a
clear comment above this goroutine describing the memory-tradeoff and that this
approach removes the configurable memory limits previously provided by
dedupe.NewDedupe(), and either (a) replace the seen map with dedupe.NewDedupe()
from projectdiscovery/utils to restore bounded memory behavior or (b) document
that this is a deliberate tradeoff and add a TODO to revisit using
dedupe.NewDedupe() for large permutation workloads.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aebc98d9-c74a-4dfc-bcf8-1b279679cad1

📥 Commits

Reviewing files that changed from the base of the PR and between 09e18e7 and 9cb466d.

📒 Files selected for processing (1)
  • mutator.go

@neo-by-projectdiscovery-dev
Copy link
Copy Markdown

neo-by-projectdiscovery-dev Bot commented May 8, 2026

Neo - PR Security Review

No security issues found

Comment @pdneo help for available commands. · Open in Neo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding a limit does not improve processing time

4 participants