feat: update payload enrichment logic with input limits and filtering#273
feat: update payload enrichment logic with input limits and filtering#273dwisiswant0 wants to merge 3 commits into
Conversation
with input limits and filtering Scale the number of inputs to process based on limit or max-size options to generate additional payloads. Signed-off-by: Dwi Siswanto <git@dw1.io>
WalkthroughClusterBomb now accepts a context and an exported callback type; permutation generation checks ctx for cancellation. Execute streams and deduplicates results with ctx-aware draining, ExecuteWithWriter uses a cancelable context and calls cancel() when Limits or MaxSize are reached, and a unit test verifies the line limit. ChangesCancelable generation + streaming dedupe
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ExecuteWithWriter
participant Execute
participant clusterBomb
participant Writer
Caller->>ExecuteWithWriter: start with Options (Limit/MaxSize)
ExecuteWithWriter->>Execute: Execute(ctxWithCancel)
Execute->>clusterBomb: drive permutations (passes ctx)
clusterBomb->>Execute: emit results (checks ctx)
Execute->>ExecuteWithWriter: stream results (dedupe stage)
ExecuteWithWriter->>Writer: write item, increment payloadCount
alt limit or maxsize reached
ExecuteWithWriter->>ExecuteWithWriter: cancel() ctx
ExecuteWithWriter-->>Execute: generation stops
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
mutator.go (3)
305-309: Consider clarifying the MaxSize condition logicThe condition only applies MaxSize if it's less than or equal to the number of inputs. This means larger MaxSize values won't be applied, which might not be intuitive.
Consider revising the condition to apply MaxSize whenever it's greater than zero:
-if m.Options.MaxSize > 0 && m.Options.MaxSize <= len(inputs) { +if m.Options.MaxSize > 0 { maxInputsToProcess = m.Options.MaxSize maxWordsToExtract = m.Options.MaxSize maxNumbersToExtract = m.Options.MaxSize }
335-337: Ensure number extraction limits handle zero value correctlyThe implementation correctly limits the number of extracted numbers, but should verify that maxNumbersToExtract is greater than zero before slicing.
Consider adding an additional check to ensure maxNumbersToExtract is greater than zero:
-if len(numbers) > maxNumbersToExtract && maxNumbersToExtract > 0 { +if maxNumbersToExtract > 0 && len(numbers) > maxNumbersToExtract { numbers = numbers[:maxNumbersToExtract] }
344-346: Similar consideration for word extraction limitsThe implementation has the same pattern as number extraction. Consider consistent ordering of conditions.
-if len(extraWords) > maxWordsToExtract && maxWordsToExtract > 0 { +if maxWordsToExtract > 0 && len(extraWords) > maxWordsToExtract { extraWords = extraWords[:maxWordsToExtract] }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
mutator.go(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Lint Test
🔇 Additional comments (7)
mutator.go (7)
292-295: Good addition of scaling variables for input processing controlThese new variables provide fine-grained control over the enrichment process, which aligns well with the PR objective of implementing input limits.
298-304: Proper implementation of scaling based on limit optionThe implementation correctly scales the number of inputs to process based on the limit option, which should improve performance when generating additional payloads.
311-313: Efficient slicing of inputs based on processing limitsThis slicing operation optimizes performance by limiting the inputs to process, which is especially valuable when dealing with large input sets.
326-334: Good implementation of word filtering with minimum lengthThe filtering of words based on minimum length is a good practice to improve the quality of enriched payloads by excluding very short, potentially meaningless words.
348-353: Robust handling of word payloadsThe revised approach properly handles both existing and new words, ensuring that existing values are not lost but enhanced with the new values.
355-360: Consistent implementation for number payloadsThe same logical structure is correctly applied to the number payloads, maintaining consistency in the implementation.
362-363: Valuable debug logging additionThe debug log statement is a good practice as it provides visibility into the enrichment process, which will be helpful for troubleshooting and understanding the behavior of the code.
tarunKoyalwar
left a comment
There was a problem hiding this comment.
I wanted to share some thoughts on this approach:
What this PR does well:
- Reduces the number of inputs processed during enrichment phase
- Adds filtering to limit extracted words/numbers
- Code is clean and well-commented
My concerns (IMO):
I'm not entirely sure this will provide noticeable performance
improvements in practice. The issue is that --limit typically refers to
the output permutations (which can be millions), while the number of
inputs is usually much smaller (hundreds to thousands). For example:
- Input: 1000 domains
- Limit: 350,000 permutations
- This PR would still process all 1000 inputs during enrichment
So the scaling still happens in the permutation phase, where we're not
addressing the early exit.
Alternative approach :
Looking at the codebase, I think the root issue is architectural - we have
an async channel-based design that requires draining (as noted in the
ExecuteWithWriter comment).
What if we modified the Execute() goroutine and clusterBomb() function to
accept a context? Then in ExecuteWithWriter(), we could cancel the context
once we hit the limit. This would stop generating new permutations
without needing to drain everything.
Something like:
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// In ExecuteWithWriter loop:
if limitReached {
cancel() // Stops permutation generation
}
Recommendation:
Given that this doesn't fully address the performance issue reported in
#270, I'd suggest we close this PR and open a new one with the
context-based approach. But I'm definitely open to other perspectives -
what do you think?
|
@dwisiswant0 Please take a look at the message a
@dwisiswant0 fyi |
Resolve mutator.go: keep pattern mining + wait-group producers from main, apply context cancellation and streaming dedupe from this branch.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
mutator.go (2)
179-181: 💤 Low valueRedundant limit check — consider simplifying.
The limit check and
cancel()call appears twice:
- Lines 179-181: Before writing, triggers on
payloadCount == Limit- Lines 210-212: After incrementing, triggers on
payloadCount == LimitThe check at lines 210-212 is sufficient by itself — it cancels immediately after writing the last allowed item. The check at 179-181 then serves to skip subsequent iterations after cancellation.
This works correctly, but the dual-check pattern is a bit subtle. A cleaner approach might use a
limitReachedflag.Also applies to: 210-212
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mutator.go` around lines 179 - 181, The redundant dual limit check can be simplified: remove the pre-write check that tests hasLineLimit && m.payloadCount == m.Options.Limit (the block that calls cancel() and continue) and keep a single post-write check that increments m.payloadCount and calls cancel() when m.payloadCount == m.Options.Limit; alternatively, introduce a limitReached boolean (e.g., limitReached) updated when you call cancel() and then short-circuit the loop using that flag to avoid the duplicate condition; update references to m.payloadCount, m.Options.Limit and cancel() to use the single post-write check or the new limitReached flag so the loop logic is clear and not duplicated.
122-156: 💤 Low valueDeviation from coding guideline: Uses in-memory map instead of
dedupe.NewDedupe().The coding guideline specifies using
dedupe.NewDedupe()fromprojectdiscovery/utilsfor result deduplication with configurable memory limits. The streaming dedupe implementation here uses a plainmap[string]struct{}which can grow unbounded.Given the PR's goal of supporting early cancellation with limits, the streaming approach may be intentional. However, for large result sets without limits, this could consume significant memory.
Consider either:
- Documenting this as an intentional deviation for streaming support
- Adding a size cap to the
seenmap with LRU eviction if memory becomes a concernAs per coding guidelines: "Use dedupe.NewDedupe() from projectdiscovery/utils for result deduplication with configurable memory limits"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mutator.go` around lines 122 - 156, The current streaming dedupe uses an unbounded in-memory map (seen := make(map[string]struct{})) which violates the guideline; replace this with the deduper from dedupe.NewDedupe() (importing projectdiscovery/utils) and use its API to check-and-add entries inside the goroutine that reads from the results channel so streaming + early ctx cancellation still works; ensure you call the dedupe API in place of the seen map checks, preserve draining behavior on ctx.Done() and closed results, and configure the deduper with the appropriate memory/size limits or document the intentional streaming deviation if you choose not to enforce a cap.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@mutator.go`:
- Around line 170-171: The hasByteBudget flag is incorrectly true when MaxSize
is unset (0); update the logic that sets hasByteBudget to only be true when the
user explicitly provided a positive MaxSize (e.g., require m.Options.MaxSize > 0
&& m.Options.MaxSize < math.MaxInt) so that m.Options.MaxSize default 0 does not
trigger byte-budget behavior; change the assignment where hasByteBudget is
computed (referencing m.Options.MaxSize and the hasByteBudget variable in the
mutator setup) and ensure downstream logic that relies on hasByteBudget (the
maxFileSize/cancellation flow) behaves as before for unset MaxSize.
---
Nitpick comments:
In `@mutator.go`:
- Around line 179-181: The redundant dual limit check can be simplified: remove
the pre-write check that tests hasLineLimit && m.payloadCount == m.Options.Limit
(the block that calls cancel() and continue) and keep a single post-write check
that increments m.payloadCount and calls cancel() when m.payloadCount ==
m.Options.Limit; alternatively, introduce a limitReached boolean (e.g.,
limitReached) updated when you call cancel() and then short-circuit the loop
using that flag to avoid the duplicate condition; update references to
m.payloadCount, m.Options.Limit and cancel() to use the single post-write check
or the new limitReached flag so the loop logic is clear and not duplicated.
- Around line 122-156: The current streaming dedupe uses an unbounded in-memory
map (seen := make(map[string]struct{})) which violates the guideline; replace
this with the deduper from dedupe.NewDedupe() (importing projectdiscovery/utils)
and use its API to check-and-add entries inside the goroutine that reads from
the results channel so streaming + early ctx cancellation still works; ensure
you call the dedupe API in place of the seen map checks, preserve draining
behavior on ctx.Done() and closed results, and configure the deduper with the
appropriate memory/size limits or document the intentional streaming deviation
if you choose not to enforce a cap.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: af6b06b3-db5c-4237-826e-575e7bfdb3f2
📒 Files selected for processing (3)
algo.gomutator.gomutator_test.go
| hasLineLimit := m.Options.Limit > 0 | ||
| hasByteBudget := m.Options.MaxSize < math.MaxInt |
There was a problem hiding this comment.
Bug: hasByteBudget incorrectly triggers when MaxSize is unset (default 0).
When MaxSize is not explicitly set, it defaults to 0 in Go. The condition m.Options.MaxSize < math.MaxInt evaluates to true, causing hasByteBudget to be set. Combined with line 183-188, this results in immediate cancellation on the first iteration since maxFileSize starts at 0.
This explains why TestMutatorResults and TestMutatorLineLimit explicitly set opts.MaxSize = math.MaxInt — without it, the test would fail. However, this is a breaking change for existing callers who don't set MaxSize.
Proposed fix
maxFileSize := m.Options.MaxSize
hasLineLimit := m.Options.Limit > 0
- hasByteBudget := m.Options.MaxSize < math.MaxInt
+ hasByteBudget := m.Options.MaxSize > 0This way, hasByteBudget is only true when the user explicitly sets a positive MaxSize value.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| hasLineLimit := m.Options.Limit > 0 | |
| hasByteBudget := m.Options.MaxSize < math.MaxInt | |
| hasLineLimit := m.Options.Limit > 0 | |
| hasByteBudget := m.Options.MaxSize > 0 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@mutator.go` around lines 170 - 171, The hasByteBudget flag is incorrectly
true when MaxSize is unset (0); update the logic that sets hasByteBudget to only
be true when the user explicitly provided a positive MaxSize (e.g., require
m.Options.MaxSize > 0 && m.Options.MaxSize < math.MaxInt) so that
m.Options.MaxSize default 0 does not trigger byte-budget behavior; change the
assignment where hasByteBudget is computed (referencing m.Options.MaxSize and
the hasByteBudget variable in the mutator setup) and ensure downstream logic
that relies on hasByteBudget (the maxFileSize/cancellation flow) behaves as
before for unset MaxSize.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
mutator.go (1)
262-296: 💤 Low valueStreaming dedupe removes configurable memory limits.
The previous
dedupe.NewDedupe()approach supported configurable memory limits for large permutation sets. The new in-memoryseenmap at line 271 can grow unbounded with the number of unique results. This is an acceptable tradeoff given the PR's goal of enabling early cancellation, but worth noting for workloads generating millions of unique permutations.Consider adding a comment documenting this tradeoff or revisiting memory-bounded deduplication in a follow-up if memory pressure becomes an issue in practice. As per coding guidelines, "Use dedupe.NewDedupe() from projectdiscovery/utils for result deduplication with configurable memory limits".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mutator.go` around lines 262 - 296, The streaming dedupe implementation uses an in-memory map called seen inside the goroutine (used when DedupeResults is true and reading from the results channel) which can grow unbounded; update mutator.go to add a clear comment above this goroutine describing the memory-tradeoff and that this approach removes the configurable memory limits previously provided by dedupe.NewDedupe(), and either (a) replace the seen map with dedupe.NewDedupe() from projectdiscovery/utils to restore bounded memory behavior or (b) document that this is a deliberate tradeoff and add a TODO to revisit using dedupe.NewDedupe() for large permutation workloads.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@mutator.go`:
- Around line 262-296: The streaming dedupe implementation uses an in-memory map
called seen inside the goroutine (used when DedupeResults is true and reading
from the results channel) which can grow unbounded; update mutator.go to add a
clear comment above this goroutine describing the memory-tradeoff and that this
approach removes the configurable memory limits previously provided by
dedupe.NewDedupe(), and either (a) replace the seen map with dedupe.NewDedupe()
from projectdiscovery/utils to restore bounded memory behavior or (b) document
that this is a deliberate tradeoff and add a TODO to revisit using
dedupe.NewDedupe() for large permutation workloads.
Neo - PR Security ReviewNo security issues found Comment |
Scale the number of inputs to process based on
limit or max-size options to generate additional
payloads.
Prolly fixes #270
Summary by CodeRabbit
New Features
Tests