Wire StaxMzMLSpectraMap for random spectrum access#2
Closed
Wire StaxMzMLSpectraMap for random spectrum access#2
Conversation
Phase 1: Core optimizations - Peak.getMass() autoboxing fix (Float -> float) - Synchronized collections -> concurrent (ConcurrentSkipListMap/ConcurrentHashMap) - Pair<Integer,Integer> -> long encoding in ScoredSpectraMap - MGF parser regex -> manual parsing - DBScanner PriorityQueue optimization - Mass binning for spectrum lookup (1.0 Da bins) - FastScorer NominalMass reuse - FastScorer bounds check (replace try-catch with explicit check) Phase 2: Bug fixes - Dead while loop in MSGFPlus.java - ReverseDB null pointer fix - MZIdentMLGen resource leak fix - CompactFastaSequence duplicate close removal - Mass precision loss fix (float -> double in ScoredSpectraMap) - DBScanner stream leak fix (try-with-resources) Phase 3: Scoring hot path optimizations - Spectrum.getPeakByMass() allocation elimination (no ArrayList/Comparator per call) - Reusable search key Peak in Spectrum - NewScoredSpectrum partition precomputation + prefix/suffix ion split - FlexAminoAcidGraph nodeScore HashMap -> int[] array - GeneratingFunction edge list allocation removal - DBScanner reusable mass range list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the slow jmzml (JAXB-based) mzML parser with a fast StAX XMLStreamReader implementation for 3-5x faster spectrum parsing. New files: - StaxMzMLParser.java: Core parser using javax.xml.stream.XMLStreamReader - StaxMzMLSpectraIterator.java: Streaming iterator with MS level filtering - StaxMzMLSpectraMap.java: Random access via indexed mzML byte offsets Changes: - Wire StAX classes into SpectraAccessor.java for .mzML files - Replace MzMLAdapter.turnOffLogs() references with StaxMzMLParser.turnOffLogs() - Exclude old jmzml-dependent files from compilation (kept as reference) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add TIMS_TOF as instrument type (index 4, -inst 4) with TOF-based scoring parameters. Includes 1/K0 ion mobility accessor (MS:1002815 CVParam) and scorer mapping for timsTOF → TOF fallback. Validated: 820 PSMs with -inst 3 (baseline match), 1015 PSMs with -inst 4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed UI, misc scripts Delete 88 files across unused packages: - msdictionary/ (7 files, 930 lines) - unused dictionary-based search - msgf2d/ (8 files, 1295 lines) - unused 2D scoring - ims/ (9 files, 683 lines) - unused ion mobility (replaced by CVParam approach) - 4 deprecated UI entry points (MSDictionary, MSGFDB, MSGFDBLib, PRMSpecGen) - 60 misc/ scripts (keep only ThreadPoolExecutorWithExceptions, ProgressData, ProgressReporter, ExceptionCapturer which are used by core pipeline) Inlined TextParsingUtils.isInteger() into MgfSpectrumParser. Removed test methods referencing deleted ConvertToMgf and VennDiagram classes. Validated: 820 PSMs (baseline match), build clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oxing - pom.xml: source/target 1.8 → release 17 - StringBuffer → StringBuilder across 46 files (no thread-shared usage) - Hashtable → HashMap across scorer/parser files (no concurrent access needed) - Vector → ArrayList in MZIdentMLGen and MascotParser - Autoboxing: new Float(x).hashCode() → Float.hashCode(x) in 3 files 75 files changed. Validated: 820 PSMs (baseline match), build clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ization Integration of: - feature/stax-mzml-parser: StAX-based mzML parser replacing jmzml JAXB - cleanup/dead-code-removal: 88 files, 15K lines of dead code removed - modernize/java17-upgrade: Java 17 target, StringBuilder, HashMap, autoboxing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r, add null safety - Increase random access buffer from 2MB to 16MB for large spectra - Remove MS-level filter in random access (callers already filtered via iterator) - Add null safety in ScoredSpectraMap.makePepMassSpecKeyMap() Fixes crash on full benchmark (109K spectra) where some spectra returned null. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The StAX-based random access (StaxMzMLSpectraMap) returns null for some spectra in large files, causing NPEs in ScoredSpectraMap. Revert random access to the proven jmzml-based MzMLSpectraMap while keeping StAX for the fast forward iteration (StaxMzMLSpectraIterator). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…from memory Replace jmzml MzMLSpectraMap (which re-parses XML for each random access) with a CachedSpectraMap that consumes the StAX iterator once and stores all spectra in a HashMap for O(1) lookups. This eliminates the slow jmzml unmarshaller index building (~111s on full benchmark) and all subsequent XML re-parsing during search. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ped from mzId output When GeneratingFunction.computeGeneratingFunction() fails, matches get deNovoScore=Integer.MIN_VALUE. In MZIdentMLGen, the check against minDeNovoScore used 'break' which stopped processing ALL remaining matches for that spectrum. Changed to 'continue' so only the individual failed match is skipped and valid matches are still written. Fixes: MSGFPlus#157 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the preliminary timsTOF support (TIMS_TOF instrument type, scorer mapping, 1/K0 accessor) in preparation for a more comprehensive native timsTOF implementation with .d file reading support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ge datasets CachedSpectraMap loaded all spectra into a HashMap for O(1) random access. While faster (740s vs 858s), it increases RSS by 1.2+ GB and would be catastrophic for datasets with millions of spectra. Revert to jmzml MzMLSpectraMap for random access until a proper memory-bounded StAX random access implementation is ready. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aMap Replace the slow jmzml JAXB-based MzMLSpectraMap with the StAX-based StaxMzMLSpectraMap for random spectrum access in SpectraAccessor. The StAX implementation reads the mzML index for byte offsets and seeks directly, avoiding full JAXB unmarshalling on each spectrum lookup. Quick test: 820 PSMs (matches baseline), 18.3s runtime. Full benchmark pending to measure impact on larger datasets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Important Review skippedToo many files! This PR contains 164 files, which is 14 over the limit of 150. ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (164)
You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MzMLSpectraMapwith StAX-basedStaxMzMLSpectraMapfor random spectrum accessSpectraAccessor.javaBenchmark Results
Quick Test (BSA1.mzML + yeast FASTA, 2 threads)
Full ProteoBench Benchmark
Running... results will be added as a comment.
Test plan
mvn package -DskipTests🤖 Generated with Claude Code