Refactor: rename enable_profiling -> enable_l2_swimlane; extend diagnostics bitmask#652
Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom Apr 24, 2026
Conversation
There was a problem hiding this comment.
Code Review
This pull request renames the --enable-profiling CLI option and its associated internal flags to --enable-perf to clarify its role as a sub-feature of the broader profiling diagnostics umbrella. The changes span across Python test configurations, documentation, C++ runtime headers, and device-side executors. Feedback highlights missing logic in the Python worker's bootstrap loop for unpacking and passing the new diagnostic flags. Additionally, it is suggested to read the performance flag from the handshake bitmask in several AICore executors to maintain consistency with other diagnostic features.
d1caf62 to
c8b71b6
Compare
poursoul
previously approved these changes
Apr 23, 2026
5 tasks
f1ad890 to
642801e
Compare
3b08641 to
0be9313
Compare
…nable_l2_swimlane Fixes hw-native-sys#641 Today the front-end uses `enable_profiling` to mean perf swimlane only, while `dump_tensor` and `pmu` are parallel one-off flags. This makes "profiling" mean two different things: an umbrella concept at the product level vs. perf-only at the API surface. Make L2 swimlane an explicit sub-feature parallel to dump_tensor and pmu: - Rename `enable_profiling` -> `enable_l2_swimlane` end-to-end: pytest CLI (`--enable-l2-swimlane`), Python kwargs, nanobind binding, ChipCallConfig field, C ABI param, mailbox offset, runtime struct field. No legacy alias. - Extend the existing `enable_profiling_flag` umbrella bitmask so each sub-feature owns one bit: bit0=dump_tensor (unchanged), bit1=l2_swimlane (new), bit2=pmu (renumbered from bit1 on a2a3 for cross-arch consistency; reserved on a5). - Wire the new l2_swimlane bit through every device_runner that publishes the bitmask to handshakes (a5/a2a3 sim+onboard). - Rename output artifacts and helpers: `perf_swimlane_*.json` -> `l2_swimlane_*.json`; env var `SIMPLER_PERF_OUTPUT_DIR` -> `SIMPLER_L2_SWIMLANE_OUTPUT_DIR`; per-subprocess output subdir prefix `outputs/perf_*` -> `outputs/l2_swimlane_*`. - Update docs (testing, task-flow, profiling-name-map, tensor-dump, RUNTIME_LOGIC) and add a one-line umbrella note: "Profiling is the umbrella; the three sub-features are --enable-l2-swimlane, --dump-tensor, --enable-pmu." - Add a ChipCallConfig round-trip test guarding against drift where only two of the three sub-features are plumbed.
0be9313 to
2e161dd
Compare
ChaoWao
approved these changes
Apr 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #641
Summary
Introduce a three-layer naming scheme that cleanly separates the user-facing feature flag, the internal collection implementation, and the on-disk artifact:
l2_swimlane--enable-l2-swimlane; Python kwarg;ChipCallConfig/Runtimefieldenable_l2_swimlane; mailbox offsets_OFF_ENABLE_L2_SWIMLANE/MAILBOX_OFF_ENABLE_L2_SWIMLANE; bitmask bitPROFILING_FLAG_L2_SWIMLANEl2_perfL2PerfCollector; filesl2_perf_collector{,_aicpu,_aicore}.{h,cpp}andl2_perf_profiling.h; function prefixesl2_perf_aicpu_*/l2_perf_aicore_*; data typesL2PerfRecord/L2PerfBuffer/L2PerfSetupHeader/L2PerfDataHeader/L2PerfFreeQueueand allL2Perf*Callbacktypedefs; scheduler countersSchedL2PerfCountersand membersched_l2_perf_[]; scheduler localsl2_perf.l2_perf_enabled/l2_perf.phase_*; runtime fieldsl2_perf_data_base/l2_perf_records_addr/l2_perf_buffer_statusl2_perf_recordsl2_perf_records_<ts>.json; per-subprocess subdiroutputs/l2_perf_records_<tag>/; environment variableSIMPLER_L2_PERF_RECORDS_OUTPUT_DIR; tools CLI--l2-perf-records-json; Python helpersflatten_l2_perf_records_subdirs/_snapshot_l2_perf_records_files/auto_select_l2_perf_records_jsonThe a2a3 and a5 backends are kept in lockstep — identical symbols and filenames, differing only by architecture subdirectory.
Motivation
Today the front-end uses
enable_profilingto mean the perf swimlane feature alone, whiledump_tensorandpmuare parallel one-off flags. This makes "profiling" mean two different things: an umbrella concept at the product level, and a perf-only sub-feature at the API surface. This PR makes the L2 swimlane an explicit sub-feature parallel todump_tensorandpmu, and disentangles the naming so that a reader can tell at a glance whether a given identifier refers to the user toggle, the collection code, or the raw records on disk.Changes
Feature flag rename (hard rename, no legacy alias)
enable_profiling→enable_l2_swimlaneend-to-end: pytest CLI--enable-l2-swimlane, Python kwargs, nanobind binding,ChipCallConfigfield, C ABI parameter, mailbox offset, and Runtime struct field.enable_profiling_flagumbrella bitmask so each sub-feature owns one bit: bit 0 =dump_tensor(unchanged), bit 1 =l2_swimlane(new), bit 2 =pmu(renumbered from bit 1 on a2a3 for cross-arch consistency; reserved on a5).l2_swimlanebit through everydevice_runnerthat publishes the bitmask to AICore handshakes (a5 and a2a3, sim and onboard).Collector refactor
PerformanceCollector→L2PerfCollectorand move files:performance_collector.{h,cpp}→l2_perf_collector.{h,cpp},performance_collector_aicpu.{h,cpp}→l2_perf_collector_aicpu.{h,cpp},performance_collector_aicore.h→l2_perf_collector_aicore.h,common/perf_profiling.h→common/l2_perf_profiling.h. All include paths and header guards updated.perf_aicpu_*→l2_perf_aicpu_*,perf_aicore_*→l2_perf_aicore_*.PerfRecord→L2PerfRecord,PerfBuffer→L2PerfBuffer,PerfSetupHeader→L2PerfSetupHeader,PerfDataHeader→L2PerfDataHeader,PerfFreeQueue→L2PerfFreeQueue, and everyPerf*Callbacktypedef →L2Perf*Callback.perf_data_base→l2_perf_data_base,perf_records_addr→l2_perf_records_addr,perf_buffer_status→l2_perf_buffer_status(the last is flagged for removal in a follow-up but renamed here for consistency).SchedProfilingCounters→SchedL2PerfCounters,SchedulerContextmembersched_perf_[]→sched_l2_perf_[], fieldprofiling_enabled→l2_perf_enabled, and local aliasauto &perf = sched_l2_perf_[tid]→auto &l2_perf = ...acrossscheduler_dispatch.cpp,scheduler_completion.cpp, andscheduler_cold_path.cpp(≈80 occurrences).On-disk artifacts
perf_swimlane_*.json→l2_perf_records_*.json(the file contains raw per-task records; the swimlane visualization is produced downstream byswimlane_converter.pyasmerged_swimlane_*.json).SIMPLER_PERF_OUTPUT_DIR→SIMPLER_L2_PERF_RECORDS_OUTPUT_DIRand the per-subprocess output subdirectory prefixoutputs/perf_*→outputs/l2_perf_records_*.flatten_perf_subdirs→flatten_l2_perf_records_subdirs,_snapshot_perf_files→_snapshot_l2_perf_records_files,_wait_new_perf_file→_wait_new_l2_perf_records_file.Tools
tools/swimlane_converter.py,tools/perf_to_mermaid.py,tools/sched_overhead_analysis.py,tools/device_log_resolver.py, andtools/README.md: CLI argument--perf-json→--l2-perf-records-json, internal functionauto_select_perf_json→auto_select_l2_perf_records_json, and all references to the old file-name prefix.max(key=dict.get)pyright complaint) and several pre-existing markdownlint MD060/MD033 violations intools/README.md.Docs
docs/testing.md,docs/task-flow.md,docs/profiling-name-map.md, the per-runtimeRUNTIME_LOGICandprofiling_levelspages, andtools/README.mdwith the new umbrella/sub-feature story and renamed identifiers.--enable-l2-swimlane,--dump-tensor,--enable-pmu."Tests
ChipCallConfiground-trip unit test that asserts all three diagnostics sub-feature flags travel together through the nanobind binding, guarding against drift where only two of the three fields get plumbed.Test plan
pip install --no-build-isolation -e .builds cleanly on both a2a3 and a5pytest tests/ut/py -x— 217 passed (one pre-existing unrelated failure intest_scene_test_cache.pyon upstream/main, logged to localKNOWN_ISSUES.md)rg '\benable_profiling\b' src/ python/ simpler_setup/ tests/ examples/ docs/ tools/ conftest.pyreturns no hits outside the intentional umbrella nameenable_profiling_flag--enable-l2-swimlane(the runner sandbox has no simulator access; please verify in CI)--enable-l2-swimlane,--dump-tensor, and--enable-pmu(nice-to-have; simulation coverage is the primary gate)