Skip to content

Refactor: rename enable_profiling -> enable_l2_swimlane; extend diagnostics bitmask#652

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoZheng109:fix/issue-641-unify-profiling-abstractions
Apr 24, 2026
Merged

Refactor: rename enable_profiling -> enable_l2_swimlane; extend diagnostics bitmask#652
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoZheng109:fix/issue-641-unify-profiling-abstractions

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

@ChaoZheng109 ChaoZheng109 commented Apr 22, 2026

Fixes #641

Summary

Introduce a three-layer naming scheme that cleanly separates the user-facing feature flag, the internal collection implementation, and the on-disk artifact:

Layer Prefix Scope
User-facing flag l2_swimlane pytest CLI --enable-l2-swimlane; Python kwarg; ChipCallConfig / Runtime field enable_l2_swimlane; mailbox offsets _OFF_ENABLE_L2_SWIMLANE / MAILBOX_OFF_ENABLE_L2_SWIMLANE; bitmask bit PROFILING_FLAG_L2_SWIMLANE
Internal implementation l2_perf class L2PerfCollector; files l2_perf_collector{,_aicpu,_aicore}.{h,cpp} and l2_perf_profiling.h; function prefixes l2_perf_aicpu_* / l2_perf_aicore_*; data types L2PerfRecord / L2PerfBuffer / L2PerfSetupHeader / L2PerfDataHeader / L2PerfFreeQueue and all L2Perf*Callback typedefs; scheduler counters SchedL2PerfCounters and member sched_l2_perf_[]; scheduler locals l2_perf.l2_perf_enabled / l2_perf.phase_*; runtime fields l2_perf_data_base / l2_perf_records_addr / l2_perf_buffer_status
On-disk artifact l2_perf_records output JSON l2_perf_records_<ts>.json; per-subprocess subdir outputs/l2_perf_records_<tag>/; environment variable SIMPLER_L2_PERF_RECORDS_OUTPUT_DIR; tools CLI --l2-perf-records-json; Python helpers flatten_l2_perf_records_subdirs / _snapshot_l2_perf_records_files / auto_select_l2_perf_records_json

The a2a3 and a5 backends are kept in lockstep — identical symbols and filenames, differing only by architecture subdirectory.

Motivation

Today the front-end uses enable_profiling to mean the perf swimlane feature alone, while dump_tensor and pmu are parallel one-off flags. This makes "profiling" mean two different things: an umbrella concept at the product level, and a perf-only sub-feature at the API surface. This PR makes the L2 swimlane an explicit sub-feature parallel to dump_tensor and pmu, and disentangles the naming so that a reader can tell at a glance whether a given identifier refers to the user toggle, the collection code, or the raw records on disk.

Changes

Feature flag rename (hard rename, no legacy alias)

  • Rename enable_profilingenable_l2_swimlane end-to-end: pytest CLI --enable-l2-swimlane, Python kwargs, nanobind binding, ChipCallConfig field, C ABI parameter, mailbox offset, and Runtime struct field.
  • Extend the existing enable_profiling_flag umbrella bitmask so each sub-feature owns one bit: bit 0 = dump_tensor (unchanged), bit 1 = l2_swimlane (new), bit 2 = pmu (renumbered from bit 1 on a2a3 for cross-arch consistency; reserved on a5).
  • Wire the new l2_swimlane bit through every device_runner that publishes the bitmask to AICore handshakes (a5 and a2a3, sim and onboard).

Collector refactor

  • Rename the host-side class PerformanceCollectorL2PerfCollector and move files: performance_collector.{h,cpp}l2_perf_collector.{h,cpp}, performance_collector_aicpu.{h,cpp}l2_perf_collector_aicpu.{h,cpp}, performance_collector_aicore.hl2_perf_collector_aicore.h, common/perf_profiling.hcommon/l2_perf_profiling.h. All include paths and header guards updated.
  • Rename AICPU/AICore function prefixes perf_aicpu_*l2_perf_aicpu_*, perf_aicore_*l2_perf_aicore_*.
  • Rename all data types exposed by the collector: PerfRecordL2PerfRecord, PerfBufferL2PerfBuffer, PerfSetupHeaderL2PerfSetupHeader, PerfDataHeaderL2PerfDataHeader, PerfFreeQueueL2PerfFreeQueue, and every Perf*Callback typedef → L2Perf*Callback.
  • Rename Runtime fields that carry collector state: perf_data_basel2_perf_data_base, perf_records_addrl2_perf_records_addr, perf_buffer_statusl2_perf_buffer_status (the last is flagged for removal in a follow-up but renamed here for consistency).
  • Rename scheduler-side profiling counters: struct SchedProfilingCountersSchedL2PerfCounters, SchedulerContext member sched_perf_[]sched_l2_perf_[], field profiling_enabledl2_perf_enabled, and local alias auto &perf = sched_l2_perf_[tid]auto &l2_perf = ... across scheduler_dispatch.cpp, scheduler_completion.cpp, and scheduler_cold_path.cpp (≈80 occurrences).

On-disk artifacts

  • Rename the runtime's output file prefix perf_swimlane_*.jsonl2_perf_records_*.json (the file contains raw per-task records; the swimlane visualization is produced downstream by swimlane_converter.py as merged_swimlane_*.json).
  • Rename the environment variable SIMPLER_PERF_OUTPUT_DIRSIMPLER_L2_PERF_RECORDS_OUTPUT_DIR and the per-subprocess output subdirectory prefix outputs/perf_*outputs/l2_perf_records_*.
  • Rename the test dispatcher helpers accordingly: flatten_perf_subdirsflatten_l2_perf_records_subdirs, _snapshot_perf_files_snapshot_l2_perf_records_files, _wait_new_perf_file_wait_new_l2_perf_records_file.

Tools

  • Update tools/swimlane_converter.py, tools/perf_to_mermaid.py, tools/sched_overhead_analysis.py, tools/device_log_resolver.py, and tools/README.md: CLI argument --perf-json--l2-perf-records-json, internal function auto_select_perf_jsonauto_select_l2_perf_records_json, and all references to the old file-name prefix.
  • Drive-by: fix a small number of pre-existing lint issues surfaced by touching these files (missing copyright header, E501 overflows, F841 unused-variable, one max(key=dict.get) pyright complaint) and several pre-existing markdownlint MD060/MD033 violations in tools/README.md.

Docs

  • Update docs/testing.md, docs/task-flow.md, docs/profiling-name-map.md, the per-runtime RUNTIME_LOGIC and profiling_levels pages, and tools/README.md with the new umbrella/sub-feature story and renamed identifiers.
  • Add a one-line umbrella note where appropriate: "Profiling is the umbrella; the three sub-features are --enable-l2-swimlane, --dump-tensor, --enable-pmu."

Tests

  • Add a ChipCallConfig round-trip unit test that asserts all three diagnostics sub-feature flags travel together through the nanobind binding, guarding against drift where only two of the three fields get plumbed.

Test plan

  • pip install --no-build-isolation -e . builds cleanly on both a2a3 and a5
  • pytest tests/ut/py -x — 217 passed (one pre-existing unrelated failure in test_scene_test_cache.py on upstream/main, logged to local KNOWN_ISSUES.md)
  • Grep gate clean: rg '\benable_profiling\b' src/ python/ simpler_setup/ tests/ examples/ docs/ tools/ conftest.py returns no hits outside the intentional umbrella name enable_profiling_flag
  • Local pre-commit clean on the full diff (clang-format, clang-tidy, cpplint, ruff check, ruff format, pyright, markdownlint, check-headers, check-english-only)
  • Simulation scene test with --enable-l2-swimlane (the runner sandbox has no simulator access; please verify in CI)
  • Hardware smoke run combining --enable-l2-swimlane, --dump-tensor, and --enable-pmu (nice-to-have; simulation coverage is the primary gate)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request renames the --enable-profiling CLI option and its associated internal flags to --enable-perf to clarify its role as a sub-feature of the broader profiling diagnostics umbrella. The changes span across Python test configurations, documentation, C++ runtime headers, and device-side executors. Feedback highlights missing logic in the Python worker's bootstrap loop for unpacking and passing the new diagnostic flags. Additionally, it is suggested to read the performance flag from the handshake bitmask in several AICore executors to maintain consistency with other diagnostic features.

Comment thread python/simpler/worker.py Outdated
Comment thread python/simpler/worker.py Outdated
Comment thread src/a2a3/runtime/aicpu_build_graph/aicore/aicore_executor.cpp Outdated
Comment thread src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp Outdated
Comment thread src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp Outdated
@ChaoZheng109 ChaoZheng109 force-pushed the fix/issue-641-unify-profiling-abstractions branch 4 times, most recently from d1caf62 to c8b71b6 Compare April 23, 2026 01:43
poursoul
poursoul previously approved these changes Apr 23, 2026
@ChaoZheng109 ChaoZheng109 force-pushed the fix/issue-641-unify-profiling-abstractions branch 2 times, most recently from f1ad890 to 642801e Compare April 23, 2026 06:17
@ChaoZheng109 ChaoZheng109 changed the title Refactor: rename enable_profiling -> enable_perf; extend diagnostics bitmask Refactor: rename enable_profiling -> enable_l2_swimlane; extend diagnostics bitmask Apr 23, 2026
@ChaoZheng109 ChaoZheng109 force-pushed the fix/issue-641-unify-profiling-abstractions branch 10 times, most recently from 3b08641 to 0be9313 Compare April 23, 2026 11:03
…nable_l2_swimlane

Fixes hw-native-sys#641

Today the front-end uses `enable_profiling` to mean perf swimlane only,
while `dump_tensor` and `pmu` are parallel one-off flags. This makes
"profiling" mean two different things: an umbrella concept at the product
level vs. perf-only at the API surface.

Make L2 swimlane an explicit sub-feature parallel to dump_tensor and pmu:

- Rename `enable_profiling` -> `enable_l2_swimlane` end-to-end: pytest
  CLI (`--enable-l2-swimlane`), Python kwargs, nanobind binding,
  ChipCallConfig field, C ABI param, mailbox offset, runtime struct
  field. No legacy alias.
- Extend the existing `enable_profiling_flag` umbrella bitmask so each
  sub-feature owns one bit: bit0=dump_tensor (unchanged),
  bit1=l2_swimlane (new), bit2=pmu (renumbered from bit1 on a2a3 for
  cross-arch consistency; reserved on a5).
- Wire the new l2_swimlane bit through every device_runner that
  publishes the bitmask to handshakes (a5/a2a3 sim+onboard).
- Rename output artifacts and helpers: `perf_swimlane_*.json` ->
  `l2_swimlane_*.json`; env var `SIMPLER_PERF_OUTPUT_DIR` ->
  `SIMPLER_L2_SWIMLANE_OUTPUT_DIR`; per-subprocess output subdir prefix
  `outputs/perf_*` -> `outputs/l2_swimlane_*`.
- Update docs (testing, task-flow, profiling-name-map, tensor-dump,
  RUNTIME_LOGIC) and add a one-line umbrella note: "Profiling is the
  umbrella; the three sub-features are --enable-l2-swimlane,
  --dump-tensor, --enable-pmu."
- Add a ChipCallConfig round-trip test guarding against drift where only
  two of the three sub-features are plumbed.
@ChaoZheng109 ChaoZheng109 force-pushed the fix/issue-641-unify-profiling-abstractions branch from 0be9313 to 2e161dd Compare April 23, 2026 11:28
@ChaoWao ChaoWao merged commit 737288d into hw-native-sys:main Apr 24, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Code Health] Unify profiling abstractions across perf, dump tensor, and PMU

3 participants