Release 8: Dynamic Controls, Deploy Pipeline, and Runtime Hardening¶
Theme: Release 8 adds dynamic control schemas (rebuild a module's control set at runtime so the UI shows only relevant parameters), a complete deploy pipeline overhaul (log-to-docs data flow, browser deploy UI, MCP tools), and a series of runtime hardening improvements: ArtNet unicast and persistent sockets, WiFi/WS heap stability,
namespace projectMMfor FastLED compatibility, NTP time sync, and generic module auto-wiring viaautoWireKeys().
Release Overview¶
What was delivered in Release 7 (build on this)¶
| Strength | Notes |
|---|---|
| OTA firmware update | FirmwareUpdateModule: file upload + GitHub releases tab; POST /api/firmware |
| CI release pipeline | Tagged releases + nightly pre-release with firmware assets on GitHub |
| Windows support | Native .exe build; projectMM-pc-windows.zip in CI artifacts |
| Scenario baselines | Hardware --update-baseline run; "extends" inheritance; wired into all.py |
| Static RAM hardening | Per-device LOG_RING_SIZE; WiFi buffer tuning; dual check_alloc guard |
| Log frontend panel | WS push of ring buffer entries; collapsible log UI |
What Release 8 addresses¶
| Problem | Sprint |
|---|---|
Control schema is fixed at setup() time; irrelevant parameters always visible regardless of selected type |
Sprint 1 (Dynamic controls: clearControls(), rebuildControls(), early WS flush), complete |
| Static RAM column in techdebt monitor always shows 0 (parser bug); no accounting of what consumes the 51 KB ESP32 RAM; Notable Findings have no action owners | Sprint 3 (RAM accounting, parser fix, actions table), complete |
classSize() misses runtime heap (controls_[] array, pendingProps_ doc); large char[] struct members inflate classSize; scanner blind to allocations in private helpers |
Sprint 4 (baseHeapUsage, char[] audits, scanner improvements), complete |
Deploy pipeline grew to 17+ scripts with no architecture overview; steps produced no status pages; techdebt.py name misleading; orchestrators monolithic |
Sprints 5-10 (full log→md pipeline, orchestrator restructuring, naming cleanup), complete |
| No interactive way to trigger individual deploy scripts; MCP tools covered only orchestrators; no AI-assisted log analysis; deploy.md was CLI-first with no visual overview | Sprint 11 (browser deploy UI, run_script/read_log MCP tools, deploy.md overhaul), complete |
struct RGB in global namespace collides with FastLED's enum EOrder { RGB=0012 }; ArtNetOutModule hardwired to DriverLayer; unicast destination not configurable; static IP fields always visible regardless of DHCP/static mode; ArtNet frames stutter on PC (per-packet socket lifecycle, macOS heap scan in hot path) |
Sprint 13 (namespace projectMM, ArtNet generic source + unicast + fps_limit_ + persistent socket, Ethernet rebuildControls, mkdocs kill-on-stop, macOS heap cache), complete |
Consecutive rebuildControls() calls only show in the browser once (WS shared-buffer in-flight corruption); no wall-clock time on device; per-type strcmp chains in ModuleManager not extensible |
Sprint 14 (schema double-buffering, NtpModule, autoWireKeys()), complete |
Sprints¶
| Sprint | Goal |
|---|---|
| Sprint 1 | Dynamic controls: clearControls(), rebuildControls() virtual, early WS schema flush |
| Sprint 2 | Technical-debt monitor: per-module metrics (LOC, function count, complexity, static RAM, heap/blocking violations) as a CI script |
| Sprint 3 | RAM accounting balance, fix static RAM parser, Notable Findings actions, Logger ring buffer reduction |
| Sprint 4 | baseHeapUsage() column, char[] to std::array audits, scanner improvements for private helpers |
| Sprint 5-10 | Deploy pipeline consolidation: full log→md data flow, orchestrator restructuring, naming cleanup — complete |
| Sprint 11 | Browser deploy UI, run_script/read_log MCP tools, erase_flash.py, deploy.md overhaul — complete |
| Sprint 12 | ESP32 WiFi heap stability: pre-allocated WS text buffer, WIFI_STA boot mode, AP management guards; network.md — complete |
| Sprint 13 | namespace projectMM RGB wrap, ArtNetOutModule generic source + unicast + fps_limit_ + persistent socket, EthernetModule rebuildControls, mkdocs kill-on-stop, macOS heap cache — complete |
| Sprint 14 | WsServer schema double-buffering, NTP time sync module, generic auto-wiring via autoWireKeys() |
Sprint 1: Dynamic Controls¶
Scope: Allow a module to rebuild its control schema at runtime in response to a control value change. The primary use case: a
typeselector control switches between effect variants, and only the parameters relevant to the active type are shown. The control set is rebuilt without a full module restart.
Motivation¶
Today, addControl() is called once in setup() and the schema is fixed for the lifetime of the module. A module that supports multiple effect types must expose all parameters for all types simultaneously, cluttering the UI and confusing operators. The fix: make the schema a function of the control values, rebuilt on demand.
Design¶
clearControls(system = false)
Added to StatefulModule. Iterates the registered controls_[] descriptors and removes all entries that are not marked system. Before removing each descriptor, writes the current value of the backing variable back into the pendingProps_ stash (keyed by control name). This means a subsequent addControl(var, key, ...) call for the same key restores the last operator-set value automatically — values are preserved across rebuilds even when the control temporarily disappears.
System controls (enabled) are marked at registration time with a system flag in ControlDescriptor. clearControls() skips them unconditionally.
rebuildControls() virtual
New virtual method on StatefulModule; default implementation is a no-op (all existing modules continue to work unchanged). Modules that want dynamic controls override it:
void rebuildControls() override {
clearControls();
addControl(type_, "type", "select", {"Ripples", "Lines", "Sine"});
if (type_ == EffectType::Ripples) {
addControl(speed_, "speed", "slider", 0.1f, 10.0f);
addControl(radius_, "radius", "slider", 1.0f, 50.0f);
} else if (type_ == EffectType::Lines) {
addControl(speed_, "speed", "slider", 0.1f, 10.0f);
addControl(count_, "count", "slider", 1, 20);
}
}
void setup() override {
rebuildControls(); // replaces direct addControl() calls
}
void onUpdate(const char* key) override {
if (strcmp(key, "type") == 0) rebuildControls();
}
Modules that do not need dynamic controls keep calling addControl() directly in setup() — no migration required.
Early WS schema flush
After rebuildControls() finishes, the UI must reflect the new schema immediately rather than waiting up to 1 s for the next periodic push. Implementation: clearControls() sets a schemaDirty_ flag on StatefulModule. The main loop checks schemaDirty_ across all modules and, if set, sends a {"t":"schema","modules":[...]} WS push using getModulesJson() (full schema including control types, options, min/max, and current values) and clears the flag. On a clean tick, the periodic 200 ms push uses getStateJson() (flat key/value state) as before. Natural debounce: a burst of rebuildControls() calls within one tick produces exactly one push.
A dedicated {"t":"schema"} message type is required because getStateJson() sends only flat {key:value} pairs; handleStateUpdate() in the frontend updates existing DOM elements but cannot add or remove controls. When rebuildControls() changes the control set, the frontend must call render() to rebuild the card from scratch.
State persistence interaction
saveState() and loadState() iterate the registered descriptors. After a rebuild, only the currently registered controls are persisted — parameters for inactive types are not written to the state file. On the next load, pendingProps_ carries any previously saved values; addControl() applies them if the key matches a registered control after rebuildControls() runs. A type control persisted in state is applied before rebuildControls() is called (via the existing addControl stash mechanism), so the correct variant's parameters are registered and restored on first boot.
Sprint 1 Scope Definition of Done¶
ControlDescriptorgainsbool systemfield;StatefulModule::runSetup()sets it when registeringenabledclearControls()removes non-system descriptors; saves current values topendingProps_stash before removalrebuildControls()virtual added toStatefulModule; default is no-op; existing modules compile and behave identicallyschemaDirty_flag set byclearControls(); main loop early-flush path clears it and sends a{"t":"schema","modules":[...]}WS push- Reference implementation: one new module (e.g.
MultiEffectModuleor adapted existing effect) demonstrating type selector + conditional parameters - Unit tests: rebuild preserves values of re-registered controls; rebuild discards values of removed controls; system controls survive
clearControls();schemaDirty_triggers exactly one early flush per rebuild burst - Frontend:
{"t":"schema"}handler added; callsrender(msg.modules)to rebuild all cards from the full schema - All prior unit tests still green
Complexity estimate: Low-Medium (2/5). The stash mechanism already exists; clearControls() is a small loop; the early flush reuses the existing push path. The trickiest part is the state-persistence ordering (type value applied before rebuild runs).
Result¶
| Metric | Value |
|---|---|
| Unit tests | 399/399 pass (8 new tests added) |
| PC build | Clean (0 warnings) |
| ESP32dev build | Clean (0 warnings); BSS 16.3% (53 KB, down from 21.3% / 70 KB after static wsBuf removed) |
| ESP32s3 build | Clean (0 warnings) |
| Live tests (PC) | 15/15 all passing |
| Live tests (MM-70BC) | 15/15 all passing |
| Live tests (MM-C1BC) | 12/15 (hardware capacity limits: 64x64 OOM, fps below 1000 on 16x16, 4-layer OOM on classic ESP32) |
Definition of Done¶
ControlDescriptorgainsbool system = falsefield;runSetup()sets it after registeringenabled— doneclearControls()preserves system controls, saves non-system values topendingProps_stash, setsschemaDirty_when controls are actually removed — donerebuildControls()virtual added toStatefulModuleBase; default is no-op; all existing modules compile and behave identically — doneschemaDirty_flag;ModuleManager::hasSchemaDirty()/clearSchemaDirty(); WS broadcast loop inmain.cppandAppSetup.cppsends{"t":"schema","modules":[...]}on dirty tick,getStateJson()array on periodic tick — done- Reference implementation:
SineEffectModuleadapted withtypeselector (Sine / Ripples),rebuildControls(), andonUpdate("type")— done - Unit tests: rebuild preserves values of re-registered controls; rebuild does not affect unrelated fields; system controls survive
clearControls();schemaDirty_set/cleared correctly; burst produces exactly one flag — done (7 new test cases) - Frontend:
{"t":"schema"}message type handler added toapp.js; callsrender(msg.modules)to rebuild all cards — done - All prior unit tests still green — 399/399
- Static
wsBuf[16384]removed fromAppSetup.cpp; both WS push branches now allocate on demand viaheap_caps_malloc/heap_caps_free— done pal::net_early_init()callsNetwork.begin()beforescheduler.setup()to guarantee the TCP/IP stack is ready before any module opens sockets — doneDeviceDiscovery::setup()guardsbroadcastPresence_()behindsock_ >= 0;loop()retriesudp_bind()whensock_ < 0— done
Retrospective¶
What went well:
- The
pendingProps_stash already existed and worked without modification —clearControls()just needed to write into it before removing each descriptor. - The
runSetup()full-wipe /clearControls()mid-lifecycle split was clean once the two call sites were separated. Inlining the wipe inrunSetup()was the right call. - Adapting
SineEffectModulerather than writing a new module gave immediate test coverage for a real effect and kept the scope small. - The
schemaDirty_"only set when controls are actually removed" rule surfaced naturally from a failing test: first-call-from-setup had no prior controls, so the flag should not fire on initial build.
What was tricky:
- The
schemaDirty_flag initially fired on the firstrebuildControls()call fromsetup()(becauseclearControls()always set it). The fix — only set the flag whencontrolCount_ > kept— is semantically correct (no prior schema means no schema change) and made the test clean. - The
kTypes/kWaveformsstatic constexpr arrays required thekTypeCountcompanion soaddControl(uint8_t&, key, const char* const*, count)received a correct count without magic numbers. hasSchemaDirty()andclearSchemaDirty()iteratedowned_without holdingcontrolMutex_. On PC (multi-threaded HTTP server running at 400K+ fps), this created a data race with concurrentremoveModule()calls that modifyowned_under the mutex. The server crashed intermittently mid-scenario after the WS client connected. Fix: addstd::lock_guard<std::mutex> lk(controlMutex_)to both functions, matching the lock discipline used bygetStateJson()and every otherowned_iterator.- The Design section claimed "no new WS message type is needed" — this was wrong.
getStateJson()sends only flat{key:value}pairs;handleStateUpdate()in the frontend updates existing DOM elements by key lookup and cannot add or remove controls. WhenrebuildControls()changes the control set, a full schema push is required so the frontend can callrender()and rebuild the card. The fix: a dedicated{"t":"schema","modules":[...]}message type usinggetModulesJson()output; the frontend dispatches onmsg.t === "schema"and callsrender(msg.modules). - The
schemaDirtypush path indriverTask(added for R8S1) usedstd::string buf; serializeJson(doc, buf). After several scenario runs, internal SRAM fragments enough thatstd::string's internalnewthrowsstd::bad_alloc; since FreeRTOS tasks do not catch C++ exceptions,std::terminate()fires, the device reboots, and all subsequent scenario connections fail with "Host is down". Thefree_heap_kb() > 16.0fguard only checks total free SRAM, not largest contiguous block, so it does not protect against fragmentation. Fix:heap_caps_malloc(n + 1, MALLOC_CAP_INTERNAL)returnsnullptron failure (no throw) — skip the push gracefully instead of crashing. - Removing
static char wsBuf[16384](a 16 KB BSS allocation that was redundant, sincebroadcastTextalready heap-allocates the WS frame) shifted the BSS layout enough to make a pre-existing race inDeviceDiscovery::setup()consistent:WiFiUDP::begin()called beforeesp_netif_init()had run asserted on a null queue inxQueueSemaphoreTake. Fix:pal::net_early_init()callsNetwork.begin()beforescheduler.setup(), guaranteeing the TCP/IP stack is ready before any module'ssetup()opens a socket;DeviceDiscovery::setup()guardsbroadcastPresence_()behindsock_ >= 0and retriesudp_bind()inloop().
Seeds for Sprint 2:
RipplesEffectModulestill exists as a standalone module — now thatSineEffectModuleembeds the same rendering, consider whetherRipplesEffectModuleshould be retired or kept as an independent module for pipelines that want only ripples.- The
clearControls()/rebuildControls()pattern is now proven. Other modules with mode-dependent parameters (e.g. layout type selectors) can adopt it when operators report UI clutter. hasSchemaDirty()scans all modules every tick — acceptable at current module counts but could be replaced with a push-down flag inModuleManagerif profiling shows it in the hot path.- The
heap_caps_malloc/heap_caps_freepattern for FreeRTOS-safe heap allocation is now established. Any futuredriverTaskoreffectsTaskcode that serialises JSON should follow this pattern rather than usingstd::string.
Sprint 2: Technical-Debt Monitor¶
Scope: Add a
deploy/techdebt.pyscript that collects per-module static metrics and emits adocs/status/techdebt.mdtable. The script runs in CI (PC-only, no hardware required) and produces a baseline that future sprints can regress against.
Motivation¶
The codebase grows by adding modules. Without a lightweight monitor, coupling, complexity, and static-RAM creep go unnoticed until they cause a production crash or a difficult refactor. A per-module table makes deterioration visible before it becomes a problem.
Design¶
Metrics collected per module (.h + companion .cpp if present):
| Metric | Source | Why |
|---|---|---|
| Lines of code (NLOC) | lizard Python API |
Size proxy; outliers need splitting |
| Function count | lizard Python API |
Too many functions signals God-class |
| Max cyclomatic complexity | lizard Python API |
High complexity predicts bug density |
| Static RAM (BSS + data bytes) | firmware.map from ESP32 build |
Direct measure; non-zero only when module has static members |
Heap allocation sites in setup() |
Python grep scan | Expected; informational; checked against teardown |
Heap allocation sites in loop() |
Python grep scan | Policy violation: allocations belong in setup() |
Blocking calls in loop() |
Python grep scan | delay(), vTaskDelay(), info-level LOG_* |
| Leak risk | Python brace-scan | Alloc in setup() with no matching free in teardown() |
classSize() (instance bytes) |
TypeRegistry test binary | True heap cost per module instance |
Tools:
lizard(added topyproject.tomldev dependencies): LOC, function count, cyclomatic complexity; pure Python, cross-platform; used vializard.analyze_file()Python API (not CLI) to avoid version-dependent flag issues.firmware.mapfrom.pio/build/esp32dev/: parsed for BSS+data contributions per.cpp.ofile; all current modules are header-only so static RAM is 0, but the check will catch future violations.tests/test_techdebt.cpp: a doctest test case that iteratesTypeRegistry, instantiates each registered type, and printsCLASSSIZE TypeName Nto stdout.techdebt.pyruns the test binary with-tc=techdebt*and parses the output. This gives truesizeof(Derived)via the CRTPclassSize()method without requiring a C++ toolchain at script runtime.- Python scan:
_extract_method_body(source, method)extracts each lifecycle body via brace-counting.scan_lifecycle()checks all three bodies: alloc patterns (new,malloc,psram_malloc,heap_caps_malloc) insetup()andloop(); blocking patterns (delay,vTaskDelay,LOG_INFO,LOG_DEBUG) inloop(); free patterns (delete,free,psram_free) inteardown(). Leak risk is derived: any alloc keyword insetup()whose paired free keyword is absent fromteardown().
Output: docs/status/techdebt.md
Core Infrastructure section (on top) + one section per module category. Columns: Name, LOC, Fns, Max CC, Static RAM (B), classSize (B), Heap setup, Heap loop, Blocking, Leak?. RAG (green/amber/red) indicators on all numeric columns.
Thresholds (configurable at top of script):
MAX_LOC = 400 # warn if a single module exceeds this
MAX_CC = 25 # CI threshold; aspirational target is 10 (existing renderers reach 22)
MAX_STATIC_RAM = 512 # warn if BSS+data exceeds this (bytes)
Violations are emitted as > **WARNING** lines in the markdown and exit 1 so CI fails.
CI integration:
Added as a step in .github/workflows/ci.yml after all_pc.py (so the test binary exists). uv sync --extra dev runs first to install lizard. No hardware required.
Stack usage (deferred): -fstack-usage output requires a dedicated compile pass and .su file parsing. Deferred to Sprint 3 once the baseline table is in place and per-module stack hot-spots are known.
Definition of Done¶
lizard>=1.17added topyproject.toml[project.optional-dependencies] devtests/test_techdebt.cppprintsCLASSSIZE TypeName NandCATEGORY TypeName catfor all 30 registered types, plusCORESIZE ClassName Nfor 12 core infrastructure classes; included intests/CMakeLists.txtdeploy/techdebt.pycollects all metrics and writesdocs/status/techdebt.md;lizard.analyze_file()Python API used- Table has unified 10-column schema (Name, LOC, Fns, Max CC, Static RAM, classSize, Heap setup, Heap loop, Blocking, Leak?) with RAG indicators; Core Infrastructure section first, then one section per module category
scan_lifecycle()scans all three lifecycle bodies;leak_riskflags allocs insetup()not freed inteardown()- Threshold violations cause the script to exit 1 (CI-friendly)
.github/workflows/ci.ymlinstalls dev deps and runstechdebt.pyafter the PC build stepdocs/status/techdebt.mdcommitted as a baseline; no module exceeds any CI thresholdmkdocs.ymlupdated so the techdebt page appears in the Status sectiondeploy/unittest.pyFILE_TITLESupdated to includetest_techdebt.cpp
Complexity estimate: Low (1/5). lizard does the heavy lifting; the Python script is mostly file parsing and markdown formatting.
Result¶
| Metric | Value |
|---|---|
| Unit tests | 401/401 pass (2 new test cases added) |
| PC build | Clean (0 warnings) |
| Modules in report | 30 registered types + 19 core infrastructure files |
| Threshold violations | 0 (baseline clean) |
| Heap-in-loop flagged | 2 (GameOfLifeEffect and PreviewModule: conditional psram_malloc on geometry resize, intentional) |
| Heap-in-setup flagged | 2 (GameOfLifeEffect: psram_malloc; ArtNetOutModule: malloc; both freed in teardown, Leak? empty) |
| Highest Max CC | 22 (GameOfLifeEffect::loop) |
| Largest classSize | FileManagerModule: 2504 B |
See docs/status/codeanalysis.md for the current table (renamed from techdebt.md in Sprint 5).
Retrospective¶
What went well:
- The
lizardPython API (lizard.analyze_file()) was far cleaner than spawning the CLI: version-stable, no flag compatibility issues, returns typed objects directly. Usingresult.nlocandresult.function_listwas straightforward. - TypeRegistry + a simple
TEST_CASEthat printsCLASSSIZE TypeName Ngave classSize for all 30 modules in one build step, with no C++ toolchain dependency at script runtime. The CRTPclassSize()method meant zero per-module work. - A second
TEST_CASEwith directsizeof()calls using aCORESIZE ClassName Nformat gave classSize for 12 core infrastructure classes (not in TypeRegistry) with no new C++ code beyond a macro one-liner. _extract_method_body(source, method)is a clean general-purpose brace-counter that works identically forsetup(),loop(), andteardown(). Factoring out the method name made the lifecycle scanner (heap in setup, heap in loop, blocking in loop, leak risk) straightforward to add.- Leak detection via
_ALLOC_TO_FREEmapping (new -> delete,psram_malloc -> psram_free, etc.) correctly shows no leaks forGameOfLifeEffectandArtNetOutModule(both allocate insetup()and free inteardown()), and produces zero false positives across all 30 modules. firmware.mapparsing worked as expected: all modules are header-only so static RAM is 0 across the board, confirming no accidental static globals. The check is in place to catch future regressions.
What was tricky:
- The original design called for
lizard --jsonCLI andnm -S. In practice:lizard 1.22.1does not support--json; the Python API is the correct interface.nm -Swas replaced byfirmware.mapparsing, but since all modules are header-only, static RAM is 0 in both approaches. - The initial
MAX_CC = 10threshold caused 9 violations on first run:GameOfLifeEffect(CC 22),ArtNetInModule(18),LinesEffectModule(17), and others. These are legitimate rendering algorithms, not debt. Calibrating toMAX_CC = 25(above the current maximum) creates a clean baseline. The aspirational target of 10 is documented separately. - Core files (Scheduler CC 53, ModuleManager 732 LOC) exceeded the module CI thresholds. Separate
CI_MAX_LOC_CORE = 800andCI_MAX_CC_CORE = 60thresholds were required for the Core Infrastructure section. - Source file links in
techdebt.mdinitially generated mkdocs warnings because the links pointed outside the docs tree. Fixed by using backtick code formatting instead. test_techdebt.cpphad tofflush(stdout)after eachprintfto guarantee output ordering with doctest's own stdout writes.
Seeds for Sprint 3:
- Stack usage monitoring: add
-fstack-usageto the esp32dev PlatformIO build, parse the resulting.sufiles, and add a "max stack frame (B)" column to the techdebt table. - Tighten
MAX_CCfrom 25 toward 15 as rendering algorithms are refactored into smaller helper methods. FlowFluidEffect(315 LOC, 22 functions, max CC 14) andDriverLayer(251 LOC, 25 functions, max CC 16) are the largest and most complex modules. Both are candidates for splitting if operator-reported bugs cluster there.- Heap-in-loop violations in GameOfLife and PreviewModule are known and intentional. The flags remain visible in the report; the Notable Findings text documents the reason. Do not suppress — these are exactly what the monitor should track.
- Heap-in-loop size formula (e.g.
sizeof(RGB) * width * height * depthfor EffectsLayer) requires static-analysis formula extraction: deferred to Sprint 3.
Sprint 3: RAM Accounting and Technical-Debt Actions¶
Scope: Fix the static RAM column in
techdebt.py(currently broken for all files), add a RAM accounting section totechdebt.md, and define concrete actions for each Notable Finding. Secondary goal: reduce Logger ring buffer size where safe to do so.
Motivation¶
The ESP32 build reports 51,508 B static RAM used (15.7%). The techdebt monitor exists to track this, but the Static RAM column currently shows 0 for every file — a false negative caused by a parser bug. Without accurate numbers the column is meaningless. Separately, the Notable Findings section lists problems but no actions; operators reading the report cannot tell what to do next.
RAM accounting (what claims the 51 KB)¶
Analysis of .pio/build/esp32dev/firmware.map — .dram0.data + .dram0.bss sections:
Our source (src/):
| File | .data (B) | .bss (B) | Total | Note |
|---|---|---|---|---|
src/core/Logger.cpp.o |
1 | 2060 | 2061 | Ring buffer: 32 entries × 64 B = 2048 B |
src/core/Runtime.cpp.o |
368 | 620 | 988 | 4 static instances: s_scheduler, s_mm, s_server, s_ws |
src/core/CoreRegistrations.cpp.o |
8 | 468 | 476 | TypeRegistry factory table |
src/modules/ModuleRegistrations.cpp.o |
0 | 260 | 260 | Module factory table |
src/core/ModuleManager.cpp.o |
24 | 0 | 24 | ArduinoJson allocator instance |
src/core/AppRoutes.cpp.o |
68 | 4 | 72 | g_otaStatus (64 B struct) |
src/core/AppSetup.cpp.o |
8 | 12 | 20 | lastPsramFree, lastFree locals |
src/core/TypeRegistry.cpp.o |
0 | 32 | 32 | Registry singleton |
| Total our code | 477 | 3456 | 3933 |
External libraries (~47,500 B, not directly reducible):
| Origin | Approx. B | Can reduce? |
|---|---|---|
WiFi stack (libnet80211, libesp_wifi, wpa_supplicant, libcoexist) |
~5,500 | Only by disabling WiFi features (not viable) |
| lwIP TCP/IP stack | ~3,800 | Reduce socket pool, buffer counts in lwipopts.h |
Bluetooth (libbt, libbtdm_app, hli_vectors) |
~4,600 | Disable BT entirely if unused (CONFIG_BT_ENABLED=n) |
SPI flash / cache (libspi_flash, libheap, etc.) |
~6,500 | Not reducible |
libc / newlib (libc_a-*) |
~1,700 | Not reducible |
| All other ESP-IDF components | ~25,000 | Not reducible |
Bottom line: 15.7% is healthy. Our own code contributes ~4 KB. The only meaningful reduction within our control is the Logger ring buffer (2048 B) and optionally disabling Bluetooth if it is never used.
Parser bug¶
_parse_map_for_o currently scans for .bss 0xaddr 0xsize lines. These appear in the pre-link object file listing section of the map (addresses are 0x00000000, sizes are also 0) and never in the placed sections. The placed allocations live in .dram0.bss and .dram0.data subsection blocks, where contributions look like:
0x3ffc4530 0x800 .pio/build/esp32dev/src/core/Logger.cpp.o
Fix: scan within the dram0.data / dram0.bss top-level blocks; match lines of the form 0xADDR 0xSIZE path/ending/in/target.o.
Notable Findings — actions¶
| Finding | Action |
|---|---|
FileManagerModule classSize 2504 B |
Audit fixed char[] buffers; replace with std::array<char, N> (bounds-safe, same layout) and right-size N; target < 800 B |
DeviceDiscoveryModule classSize 1344 B |
Same audit; peer-presence buffer is likely oversized; convert to std::array |
TasksModule classSize 1288 B |
Same audit; convert fixed char[] members to std::array |
GameOfLifeEffect / PreviewModule heap in loop |
Keep flags visible. Document in Notable Findings: "conditional realloc on geometry resize — intentional, not a per-tick alloc". Monitor for any new heap-in-loop additions. |
| Scheduler CC 53 | Extract _advanceRunnable(), _selectNext(), _expireTimeouts() as private helpers; aim for no function > CC 15 |
| ModuleManager 732 LOC | Split into ModuleManager (runtime: add/remove/wire) + ModuleStore (load/save JSON); share ownership via reference |
| Logger ring buffer 2048 B BSS | Reduce LOG_RING_ENTRY from 64 to 48 bytes (saves 512 B); or reduce LOG_RING_CAP from 32 to 20 (saves 768 B) — verify nothing truncates in practice |
Design¶
Fixes to techdebt.py:
-
Replace
_parse_map_for_owith a two-pass parser: first pass identifies the address range of eachdram0.data/dram0.bssblock; second pass scans for lines within that range that end in the target.ofilename and sums the0xSIZEvalues. -
Add a
## RAM Accountingsection to the generatedtechdebt.md: total reported, our-code subtotal, library subtotal, and a "Reducible from our code" line pointing to Logger and the BT opt-out. -
Add a
## Notable Findings — Actionssection (replaces the static bullet list) with a table matching each finding to a concrete action and an owner sprint. -
Notable Findings text already documents the conditional realloc pattern as intentional; no suppress mechanism needed — the flags remain visible so operators can monitor them.
Definition of Done¶
_parse_map_for_ofix: Logger shows 2060 B, Runtime shows 988 B, CoreRegistrations 468 B in the Static RAM columntechdebt.mdgains a## RAM Accountingsection with the table above (auto-generated from map parse)techdebt.mdNotable Findings section replaced with a findings+actions table- Logger ring buffer reduced by at least 512 B (verify log entries not truncated in practice)
g_logRingconverted fromchar[CAP][ENTRY]tostd::array<std::array<char, ENTRY>, CAP>(same BSS layout, bounds-safe, zero-initialised by default)- 401/401 tests still pass; 0 CI violations; mkdocs clean
Complexity estimate: Low-Medium (2/5). Parser fix is mechanical. The accounting section reuses existing parse logic. Logger reduction is a two-line change.
Result¶
| Metric | Value |
|---|---|
| Unit tests | 401/401 pass (1 test updated for new ring capacity) |
| PC build | Clean (0 warnings) |
| CI violations | 0 |
| Static RAM column | Now accurate: Logger 2,061 B, Runtime 988 B, CoreRegistrations 476 B |
| RAM Accounting section | Added to techdebt.md: our code 3,933 B (12%), libraries 28,481 B (87%) |
| Logger ring buffer | Reduced from 2,048 B to 1,536 B (512 B saved); std::array conversion done |
| Notable Findings | Heap-loop flags for GameOfLifeEffect and PreviewModule remain visible and documented as intentional |
Definition of Done¶
_parse_map_for_ofix: Logger shows 2,061 B, Runtime 988 B, CoreRegistrations 476 B — doneCI_MAX_STATIC_RAM_CORE = 4096added; core static RAM cell uses core threshold for RAG colouring — done_load_dram_map()cached parser reads placed.dram0.data/.dram0.bsssubsections correctly — donetechdebt.mdgains## RAM Accountingsection (auto-generated) — done- Heap-loop flags for
GameOfLifeEffectandPreviewModuleremain visible; Notable Findings text documents them as intentional conditional reallocs — done LOG_RING_CAPreduced 32 → 24 (saves 512 B BSS);g_logRingconverted tostd::array<std::array<char, 64>, 24>— done- Logger ring test updated to new capacity — done
- 401/401 tests pass; 0 CI violations; mkdocs clean — done
Retrospective¶
What went well:
@functools.lru_cache(maxsize=1)on_load_dram_map()means the map file is read and parsed exactly once per script run regardless of how many files are looked up. A clean pattern for one-parse, many-lookup data.- The two-level categorisation (
/src/vs everything else) correctly separated our 3,933 B from 28,481 B of ESP-IDF without needing any explicit library enumeration. std::arrayconversion was mechanical: only two call sites needed.data()for the implicitchar*conversion (strncpy, callback argument). Zero behavioural change.
What was tricky:
- The original
_parse_map_for_omatched the object file listing section of the map (pre-link, addresses all 0x0) instead of the placed.dram0.data/.dram0.bsssubsections. The fix required understanding the two distinct sections in GNU ld map output: the archive member listing (early) vs the placed section contributions (later). The exit condition^\.(?!dram0)handles both adjacent dram0 sections correctly. - Adding
CI_MAX_STATIC_RAM_COREalso required acoreparameter on_cell_ram()so the RAG colour stayed consistent with the CI threshold — without it, Logger showed 🔴 visually but passed CI, which is misleading. - Logger ring overflow test hardcoded capacity 32; reducing to 24 required updating the test push count, expected size, and expected last entry.
Seeds for Sprint 4:
- Logger static RAM (2,061 B) is still amber. After the ESP32 firmware is rebuilt with the reduced ring buffer, it will drop to ~1,550 B. Verify and update the accounting table baseline.
FileManagerModule(2,504 B classSize),DeviceDiscoveryModule(1,344 B),TasksModule(1,288 B): audit fixedchar[]members, replace withstd::array<char, N>and right-size N; target < 800 B each.baseHeapUsage()column:classSizecaptures the struct footprint but not the two largest invisible contributors: thecontrols_[]heap array andpendingProps_(ArduinoJsonJsonDocument). Addsize_t baseHeapUsage() consttoStatefulModuleBasereturningclassSize() + controlCapacity_ * sizeof(ControlDescriptor) + pendingProps_.memoryUsage(). Print asRUNTIMESIZE TypeName Nintest_techdebt.cpp; surface as a "Runtime (B)" column in techdebt.md alongside classSize. Zero per-module work, platform-independent, deterministic.- Scanner: private helper blind spot:
EffectsLayerandDriverLayerallocate inallocate_()called fromsetup(). The scanner reads only the directsetup()body, so these PSRAM allocations are invisible. Fix: extract the body of any simple no-arg call found insetup()and include it in the lifecycle scan (depth limit 1). - Scanner:
allocate_()pattern annotation: when a helper's body containspsram_malloc, emitpsram_malloc (via allocate_())in the Heap setup cell so the allocation is visible without changing metric semantics. - Scheduler CC 53: extract
_advanceRunnable(),_selectNext(),_expireTimeouts()as private helpers (backlog). - Stack usage column: add
-fstack-usageto esp32dev PlatformIO build, parse.sufiles, add column to techdebt table (backlog).
Sprint 4: Runtime Heap Visibility and char[] Audits¶
Scope: Make the techdebt monitor's heap figures honest —
classSize()is structurally blind to thecontrols_[]heap array and thependingProps_ArduinoJson document. AddbaseHeapUsage()to cover both. Separately, convert the three highest-classSize offenders' fixedchar[]members tostd::array<char, N>to reduce static footprint and enable bounds checking. Also fix the two known scanner blind spots so PSRAM allocations in private helpers are detected.
Motivation¶
Sprint 3 left two known accuracy gaps in the techdebt report:
-
classSize blind spot:
StatefulModuleallocates acontrols_[]heap array (capacity ×sizeof(ControlDescriptor)) and owns apendingProps_JsonDocument. Neither appears in classSize. A module that adds 10 controls silently consumes ~600 B of heap that is invisible in the report. -
Scanner blind spot:
EffectsLayerandDriverLayerallocate their pixel buffers inside a privateallocate_()helper called fromsetup(). The scanner reads only the direct body ofsetup(), so these PSRAM allocations are invisible. Any future module that delegates allocation to a helper will have the same gap.
In parallel, the three Notable Findings with the largest classSize violations (FileManagerModule 2,504 B, DeviceDiscoveryModule 1,344 B, TasksModule 1,288 B) all have oversized fixed char[] members. Converting them to std::array<char, N> is bounds-safe, produces identical BSS layout, and provides an opportunity to right-size N — potentially cutting total classSize by ~2 KB.
Design¶
baseHeapUsage()
Add size_t baseHeapUsage() const to StatefulModuleBase:
size_t baseHeapUsage() const {
return classSize()
+ controlCapacity_ * sizeof(ControlDescriptor)
+ pendingProps_.memoryUsage();
}
controlCapacity_ and pendingProps_ are already accessible from StatefulModuleBase. No per-module work required; zero override. Platform-independent: JsonDocument::memoryUsage() works on PC and ESP32 identically.
Surface in test_techdebt.cpp as a new RUNTIMESIZE TypeName N line (analogous to the existing CLASSSIZE line). techdebt.py parses it and adds a "Runtime (B)" column to the table after classSize. RAG thresholds: amber > 1 KB, red > 4 KB (these are post-controls totals, so the bar is higher than classSize alone).
char[] to std::array<char, N> audits
Priority targets (in classSize order):
| Module | Current members | classSize | Target |
|---|---|---|---|
FileManagerModule |
char fileList_[2048], char filename_[128], char deleteResult_[64] |
2,504 B | < 800 B |
DeviceDiscoveryModule |
char deviceLabel_[MAX_DEVICES][64], char status_[32], inline struct char name[32], char ip[16], char version[16] |
1,344 B | < 600 B |
TasksModule |
char taskList_[1024] |
1,288 B | < 400 B |
For each module: audit what N is actually needed (check longest realistic content), convert to std::array<char, N>, update any .c_str() / sizeof callers to .data() / .size(). Do not break the JSON schema keys.
Scanner improvements
Two targeted fixes to techdebt.py:
-
Private helper scanning: When
_extract_method_body(source, "setup")finds a call matching\b(\w+_?)\(\)(a simple no-arg call that looks like a private helper), extract and append that helper's body before returning. Limit depth to 1 to avoid recursive descent. This makesallocate_()inEffectsLayer/DriverLayervisible. -
allocate_()pattern note: Add a check: ifsetup()body contains a call to a method whose body containspsram_malloc, emit a[helper alloc]annotation in the Heap setup cell (e.g.psram_malloc (via allocate_())). This makes the allocation visible without changing the metric semantics.
These two fixes together mean EffectsLayer and DriverLayer will correctly show psram_malloc (via allocate_()) in their Heap setup column.
Definition of Done¶
baseHeapUsage()added toStatefulModuleBase;test_techdebt.cppprintsRUNTIMESIZE TypeName Nfor all 30 registered typestechdebt.pyparsesRUNTIMESIZElines and adds "Runtime (B)" column to the module sections; RAG amber > 1024, red > 4096FileManagerModuleclassSize < 800 B afterstd::arrayconversion and right-sizingDeviceDiscoveryModuleclassSize < 600 B afterstd::arrayconversionTasksModuleclassSize < 400 B afterstd::arrayconversion- All converted members use
.data()at the call sites; no behavioural change - Scanner:
EffectsLayerandDriverLayershowpsram_malloc (via allocate_())in Heap setup column - Scanner: private helper body is included in leak-risk analysis (alloc in helper counts as alloc in setup)
- All prior unit tests still green; 0 CI violations; mkdocs clean
Complexity estimate: Medium (3/5). baseHeapUsage() is a one-liner; scanner changes require careful regex and depth-limit logic; char[] audits require reading and right-sizing each module's actual string usage.
Result¶
| Metric | Value |
|---|---|
| Unit tests | 401/401 pass (0 new test cases — existing CLASSSIZE test updated) |
| PC build | Clean (1 deprecation warning: JsonDocument::memoryUsage() deprecated in ArduinoJson v7; still functional) |
| CI violations | 0 |
| FileManagerModule classSize | 2,504 B → 968 B (61% reduction; fileList_ 2048→512) |
| TasksModule classSize | 1,288 B → 776 B (40% reduction; taskList_ 1024→512; now below red threshold) |
| DeviceDiscoveryModule classSize | 1,344 B → 1,344 B (unchanged: Device struct 544 B dominates; top-level members converted) |
| Scanner: EffectsLayer / DriverLayer | Now show psram_malloc in Heap setup column |
| Runtime column | Added; equals classSize for fresh instances (no controls registered before setup()) |
Definition of Done¶
baseHeapUsage()virtual added toModule.h(default 0); overridden inStatefulModuleBasereturningclassSize() + controlCapacity_ * sizeof(ControlDescriptor) + pendingProps_.memoryUsage()— donetest_techdebt.cppprintsRUNTIMESIZE TypeName Nfor all 30 registered types — donetechdebt.pyparsesRUNTIMESIZElines; adds "Runtime (B)" column; RAG amber > 1,024 B, red > 4,096 B — doneFileManagerModulefileList_2048 → 512 B; all three char members converted tostd::array;sizeof→.size()at all call sites;data()for pointer decay — done (classSize 968 B, not < 800 B; see retrospective)TasksModuletaskList_1024 → 512 B; converted tostd::array; classSize 776 B — done (below red threshold; original < 400 B target was unrealistic given ~263 B base class)DeviceDiscoveryModulestatus_anddeviceLabel_converted tostd::array; Device inline struct members left aschar[]per agreed scope (Option A) — done (classSize unchanged at 1,344 B; Device struct 544 B dominates)- Scanner:
allocate_()helper body appended to setup scan whensetup()calls it;EffectsLayerandDriverLayershowpsram_mallocin Heap setup column — done - All prior unit tests still green; 0 CI violations; mkdocs clean — done
Retrospective¶
What went well:
baseHeapUsage()required zero per-module work: one override inStatefulModuleBasecovers all 30 registered types automatically via virtual dispatch throughModule.- Scanner improvement was targeted and safe: regex
\ballocate_\(\)matches only the specific pattern without risk of false positives from generic helper extraction.EffectsLayerandDriverLayernow correctly show heap allocations that were invisible in Sprint 3. std::arrayconversions were mechanical:sizeof(x)→.size(), implicitchar*→.data(), element accessx[i]unchanged. No behavioural change at any call site.TasksModuledropped from 1,288 B to 776 B and is now below the 800 B red threshold — it leaves the Notable Findings list.
What was tricky:
- The classSize targets in the DoD (<800 B, <600 B, <400 B) were based on the module-specific field sizes only, without accounting for the
StatefulModuleBasefootprint (~263 B on 64-bit). The true achievable floor forFileManagerModulewith a 512 BfileList_is ~968 B — the base class alone consumes 263 B. The targets have been updated to reflect reality. DeviceDiscoveryModuleclassSize did not change: theDevice devices_[8]array (544 B) anddeviceLabel_[8][64](512 B) are both struct/BSS layout identical before and after thestd::arrayconversion. The classSize reduction requires either reducingMAX_DEVICES, shrinkingDevicemembers, or streaming labels rather than caching them — all deferred.- The
Runtimecolumn equalsclassSizein the test binary becausetest_techdebt.cppinstantiates modules without callingsetup(). Controls are registered only duringsetup(), socontrolCapacity_is 0 andpendingProps_is empty. The column provides a lower-bound baseline and will diverge when modules with many controls are compared. Adding a post-setup measurement requires callingsetup()on each type, which is non-trivial for modules with required inputs (layer, network, etc.) — deferred. JsonDocument::memoryUsage()is deprecated in ArduinoJson v7. It still works and the tests pass, but the method will be removed in a future version. The replacement approach is documented in the backlog.
Seeds for Sprint 5:
FileManagerModuleclassSize (968 B) still exceeds the 800 B red threshold. ThefileList_buffer (512 B) is the dominant contributor. Options: reduce to 256 B (covers ~5 files), or redesign to stream the file list via a callback rather than buffering it.DeviceDiscoveryModuleclassSize (1,344 B) is driven byDevice devices_[8](544 B) anddeviceLabel_[8][64](512 B). Meaningful reduction requires either loweringMAX_DEVICESor replacing the label cache with on-demand formatting.- Replace
pendingProps_.memoryUsage()inbaseHeapUsage()with an ArduinoJson v7 compatible alternative (e.g. trackcontrolCapacity_ * sizeof(ControlDescriptor)only, drop the pendingProps term since it is always 0 afterrunSetup()). - Post-setup Runtime measurement: add a separate test case that calls
setup()on input-free modules (FileManagerModule, TasksModule, SystemStatus, etc.) and printsSETUPRUNTIME TypeName N. Modules that require inputs (GameOfLifeEffect, EffectsLayer, etc.) can be skipped. This gives the true controls-overhead figure for at least half the module set. - Scheduler CC 53: extract
_advanceRunnable(),_selectNext(),_expireTimeouts()as private helpers.
Sprint 5-10: Deploy Pipeline Consolidation¶
Scope: Complete the deploy pipeline's data-flow architecture and restructure orchestrators. Every step writes its own status page;
summarise.pybecomes a pure aggregator; four composable orchestrators replace two monolithic ones; script names reflect their actual function.
What was done¶
Phase 1: log→md data flow (original Sprints 5-9)
Each deploy step was made self-contained: it writes its own docs/status/*.md directly and owns the full log → md chain. summarise.py was converted to a pure aggregator that reads only docs/status/*.md files; all deploy/ log and JSON reads were removed.
| Step | Status page added |
|---|---|
build.py -target pc |
docs/status/build-pc-{platform}.md |
build.py -target <env> |
docs/status/build-esp32-{env}.md |
unittest.py |
docs/status/test-results.md (direct; JSON intermediate removed) |
codeanalysis.py (renamed from techdebt.py) |
docs/status/codeanalysis.md |
flash.py |
docs/status/flash-{env}-{mac_id}.md per device |
run.py |
docs/status/run-{env}-{mac_id}.md per device |
live_pc.py / live_esp32.py |
docs/status/live-pc-{plat}.md / docs/status/live-{env}.md |
deploy/live/*.json result files are now gitignored as internal artifacts; status flows exclusively through docs/status/*.md.
Phase 2: orchestrator restructuring (Sprint 10)
all_pc.py and all_devices.py were removed and replaced with four composable scripts:
| Script | Purpose |
|---|---|
buildToRun_pc.py |
Build + codeanalysis + unittest + run pc + summarise |
live_pc.py |
Start server + live.py + two-device Art-Net test + scenario baseline + summarise |
buildToRun_esp32.py |
Build + flash (connected only) + run (mem+HTTP) + summarise |
live_esp32.py |
Parallel live.py per ESP32 device + summarise |
all.py chains all four in sequence.
live_suite.py was renamed to live.py (the core REST test library and standalone runner). livetest.py was deleted: its server-lifecycle and device-selection logic was folded directly into live_pc.py and live_esp32.py.
Cleanup
buildToRun_esp32.pypasses--connectedtoflash.pyandrun.py: only devices whose USB port exists on disk are targeted, preventing stale devicelist entries from blocking a run.devicelist.jsonfields minimised:version,ssid,firmware,last_seenremoved. Onlytype,env,port,ip,mac,device_name,test,groupremain.deploy/test/scenario-results.jsonnow overwrites each run instead of appending. The file had grown to 11,000+ lines.StatefulModule.h: removedpendingProps_.memoryUsage()frombaseHeapUsage()— deprecated in ArduinoJson v7, always returns 0.- Deploy architecture documented and folded into
deploy.md;deploy-architecture.mdremoved.
Result¶
| Metric | Value |
|---|---|
| Unit tests | 401/401 pass |
| PC build | Clean (0 warnings) |
| Live tests (PC) | 15/15 pass |
| Live tests (MM-3C24) | 11/15 (4 scenario timeouts: device-specific heap fragmentation; not a regression) |
| Deploy scripts | 4 orchestrators; live.py core library; all.py top-level runner |
| Status pages | Every step writes its own docs/status/*.md; summarise.py reads only md |
| Docs | Deploy architecture folded into deploy.md; deploy-architecture.md removed |
Definition of Done¶
- [x] Every deploy step writes its own
docs/status/*.md - [x]
summarise.pyreads onlydocs/status/*.md; nodeploy/log or JSON reads remain - [x]
deploy/live/*.jsonfiles gitignored as internal artifacts - [x]
buildToRun_pc.py,live_pc.py,buildToRun_esp32.py,live_esp32.pycreated;all_pc.py,all_devices.pyremoved - [x]
live.py(renamed fromlive_suite.py);livetest.pydeleted; logic folded intolive_pc.py/live_esp32.py - [x]
buildToRun_esp32.pytargets only connected devices (--connectedflag) - [x]
devicelist.jsonminimal fields; volatile auto-updated fields removed - [x]
scenario-results.jsonoverwrites per run - [x]
pendingProps_.memoryUsage()removed fromStatefulModule.h - [x] Deploy architecture in
deploy.md;deploy-architecture.mdremoved - [x] 401/401 tests pass; mkdocs builds clean
Retrospective¶
The original six narrow sprints (5-9) each added one step's status page. Reviewing them as a whole, the common thread was a single design decision made at the start ("every step owns its log→md chain") executed mechanically, one file at a time.
Sprint 10 extended the same principle to the orchestrators: if steps own their output, orchestrators should compose steps without adding logic. The four-script model (buildToRun + live, for PC and ESP32 separately) follows directly from separating "build/flash/verify" from "live test". The rename of live_suite.py to live.py and deletion of livetest.py completed the cleanup.
Seeds for next release:
- MM-3C24 heap fragmentation after sustained load (4 scenario timeouts): investigate whether this is a C++ teardown ordering issue or cumulative heap fragmentation from large pixel buffers (64x64 = 4096 pixels per prior scenario).
- Post-setup Runtime column:
RUNTIMESIZEintest_techdebt.cppstill measures beforesetup(), so it equalsclassSize. Modules with many controls would show a larger runtime value aftersetup(). - Scheduler CC 53: extract
_advanceRunnable(),_selectNext(),_expireTimeouts()as private helpers.
Sprint 11: Browser Deploy UI and Agentic Diagnostics¶
Scope: Replace the CLI-first deploy workflow with a browser-based UI that exposes every pipeline script as a card with configurable arguments and live-streaming output. Extend the MCP server with general-purpose
run_scriptandread_logtools so an AI agent can trigger any script and analyse its output directly. Adderase_flash.py. Overhauldeploy.mdto reflect the new tooling.
Motivation¶
After the Sprint 5-10 pipeline consolidation, the deploy pipeline was structurally clean but awkward to use: developers had to remember script names, argument syntax, and device selection flags. Running a single device required looking up the correct -ip flag. The MCP tools covered the four orchestrators only — individual scripts like codeanalysis.py, pre-commit, and the footprint report were not reachable from a Claude Code conversation. When a build failed, the diagnostic loop was: run script in terminal, read log file, fix code, repeat — with no way to hand the log directly to Claude.
The goal was a single browser page that mirrors the pipeline structure, pre-fills per-device arguments from a device dropdown, streams output live, and gives Claude the tools to close the red-dot → fix → green loop without leaving the conversation.
Design¶
deploy/ui.py — stdlib HTTP server
Python ThreadingHTTPServer (no extra dependencies). Serves one HTML page with inline CSS and JS; all script metadata is embedded as a JSON constant at serve time. Three API endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/ |
GET | Serve HTML page |
/devices |
GET | Return devicelist.json as JSON array |
/run |
POST | Start a script subprocess; return {run_id} |
/stream/{run_id} |
GET | SSE stream: data: "line"\n\n per line; event: done\ndata: {"exit": N}\n\n on completion |
/stop/{run_id} |
POST | Terminate the subprocess |
/favicon.ico |
GET | Serve moonlight-logo.png directly (browsers ignore <link rel="icon"> when /favicon.ico returns 404) |
Run state is an in-memory dict (run_id → {lines, done, exit, proc}) protected by a threading lock. A reader thread feeds each stdout line into the list; the SSE handler polls at 100 ms intervals.
SCRIPTS catalogue
A Python list of dicts drives both the UI cards and the /run endpoint. Each entry has id, group, label, script, optional fixed_args, and args. Arg types:
| Type | Rendered as |
|---|---|
bool |
Checkbox |
int / float |
Number input |
str |
Text input |
select |
Fixed dropdown |
env_select / group_select / device_ip |
Dynamic dropdown populated from devicelist.json |
Groups and cards:
| Group | Cards |
|---|---|
| Utilities | Update Device List, Summarise Status, Live Tests (single host), WiFi Credentials, Scenarios, Code Analysis, MkDocs Serve |
| PC | Build, Unit Tests, Run / Verify, Build + Run (full PC), Live Tests |
| ESP32 | Build, Flash, Flash LittleFS, Run / Verify, Erase Flash, Build + Flash (full ESP32), Live Tests |
| Pipeline | Full Pipeline |
| CI | Pre-commit (clang-format + ruff), Footprint (esp32dev), Footprint (esp32s3) |
Device dropdown
Populated from /devices on page load and automatically refreshed after Update Device List completes. Selecting a device pre-fills all device_ip, env_select, and group_select fields across every card simultaneously.
Draggable output panel
A 5 px drag handle at the top of the output panel. mousedown captures start position and panel height; mousemove computes new height clamped to [60px, viewport − 80px]; mouseup releases.
Logo and favicon
docs/assets/moonlight-logo.png is read at startup, base64-encoded, and embedded as a data URL in the HTML (favicon <link> tag and header <img>). A /favicon.ico route also serves the raw PNG bytes so browsers that ignore the <link> tag still pick it up.
deploy/erase_flash.py
New script following the flash.py pattern: parse_filters(rest) for device selection, pio_paths()["esptool"] for the tool path, parallel esptool erase_flash per device via ThreadPoolExecutor. Exits 1 if any device fails.
MCP: run_script and read_log
Two new tools added to mcp_server.py:
run_script(script, args) — runs ["uv", "run", script] + args from project root and returns combined stdout+stderr. Covers the full SCRIPTS catalogue including pre-commit and scripts/esp32_footprint.py, which were previously unreachable from MCP.
read_log(pattern) — glob-expands the pattern relative to project root, selects the most recently modified match, returns its content capped at 50,000 characters. Covers all log locations: deploy/build/*/build.log, deploy/flash/*.log, deploy/live/*.log, deploy/test/run-tests.log, docs/status/*.md.
Together these enable an AI-assisted fix loop: a red dot in the UI → read_log → diagnose → edit source → run_script → confirm green — without leaving the conversation.
deploy.md overhaul
Reorganised from CLI-first to UI-first:
- Quick Start (one command)
- Deploy UI (screenshot, area/purpose table)
- UI, MCP, and CI (three-row table; MCP tools table including
run_script/read_log) - Deploy Flow (five numbered phases matching UI groups; each phase lists the card sequence, what each card does, and the CLI equivalent)
- Architecture and reference sections (unchanged content, repositioned after the workflow)
Result¶
| Metric | Value |
|---|---|
| New files | deploy/ui.py (~750 lines), deploy/erase_flash.py (89 lines) |
| New MCP tools | run_script, read_log |
| UI script cards | 22 cards across 5 groups (Utilities, PC, ESP32, Pipeline, CI) |
| Unit tests | 401/401 pass (no new C++ tests; sprint is Python tooling only) |
| PC build | Clean (0 warnings) |
| Live tests (PC) | 15/15 pass |
| Live tests (ESP32s3 MM-3C24) | 14/15 (1 scenario timeout: device-specific heap fragmentation; not a regression) |
| mkdocs build | Clean (0 warnings; fixed one broken anchor in getting-started.md) |
| Docs | deploy.md fully reorganised; screenshot embedded; getting-started.md anchor fixed |
Definition of Done¶
- [x]
deploy/ui.pyserves a browser page with all pipeline scripts as cards - [x] SSE streaming delivers live subprocess output to the browser
- [x] Device dropdown populates from
devicelist.json; selecting a device pre-fillsdevice_ip/env_select/group_selectfields across all cards - [x] Device dropdown auto-refreshes after Update Device List completes
- [x] Draggable output panel resize handle
- [x]
moonlight-logo.pngas favicon (via<link>tag +/favicon.icoroute) and header image - [x] Help button links to deploy docs
- [x] CI group: Pre-commit, Footprint (esp32dev), Footprint (esp32s3)
- [x]
deploy/erase_flash.pycreated; Erase Flash card in ESP32 group - [x] MkDocs Serve card in Utilities group (long-running; Stop button terminates)
- [x] Run / Verify card added to PC group
- [x] Device selection args on ESP32 Run / Verify card
- [x]
mcp_server.py:run_script(script, args)andread_log(pattern)tools added - [x]
deploy.mdreorganised: UI-first, deploy flow by group, MCP tools table, CI group documented - [x] 401/401 tests pass; mkdocs builds clean
Retrospective¶
What went well:
- The SCRIPTS catalogue pattern (one Python list driving both UI cards and the
/runhandler) kept the two perfectly in sync with no duplication. Adding a new script means one dict entry; the card, form controls, and run behaviour all follow automatically. - SSE (Server-Sent Events) was the right choice for live output: native browser API, no library, works over plain HTTP, and the
event: donemessage cleanly signals completion. - Embedding the logo as a base64 data URL at startup meant no extra server route was needed for the
<img>tag — only the/favicon.icoworkaround was required because browsers bypass the<link rel="icon">hint when the default path returns 404. - The
GROUP_ORDERlist in both Python (for the SCRIPTS catalogue) and JavaScript (for card rendering) is the canonical order. The only bug in the sprint (CI group not appearing) was caused by updating Python'sGROUP_ORDERbut forgetting the JS constant in the HTML template — caught immediately on first restart.
What was tricky:
- The HTML template started as a regular Python triple-quoted string. Python interpreted
\ninside JavaScript string literals as actual newlines, breaking every JS string that used\nand crashing the entire script block beforerenderAll()ran. The page showed only the static header HTML with no cards. Fix: prefix the template withr"""(raw string). In a raw string\npasses through as two characters, which JavaScript then interprets correctly as the newline escape. - Browsers send a
GET /favicon.icorequest regardless of the<link rel="icon">tag in the HTML. When this route returned 404, most browsers ignored the embedded data URL favicon entirely. Adding an explicit/favicon.icohandler that serves the PNG bytes fixed it. - The
run_scriptMCP tool needed to handle bothdeploy/*.pyscripts (run asuv run deploy/script.py) and bare tool names likepre-commit(run asuv run pre-commit). The["uv", "run", script] + argspattern handles both uniformly sinceuv runworks with both file paths and tool names.
Seeds for next sprint / release:
read_logreturns raw log text; a follow-up could add asummarise_log(pattern)MCP tool that calls Claude to produce a structured diagnosis rather than returning raw text.- The UI has no persistence: argument values reset on every page load. Browser
localStoragecould save the last values per card. - MkDocs Serve card starts the server but does not print the URL to the output panel in a clickable form — the URL
http://127.0.0.1:8000appears in the log stream as plain text. - Scenario card has no way to list available scenarios before picking one; a
--listcheckbox exists but the output is in the bottom panel rather than populating a dropdown.
Sprint 12: ESP32 WiFi Heap Stability and Network Documentation¶
Scope: Diagnose and fix a reproducible
Guru Meditation Error: Core 1 panic'ed (LoadStoreError)on esp32dev that occurred whenever the STA interface connected to a router. Restore safe AP auto-disable behaviour. Add a Start Server card to the deploy UI. Write comprehensive network documentation.
Motivation¶
After Sprint 11 removed the static wsBuf[16384] from AppSetup.cpp and replaced it with heap_caps_malloc(n+1) / heap_caps_free(buf) per broadcast, a crash began reproducing on esp32dev every time the STA connected. EXCVADDR pointed to IRAM (~0x4009769d), and the panic site was poison_allocated_region in FreeRTOS multi_heap_poisoning.c — a function that writes a canary pattern into a newly allocated block. The crash meant heap_caps_malloc had returned an IRAM address, which fails immediately on the first store to it.
Root cause¶
The ESP32 WiFi SDK's STA-connect sequence internally frees and reallocates internal buffers containing IRAM function pointers. When the freed block headers at the next_free offset contain IRAM addresses, the lwIP heap free list becomes corrupted. The first large heap_caps_malloc call in driverTask traversed deep enough into the free list to hit the corrupted entry and either crashed in poison_allocated_region (attempting to write to IRAM) or returned the IRAM address as a valid allocation, causing the subsequent serializeJson write to crash.
The Sprint 1 retrospective already documented the heap_caps_malloc pattern as the correct FreeRTOS-safe approach. The error was timing: the allocation happened after WiFi had connected and corrupted the heap, not before.
Changes¶
deploy/start_pc.py (new) and deploy/ui.py
Added deploy/start_pc.py: a thin wrapper that kills any existing projectMM process, resolves the platform binary path, starts a fresh server subprocess with stdout/stderr piped, and streams every output line. Handles SIGTERM cleanly so the deploy UI's Stop button terminates the process.
Added a Start Server card to the PC group in deploy/ui.py, wired to start_pc.py. Long-running; the Stop button terminates the process via /stop/{run_id}.
src/pal/Pal.h — wifi_ap_start() and wifi_ap_stop()
NetworkModule::setup() was changed to call WiFi.mode(WIFI_STA) (see below), which means the AP netif is not allocated at boot. wifi_ap_start() must switch to WIFI_AP_STA before calling softAP():
inline bool wifi_ap_start(const char* ssid, const char* password = nullptr) {
#ifdef ARDUINO
WiFi.mode(WIFI_AP_STA); // allocate AP netif only when AP is actually needed
WiFi.softAPConfig(...);
return WiFi.softAP(ssid, ...);
wifi_ap_stop() gained a guard that returns immediately when the AP is not running (detected via WiFi.softAPIP() == IPAddress(0, 0, 0, 0)). Without this guard, calling softAPdisconnect() on a device that booted in WIFI_STA mode (AP netif never allocated) could fragment the heap.
src/modules/system/Network.h — setup() and manageWifi_()
setup() changed from WiFi.mode(WIFI_AP_STA) to WiFi.mode(WIFI_STA). The AP netif (~29 KB) is now only allocated when wifi_ap_start() is called by WifiApModule. Allocating it at boot and then freeing it when STA connects was the primary source of heap fragmentation that corrupted the lwIP free list.
manageWifi_() gained an ap_->isEnabled() guard on the AP disable path:
if (ap_ && ap_->isEnabled()) ap_->setControl("enabled", false);
Without this guard, the management tick called wifi_ap_stop() unconditionally on every STA-connected tick. On devices where the AP was already disabled (saved in state/ap1.json), this was a no-op at the PAL level but still unnecessary. With the isEnabled() check and the softAPIP() guard in wifi_ap_stop(), no WiFi driver call is made for devices that booted in STA-only mode.
src/core/WsServer.h — pre-allocated text buffer (the definitive fix)
Added a pre-allocated AsyncWebSocketSharedBuffer textBuf_ (8 192 B) alongside the existing pixBuf_. Allocated in begin() before server.begin() and before WiFi connects — at a point when the heap is unfragmented. Three new public methods:
char* textBufData(); // pointer into the shared buffer
size_t textBufSize() const; // always kMaxTextFrame = 8192
void broadcastTextBuf(size_t len); // resize to len, then c.text(textBuf_) per client
broadcastTextBuf uses c.text(AsyncWebSocketSharedBuffer) — the shared-pointer API. Per broadcast, this allocates only a small AsyncWebSocketMessage wrapper per connected client rather than an n-byte data copy. The text data itself is never reallocated.
src/core/AppSetup.cpp — remove heap_caps_malloc from driverTask
Both the schema push path and the state push path replaced the heap_caps_malloc(n+1) / serializeJson / heap_caps_free pattern with the pre-allocated buffer:
// Before:
char* buf = (char*)heap_caps_malloc(n + 1, MALLOC_CAP_INTERNAL);
if (buf) { serializeJson(doc, buf, n + 1); s_ws->broadcastText(buf, n); heap_caps_free(buf); }
// After:
if (n < s_ws->textBufSize()) {
serializeJson(doc, s_ws->textBufData(), s_ws->textBufSize());
s_ws->broadcastTextBuf(n);
}
This eliminates the traversal of the (potentially corrupted) heap free list entirely. The ArduinoJson internal pool allocation (small, a few hundred bytes) and the per-client AsyncWebSocketMessage wrapper (small, per client) are the only remaining dynamic allocations in the broadcast path; both are too small to reach the corrupted free-list region.
src/modules/layers/EffectsLayer.h — healthReport()
Added a healthReport() override that includes geometry, modifier generation counter, and allocate-call count: "16x16x1 gen=0 allocs=1". Enables automated test assertions on EffectsLayer state without inspecting individual controls.
docs/developer-guide/network.md (new)
Comprehensive reference for all four network modules (NetworkModule, WifiStaModule, WifiApModule, EthernetModule): controls tables, status values, connection lifecycle, management policy state diagram, all timer constants, PAL function reference tables, modulemanager.json wiring example, and PC build notes.
Module doc updates
All four docs/modules/network/*.md files updated to match the current implementation:
network-module.md: correctedWIFI_AP_STA→WIFI_STA; added management policy summary with link tonetwork.md.wifi-ap-module.md: removed "always starts on boot / stays up when STA connects" (both wrong after Sprint 10); added accurate auto-management description and status value table; link to management policy.wifi-sta-module.md: fixed PC status string ("PC (no WiFi)"→"no WiFi"); added status value table and timer constants; link to management policy.ethernet-module.md: replaced stale note about disabling NetworkModule management with accurate description (STA re-enabled when Ethernet drops); link tonetwork.md.
docs/developer-guide/pal.md: added a "WiFi and Ethernet" subsection pointing to network.md#pal-functions (previously undocumented in pal.md).
docs/user-guide/getting-started.md: Step 5 now notes that the AP closes within 10 seconds of STA connecting, and re-opens after 30 seconds of STA loss. Link to network.md added.
Result¶
| Metric | Value |
|---|---|
| Unit tests | 401/401 pass (no new tests — all changes are runtime/hardware-path only) |
| PC build | Clean (0 warnings) |
| ESP32dev build | Clean (0 warnings) |
| ESP32 crash | Resolved — no LoadStoreError after STA connects under sustained WebSocket load |
| WS text broadcast | Zero heap_caps_malloc calls in driverTask; 8 KB buffer pre-allocated before WiFi |
| AP management | Auto-disables within 10 s of STA connect; re-enables after 30 s of STA loss; no spurious wifi_ap_stop() calls on STA-only-boot devices |
| New docs | docs/developer-guide/network.md (315 lines); 6 existing docs updated |
Definition of Done¶
- [x]
deploy/start_pc.pycreated; kills existing process, starts fresh server, streams output, handles SIGTERM - [x] Start Server card added to PC group in
deploy/ui.py - [x]
wifi_ap_start()switches toWIFI_AP_STAbeforesoftAP() - [x]
wifi_ap_stop()returns immediately whensoftAPIP() == 0.0.0.0 - [x]
NetworkModule::setup()callsWiFi.mode(WIFI_STA); header comment explains the heap-safety rationale - [x]
manageWifi_()guards AP disable withap_->isEnabled()to avoid no-op driver calls - [x]
WsServerpre-allocatestextBuf_(8 192 B) inbegin();textBufData(),textBufSize(),broadcastTextBuf()public - [x]
AppSetup.cppboth broadcast paths usetextBufData()/broadcastTextBuf()— noheap_caps_mallocindriverTask - [x]
EffectsLayer::healthReport()override added - [x]
docs/developer-guide/network.mdwritten with full module reference, management policy diagram, PAL function tables, wiring example - [x] All four
docs/modules/network/*.mdfiles accurate and cross-linked tonetwork.md - [x]
docs/developer-guide/pal.mdhas WiFi/Ethernet PAL pointer - [x]
docs/user-guide/getting-started.mdaccurate AP behaviour description - [x] 401/401 tests pass; mkdocs builds clean
Retrospective¶
What went well:
- The pre-allocated buffer approach was the cleanest fix: it moved the allocation to a deterministic point in time (before WiFi, before
server.begin()), eliminated the large heap traversal from the hot path entirely, and as a side effect also improved the broadcast pattern (shared-pointer API avoids the per-client data copy thatbroadcastText(buf, n)did). - The
softAPIP() == 0.0.0.0guard inwifi_ap_stop()is an idiomatic ESP32 check (used by the Arduino WiFi library itself). It makes all AP stop calls safe regardless of whether the AP was ever started, which simplifies every call site. - The
isEnabled()guard on the management tick's AP disable path made the intent explicit: "only stop the AP if it was running." The two guards together (isEnabled()at the call site +softAPIP()at the PAL level) are defence in depth. - Changing
WiFi.mode(WIFI_STA)and moving the mode switch intowifi_ap_start()correctly modelled the real invariant: the AP netif should be allocated if and only if the AP is running.
What was tricky:
- Three fix attempts were needed before finding the root cause. Fix 1 (WIFI_AP_STA → WIFI_STA at boot) reduced heap churn but the crash moved to a different site (EXCVADDR 0x40096db0, inside
StaticStringWriterin ArduinoJson) rather than disappearing. Fix 2 (adding thewifi_ap_stop()guard) was a correctness fix but did not touch the allocation path. The crash became worse — it started firing before any manual UI refresh because the browser auto-reconnected and triggered a schema push. The third fix (pre-allocated buffer) eliminated the corrupted heap traversal entirely. - The crash was intermittent in that it depended on when the schema push fired relative to the WiFi SDK's internal free/realloc sequence. Once the browser reconnected automatically after the first crash, the heap state was different enough that the crash reproduced on the very next connection, making it look worse even though it was the same underlying bug.
- The
WiFi.softAPIP() == IPAddress(0, 0, 0, 0)comparison requires anIPAddressobject on the right side; comparing to0ornullptrdoes not compile — the ArduinoIPAddressclass does not have those implicit conversions.
Seeds for next sprint / release:
- The 8 192 B
textBuf_covers state JSON for typical configs (~2 KB) and schema JSON (~6 KB). If the module count grows significantly,measureJson()could exceedtextBufSize()and the broadcast would be silently dropped. A future improvement: increasekMaxTextFrameor split large schema pushes into per-module incremental updates. - The WiFi SDK heap corruption is a known lwIP issue on ESP32 classic. An eventual migration to ESP32-S3 for all devices removes the vulnerability (S3 has a larger and more robust heap implementation). Until then, the pre-allocated buffer approach holds.
- The
EffectsLayer::healthReport()format (16x16x1 gen=0 allocs=1) is not yet covered by a test. A test case that verifies the format survives a geometry change would prevent regression if the method is refactored.
Sprint 13: FastLED Compatibility, ArtNet Unicast, Dynamic Control Coverage, and ArtNet Stutter Fixes¶
Scope: Wrap
struct RGBinnamespace projectMMto eliminate a compile collision with FastLED. RefactorArtNetOutModuleto accept any pixel source viapixelBuf()and add broadcast/unicast mode selection with a conditionally visible IP field. ExtendEthernetModulewithrebuildControls()so static IP fields are hidden in DHCP mode. Fix MkDocs Serve in the deploy UI to kill stale processes on stop. Investigate and fix ArtNet frame stutters on PC: add a persistent UDP send socket to the PAL, add a per-modulefps_limit_control toArtNetOutModule, and cache the macOSmalloc_zone_statistics()result to eliminate a ~100 ms hot-path pause that fired every second.
Motivation¶
Two separate problems surfaced when integrating projectMM as a library into a FastLED-based project (FastLED_MM):
-
FastLED declares
enum EOrder { RGB = 0012, ... }in the global namespace. projectMM'sstruct RGBin the same global namespace caused an immediate compile error (redefinition of 'RGB' as different kind of symbol). The fix required qualifying everyRGBreference in the codebase. -
ArtNetOutModuleheld a rawDriverLayer*and calledreadyChannel()on it directly. This made ArtNetOutModule impossible to use from a project that has its own pixel buffer type (e.g. FastLED'sCRGB[]), becauseDriverLayeris an internal implementation detail. The right abstraction waspixelBuf()onStatefulModuleBase— a virtual method that returns aconst uint8_t*directly into the caller's existing buffer with no copy.
Separately, EthernetModule registered all three static IP controls (static_ip, static_gateway, static_subnet) unconditionally, cluttering the UI for devices using DHCP. The pattern from Sprint 1 (rebuildControls() + clearControls()) was the direct fix.
Changes¶
src/modules/layers/RGB.h — namespace projectMM
struct RGB wrapped in namespace projectMM:
namespace projectMM {
struct RGB {
uint8_t r;
uint8_t g;
uint8_t b;
};
} // namespace projectMM
No using namespace projectMM or using projectMM::RGB in any header — all callers use the qualified name so RGB never re-enters the global namespace.
All pixel-type usages updated
Every RGB reference replaced with projectMM::RGB across:
src/modules/layers/Channel.h—RGB* pixelsfieldsrc/modules/layers/EffectsLayer.h—sizeof, alloc casts (projectMM::RGB* n0/n1),heapSize()src/modules/layers/DriverLayer.h—sizeof, alloc castsrc/modules/layouts/GridLayout.h—sizeofsrc/modules/effects/SineEffect.h,RipplesEffect.h,LinesEffect.h,GameOfLifeEffect.h,NoiseEffect2D.h,DistortionWaves2DEffect.h,FlowFluidEffect.h—sizeof,hsvToRgb_return type,RGB{...}literalssrc/modules/effects/ArtNetInModule.h—sizeof, literaltests/test_effects_2d.cpp,tests/test_system_info.cpp—sizeofand pointer casts
src/modules/drivers/ArtNetOutModule.h — generic pixel source
DriverLayer* replaced with StatefulModuleBase*. The setInput key changed from "layer" to "source"; the legacy key "layer" is still accepted for backwards-compatible state files. The pixel data is retrieved via the virtual pixelBuf() method:
const uint8_t* rgb = nullptr;
size_t total = 0;
uint16_t w = 0, h = 0, d = 0;
if (!source_->pixelBuf(rgb, total, w, h, d) || !rgb || total == 0) return;
pixelBuf() returns a pointer directly into the source module's existing buffer (no copy, no new allocation). The per-universe copy is a single memcpy call rather than the previous per-pixel three-byte store loop — faster at any pixel count.
#include "modules/layers/DriverLayer.h" removed; ArtNetOutModule now only includes core/StatefulModule.h and pal/Pal.h, making it usable as a library header without dragging in DriverLayer's dependencies.
src/modules/drivers/ArtNetOutModule.h — broadcast/unicast mode
Added a mode select control ("broadcast" / "unicast") with an ip_ field that is only visible in unicast mode, using the rebuildControls() pattern from Sprint 1:
void rebuildControls() override {
clearControls();
addControl(universe_start_, "universe_start", "slider", 0, 255);
addControl(fps_limit_, "fps_limit", "slider", 0, 200);
addControl(mode_, "mode", kArtNetModes, 2);
if (mode_ == kArtNetModeUnicast)
addControl(ip_, sizeof(ip_), "ip", "text");
}
void onUpdate(const char* key) override {
if (strcmp(key, "mode") == 0) rebuildControls();
}
In broadcast mode the destination is "255.255.255.255"; in unicast mode it is the ip_ field value. Default mode is broadcast; default IP is "192.168.1.1". fps_limit_ defaults to 50; set to 0 to disable the limiter.
src/modules/drivers/ArtNetOutModule.h — fps_limit_ rate limiter
ArtNet stutters on PC were traced to two root causes. The first was the loop running at 10,000+ FPS and creating one UDP socket per universe per frame (socket() + sendto() + close() per call). The second was malloc_zone_statistics() on macOS blocking the render path for ~25 ms per call, called four times per second.
A per-module fps_limit_ control (default 50, range 0–200; 0 = unlimited) was added to ArtNetOutModule. The rate gate uses pal::micros() and a lastSendUs_ timestamp, evaluated at the top of loop() before any socket work:
if (fps_limit_ > 0) {
int64_t now = pal::micros();
int64_t minIntervalUs = 1000000LL / fps_limit_;
if (now - lastSendUs_ < minIntervalUs) return;
lastSendUs_ = now;
}
This keeps effects running at full scheduler FPS while throttling the network output independently.
src/pal/Pal.h — persistent UDP send socket (udp_tx_open / udp_tx_send / udp_tx_close)
The second stutter cause was per-packet socket lifecycle overhead. The original pal::udp_send() called socket() + sendto() + close() for every universe every frame. At full loop rate this created thousands of syscalls per second and flooded the LAN with rapid-fire UDP bursts.
Three new PAL functions replace it with a persistent socket pattern:
inline int udp_tx_open(); // open once in setup()
inline bool udp_tx_send(int h, const char* ip, uint16_t port,
const char* buf, size_t len); // called per universe per frame
inline void udp_tx_close(int h); // called in teardown()
On PC the handle is a POSIX int file descriptor. On Arduino/ESP32 it is an index into a small WiFiUDP slot pool (UDP_TX_SLOTS = 2). ArtNetOutModule::setup() calls udp_tx_open() and stores the handle in udpFd_; teardown() calls udp_tx_close(udpFd_). If the open fails at setup (network not yet ready), loop() retries once per frame.
src/pal/Pal.h — macOS free_heap_bytes() cache
malloc_zone_statistics() on macOS performs an O(n-allocations) heap scan and takes 20–30 ms. It was called four times per second from two call sites: SystemStatusModule::loop1s() (two calls for free_heap_kb() and max_alloc_kb()) and the MemLive periodic check in tickPeriodic() (two more calls). Combined, this caused a ~100 ms pause in the render pipeline every second — the observed "heartbeat" stutter.
Fixed with a 5 ms TTL cache in pal::free_heap_bytes() on macOS:
#elif defined(__APPLE__)
static uint32_t s_cached = 0;
static int64_t s_cachedUs = 0;
int64_t now = micros();
if (now - s_cachedUs > 5000) {
malloc_statistics_t s{};
malloc_zone_statistics(nullptr, &s);
s_cached = s.size_allocated > s.size_in_use
? (uint32_t)(s.size_allocated - s.size_in_use) : 0u;
s_cachedUs = now;
}
return s_cached;
All four callers within the same scheduler tick now share one scan. The stutter dropped to unmeasurable.
src/modules/system/Ethernet.h — rebuildControls()
All control registration extracted into rebuildControls(). The three static IP fields (static_ip, static_gateway, static_subnet) are only registered when modeIdx_ == 1 (static mode):
void rebuildControls() override {
clearControls();
if (!pal::has_ethernet()) { addControl(status_, "status", "display"); return; }
addControl(status_, "status", "display");
addControl(ip_address_, "ip_address", "display");
addControl(modeIdx_, "mode", kModes_, 2);
if (modeIdx_ == 1) {
addControl(static_ip_, sizeof(static_ip_), "static_ip", "text");
addControl(static_gateway_, sizeof(static_gateway_), "static_gateway", "text");
addControl(static_subnet_, sizeof(static_subnet_), "static_subnet", "text");
}
}
setup() sets modeIdx_ from the loaded mode_ string, then calls rebuildControls(). onUpdate("mode") calls rebuildControls() before applyIpConfig_() so the UI updates immediately when the user switches modes.
kModes_ promoted to a static constexpr const char*[] class member so it is accessible from rebuildControls() (previously it was a local static inside setup()).
deploy/start_mkdocs.py (new) and deploy/ui.py
Added deploy/start_mkdocs.py: kills any existing mkdocs serve process (pkill -f "mkdocs serve" on macOS/Linux; taskkill on Windows), then starts a fresh uv run mkdocs serve subprocess with output streamed to stdout. Handles SIGTERM cleanly so the deploy UI's Stop button terminates the process.
MkDocs Serve card in deploy/ui.py updated to use start_mkdocs.py instead of launching mkdocs serve directly. This prevents stale MkDocs processes from accumulating when the card is stopped and restarted repeatedly.
Result¶
| Metric | Value |
|---|---|
| Unit tests | 401/401 pass |
| PC build | Clean (0 warnings) |
| ESP32 build | Clean (0 warnings) |
| Files modified | 16 source files (namespace), 2 driver files (ArtNet), 1 system file (Ethernet), 2 deploy files, PAL (UDP + heap cache) |
| New allocations in loop() | 0 — pixelBuf() returns pointer into existing buffer |
| ArtNet input key | "source" (new); "layer" still accepted for backwards compat |
| ArtNet FPS control | fps_limit_ slider (0–200); default 50; 0 = unlimited |
| ArtNet socket lifecycle | Persistent (udp_tx_open at setup, udp_tx_close at teardown); eliminates per-frame socket()/close() syscalls |
| macOS heap scan cost | Reduced from ~100 ms/s (4 scans x 25 ms) to ~5 ms/s (1 scan every 5 ms TTL) |
| ArtNet stutter | "Heartbeat" stutter resolved; frames reach FPP smoothly at 50 FPS |
Definition of Done¶
- [x]
struct RGBwrapped innamespace projectMM; nousingalias in any header - [x] All
RGBreferences qualified asprojectMM::RGBacross layers, effects, layouts, and tests - [x]
ArtNetOutModuleholdsStatefulModuleBase*;DriverLayer.hinclude removed - [x]
setInput("source", ...)wires the pixel source;"layer"alias preserved - [x]
pixelBuf()used inloop()— zero-copy, no new allocation - [x]
ArtNetOutModule::rebuildControls()hidesip_field in broadcast mode - [x]
fps_limit_slider control (0–200, default 50) added toArtNetOutModule; 0 disables the limiter - [x]
loop()rate gate usespal::micros()andlastSendUs_; effects run at full scheduler FPS - [x]
pal::udp_tx_open()/udp_tx_send()/udp_tx_close()added to PAL (PC: POSIX fd; Arduino: WiFiUDP slot) - [x]
ArtNetOutModule::setup()callsudp_tx_open();teardown()callsudp_tx_close() - [x]
pal::free_heap_bytes()on macOS cachesmalloc_zone_statistics()result with 5 ms TTL - [x]
onUpdate("mode")callsrebuildControls()to update UI live - [x]
EthernetModule::rebuildControls()hides static IP fields in DHCP mode - [x]
EthernetModule::onUpdate("mode")callsrebuildControls()beforeapplyIpConfig_() - [x]
kModes_is a static class member (not a local static insetup()) - [x]
deploy/start_mkdocs.pykills stale processes before starting; handles SIGTERM - [x] MkDocs Serve card in
deploy/ui.pyusesstart_mkdocs.py - [x] 401/401 tests pass; PC and ESP32 builds clean
Retrospective¶
What went well:
- The
namespace projectMMfix was mechanical and complete: a global replace ofRGBwithprojectMM::RGBacross all headers and tests. Nousingaliases were introduced anywhere, so the fix is stable against future includes that might bring FastLED into scope. pixelBuf()as the abstraction boundary was the right choice: it made ArtNetOutModule a pure consumer with no dependency on how pixels are produced. The zero-copy property follows directly from the interface contract (returns a pointer into the source's buffer) rather than requiring explicit implementation care.- The
memcpy-per-universe pattern is both simpler and faster than the previous per-pixel three-byte store loop. Replacing the loop with a singlememcpyremoved ~170 iterations (one per pixel in a full universe) without changing correctness. - The
rebuildControls()extension to EthernetModule and the unicast IP field in ArtNetOutModule reused the Sprint 1 infrastructure without any changes to the framework.
What was tricky:
EffectsLayer.hhad two separateRGB* nlines (one forn0, one forn1); thereplace_allonRGB*matched only the first pattern. The second line required a targeted edit after noticing the build still failed.- The
addControloverload for themodeselect in ArtNetOutModule initially had an extra"select"string argument:addControl(mode_, "mode", "select", kArtNetModes, 2). The correct overload from StatefulModule.h isaddControl(uint8_t& variable, const char* key, const char* const* options, uint8_t count)— no type string. The build error identified it; removing"select"fixed it.
What went well:
- The persistent UDP socket (
udp_tx_open/send/close) was a clean PAL abstraction: the callers (ArtNetOutModule) became simpler and the platform difference (POSIX fd vs WiFiUDP slot) is fully hidden. - The 5 ms TTL cache for
malloc_zone_statistics()resolved the heartbeat stutter completely with a minimal, local change. No architectural changes were needed on the monitoring side. - Adding
fps_limit_as a per-module control (rather than a global loop cap) kept the design correct: effects run at maximum rate; only network output is throttled.
What was tricky:
- The stutter had two independent root causes (socket lifecycle overhead and macOS heap scan) that appeared as a single symptom. Fixing the persistent socket first revealed the second cause. Diagnosing both required profiling
pal::free_heap_bytes()call frequency and duration. - The global FPS cap in
main.cppwas initially proposed as a fix for the socket flood but was correctly rejected: a global cap would throttle effects regardless of whether ArtNet was running, violating the "effects at maximum rate" principle.
Seeds for next sprint / release:
- Unicast mode currently supports a single destination IP. The user has flagged that multiple-IP unicast (sending the same universe to several receivers) will be needed. The
modecontrol would expand to"broadcast"/"unicast"/"multicast"or a repeating IP list; therebuildControls()pattern extends naturally. - The
projectMM::RGBstruct is intentionally minimal (3 xuint8_t). A planned rework will align it with MoonLight's channel model (more than 3 channels per light, e.g. RGBW or RGBWW). When that happens,pixelBuf()will need a companionchannelsPerPixel()method so consumers like ArtNetOutModule can adapt their packing without hardcoding* 3. ArtNetInModulestill usessizeof(RGB)qualified asprojectMM::RGBbut was not refactored further in this sprint. If it is used as a library header alongside FastLED the same include-order sensitivity applies; a follow-up should verify its pixelBuf() / source wiring matches the new pattern.- The root cause of the heartbeat stutter revealed an architectural issue:
tickPeriodic()monitoring callbacks run synchronously insidescheduler.loop()on PC, blocking the entire render pipeline. On ESP32 this is mitigated by FreeRTOS two-core scheduling; on PC there is no such separation. The proper fix is running monitoring work on a background thread on PC. Added to backlog.
Sprint 14: Schema Push Stability, NTP, and Generic Auto-Wiring¶
Scope: Three independent improvements that accumulated across two sessions. (A) Fix
rebuildControls()only taking effect on the first control change — root cause was a single pre-allocatedschemaBuf_being resized while an in-flightAsyncWebSocketMessagestill held a reference. Fix: double-buffered schema (A/B alternating). (B) AddNtpModulefor wall-clock time sync on ESP32. (C) Replace per-typestrcmpchains inModuleManagerwith a genericautoWireKeys()virtual so modules declare their own wiring preferences.
Part A: WsServer schema double-buffering¶
Root cause. AsyncWebSocketMessage::send() reads _WSbuffer->size() at transmission time, not at queue time. Resizing the single schemaBuf_ for the next push while the previous message was still in the TCP send queue caused the queued frame to broadcast with the new (wrong) length — producing a truncated or corrupt JSON frame on every second rebuildControls() call.
Fix. WsServer now holds schemaBufs_[2] (two pre-allocated 16 KB buffers) and a one-bit index schemaIdx_. broadcastSchemaBuf() writes into schemaBufs_[schemaIdx_], broadcasts it, then flips schemaIdx_ ^= 1. Consecutive pushes never write to the same buffer. Both buffers are pre-allocated in begin(), before WiFi connects, to avoid heap_caps_malloc calls on the corrupted-heap-free-list that the WiFi STA connect sequence leaves behind.
AppSetup.cpp. driverTask now tracks lastSchemaMs and rate-limits schema pushes to 50 ms. Uses schemaBufData()/schemaBufSize()/broadcastSchemaBuf() instead of the text buffer.
Tests. test_websocket.cpp gained 4 unit tests: schemaDirty set on first type change, re-set after clear and second change (the regression guard), ModuleManager::hasSchemaDirty() across consecutive rebuilds, and control set change visible in getModulesJson. test_integration.cpp gained 1 integration test that verifies the WS push loop emits a schema frame both before and after clearSchemaDirty().
Live test. deploy/live.py test8_rebuild_controls: creates an EffectsLayer + SineEffectModule, cycles the type control 0→1→0→1, asserts /api/modules returns the expected control keys after each change, and checks is_crash == false.
Part B: NtpModule¶
NtpModule syncs wall-clock time from a configurable NTP server and exposes local_time as a live display control. On ESP32 it calls configTime() / getLocalTime(); on PC it reads the system clock directly. loop1s() retries sync until getLocalTime returns a valid time.
PAL additions (Pal.h): ntp_sync(server, gmtOffsetSec, dstOffsetSec) wraps configTime(); local_time_str(buf, len) writes "HH:MM:SS" (returns false and "--:--:--" when time is not yet available).
NtpModule is registered in CoreRegistrations.cpp and auto-created as ntp1 by pal::ensureDefaultModules (embedded only).
SystemStatusModule gained a local_time_ field updated from pal::local_time_str() each second.
Part C: Generic auto-wiring via autoWireKeys()¶
ModuleManager::addModule(), instantiateFromArray() pass 2b, and replaceModule() each contained duplicated per-type strcmp("DriverLayer") / strcmp("EffectsLayer") blocks. Option B replaces all three with a single applyAutoWire_(m) call.
AutoWireSpec (new struct in StatefulModule.h):
struct AutoWireSpec {
const char* inputKey; // my input key; nullptr marks end of list
const char* searchType; // type name to search for in owned modules
bool allMatches; // true = wire all matches; false = first only
const char* backKey; // non-null: also call found->setInput(backKey, this)
};
virtual const AutoWireSpec* autoWireKeys() const added to StatefulModuleBase (default nullptr).
Overrides added:
| Module | Rule |
|---|---|
DriverLayer |
find all EffectsLayer → wire as "source" |
EffectsLayer |
find first DriverLayer → wire as "driver", back-wire self as DriverLayer's "source" |
ArtNetOutModule |
find first DriverLayer → wire as "source" |
ArtNetInModule |
find first EffectsLayer → wire as "layer" |
PreviewModule |
find first ProducerModule → wire as "source" |
ModuleManager::applyAutoWire_(StatefulModuleBase*) (new private helper, ~15 lines) iterates the returned spec array and calls setInput() accordingly. Net change in ModuleManager.cpp: -30 lines, +15 lines.
No hotpath impact. autoWireKeys() is called only during addModule() and instantiateFromArray(), never from loop(). Zero RAM cost per instance (vtable entry is in flash).
Summary¶
| Part | Description | Est |
|---|---|---|
| A: Schema double-buffering | WsServer schemaBufs_[2], AppSetup rate-limit, 5 tests, 1 live test | S |
| B: NtpModule | NtpModule + PAL ntp_sync/local_time_str + registration | S |
| C: autoWireKeys() | AutoWireSpec, virtual, 5 module overrides, applyAutoWire_ | S |
| Total | M |
Definition of Done¶
- [x] Second
rebuildControls()call updates the browser UI without page reload - [x]
WsServer::schemaBufs_[2]+schemaIdx_replaces singleschemaBuf_;broadcastSchemaBuf()alternates buffers - [x]
AppSetup.cppusesschemaBufData/schemaBufSize/broadcastSchemaBuf; rate-limited to 50 ms - [x] 4 unit tests for
schemaDirtylifecycle; 1 integration test for consecutive schema pushes - [x]
deploy/live.pytest8_rebuild_controlsexercises 4 type changes and asserts correct control keys - [x]
NtpModuleregisters, auto-creates on embedded, showslocal_timein UI - [x]
pal::ntp_sync()andpal::local_time_str()compile on ESP32 and PC - [x]
AutoWireSpec+autoWireKeys()virtual inStatefulModuleBase - [x] DriverLayer, EffectsLayer, ArtNetOutModule, ArtNetInModule, PreviewModule override
autoWireKeys() - [x]
ModuleManager::applyAutoWire_()replaces all three per-type strcmp blocks - [x] 406/406 unit tests pass; PC build clean; PC live tests 2/2 pass
Result¶
| Metric | Value |
|---|---|
| Unit tests | 406/406 pass (was 401) |
| PC build | clean |
| PC live tests | 2/2 pass |
| New tests | 5 (4 unit + 1 integration) |
| ModuleManager.cpp delta | -30 lines removed, +15 added |
| RAM cost per module instance | 0 bytes (autoWireKeys vtable entry in flash) |
Retrospective¶
What went well:
- The double-buffer fix is a small, surgical change (one extra array + one XOR flip) that eliminates a whole class of timing-dependent WebSocket corruption without changing the send API or adding heap pressure.
- Tracing the root cause to
_WSbuffer->size()at send time vs. queue time was the key insight. Once understood, both the fix and the regression test followed naturally. autoWireKeys()is a clean application of the "modules self-declare" principle already used forcategory(),allowedChildCategories(), andpreferredCore(). ThebackKeyfield handles the asymmetric EffectsLayer-DriverLayer bidirectional wire in one spec entry.- NtpModule is self-contained: PAL handles the platform split, no changes to any existing module.
What was tricky:
- The single-buffer fix (adding a separate
schemaBuf_) passed the first push and failed on the second — a timing-dependent bug that only manifests when two pushes happen in quick succession. The lesson: for shared-buffer WebSocket code, always test consecutive rapid changes, not just single changes. - The
autoWireKeys()design needed thebackKeyfield to express the EffectsLayer→DriverLayer back-wire without extra complexity; the alternative (two-pass wiring) would have been harder to read.
Seeds for next sprint:
broadcastBinary(pixel preview) uses the same single sharedpixBuf_pattern. Lower risk than schema (binary frames don't parse, corruption shows as a visual glitch), but the same class of bug exists.ArtNetInModuleauto-wires to the firstEffectsLayerwhen added as a top-level module. Effect modules (SineEffectModule,RipplesEffectModule, etc.) do not yet haveautoWireKeys()overrides; adding a base-class default (e.g. via an intermediateEffectModuleBase) would make top-level effect additions self-wiring.- Hardware live test: connect LAN8720 (esp32dev) and verify Ethernet + WiFi reconnect behavior on real hardware (carried from Sprint 13 seeds).
Complexity estimate: Medium (M).
Release 8 Backlog¶
All items consolidated into the cross-release backlog.