Release 7: OTA, Ethernet, and Runtime Hardening (v1.7.0)¶
Theme: Release 7 completes the field deployment story: over-the-air firmware updates with a CI release pipeline, Windows support, and a full Ethernet + WiFi management stack with automatic reconnect. Runtime hardening spans heap/OOM safety, static RAM tuning, and WebSocket log streaming. It closes out with PAL simplification (IDF_VER removal) and deploy pipeline correctness fixes.
Release Overview¶
Foundation from previous releases¶
| Capability | Notes |
|---|---|
| Virtual/Physical layer split | Effects render in virtual space; layouts own the physical mapping |
| PhysMap | 1:0, 1:1, 1:N pixel mappings; PSRAM-backed on S3 |
| Modifier library | Mirror, Checkerboard, Scroll, Rotate, Tile |
| Non-rectangular layouts | RingLayout, WheelLayout, XmasTreeLayout |
| Memory observability | MemBoot balance sheet per module; MemLive fragmentation warnings at runtime |
| Time observability | Per-second CPU accounting with module hierarchy; REST + WS exposure |
| Scenario benchmarking | Declarative JSON pipelines shared by unit tests and live tests; fps + heap per step |
| Live test suite | 13 tests (smoke / format / behavioral / integration) on PC + ESP32 via REST |
| Deploy pipeline | all.py with post-flash mem capture (--reset), live tests, and status docs |
| 357 unit tests | All passing; smoke / format / behavioral / integration classification |
| Developer tooling | uv workspace, pre-commit hooks, PAL compile-time enforcement, MCP server |
What Release 7 delivers¶
| Problem / Goal | Sprint |
|---|---|
| No over-the-air firmware update path | Sprint 1 (FirmwareUpdateModule: file + GitHub) |
| No firmware assets on GitHub releases / no nightly build | Sprint 2 (CI release pipeline) |
| No Windows build or release binary | Sprint 3 (Windows build + CI) |
| Scenario baselines not populated from hardware | Sprint 5 (Scenario baseline + extends) |
| Classic ESP32 static RAM / fragmentation headroom thin | Sprint 6 (LOG_RING_SIZE tuning, WiFi buffer counts, dual check_alloc) |
| Ring buffer diagnostics not visible without serial monitor | Sprint 7 (GET /api/log WebSocket streaming + frontend log panel) |
| Heap safety and HTTP OOM crashes | Sprint 8 (per-module controlAllocBytes, HTTP OOM catch) |
| No dropdown control type | Sprint 9 (select control: backend index + frontend <select>) |
| Module creation UX and WiFi management gaps | Sprint 10 (boot module redesign, dynamic AP/STA management) |
| No Ethernet support (LAN8720 / W5500) | Sprint 11 (EthernetModule, PAL functions, static IP) |
| WiFi does not reconnect after signal loss or Ethernet drop | Sprint 12 (STA retry every 30 s, Ethernet-drop STA re-enable) |
| PAL IDF_VER dead code; status doc duplication; pipeline bugs | Sprint 13 (IDF_VER removal, deploy-summary consolidation, pipeline fixes) |
Sprints¶
| Sprint | Goal |
|---|---|
| Sprint 1 | FirmwareUpdateModule: OTA PAL + file upload + GitHub releases UI + env in SystemStatus |
| Sprint 2 | CI release pipeline: firmware assets on tagged releases + nightly pre-release build |
| Sprint 3 | Windows build: #ifdef _WIN32 in WsServer.h and Pal.h; ws2_32 link; CI job; .zip artifact |
| Sprint 5 | Scenario baseline: first hardware --update-baseline run; "extends" inheritance; wire into all.py |
| Sprint 6 | Static RAM hardening: LOG_RING_SIZE 4 KB on all devices, WiFi buffer tuning, dual check_alloc guard |
| Sprint 7 | GET /api/log frontend panel: WS push of ring buffer entries; collapsible log UI |
| Sprint 8 | Heap safety: per-module controlAllocBytes hook, HTTP OOM catch, live-test correctness, isPermanent() removal |
| Sprint 9 | select control type: addControl(..., "select") + addControlValue(); backend index storage; frontend <select> rendering |
| Sprint 10 | Boot module creation redesign; dynamic AP/STA WiFi management with Ethernet gating |
| Sprint 11 | EthernetModule: LAN8720 (RMII) + W5500 (SPI); PAL functions; DHCP + static IP |
| Sprint 12 | WiFi reconnect: STA retry every 30 s; re-enable STA on Ethernet drop; PAL architecture docs |
| Sprint 13 | PAL IDF_VER removal; deploy-summary.md consolidation; deploy pipeline correctness fixes |
Sprint 1: FirmwareUpdateModule¶
Scope: Full over-the-air firmware update: PAL plumbing, a
POST /api/firmwareendpoint, and aFirmwareUpdateModulethat supports both a local file upload and one-click flashing from GitHub releases. The GitHub releases path fetches the public releases API in the browser, matches assets to the current device environment, and streams the binary to the device — no internet access required on the ESP32 itself.
Deferred from: Release 5 original scope.
Summary¶
| Part | Description | Est |
|---|---|---|
| PAL + HTTP endpoint | pal::ota_* functions, POST /api/firmware streaming endpoint, dual-OTA partition scheme |
M |
| FirmwareUpdateModule | Module lifecycle, update_status control, OTA state integration |
S |
| Frontend: file upload | File picker tab, XHR streaming to endpoint, progress bar | S |
| Frontend: GitHub releases | Browser-direct GitHub API, asset matching by env, version badge, sessionStorage cache | M |
| SystemStatus env field | "env": BUILD_TARGET in GET /api/system and healthReport() |
XS |
| Tests | PAL stub tests, endpoint test (ota_write call count, ota_end once) | S |
| Total | L |
Planned scope¶
PAL and endpoint:
pal::ota_begin(),pal::ota_write(buf, len),pal::ota_end(),pal::ota_abort()inPal.h. On ESP32: wrapsesp_ota_*. On PC: writes received bytes to a temp file and prints a log line.POST /api/firmware(multipart or chunked binary body): streams bytes throughpal::ota_write, callspal::ota_end()on completion, triggers reboot. Returns{"ok":true}or{"error":"..."}.- Partition scheme: dual-OTA layout (
partitions/esp32dev-ota.csv,partitions/esp32s3-ota.csv) so a running image can be updated without erasing LittleFS.
SystemStatus — env field:
- Add
"env": BUILD_TARGETtoGET /api/systemresponse and toSystemStatus::healthReport().BUILD_TARGETis already a compile-time define (esp32dev,esp32s3_n16r8,PC, …). This field is what the update UI uses to match GitHub release assets to the current device.
FirmwareUpdateModule:
- Registered in
ModuleRegistrations.cpp;isPermanent() = true. - Exposes an
"update_status"display control (idle / downloading / flashing X% / done / error) updated viadisableSelf()is NOT used here; the module stays alive through the update. - WebSocket push: progress events
{"type":"ota","pct":42}every ~5% so the frontend can update a progress bar without polling.
Frontend — two update paths in one card:
- File upload tab:
<input type="file" accept=".bin">→ reads file asArrayBuffer→POST /api/firmwarewithContent-Type: application/octet-stream. Shows a progress bar driven byXMLHttpRequest.upload.onprogress. - GitHub releases tab: on open, browser JS calls
https://api.github.com/repos/ewowi/projectMM/releases?per_page=5directly (public API, no auth, no device internet required). For each release shows: tag name, release title, date, pre-release badge. Downloads the asset matchingprojectMM-{env}.bin(whereenvcomes fromGET /api/system). Streams the downloadedArrayBuffertoPOST /api/firmware. Shows the same progress bar. If no matching asset exists for a release, that release is greyed out. - Maximum 5 releases shown; controlled by the
per_pagequery parameter. - Error handling: network failure fetching GitHub API shows "GitHub unreachable — use file upload"; missing asset shows "No firmware for {env} in this release".
Tests:
- Unit test:
pal::ota_*PC stub writes bytes correctly and returns success. - Unit test:
POST /api/firmwarewith a 1 KB payload callsota_writeN times andota_endonce. - Live test: flash a known-good binary via
POST /api/firmware; assert version field inGET /api/systemmatches expected value after reboot.
Definition of Done¶
- [x]
pal::ota_begin/write/end/abortimplemented inPal.h(ESP32 wrapsesp_ota_*; PC stubs return true and print) - [x]
OtaHandletype alias:esp_ota_handle_ton ESP32,inton PC - [x]
POST /api/firmwareusingonPostBinary(no body buffering; chunks stream directly topal::ota_write) - [x]
onPostBinaryadded toHttpServer.h(ESP32: upload/body callback; PC: buffers once, calls chunk handler) - [x]
OtaState.hinline globals (g_otaStatus,g_otaPct,g_otaHandle) shared between AppRoutes and module - [x]
FirmwareUpdateModuleregistered inCoreRegistrations.cpp - [x]
FirmwareUpdateModuleauto-created byensureInfraModuleson first boot (embedded only); not permanent — boot guard is the safety net - [x]
"env": BUILD_TARGETadded toGET /api/systemviaSystemStatus::fillSystemJson - [x] Frontend: File Upload tab in FirmwareUpdateModule card (file picker + XHR progress bar)
- [x] Frontend: GitHub Releases tab (fetches public API, 1 hr sessionStorage cache, matches
projectMM-{env}.bin) - [x] Frontend: Version badge in status bar when newer non-prerelease GitHub release has a matching asset
- [x]
flash_chip_mode()andpsram_mode()PAL functions added; wired intoSystemStatuscontrols andfillSystemJson(psram_mode insidetotalPsramKb_ > 0guard) - [x] Light mode fix:
#preview-sectiongetsbackground: #f5f5faoverride so sticky bar blends with body in day mode - [x] 11 new unit tests (pal::ota_* stubs, OtaState globals, FirmwareUpdateModule lifecycle); 375/375 pass
- [x] PC live test: PASS; ESP32 live tests: MM-70BC PASS, MM-ESP32 PASS
- [x] esp32dev and esp32s3_n16r8 build successfully
Result¶
| Metric | Value |
|---|---|
| Unit tests | 375/375 pass (11 new) |
| PC live tests | 13/13 PASS |
| ESP32 live tests | MM-70BC PASS, MM-ESP32 PASS |
| esp32dev build | SUCCESS (~1.16 MB) |
| esp32s3_n16r8 build | SUCCESS (~1.16 MB) |
POST /api/firmware |
Returns {"ok":true} (PC stub); streams via body callback on ESP32 |
| Version badge | Shown when GitHub latest release tag newer than firmware_version and has matching .bin |
flash_chip_mode / psram_mode |
PAL functions + SystemStatus controls; psram_mode guarded by totalPsramKb_ > 0 |
| Light mode | #preview-section background override added; day mode preview bar now white |
Backlogged from this sprint (per user Q decisions):
- Device-side WS progress events during OTA (Q1-B); XHR upload.onprogress used instead
- Nightly pre-release channel in version badge (Q2-B); only stable releases shown
- Live test: flash binary + verify version (requires hardware access with known binary on GitHub releases)
Retrospective¶
What went well:
- onPostBinary cleanly separated from onPost (no RAM buffering for large binaries)
- Dual-OTA partitions already in place; no CSV changes needed
- Browser-direct GitHub API (CORS OK on public repos) avoids device internet access
- checkForUpdate() uses sessionStorage to rate-limit GitHub API calls (1 hr TTL)
- OtaState.h inline globals give clean shared state between the HTTP route and module without any RTOS sync overhead
- flash_chip_mode / psram_mode PAL functions fit the existing pattern cleanly; compile-time CONFIG_SPIRAM_MODE_OCT is the reliable OPI indicator
What was tricky:
- ota_end re-fetches the next OTA partition via esp_ota_get_next_update_partition(nullptr) since set_boot_partition is not called before ota_end; easy to miss
- Light mode required an explicit #preview-section background override because sticky positioning pins the dark base color through the body override
Seeds for future sprints: - Sprint 2 (CI Release Pipeline) is the next step: assets must be published before the GitHub tab or version badge show anything useful - Live OTA test (flash + verify version change) is backlogged until Sprint 2 ships firmware assets
Sprint 2: CI Release Pipeline¶
Scope: Attach firmware binaries as GitHub release assets on every tagged release, and add a nightly pre-release that rebuilds automatically each night. These assets are what
FirmwareUpdateModule's GitHub tab fetches.
Depends on: Sprint 1 (asset naming convention must match what FirmwareUpdateModule expects).
Complexity: S (YAML only; no C++ or Python changes).
Summary¶
| Part | Description | Est |
|---|---|---|
release.yml alignment |
Pin Python 3.12 + PlatformIO <7, add PIO package cache, asset upload job |
S |
nightly.yml |
New workflow: cron 02:00 UTC, idempotent delete+recreate nightly pre-release | S |
| Total | S |
Asset naming convention¶
| Asset | Source path | Matches env |
|---|---|---|
projectMM-esp32dev.bin |
.pio/build/esp32dev/firmware.bin |
esp32dev |
projectMM-esp32s3_n16r8.bin |
.pio/build/esp32s3_n16r8/firmware.bin |
esp32s3_n16r8 |
projectMM-pc-macos.tar.gz |
deploy/build/pc/macos/projectMM |
PC (macOS CI runner) |
projectMM-pc-windows.zip |
deploy/build/pc/windows/projectMM.exe |
PC (Windows, Sprint 3) |
Scope (confirmed)¶
.github/workflows/release.yml — aligned and complete:
- Triggered by
push: tags: ['v*']orworkflow_dispatchwith tag input. - PC build:
astral-sh/setup-uv@v5+uv run deploy/build.py -target pc(aligned withci.yml). - ESP32 builds: Python 3.12 pinned, PlatformIO pinned to
<7, PlatformIO package cache added (both gaps vsci.ymlfixed). upload-assetsjob usesgh release upload --clobberafter all three build jobs pass.
.github/workflows/nightly.yml — new:
- Triggered on
schedule: cron: '0 2 * * *'(02:00 UTC daily) andworkflow_dispatch. - Identical build matrix to
release.yml(macOS PC + esp32dev + esp32s3_n16r8). publish-nightlyjob: deletes existingnightlyrelease + tag, re-creates as pre-release titledNightly (YYYY-MM-DD)with short commit SHA in notes. Idempotent: thegh release delete ... || trueguard handles first run.- The nightly pre-release appears in
FirmwareUpdateModule's GitHub releases tab with a "pre-release" badge.
Backlogged from this sprint:
scripts/list_pio_envs.py+deploy/build.py --all-envs: not needed while only 2 ESP32 envs exist; pick up when ESP32-P4 is added (Release 8 Sprint 1).- PC Linux build artifact: macOS binary covers the main use case for now.
Definition of Done¶
- [x]
release.yml: PC build usesuv run deploy/build.py -target pc(was raw cmake) - [x]
release.yml: ESP32 jobs pin Python 3.12 andplatformio<7(was3.x+ unpinned) - [x]
release.yml: PlatformIO package cache added to both ESP32 jobs - [x]
nightly.ymlcreated: cron 02:00 UTC +workflow_dispatch; builds macOS PC + esp32dev + esp32s3_n16r8 - [x]
nightly.yml:publish-nightlyjob deletes and re-createsnightlypre-release (idempotent) - [x] Asset names match
FirmwareUpdateModuleexpectations (projectMM-{env}.bin)
Result¶
| Metric | Value |
|---|---|
release.yml |
Aligned with ci.yml; tag-triggered; 3 build jobs + upload |
nightly.yml |
New; cron 02:00 UTC; delete+recreate nightly pre-release |
| Asset naming | projectMM-esp32dev.bin, projectMM-esp32s3_n16r8.bin, projectMM-pc-macos.tar.gz |
| Python/PlatformIO | Pinned to 3.12 and <7 in both workflows (consistent with ci.yml) |
| Unit tests | 375/375 (no new tests; YAML-only sprint) |
Retrospective¶
What went well:
- release.yml already existed with the core structure; this sprint was alignment + nightly, not a rebuild from scratch
- gh release delete nightly --yes --cleanup-tag 2>/dev/null || true pattern is clean and idempotent; no third-party action needed
- Build matrix in nightly.yml is identical to release.yml so both stay in sync by copy
What was tricky:
- release.yml had python-version: '3.x' and unpinned PlatformIO; these would have broken on PlatformIO 7 release or a Python 3.13 runner update (same issue ci.yml already fixed months ago)
Seeds for future sprints:
- Sprint 3 (Windows) adds projectMM-pc-windows.exe to both release.yml and nightly.yml
- Release 8 Sprint 1 (ESP32-P4) adds a third ESP32 build job; at that point list_pio_envs.py becomes worth adding to avoid three copies of the same job
- Once a tagged release exists, run a manual workflow_dispatch to verify the upload-assets path end-to-end
Sprint 3: Windows Build¶
Scope: projectMM builds and runs as a native Windows binary (CMake + Clang/Ninja via llvm-mingw). CI job produces a
.ziprelease artifact, and bothrelease.ymlandnightly.ymlgain abuild-pc-windowsjob. macOS build is unaffected.
Deferred from: Release 5 original scope.
Summary¶
| Part | Description | Est |
|---|---|---|
| Winsock2 guards | WsServer.h + Pal.h UDP: POSIX socket calls behind #ifdef _WIN32 |
S |
| Socket shim unification | PcSocketShims.h unified header; PcSockets.h deleted; both consumers updated |
S |
| CMake + build scripts | Ninja on Windows, pc_platform() helper, per-platform binary paths in all deploy scripts |
M |
| Windows memory stats | VirtualQueryEx for free_heap_bytes(); MemBoot/MemLive correct on Windows |
M |
| Output file split | live-results-pc-{platform}.json, per-env MD files; live-results-all.json dropped |
S |
| CI integration | build-pc-windows job in ci.yml, release.yml, nightly.yml; .zip artifact |
S |
| Test + misc fixes | /tmp/ relative-path fix, UTF-8 encoding, dangling-pointer onInputRemoved fix |
M |
| Total | XL |
Planned scope¶
#ifdef _WIN32guards inWsServer.handPal.h(Winsock2 instead of POSIX sockets).ws2_32link inCMakeLists.txtandtests/CMakeLists.txt.- GitHub Actions CI job on
windows-latest(build + unit tests); addsprojectMM-pc-windows.zipto release and nightly artifact lists. deploy/build.py:-G Ninjaon Windows (single-config generator, binary at predictable path).
Additional work discovered during implementation:
src/pal/MemoryStats.h: Windows branch usingGetDiskFreeSpaceExA(nosys/statvfs.h).tests/ws_test_client.h: full rewrite with_wstc*socket shims for Winsock2 compatibility.tests/test_module_manager.cpp,tests/test_reorder.cpp: fix hardcoded/tmp/paths to relative paths (no/tmp/on Windows).deploy/unittest.py: addencoding='utf-8'to markdown write (Windows default codec lacks emoji support); add blank-line stripping fromrun-tests.logoutput.deploy/_lib.py: addpc_platform()helper ("windows"/"macos"/"linux").- All deploy scripts (
build.py,run.py,livetest.py,summarise.py,unittest.py): paths updated fromdeploy/build/pc/todeploy/build/pc/{platform}/and logs from*-pc.logto*-pc-{platform}.log. - Socket code sharing:
src/pal/PcSocketShims.hcreated as a shared header with unified_ws*shim functions (open, close, accept, connect, recv, send, wait).src/pal/PcSockets.hmerged in and deleted.WsServer.handws_test_client.hboth includePcSocketShims.h;ws_test_client.hno longer has its own_wstc*duplicates. - Windows MemBoot/MemLive:
pal::free_heap_bytes()on Windows implemented viaVirtualQueryEx(walks committed private virtual memory regions; "free" = 512 MB ceiling minus committed).pal::total_heap_kb()returns the matching ceiling.pal::s_freeHeapCache_()caches the last scan somax_alloc_bytes()avoids a second scan in the same tick.MemBootandMemLivelines now appear in the Windows server log with correct per-module deltas. - Output file improvements:
live-results-pc.jsonrenamed tolive-results-pc-{platform}.json;live-results-all.jsondropped entirely.docs/status/live-results.mdsplit into per-env files (live-results-pc-windows.md,live-results-esp32dev.md,live-results-esp32s3_n16r8.md).livetest_out.txtdeleted.deploy/summarise.pyrewritten to read per-device JSON files directly. docs/developer-guide/deploy.md: fully updated for Windows (toolchain requirements, Ninja, llvm-mingw,uv runthroughout, per-platform binary paths, CI table with Windows row, log file table).
Definition of Done¶
- [x]
src/core/WsServer.h: Winsock2 shims replace POSIX socket calls under#ifdef _WIN32. - [x]
src/pal/Pal.h: UDP functions (udp_bind,udp_recv,udp_send,udp_broadcast) compile on Windows. - [x]
src/pal/MemoryStats.h: Windows branch providesgetMemoryStats()viaGetDiskFreeSpaceExA. - [x]
CMakeLists.txt+tests/CMakeLists.txt:ws2_32linked on Windows. - [x]
tests/ws_test_client.h: cross-platform socket shims; test helper compiles on Windows. - [x]
deploy/build.py: Ninja generator selected on Windows; binary atdeploy/build/pc/windows/projectMM.exe. - [x] All 375 unit tests pass on Windows (Clang 18 + llvm-mingw-ucrt + Ninja).
- [x]
ci.yml:build-pc-windowsjob (build + unit tests). - [x]
release.yml+nightly.yml:build-pc-windowsjob;.zipartifact included. - [x] Deploy scripts use
deploy/build/pc/{platform}/paths; macOS logs remain*-pc-macos.log. - [x] All 13 live test groups pass on Windows (
all_pc.py: 4 passed, 0 failed). - [x]
src/pal/PcSocketShims.h: unified socket shim header;PcSockets.hmerged in and deleted;WsServer.handws_test_client.hboth includePcSocketShims.h. - [x]
pal::free_heap_bytes()on Windows viaVirtualQueryEx;MemBoot/MemLivelines appear in Windows server log with correct per-module deltas. - [x] Live result files split per platform (
live-results-pc-{platform}.json); per-envdocs/status/live-results-*.mdfiles generated;live-results-all.jsondropped. - [x]
docs/developer-guide/deploy.mdupdated: Windows toolchain requirements,uv runthroughout, per-platform paths, CI table with Windows row.
Result¶
| Metric | Value |
|---|---|
| Unit tests (Windows) | 375 / 375 passed |
| Test assertions (Windows) | 1807 / 1807 passed |
| Live test groups (Windows) | 13 / 13 passed (133 assertions) |
all_pc.py result |
4 / 4 steps passed |
| Toolchain | Clang 18.1.8 + llvm-mingw-20240619-ucrt-x86_64 + Ninja |
| Build target | projectMM.exe (Windows x86-64) |
| Files changed | 50 source, deploy, docs, and CI files |
| macOS tests (unaffected) | unchanged (375 pass in CI) |
| Windows MemBoot | Correct per-module deltas via VirtualQueryEx (frag% display deferred — see backlog) |
| Socket shim files | PcSocketShims.h unified; PcSockets.h deleted |
Retrospective¶
What went well:
- The socket shim pattern (_ws* unified in PcSocketShims.h) kept platform branches out of class bodies and eliminated the duplicate _wstc* block that had grown alongside the original _ws* set.
- pc_platform() in _lib.py gives a single source of truth for the three-way platform string; all deploy scripts and CI reference it.
- UDP broadcast loopback (Art-Net test5) works on Windows without any changes.
- VirtualQueryEx gives realistic, per-module heap deltas in MemBoot — the approach is correct even though the frag% display has a pending fix.
- Splitting live-results-all.json into per-platform files and live-results.md into per-env files removes the aggregation step and makes each device's results self-contained.
What was tricky:
- sys/statvfs.h (MemoryStats.h) and arpa/inet.h (ws_test_client.h) are not available on Windows and required additional platform guards not in the original scope.
- /tmp/ hardcoded in several test files causes silent failures on Windows (file not written, module not loaded, findById returns nullptr). Fixed by switching to relative paths.
- std::filesystem::path::write_text on Windows uses the system default encoding (cp1252) which cannot encode emoji (✅) used in test-results.md. Fixed by passing encoding='utf-8'.
- The build/pc/ flat layout conflated macOS and Windows artifacts. Restructured to build/pc/{platform}/ in the same sprint.
- Latent dangling-pointer bug exposed by Windows: DriverLayer stores raw EffectsLayer* pointers in sources_[]. When delete_all_modules() freed an EffectsLayer, driver1 retained a stale pointer and crashed on the next loop() tick. macOS tolerated the dangling access; Windows terminated the process. Fixed by adding Module::onInputRemoved(Module*).
- Windows heap measurement: HeapWalk (Win32 default heap) and GlobalMemoryStatusEx (system-wide RAM) were tried and rejected before settling on VirtualQueryEx. HeapWalk walks the wrong heap (Win32 vs UCRT malloc), giving ~20 KB values that triggered check_alloc() denial and crashed the server. GlobalMemoryStatusEx returns 4 GB+ with no per-allocation granularity.
- frag% overflow: largNow * 100u overflows uint32_t at ~500 MB values. Fixed in pal::memEvent() with a (uint64_t) cast; the same overflow exists in StatefulModule.h and Scheduler.cpp and is deferred to the backlog (see index.md).
Seeds for future sprints:
- Linux PC build is CI-tested only on macOS. A ubuntu-latest CI leg would close the triangle (low effort: same uv run deploy/build.py -target pc command, linux slug already in pc_platform()).
- The timing-sensitive test Scheduler timing accumulator tracks SpinModule within 5% occasionally flakes under heavy CI load on Windows (passes in isolation). Consider widening epsilon or moving to a dedicated timing fixture.
- Windows MemBoot frag% accuracy: apply the (uint64_t) overflow fix to StatefulModule.h and Scheduler.cpp, and fix call order (max_alloc before free_heap in Scheduler). Tracked in the backlog.
- Effects animate slowly in the WebGL preview on Windows but not on macOS. Root cause not yet identified (push-rate throttle, time-unit mismatch, or browser queue lag). Tracked in the backlog.
Sprint 5: Scenario Baseline and extends¶
Scope: Populate
deploy/test/scenario-baseline.jsonfrom a real ESP32 run; add"extends"inheritance to scenario files; wire--compare-baselineintodeploy/all.py.
Deferred from: Sprint 10 retrospective seeds.
Complexity: M
Summary¶
| Part | Description | Est |
|---|---|---|
extends support |
Single-level inheritance in scenario.py, live_suite.py, test_scenarios.cpp (identical logic in each) |
M |
| New scenario files | base-pipeline-64x64.json and four-layers.json using extends |
S |
| Baseline population | Run on MM-70BC hardware, commit scenario-baseline.json |
S |
all_pc.py integration |
_run_scenario_baseline(): start server, compare baseline, non-fatal |
S |
| Total | M |
Planned scope¶
- Run
deploy/scenario.py --update-baselineagainst MM-70BC (ESP32-S3); commit result. - Implement
"extends"key (single-level): load parent steps and prepend them; child metadata wins. deploy/all_pc.py: after live tests, start the PC server and rundeploy/scenario.py --compare-baseline; print warning on regressions (non-fatal).- Add
base-pipeline-64x64.jsonandfour-layers.jsonstress scenarios, both using"extends".
Definition of Done¶
- [x]
deploy/scenario.pyload_scenario(): single-level"extends"resolves parent file and prepends parent steps - [x]
deploy/live_suite.pyrun_scenario(): sameextendsresolution for live tests - [x]
tests/test_scenarios.cppresolve_extends(): same resolution so C++ scenario replay handles the new files - [x]
deploy/test/scenarios/base-pipeline-64x64.json: extendsbase-pipeline-32x32, adds resize to 64x64 - [x]
deploy/test/scenarios/four-layers.json: extendstwo-layers, adds GameOfLife + Noise layers - [x]
deploy/all_pc.py:_run_scenario_baseline()starts PC server, runs scenario--compare-baseline, non-fatal - [x]
deploy/test/scenario-baseline.json: populated from MM-70BC (ESP32-S3); 7 scenarios, all steps measured - [x] 375/375 unit tests pass; all scenario replay tests include extended scenarios
Result¶
| Metric | Value |
|---|---|
| Unit tests | 375/375 pass (20 new assertions from extended scenario replay) |
| PC live tests | 13/13 PASS (including 2 new extended scenarios) |
| Baseline | Populated from MM-70BC: 7 scenarios, ~177 KB free heap at base pipeline |
| Scenarios | 7 files (5 pre-existing + base-pipeline-64x64, four-layers) |
Backlogged from this sprint:
- system_fps baseline threshold too tight (50%+ swings between runs on hardware); tracked in cross-release backlog.
- Recursive extends (parent can itself extend) deferred until a chain is actually needed.
Retrospective¶
What went well:
- Single-level extends is a clean pattern: parent steps first, child steps appended, child metadata wins. No ambiguity.
- The three places that load scenario JSON (scenario.py, live_suite.py, test_scenarios.cpp) each got identical logic in ~8 lines; no shared abstraction needed at this scale.
- _run_scenario_baseline() in all_pc.py cleanly manages its own server lifetime (start, run, terminate) as a self-contained helper.
What was tricky:
- system_fps is too volatile for a 20% threshold on hardware (WiFi task preemption causes 30-65% swings between identical runs). The baseline pass/fail signal is unreliable for fps; heap metrics are stable and useful.
- The live suite (live_suite.py) loads scenario JSON independently of scenario.py, so extends resolution had to be added in three places. A shared Python utility would reduce duplication if more scenario features are added.
Seeds for future sprints:
- Scope baseline checks to heap metrics only (heap_free, max_alloc); skip fps or widen its threshold to 50%.
- Recursive extends if scenario hierarchies deepen.
Sprint 6: Static RAM Hardening for Classic ESP32¶
Scope: Reduce the permanent
.bssfootprint on esp32dev (no PSRAM) to give module setup more headroom. The log ring buffer and WiFi buffer allocation are the two largest tunable levers.
Identified in: R6S8 live device analysis (esp32dev free-heap floor ~109 KB, only 19 KB above 90 KB reserve; fragmentation 55%+).
Complexity: S
Summary¶
| Part | Description | Est |
|---|---|---|
| Ring buffer resize | LOG_RING_CAP=32, LOG_RING_ENTRY=64 (2 KB, saves 6 KB .bss); test updated |
XS |
check_alloc dual guard |
Adds max-alloc block check alongside free-heap reserve; printf on failure reason |
S |
| WiFi buffer investigation | -DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM attempted then removed (pre-compiled framework conflict) |
XS |
| Total | S |
Planned scope¶
- Set ring to 32 entries x 64 bytes = 2 KB on all devices. Saves 6 KB vs the original 8 KB ring on classic ESP32. Trade-off: the ring holds ~32 lines instead of ~64.
- Tune WiFi dynamic RX buffer count:
-DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM=16was attempted inbuild_flagsbut the symbol is already defined in the framework's pre-compiledsdkconfig.h, causing a redefinition warning. Flag removed; WiFi buffer count cannot be overridden this way with the Arduino framework blob. - Upgrade
pal::check_allocto a dual guard:free_heap_bytes() >= bytes + reserveANDmax_alloc_bytes() >= bytes. Surface the failure reason viaprintf("check_alloc: reserve violation"vs"check_alloc: largest block too small").
Complexity: S
Definition of Done¶
- [x]
src/core/Logger.cpp:LOG_RING_CAP = 32,LOG_RING_ENTRY = 64(2 KB total) - [x]
platformio.ini:-DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM=16attempted and removed — conflicts with pre-compiledsdkconfig.hin Arduino framework blob - [x]
src/pal/Pal.hcheck_alloc(): dual guard checks both free heap reserve AND max contiguous block;printfon failure - [x]
tests/test_logger.cpp: ring overflow test updated for 32-entry cap (38 entries pushed, 32 survive fromentry6) - [x] esp32dev and esp32s3_n16r8 build successfully
- [x] 375/375 unit tests pass
Result¶
| Metric | Value |
|---|---|
| Ring size | 32 x 64 = 2 KB (was 64 x 128 = 8 KB, saves 6 KB .bss) |
| WiFi RX buffers | Not changed — symbol already defined in framework sdkconfig.h; -D override causes redefinition warning and was removed |
check_alloc |
Dual guard: free-heap reserve + max-alloc block; printf on refusal |
| Unit tests | 375/375 pass |
| esp32dev build | SUCCESS |
| esp32s3_n16r8 build | SUCCESS |
Backlogged from this sprint: - Verify WiFi buffer flag runtime effect on hardware (depends on pioarduino compiling WiFi component from source vs precompiled blob). - Update MemBoot/MemLive baseline table in docs with post-hardening numbers from MM-C1BC (requires live flash and measurement).
Retrospective¶
What went well:
- Ring size change is a 2-line edit in Logger.cpp with a single test update; zero risk.
- Dual check_alloc guard closes the fragmentation blind spot cleanly — the previous check passed when total free was enough but no single block was large enough to satisfy the allocation.
- printf for the guard failure message avoids a Logger dependency in Pal.h.
What was tricky:
- -DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM=16 in build_flags causes a redefinition warning: sdkconfig.h in the pre-compiled Arduino framework blob already defines the symbol. The WiFi component is not compiled from source, so Kconfig values are fixed at framework build time and cannot be overridden via compiler flags. Removed the flag; WiFi buffer tuning requires a custom framework build or a sdkconfig.defaults approach outside the standard pioarduino setup.
Seeds for future sprints:
- WiFi RX buffer tuning via sdkconfig.defaults (requires custom framework build); backlogged.
- Move ring buffer to PSRAM-backed heap allocation in setup() if 2 KB .bss still matters (requires PAL extension).
Sprint 7: GET /api/log Frontend Panel¶
Scope: Surface the existing ring buffer (R6S2) in the frontend as a live log panel, removing the need for a serial monitor during field debugging.
Identified in: R6S2 retrospective ("ring buffer exists; streaming it to the frontend is the obvious next step") and R6 backlog.
Complexity: S
Summary¶
| Part | Description | Est |
|---|---|---|
| Log colouring | _logClass() for warn/error; .log-warn/.log-error CSS classes; light-mode overrides |
S |
| Scroll management | logAtBottom flag; auto-scroll pauses on manual scroll-up, resumes at bottom |
S |
| History backfill | GET /api/log fetched on WS connect; ring entries prepended to panel |
S |
| Total | S |
Planned scope¶
- WS push is already in place per-line via
g_logWsPushFn(format{"t":"log","m":"..."}). Sprint scope is completing the frontend panel. - Frontend enhancements: WARN/ERROR line colouring (keyword match); auto-scroll pauses on manual scroll-up; backfill history from
GET /api/logon WS connect. LOG_MAX_LINES = 100JS constant; clear button resets scroll state.
Definition of Done¶
- [x]
src/frontend/app.js:_logClass(text)colours lines containingwarn(amber) orerror/fail(red) - [x]
src/frontend/app.js:logAtBottomflag; auto-scroll only when panel is scrolled to bottom; pauses on manual scroll-up - [x]
src/frontend/app.js:GET /api/logfetched onwsConn.onopen; ring entries backfilled into panel - [x]
src/frontend/app.js: clear button resetslogAtBottom = true - [x]
src/frontend/style.css:.log-warn(amber) and.log-error(red) classes; light-mode overrides - [x] PC live tests pass (no regressions)
Result¶
| Metric | Value |
|---|---|
| Log panel | Collapsible below module list; LOG_MAX_LINES = 100 |
| WS push | Per-line {"t":"log","m":"..."} format (pre-existing); kept as-is |
| Coloring | warn lines amber; error/fail lines red; light-mode overrides |
| Auto-scroll | Pauses on manual scroll-up; resumes when scrolled back to bottom |
| History backfill | GET /api/log fetched on WS connect; all ring entries added to panel |
| PC live tests | 13/13 PASS |
Backlogged from this sprint:
- Timestamp and log level as separate columns (structured rows) deferred; raw message text is sufficient for current debugging needs.
- Batched WS push ({"type":"log","entries":[...]}) deferred; per-line push at current log rates does not cause measurable overhead.
Retrospective¶
What went well:
- The per-line WS push (g_logWsPushFn) and frontend handler were already in place. Sprint 7 completed the UX: colouring, scroll-pause, and history backfill.
- History backfill on WS connect (4 lines) means a browser that opens 10 s after boot sees the startup log immediately — the most common debugging scenario.
- Keyword-based colouring (no prefix parsing) works with the actual log message format, which does not use systematic level prefixes.
What was tricky:
- Scroll-pause needs a passive: true scroll listener and an explicit logAtBottom flag tracking scrollTop + clientHeight >= scrollHeight - 5. The 5 px tolerance avoids false "not at bottom" on fractional scroll positions.
Seeds for future sprints:
- If log volume grows, consider adding a "level": "warn"|"error" field to the WS frame so colouring is exact rather than keyword-matched.
Sprint 8: Heap Safety, HTTP OOM Hardening, Live-Test Correctness, and isPermanent() Removal¶
Scope: Four interrelated changes that close the remaining stability gaps on classic ESP32 and clean up dead runtime scaffolding: a per-module opt-in heap check before committing control changes, OOM recovery in the HTTP layer, a set of live-test correctness fixes that had been producing spurious duplicate modules, and full removal of the
isPermanent()mechanism that turned out to be dead code.
Identified in: MM-C1BC crash (70x28 GridLayout saved to LittleFS; on browser refresh serializeJson to std::string threw std::bad_alloc → abort); live-test review (PreviewModule top-level, duplicate SystemStatusModule/FirmwareUpdateModule from scenario scripts, Windows MD file written on macOS).
Summary¶
| Part | Description | Est |
|---|---|---|
| A: controlAllocBytes heap guard | Per-module controlAllocBytes() opt-in; setControl checks heap before committing large allocs |
M |
| B: HTTP OOM hardening | Heap check before JSON serialisation in AppRoutes; 503 on OOM; DynamicJsonDocument size guard |
M |
| C: Live-test correctness | INFRA_TYPES/SINGLETON_TYPES, summarise.py ESP32 guard, unittest.py cwd fix, state pollution fix |
M |
| D: FirmwareUpdateModule docs | User-guide page and module doc page | S |
E: isPermanent() removal |
Delete isPermanent() from all modules and base class; remove 403 route guard |
S |
| Total | L |
Part A: Per-module controlAllocBytes heap guard¶
Problem: A user could resize GridLayout to 70x28 (1960 pixels). On a fresh boot the allocation succeeded. After WiFi and the HTTP server started, free heap was ~60 KB with a largest block of ~36 KB. On browser refresh GET /api/modules called serializeJson(doc, std::string body) — the growing-string operator new chain threw std::bad_alloc and the device aborted.
Root cause in two parts:
1. setControl("width", 70) had no heap check; the value was saved to LittleFS and survived reboots.
2. serializeJson used a growing std::string that fragmented the already-tight heap.
Fix in StatefulModule.h:
readThrough(ControlDescriptor&): static helper that reads a control's current value asfloat(mirrors the existingwriteThrough).virtual size_t controlAllocBytes(const char* key) const: returns 0 by default (opt-in; modules with no significant heap impact need not override it).- Both
setControl()overloads: save old value withreadThrough, write new value, callcontrolAllocBytes(key), callpal::check_alloc(need)ifneed > 0, revert withwriteThrough(old)and returnfalseif the check fails. NoonUpdate()is called on failure; the control stays at its previous value.
Fix in GridLayout.h:
safeWidth_,safeHeight_,safeDepth_(uint32_t, default 10/10/1): the last dimensions successfully committed to DriverLayer.controlAllocBytes()override: computes(newNPix - safeNPix) * sizeof(RGB) * 2(EffectsLayer double-buffer delta); returns 0 for shrinks.onUpdate(): updates safe fields and rebuilds only after the heap check passes (the framework reverts the control automatically on failure).buildMappings_(): changed tonew (std::nothrow)with a null-check log and early return on OOM.setup()/teardown(): initialize/reset safe fields.
size_t controlAllocBytes(const char* /*key*/) const override {
const uint32_t newNPix = (uint32_t)width_ * height_ * depth_;
const uint32_t safeNPix = (uint32_t)safeWidth_ * safeHeight_ * safeDepth_;
return newNPix > safeNPix ? (newNPix - safeNPix) * sizeof(RGB) * 2 : 0;
}
New tests (tests/test_layouts.cpp, +3):
- GridLayout - growing dimensions updates mappingCount (width 10→16, height 10→16)
- GridLayout - shrinking dimensions updates mappingCount (32x32 → 8x8)
- GridLayout - healthReport reflects current dimensions after resize
Part B: HTTP OOM hardening¶
Problem: serializeJson(doc, std::string body) uses std::string::push_back internally. On a heap-fragmented ESP32 this triggers a growing series of reallocations (16 → 32 → 64 → … → N bytes), each of which can throw std::bad_alloc. The final HttpResponse{body} string construction was also a heap allocation.
Fix in AppRoutes.cpp: All GET routes that serialized a JsonDocument to a std::string now use:
std::string body;
body.reserve(measureJson(doc) + 1); // one allocation, exact size, no growth
serializeJson(doc, body);
return HttpResponse{200, "application/json", std::move(body)};
measureJson(doc) traverses the document without allocating, returns the exact byte count. reserve() does a single heap allocation of that size. serializeJson then fills the string without reallocating. std::move steals the buffer into HttpResponse without copying. Net result: one heap allocation per response instead of O(log N).
Fix in HttpServer.h (ESP32 section): All handler dispatch points — onGet, onPost (request-complete callback), onDelete, onPatch (request-complete callback) — now wrap the handler call in:
try {
auto resp = handler(...);
req->send(resp.status, resp.contentType.c_str(), resp.body.c_str());
} catch (const std::bad_alloc&) {
req->send(503, "application/json", R"({"error":"low heap"})");
}
The 503 response uses a string literal (no heap). This catches any remaining allocation failure (e.g. JsonDocument internal pool) and returns a clean HTTP error instead of calling abort().
Part C: Live-test correctness¶
Problem 1: PreviewModule created as a top-level module. test0_infra added preview1 with no parent_id. PreviewModule belongs as a child of DriverLayer (per spec). Because PreviewModule was in INFRA_TYPES, it survived delete_all_modules() and accumulated as an orphan across tests. Scenario scripts tried to add preview1 with parent_id="driver1" but add_or_exists silently accepted the already-running top-level instance.
Problem 2: Duplicate singleton modules. Scenario files include steps for NetworkModule, SystemStatusModule etc. When the running instance had a different id than the scenario's id (e.g. device has systemstatus1, scenario adds sysinfo1), add_or_exists would create a second instance. Same for FirmwareUpdateModule: ensureInfraModules() recreates it on every boot, so it always survives delete_all_modules(), yet the scenario runner could add a second one under a new id.
Problem 3: live-results-pc-windows.md written on macOS. deploy/live/live-results-pc-windows.json is committed from Windows CI. On macOS, summarise.py read it and wrote docs/status/live-results-pc-windows.md. The per-env MD should only be written by the machine that actually ran those tests.
Fixes in deploy/live_suite.py:
PreviewModuleremoved fromINFRA_TYPES: it is not infrastructure — it requires a parent driver and must be re-created per-test.- New
SINGLETON_TYPES = INFRA_TYPES | {"FirmwareUpdateModule"}: types where only one instance should ever exist. _scenario_step: beforeadd_or_exists, checkstype_ in SINGLETON_TYPES and type_ in client.types_present(). If true, logs a skip and returns success — prevents a second instance being created when the live state has one under a different id.test0_infra: removed the top-levelpreview1add (no driver exists at that point).test1_ripples_pipeline: addspreview1as child ofdriver1right afterdriver1is created.test5_artnet_loopback: addspreview_txas child oftx_drvandpreview_rxas child ofrx_drv.test7_multi_layout: addspreview7as child ofdriver7.
Fix in deploy/summarise.py:
CURRENT_PC_PLATFORMderived fromplatform.system()at module load ("darwin"→"macos", etc.)._write_live_results_md: skips writinglive-results-pc-{other}.mdwhenenv != f"pc-{CURRENT_PC_PLATFORM}". The foreign-platform JSON data is still loaded for theindex.mdsummary table; only the per-env MD file is gated.
Part D: FirmwareUpdateModule user documentation¶
Added docs/modules/system/firmware-update-module.md covering: what the module does (surfaces OTA progress; upload handled by AppRoutes), the two controls (update_status, update_pct), three ways to flash (browser file picker, URL API call, uv run deploy/flash.py), and platform notes (URL OTA returns 501 on PC). Added to mkdocs.yml nav and to the category table and reference list in docs/user-guide/modules/index.md.
Part E: Remove isPermanent()¶
Problem: isPermanent() was a virtual method on StatefulModuleBase intended to prevent certain modules from being deleted at runtime. ModuleManager was the only class that returned true. However, ModuleManager is never placed in owned_[] — it manages the list but is not part of it. This meant the isPermanent() check in removeModule() was never triggered. The mechanism was dead code, and its presence was actively misleading: it suggested FirmwareUpdateModule should be permanent (it had the override until Sprint 8 Part D), when the correct safety net is the boot guard in ensureInfraModules().
Root cause: ModuleManager can be targeted via its kId for control updates (via the special-case path in setControl), but it is never added to owned_[] via addModule. removeModule() iterates owned_[], so it can never find and delete ModuleManager. The 403 Permanent response in AppRoutes.cpp was therefore unreachable.
Fix:
src/core/StatefulModule.h: removedvirtual bool isPermanent() constdeclaration.src/core/ModuleManager.h: removedbool isPermanent() const override { return true; }; removedRemoveResult::Permanentfrom the enum; updatedremoveModule()comment.src/core/ModuleManager.cpp: removedisPermanent()check inremoveModule(); removedisPermanent()check inreplaceModule(); removedobj["permanent"]fromgetModulesJson().src/core/AppRoutes.cpp: removedcase RemoveResult::Permanent(403) from the DELETE handler; updated replace error message to remove mention of "permanent".src/frontend/app.js: replace button and delete button now always rendered —mod.permanentwasundefinedfor all modules anyway (field no longer emitted by the server).tests/test_module_manager.cpp: removedModuleManager - isPermanent returns truetest.tests/test_system_info.cpp: removedFirmwareUpdateModule is permanenttest.
Definition of Done¶
- [x]
src/core/StatefulModule.h:readThrough(),controlAllocBytes()virtual hook, heap check in bothsetControl()overloads with auto-revert on failure - [x]
src/modules/layouts/GridLayout.h:safeWidth_/Height_/Depth_safe dimension tracking;controlAllocBytes()override;buildMappings_()usesnew (std::nothrow) - [x]
tests/test_layouts.cpp: 3 new GridLayout resize tests; 378/378 pass - [x]
src/core/AppRoutes.cpp: all GET JSON routes usemeasureJson+reserve+std::move; no growing-string allocation - [x]
src/core/HttpServer.h(ESP32):try/catch(std::bad_alloc)in all four handler dispatch points; returns HTTP 503 on OOM - [x]
deploy/live_suite.py:PreviewModuleremoved fromINFRA_TYPES;SINGLETON_TYPESguard in_scenario_step(includesFirmwareUpdateModule);preview1/7/tx/rxwired as children of their driver - [x]
deploy/summarise.py:CURRENT_PC_PLATFORMguard;live-results-pc-windows.mdnot written on macOS - [x]
docs/modules/system/firmware-update-module.mdcreated; added tomkdocs.ymlnav and module index - [x] MM-C1BC: 70x28 GridLayout removed via
DELETE /api/modules/tree1; device stable - [x]
isPermanent()virtual method removed fromStatefulModuleBase;RemoveResult::Permanentenum value removed; all call sites inModuleManager.cpp,AppRoutes.cpp, andapp.jscleaned up; two now-stale tests removed - [x] 376/376 unit tests pass;
mkdocs serveproduces no warnings for the new doc page
Result¶
| Metric | Value |
|---|---|
| Unit tests | 376/376 pass (3 new GridLayout resize tests added, 2 stale isPermanent tests removed) |
| New virtual hook | controlAllocBytes() in StatefulModuleBase; default returns 0 (opt-in) |
| GridLayout | Rejects oversized resize when heap check fails; always allows shrink |
| HTTP OOM | try/catch(std::bad_alloc) in all ESP32 handler dispatchers; returns 503 |
| Serialization | measureJson + reserve = 1 allocation per response (was O(log N) growing chain) |
| Live test fix | PreviewModule always child of its DriverLayer; no more top-level orphans |
| Singleton guard | SINGLETON_TYPES prevents second instance of NetworkModule, SystemStatusModule, FirmwareUpdateModule etc. |
| summarise.py | live-results-pc-windows.md not written on macOS |
| FirmwareUpdateModule docs | New user-facing doc page; wired into mkdocs nav |
isPermanent() |
Removed entirely: virtual method, enum value, all call sites, frontend gate, 2 tests |
| Device recovery | MM-C1BC: bad 70x28 GridLayout deleted via REST; device stable |
Retrospective¶
What went well:
- controlAllocBytes as a virtual hook with a zero default keeps the mechanism entirely opt-in: modules with no significant heap impact add no code and pay no overhead.
- measureJson + reserve eliminates the growing-string problem with no memory overhead and no static buffer: the heap allocation is still there, but it is now exactly one call of exactly the right size.
- try/catch(std::bad_alloc) in HttpServer.h is the correct safety net: even if measureJson+reserve is not used on some future route, the device will return 503 rather than crash.
- SINGLETON_TYPES in the scenario runner is a clean, low-ceremony fix: one set, one guard, solves both the SystemStatusModule and FirmwareUpdateModule duplication problems without touching the scenario JSON files.
- Removing PreviewModule from INFRA_TYPES and wiring it as a child of its driver per-test is architecturally correct and required no changes to the scenario JSON files (they already had parent_id: "driver1").
What was tricky:
- Initial fix for AppRoutes used a static char kJsonSerBuf[12288] BSS buffer. Rejected: it added 12 KB of static RAM with no saving elsewhere. Replaced by the measureJson+reserve pattern which has the same single-allocation property with zero BSS cost.
- The CURRENT_PC_PLATFORM guard in summarise.py is for the per-env MD file only; the JSON data from other platforms is still loaded and appears in index.md. Care was needed not to break the cross-platform summary table.
What was tricky (Part E):
- isPermanent() looked load-bearing because it appeared in removeModule(), replaceModule(), the JSON output, and the frontend. Tracing the actual call graph revealed it was never reached: ModuleManager is not in owned_[], so the check at owned_[i].module->isPermanent() was never true for any module.
- The boot guard in ensureInfraModules() / ensureNetworkModules() is the correct protection for infra modules: it recreates missing modules on the next reboot rather than refusing DELETE at the API level. This is more resilient and less surprising to users.
Seeds for future sprints:
- Other modules that allocate in onUpdate() (e.g. EffectsLayer on buffer resize) should also implement controlAllocBytes().
- The try/catch in HttpServer.h only catches std::bad_alloc. A broader catch (const std::exception&) would catch any handler exception, which could be useful as the handler set grows.
- The SINGLETON_TYPES guard prevents a second instance but does not fix the id mismatch (the running module may have a different id than the scenario expects). A future improvement would be a get_or_create_by_type helper that returns the existing instance id if one is found.
Sprint 9: select Control Type¶
Scope: Add
CtrlType::Selectto the control system: auint8_t-backed dropdown registered viaaddControl(..., "select")followed byaddControlValue("label")calls, or via a singleaddControl()call that takes a pre-declared static options array. Option strings are C string literals and live in flash (.rodata), not DRAM; only the pointer array costs heap, and with the static-array form even that is zero. The backing field stores auint8_tindex (1 byte), not the selected string. The schema emits an"options"array; the frontend renders a native<select>element. No changes to existing control types.
Summary¶
| Part | Description | Est |
|---|---|---|
| A: Memory strategy | ControlEntry union (inline 8-char + heap overflow), addControlValue(), teardown() cleanup |
M |
B: addControl overloads |
3-arg select overload; remove defaults from uint8_t generic to avoid ambiguity |
S |
| C: Schema + persistence | getSchema() emits "options" array; saveState()/loadState() by index; setControl() by label |
M |
| D: Frontend dropdown | app.js renders select as <select>; sends index on change |
M |
| Total | L |
Design decision: store index (uint8_t) or selected string?¶
This is the first design choice that must be made before any code is written.
Option A: store as uint8_t index (recommended)¶
The backing field holds the zero-based index of the selected option. Saved state JSON looks like "waveform": 2.
Pros:
- 1 byte in RAM and in saved JSON — critical on heap-constrained ESP32.
- Consistent with existing Uint8 control type; setControl(), saveState(), getControlValues() all reuse the same numeric path with minimal changes.
- Fast in loop(): module code reads a uint8_t directly and uses it in a switch or array index — no string comparison.
- controlAllocBytes() returns 0 naturally (index is always 1 byte regardless of option count).
Cons:
- Saved state is not self-documenting: "waveform": 2 requires knowing the options list to interpret.
- If option order changes between firmware builds, a saved index silently maps to the wrong option (breaking change). Option order must be treated as part of the API, the same as JSON key names.
- REST and WebSocket clients must look up the schema to translate an index to a label.
Option B: store as char[] string value¶
The backing field holds the selected label as a C string. Saved state JSON looks like "waveform": "triangle".
Pros:
- Self-documenting in saved state and in logs.
- Robust to option reordering: the saved string still matches the right option after a firmware update that reorders the list (though renaming an option still breaks it).
- REST clients can post {"waveform": "triangle"} without knowing indices.
Cons:
- char[N] backing field: typically 16-32 bytes vs 1 byte for a uint8_t. On a module with four select controls that adds 60-124 bytes of RAM overhead.
- loop() must do strcmp or a linear search to map the string back to an integer branch — meaningfully slower on the hot path for effects modules.
- setControl() needs a new string-matching path: iterate options list to find the index, then write the string into the backing buffer. More code; more failure modes.
- Can still silently break if an option is renamed (different failure mode from index reordering, but equally possible).
Verdict: Option A (uint8_t index)¶
RAM and hot-path performance win on ESP32. The option-stability risk is the same class of breaking change as renaming a JSON key — already documented as requiring a version bump. The schema always includes the "options" array, so clients are never left guessing.
Part A: memory strategy for option strings¶
This is the second design decision, and the most important for ESP32.
Where do the strings live?¶
"sine", "triangle", "square" are C string literals. On ESP32 (and on all targets) they are stored in .rodata — flash memory, not DRAM. Accessing them requires a flash read (cached), but they cost zero bytes of DRAM. This is true whether they appear as addControlValue("sine") inline arguments or as elements of a static constexpr const char*[] array.
The only DRAM cost is the pointer array that ControlDescriptor::options points to: 4 bytes per option on ESP32 (32-bit pointer). For a 4-option select that is 16 bytes of DRAM.
Two registration styles and their costs¶
Style 1: addControlValue() — ergonomic, one small heap allocation
// in setup():
addControl(waveform_, "waveform", "select");
addControlValue("sine");
addControlValue("triangle");
addControlValue("square");
addControlValue("sawtooth");
The pointer array is heap-allocated during setup(). To avoid realloc churn, the first addControl("select") call pre-allocates a fixed-size slot array (e.g. 8 pointers = 32 bytes). Each addControlValue() fills the next slot; no reallocation until the pre-allocated capacity is exceeded. clearControls() frees the array on teardown().
DRAM cost: 32 bytes pre-allocated pointer slots (fixed per select control, regardless of actual option count up to 8).
Style 2: static array — zero heap, all in flash
// file scope or class body (own code):
static constexpr const char* kWaveforms[] = {"sine", "triangle", "square", "sawtooth"};
// in setup():
addControl(waveform_, "waveform", "select", kWaveforms, 4);
kWaveforms is a constexpr pointer array — it lives in .rodata (flash) alongside the string literals. The descriptor stores the pointer to kWaveforms directly. No heap allocation at any point. clearControls() does not free it.
Library-defined arrays work identically, provided they are declared inline in the header:
// in the library header (C++17 inline variable — one definition across all TUs):
inline const char* const EMITTER_NAMES[EMITTER_COUNT] = {
"orbitaldots", "swarmingdots", "audiodots", "lissajous",
"borderrect", "noisekaleido", "cube", "fluidjet"
};
// in setup():
addControl(emitter_, "emitter", "select", EMITTER_NAMES, EMITTER_COUNT);
The inline keyword (C++17) guarantees the linker keeps exactly one copy of the pointer array in the final binary even when the header is included from multiple translation units. Without inline the linker might emit a copy per .o file, wasting flash. Either way no DRAM is used — inline just prevents flash duplication.
DRAM cost: 0 bytes. The array and all string data are in flash.
Distinguishing owned vs. borrowed options¶
clearControls() must know whether to free(d.options). Add a single bit to ControlDescriptor:
bool ownsOptions; // true: heap-allocated by addControlValue(); false: static array
Set to true by addControlValue(), false by the static-array addControl() overload.
Recommended strategy¶
Use the static array form (Style 2) for any module that ships with a fixed option list — which is the common case (effects, drivers, layouts all have known-at-compile-time options). The addControlValue() form is available as convenience for prototyping or for option lists that are built dynamically from discovered resources (e.g. a list of available effects).
ControlDescriptor changes¶
struct ControlDescriptor {
const char* key;
const char* uiType;
CtrlType type;
uintptr_t ptr;
float minVal;
float maxVal;
float defVal;
const char** options; // pointer to option labels; null for non-Select
uint8_t optionCount; // number of valid entries in options
bool ownsOptions; // true: heap-allocated; free on clearControls()
};
Per-descriptor overhead for non-Select controls: 4 + 1 + 1 = 6 bytes (pointer + count + owns flag, with padding likely making it 8 bytes). For a module with 8 controls of which 1 is a 4-option static-array Select: 8 × 8 = 64 bytes extra DRAM — modest.
Complete memory picture (ESP32dev, 4-option static-array select)¶
| Item | Location | DRAM cost |
|---|---|---|
| Option strings ("sine" etc.) | flash (.rodata) | 0 bytes |
kWaveforms[] pointer array |
flash (.rodata) | 0 bytes |
ControlDescriptor::options pointer |
heap (controls_ array) | 4 bytes |
ControlDescriptor::optionCount |
heap (controls_ array) | 1 byte |
ControlDescriptor::ownsOptions |
heap (controls_ array) | 1 byte |
Backing uint8_t waveform_ field |
module instance (heap) | 1 byte |
| Total extra DRAM per select | ~7 bytes |
With addControlValue() style instead: add 32 bytes for the pre-allocated pointer slots on heap.
Part B: addControl overloads and addControlValue¶
Add CtrlType::Select to the enum in StatefulModule.h:
enum class CtrlType : uint8_t { Float, Uint8, Uint32, Bool, String, EditStr, FloatConst, Select };
Static-array overload (preferred — zero heap):
// Register a select control backed by a uint8_t index. options must outlive the module
// (use static constexpr). min/max are derived from optionCount.
void addControl(uint8_t& variable, const char* key,
const char* const* options, uint8_t optionCount);
Sets CtrlType::Select, stores the pointer directly, ownsOptions = false, maxVal = optionCount - 1.
Dynamic addControlValue() overload (ergonomic, small heap allocation):
// Register a select control; call addControlValue() immediately after for each label.
// uiType must be "select" — required for API consistency with all other addControl overloads.
void addControl(uint8_t& variable, const char* key, const char* uiType);
// Append a label to the most recently registered Select control.
// Pre-allocates 8 pointer slots on first call; no realloc within that capacity.
void addControlValue(const char* label);
addControlValue() finds the last CtrlType::Select descriptor, allocates 8-slot pointer array on the first call (ownsOptions = true), fills the next slot, increments optionCount, updates maxVal.
clearControls(): iterate descriptors; for any with ownsOptions == true, call free(d.options).
maxVal on the descriptor equals optionCount - 1 in both cases, so the existing range-clamp in setControl() rejects out-of-range indices without changes.
Part C: schema emission, value reads, and persistence¶
getSchema() — add Select case:
case CtrlType::Select:
c["value"] = *reinterpret_cast<const uint8_t*>(d.ptr);
c["default"] = (uint8_t)d.defVal;
{
JsonArray opts = c["options"].to<JsonArray>();
for (uint8_t j = 0; j < d.optionCount; ++j) opts.add(d.options[j]);
}
break;
The "type" field in the schema JSON is already d.uiType ("select"), so no other changes are needed for the frontend to identify the control.
getControlValues() — add Select case: identical to Uint8 (emit the index as an integer).
setControl() — add Select case: identical to Uint8 (clamp to [0, optionCount-1], write through the uint8_t* pointer, call onUpdate()).
saveState() / loadState(): no changes needed — Select follows the Uint8 save/load path (save as integer, load as integer via applyPending_()).
Part D: frontend dropdown rendering (app.js)¶
getSchema() already emits "type": "select" for the control. The frontend renderControl() function currently renders sliders for numeric types and checkboxes for bools. Add a branch for "select":
if (ctrl.type === 'select' && Array.isArray(ctrl.options)) {
const sel = document.createElement('select');
ctrl.options.forEach((label, i) => {
const opt = document.createElement('option');
opt.value = i;
opt.textContent = label;
if (i === ctrl.value) opt.selected = true;
sel.appendChild(opt);
});
sel.onchange = () => sendControlUpdate(modId, ctrl.key, parseInt(sel.value));
return sel;
}
WebSocket state updates that arrive mid-session must also update the <select> element's selectedIndex, following the same pattern as slider value updates.
Definition of Done¶
- [x]
CtrlType::Selectadded to enum inStatefulModule.h - [x]
ControlDescriptorextended withoptions,optionCount,ownsOptionsfields; non-Select defaults tonullptr/ 0 /false - [x]
addControl(uint8_t&, key, options, count)— static-array form; zero heap;ownsOptions = false - [x]
addControl(uint8_t&, key, "select")+addControlValue(label)— dynamic form;"select"uiType required for API consistency;ownsOptions = true; genericuint8_toverload has explicit min/max (no defaults) to eliminate 3-arg overload ambiguity - [x]
clearControls()and destructor freeoptionsviafreeOwnedOptions_()only whenownsOptions == true - [x]
getSchema(): Select case emits"value","default","options"array - [x]
getControlValues(): Select emits index as integer (same as Uint8) - [x]
setControl(): Select reuses Uint8 path viareadThrough/writeThrough; value stored asuint8_t - [x]
saveState()/loadState()/applyPending_(): Select handled same as Uint8 - [x]
SineEffectModule:waveformselect (sine/square/triangle/sawtooth);wave_()helper applies chosen shape; static-array form used - [x]
LinesEffectModule:axisselect (all/x/y/z); loop conditionally draws each plane; static-array form used - [x] Frontend:
<select class="select-input">rendered fortype == "select",addEventListener('change')posts index, live WebSocket updates reflected viaselect.value - [x] CSS:
.select-inputmatches.text-inputstyling; light-theme override included - [x] Tests: 10 new cases in
test_stateful_module.cpp— registration, schema, setControl, saveState round-trip, hot-reload leak safety,addControlValuedynamic form, SineEffect waveform, LinesEffect axis - [x] 386/386 tests pass (10 new, 1 existing test updated for new SineEffect waveform control)
- [x]
deploy/summarise.py: ESP32 MD guard added — skips writinglive-results-esp32-*.mdwhen no current (non-last-good) ESP32 JSON exists, preventing stale ESP32 sections from being rewritten onall_pc.pyruns - [x]
deploy/unittest.py:run_teeaccepts optionalcwd; test binary invoked with absolute path - [x]
tests/test_module_manager.cpp: auto-pipeline test callsdisableStatePersistence()before teardown to prevent writing state files to the working directory - [x]
state/grid1.json: reset to 16x16x1 (segfault fix: stale 1013x1018x32 values from a previous live-test run with the server started from the wrong directory)
Result¶
| Metric | Value |
|---|---|
| Unit tests | 386/386 pass (10 new, 1 updated) |
| New control type | CtrlType::Select backed by uint8_t index; zero DRAM for static-array form |
New ControlDescriptor fields |
options (4 B), optionCount (1 B), ownsOptions (1 B) per descriptor |
addControl overloads |
Static-array form (zero heap) and dynamic addControlValue() form; both require explicit "select" uiType |
| Hot-reload safety | freeOwnedOptions_() called from clearControls() and destructor |
SineEffectModule |
New waveform select: sine / square / triangle / sawtooth |
LinesEffectModule |
New axis select: all / x / y / z |
| Frontend | <select> element rendered for type == "select"; live WS updates applied; CSS styled |
| Schema | "options" array emitted by getSchema(); "value" and "default" as integer index |
| PC live tests | All pass; test4 (device discovery) expected FAIL without ESP32 on network |
See test results for full pass/fail breakdown.
Retrospective¶
What went well:
- The CtrlType::Select case slots cleanly into every existing switch in StatefulModule.h because the backing type (uint8_t) is identical to Uint8. readThrough/writeThrough/applyPending_/saveState all just needed case CtrlType::Select: fall-through onto the existing Uint8 case.
- Static-array form (addControl(var, key, kArr, N)) costs exactly 6 bytes of DRAM per descriptor and zero heap — kArr and all string literals live in flash. This is the right default for any module with a compile-time-fixed option list.
- The ownsOptions flag on ControlDescriptor cleanly separates the two ownership modes. clearControls() and the destructor both call the same freeOwnedOptions_() helper, so hot-reload and final teardown are handled identically.
- Library-provided arrays (e.g. inline const char* const EMITTER_NAMES[]) work directly with the static-array overload — no adaptation needed.
- Making "select" an explicit uiType argument on the dynamic form (addControl(var, key, "select")) aligns it with every other addControl overload. The API is now fully consistent: the uiType string is always the third argument, regardless of control type.
- The summarise.py ESP32 guard (skip rewriting live-results-esp32-*.md when no current ESP32 JSON exists) prevents all_pc.py runs from silently overwriting the last good ESP32 status with a stale timestamp.
What was tricky:
- clearControls() previously just set controlCount_ = 0 without freeing anything. Adding freeOwnedOptions_() there required also calling it from the destructor; overlooking either site would cause a leak on hot-reload or module destruction respectively.
- The memmove in runSetup() that promotes enabled_ to index 0 copies ControlDescriptor structs byte-for-byte, including options pointers. This is safe — the pointers remain valid — but it means two descriptor slots briefly point at the same options array. The old slot is immediately overwritten, so there is no double-free risk. Worth understanding before reading this code path.
- addControlValue() uses realloc on each call. The sprint doc proposed 8-slot pre-allocation; the implementation went with simple realloc instead (simpler code, acceptable since it only runs during setup()). Backlogged if profiling ever shows setup-time fragmentation.
- Adding "select" as an explicit uiType argument to addControl(uint8_t&, key, uiType) required removing the default min/max from the generic uint8_t overload to avoid a 3-argument ambiguity. No existing caller relied on those defaults (all passed explicit min/max), so the change was safe.
- A stale state/grid1.json with dimensions 1013x1018x32 (written by a previous live-test session that started the server from the project root) caused a segfault on the next run. The pal::check_alloc guard correctly blocked the allocation, but the state/ file survived. Fixed by resetting to 16x16x1. The all_pc.py pipeline always starts the server from deploy/build/pc/{platform}/ so this state is isolated; running the binary from the project root directly can still contaminate the project-root state/ directory.
- The auto-pipeline unit test (ModuleManager - auto-creates default pipeline when no modules exist) did not call disableStatePersistence() after its assertions, causing it to write state/grid1.json (with default 10x10x1 dimensions) to the project root on every test run. Fixed by calling disableStatePersistence() after the assertions, before teardown.
Complexity estimate: Medium. The core StatefulModule.h changes are straightforward switch-case additions. The non-trivial parts were: ownership lifecycle (ownsOptions, freeOwnedOptions_), verifying the memmove path is safe, and the two waveform implementations (wave_() and the axis conditional in LinesEffect).
Seeds for future sprints:
- Proper range-clamping in setControl() for Select: clamp submitted value to [0, optionCount-1] rather than relying on uint8_t truncation. Frontend-submitted values are always valid; REST misuse is the only exposure.
- addControlValue() 8-slot pre-allocation: measure whether realloc churn during setup() causes fragmentation on ESP32 dev before adding the optimization.
- Other modules with natural discrete parameters: NoiseEffect2D blend mode, RipplesEffect shape, MirrorModifier axis. Each is a one-liner addControl + static array addition.
Sprint 10: Boot Module Creation Redesign and Dynamic Network Management¶
Scope: Replace the ad-hoc
ensureNetworkModules/ensureInfraModules/instantiateDefaultPipeline_boot logic with a single coherent rule: on first boot (no non-network top-level modules), create the full default set; otherwise leave the pipeline alone. AddEthernetModuleto the initial network group. Add dynamic network management toNetworkModuleso the AP is automatically disabled when STA or Ethernet is connected, and re-enabled when connectivity is lost. Investigate and fix the root causes of apparent duplicate modules (type name bug inAppSetup.h, scenario script behavior, ambiguous 409 error response).
Identified from: Sprint 9 retrospective seeds; user request after Sprint 10 scope discussion.
Summary¶
| Part | Description | Est |
|---|---|---|
| A: EthernetModule in boot | Add eth1 (child of network1) to first-boot network group in ensureNetworkModules |
XS |
B: ensureDefaultModules |
Replace ensureDefaultPipeline+ensureInfraModules; "no non-network top-level modules" rule; update PC instantiateDefaultPipeline_ |
M |
| C: Dynamic network management | NetworkModule 10 s ticker + 60 s grace-period debounce; onUpdate("enabled") on WifiAp/WifiSta; setInput wiring |
L |
| D: ui.md boot section | Document boot module creation, dynamic WiFi, delete-to-prevent-recreation | S |
| E: Duplicate investigation + 409 | AppSetup type name bug fixed via B; 409 reason field in AppRoutes |
S |
| Total | L |
Current boot logic (before this sprint)¶
On embedded (AppSetup.cpp):
1. mm.setup() — if no DriverLayer AND no EffectsLayer: create driver1 + grid1 + effects1 + ripples1 + preview1.
2. ensureNetworkModules() — if no NetworkModule: create network1 + sta1 + ap1.
3. ensureInfraModules() — calls ensureDefaultPipeline() (patches EffectsLayer / Preview onto an existing DriverLayer if absent), then unconditionally adds SystemStatus and FirmwareUpdateModule if not present.
On PC (main.cpp):
1. mm.setup() — same pixel-pipeline creation as embedded step 1. No SystemStatus, Firmware, or network modules created.
Problems with the current logic:
- ensureDefaultPipeline patchwork adds EffectsLayer / PreviewModule even when the user deliberately built a custom pipeline without them.
- SystemStatus and FirmwareUpdateModule are added unconditionally, even on partially customised setups.
- EthernetModule is never auto-created.
- No dynamic network management: AP runs permanently even when STA is connected.
Part A: Extend ensureNetworkModules — add EthernetModule¶
ensureNetworkModules() currently creates network1 + sta1 + ap1 on first boot (guarded by hasModuleType("NetworkModule")). Add eth1 to this initial creation:
mm.addModule("EthernetModule", "eth1", {}, {}, 1, "network1"); // child of network1
"If later deleted, don't recreate" guarantee: This is already provided by the hasModuleType("NetworkModule") guard — ensureNetworkModules is a no-op on any boot where NetworkModule already exists. EthernetModule is only created once, alongside the rest of the network group, and is never checked for independently.
Part B: Replace ensureDefaultPipeline + ensureInfraModules with ensureDefaultModules¶
Replace both functions with a single ensureDefaultModules(mm) that applies the new rule:
New rule: count top-level modules (parentId == "") whose type is not "NetworkModule". If the count is zero, create the full default set. Otherwise, do nothing.
top-level non-network modules == 0 → create full default set
top-level non-network modules > 0 → do nothing
Full default set (created atomically):
| id | type | parent |
|---|---|---|
| sysinfo1 | SystemStatusModule | — |
| firmware1 | FirmwareUpdateModule | — |
| discovery1 | DeviceDiscoveryModule | — |
| driver1 | DriverLayer | — |
| grid1 | GridLayout | driver1 |
| effects1 | EffectsLayer | — |
| ripples1 | RipplesEffectModule | effects1 |
| preview1 | PreviewModule | driver1 |
Behavior changes from current logic:
| Scenario | Before | After |
|---|---|---|
| Completely blank first boot | pixel pipeline only, then SystemStatus + Firmware added separately | full default set created atomically |
| Only DriverLayer exists | EffectsLayer + Preview patched in; SystemStatus + Firmware added | do nothing |
| Only NetworkModule + children | full default pipeline created | full default set created |
| Any non-network top-level module | partial patching applied | do nothing |
ModuleManager::instantiateDefaultPipeline_() (PC):
The existing function runs on PC where AppSetup.h is not compiled. It should be updated to apply the same rule: check for any top-level module (on PC there is no NetworkModule so the check becomes "any top-level module exists"). If none exist, create the full default set including SystemStatus, FirmwareUpdateModule, and DeviceDiscoveryModule. The PC build does register all three types.
ModuleManager::setup() (both platforms):
Remove the !hasDriver && !hasEffects auto-pipeline check. On embedded, ensureDefaultModules handles first-boot creation. On PC, instantiateDefaultPipeline_ (updated) handles it. The condition that triggers it changes from "no DriverLayer+EffectsLayer" to "no top-level modules at all".
Part C: Dynamic network management in NetworkModule¶
NetworkModule::loop() currently does nothing. Add a 10-second periodic check that manages the AP based on current connectivity:
Priority (highest wins): 1. Ethernet connected → disable both WiFi AP and WiFi STA. 2. STA connected → disable WiFi AP (keep STA running). 3. Neither connected, grace period expired → enable WiFi AP (recovery path for configuration access).
Grace period for STA loss: A brief disconnection (network hiccup, AP reboot) should not immediately re-enable the AP — toggling the AP is disruptive (clients connecting mid-hiccup, mDNS flapping). A configurable grace period lets STA recover before any AP change is made.
- When STA was connected and then drops: record
staLostMs_(millis timestamp). - Each tick: if
now - staLostMs_ >= sta_grace_ms_and no other connectivity, enable AP. - If STA reconnects or Ethernet comes up before the grace period expires: clear
staLostMs_, no AP change. - On first boot (STA never connected): no grace period — enable AP immediately.
sta_grace_ms_is a private constant (default 60 000 ms). A future control could expose it; for Sprint 10 a compile-time default is sufficient.
Implementation sketch:
NetworkModule needs typed pointers to its children to call setControl("enabled", ...) on them. Wiring approach: NetworkModule implements setInput("sta", ...), setInput("ap", ...), setInput("eth", ...), receiving the module pointers when the wiring pass runs. ensureNetworkModules passes the child ids as inputs to network1 after creating all children (or a dedicated post-creation wiring step).
WifiApModule and WifiStaModule must override onUpdate("enabled") to actually start/stop their WiFi interface when the enabled control changes. Currently, enabled_ only gates loop execution; setting it to false does not call teardown() or stop WiFi. This change makes enabled semantically equivalent to "WiFi interface is running".
10-second ticker and grace-period state in NetworkModule:
uint32_t lastCheckMs_ = 0;
uint32_t staLostMs_ = 0; // 0 = STA connected or never-connected; non-zero = grace countdown started
bool staWasConnected_ = false;
static constexpr uint32_t STA_GRACE_MS = 60000;
void loop() override {
#ifdef ARDUINO
uint32_t now = pal::millis();
if (now - lastCheckMs_ < 10000) return;
lastCheckMs_ = now;
manageWifi_(now);
#endif
}
manageWifi_(now) logic:
ethConn = eth_ && eth_->isConnected()
staConn = sta_ && pal::wifi_sta_is_connected()
if ethConn:
clear staLostMs_; disable AP and STA
else if staConn:
clear staLostMs_; disable AP // STA healthy: AP not needed
else:
if staWasConnected_ and staLostMs_ == 0:
staLostMs_ = now // STA just dropped: start grace timer
if staLostMs_ != 0 and (now - staLostMs_) >= STA_GRACE_MS:
enable AP; clear staLostMs_ // grace expired: open recovery AP
// else: within grace period — do nothing, wait for STA to recover
staWasConnected_ = staConn
The children's onUpdate("enabled") handlers propagate the change to the WiFi stack.
EthernetModule isConnected(): currently always returns false. The interface is added now so NetworkModule can call it; the stub is replaced when real Ethernet support is implemented.
Part D: Update docs/user-guide/ui.md¶
Add (or update) a "Boot module creation" section that describes: - First boot: what modules are created and in what order. - Network group: NetworkModule + WifiSta + WifiAp + Ethernet. - Default pipeline: only created when no non-network top-level modules exist. - Dynamic WiFi: AP is automatically disabled when STA or Ethernet is connected; re-enabled when not. - User control: delete any default module to prevent it being recreated on next boot.
Part E: Duplicate module investigation and 409 error clarity¶
ModuleManager::addModule already checks for duplicate IDs at line 416-418 and returns false (HTTP 409) if the ID is already registered. The guard is solid. Despite this, users occasionally see duplicate modules in the UI. Three root causes were identified:
1. AppSetup.h type name bug (primary cause)
ensureInfraModules() calls mm.addModule("SystemStatus", ...) but the TypeRegistry key is "SystemStatusModule" (the class name, set by REGISTER_MODULE(SystemStatusModule)). The type lookup fails silently and the module is never created. The live-test test0_infrastructure scenario then creates it using a different id (systemstatus1) via HTTP, so the user sees two apparent SystemStatus entries after running test0 more than once: the real sysinfo1 (if it ever existed from a prior boot) and systemstatus1 from test0. Fix: change "SystemStatus" to "SystemStatusModule" throughout AppSetup.h, or eliminate the call entirely via Part B's ensureDefaultModules.
2. Scenario scripts using add_or_exists (secondary cause)
live_suite.py's add_or_exists treats HTTP 409 as success if a module with the same ID already exists (type-checked). For types listed in SINGLETON_TYPES (INFRA_TYPES | {"FirmwareUpdateModule"}), the test step will skip re-creation when the correct type is already present. For non-singleton types (e.g., EffectsLayer, DriverLayer), a POST with a different ID will create a second instance if the first was not cleaned up. The delete_all_modules step at the start of each scenario should prevent this, but it preserves INFRA_TYPES modules — so if a prior scenario left a non-infra module with the same type but a different id, a new one will be created.
3. Ambiguous HTTP 409 error message (diagnostic cause)
AppRoutes.cpp returns 409 for three distinct failures: ID already exists, unknown type, invalid parent ID. These are currently indistinguishable from the HTTP response alone, making debugging harder. Fix: return a reason field in the JSON body distinguishing the three cases.
Fixes in this sprint:
AppSetup.h: fix"SystemStatus"→"SystemStatusModule"(addressed implicitly by Part B'sensureDefaultModulesrewrite)AppRoutes.cpp: return distinctreasonstrings in the 409 response body ("id_exists","unknown_type","invalid_parent")
Design decisions¶
Why "else do nothing" instead of per-type checks? The previous "patch up missing pieces" approach was opaque — it was hard to predict whether a module would be added on the next reboot. The new rule is a single, testable invariant: the first-boot state is fully deterministic; any subsequent state is entirely the user's configuration.
Why is SystemStatusModule part of the conditional set?
Previously it was added unconditionally. Making it conditional brings it in line with the other defaults — if the user removes it they clearly don't want it. The boot-guard pattern (ensureNetworkModules re-creates network if deleted) is reserved for modules that are genuinely required for the device to be accessible (networking). Infra/status modules are optional from the device's perspective.
Why onUpdate("enabled") on WiFi child modules rather than direct PAL calls from NetworkModule?
Direct PAL calls from NetworkModule would bypass the module's state machine and leave its status_ control stale. Routing through setControl("enabled") → onUpdate keeps the module self-consistent and makes the WiFi state visible in the UI.
Definition of Done¶
- [x]
AppSetup.h:ensureNetworkModulescreates eth1 (child of network1) alongside sta1 and ap1 on first boot - [x]
AppSetup.h:ensureDefaultModulesreplacesensureDefaultPipeline+ensureInfraModules; creates full default set only when no non-network top-level modules exist - [x]
AppSetup.cpp: callsensureNetworkModulesthenensureDefaultModules(replacing the old pair) - [x]
ModuleManager.cpp:instantiateDefaultPipeline_updated for PC — checks "no top-level modules" and creates full default set (including SystemStatus, FirmwareUpdateModule, DeviceDiscoveryModule) - [x]
ModuleManager.cpp: removes the!hasDriver && !hasEffectsauto-pipeline check fromsetup() - [x]
NetworkModule:setInput("sta", ...),setInput("ap", ...),setInput("eth", ...)added;loop()manages WiFi with 10-second ticker;lastCheckMs_uint32_t member - [x]
WifiApModule:onUpdate("enabled")callswifi_ap_stop()on disable andstartAp()on enable - [x]
WifiStaModule:onUpdate("enabled")disconnects on disable and reconnects on enable - [x]
EthernetModule:isConnected()method added (returns false; stub for future Ethernet implementation) - [x]
ensureNetworkModuleswires sta1, ap1, eth1 as inputs to network1 so NetworkModule receives the typed pointers - [x] Tests: new unit tests for
ensureDefaultModules(no modules → full set created; DriverLayer present → nothing added) - [x] Tests:
WifiApModule.onUpdate("enabled")stops/starts AP;WifiStaModule.onUpdate("enabled")disconnects/reconnects;EthernetModule.isConnected()returns false on PC - Note:
NetworkModulegrace-period logic is#ifdef ARDUINO-only — not testable on PC; verified by code review - [x]
AppRoutes.cpp: HTTP 409 response body includes areasonfield ("id_exists","unknown_type","invalid_parent") so callers can distinguish the three failure cases - [x]
docs/user-guide/ui.md: boot module creation section added - [x] PC live tests pass (7/7 scenarios); ESP32 live tests skipped (no devices connected during sprint completion)
- [x] esp32dev and esp32s3_n16r8 build successfully
Result¶
| Metric | Value |
|---|---|
| Unit tests | 390/390 pass (4 new) |
| PC live tests | 7/7 scenarios pass |
| esp32dev build | 1374 KB flash, 19.9% RAM |
| esp32s3_n16r8 build | 1362 KB flash, 19.3% RAM |
AppSetup.h |
ensureNetworkModules + ensureDefaultModules replace 3 old boot functions |
NetworkModule |
WiFi management with 10 s ticker and 60 s STA grace period |
WifiApModule / WifiStaModule |
onUpdate("enabled") reactive AP/STA control |
| HTTP 409 | Now includes reason field: id_exists / unknown_type / invalid_parent |
See test results for full pass/fail breakdown.
Retrospective¶
What went well:
- The "no non-network top-level modules" rule (countTopLevelNonNetwork()) gives a single, testable invariant for first-boot behavior — deterministic and easy to reason about compared to the patchwork of hasDriver && hasEffects checks it replaced.
- Routing WiFi enable/disable through setControl("enabled") -> onUpdate keeps each module self-consistent. NetworkModule never needs to know about STA/AP internals; the child modules keep their own status_ display up to date.
- rewireModule("network1", inputs) after creating the four network modules is a clean pattern — create children first, then wire the parent. No ordering constraint on addModule itself.
- The StatefulModuleBase* type for ap_ and sta_ in NetworkModule solved the circular-include problem cleanly: WifiAp.h and WifiSta.h both include Network.h, so Network.h cannot include them. The base pointer is sufficient for setControl() calls.
What was tricky:
- runSetup() vs setup() in tests: setup() does not register the enabled_ control — that is runSetup()'s job (the base-class wrapper). The three new behavior tests initially called ap.setup() and the setControl("enabled", false) returned false silently (control not found), so onUpdate was never called and status stayed "starting". Fixed by switching to runSetup().
- The circular-include problem between Network.h and WifiAp.h/WifiSta.h was not obvious until the first compile. Storing ap_/sta_ as StatefulModuleBase* and casting in setInput() is the right fix, but required understanding which header depends on which.
- AppSetup.h previously used "SystemStatus" (wrong) instead of "SystemStatusModule" as the type name string. The bug was latent until Sprint 10's investigation of apparent duplicate-creation. Eliminating ensureInfraModules entirely fixed it without a targeted patch.
Complexity estimate: Large. Three distinct sub-systems changed (boot logic, WiFi management, HTTP error details) plus four test files and the ui.md doc.
Seeds for future sprints:
- Sprint 11: implement EthernetModule for real (LAN8720/W5500); NetworkModule.manageWifi_() already calls eth_->isConnected() — just needs the stub replaced.
- Expose STA_GRACE_MS (currently 60 000 ms compile-time constant) as a NetworkModule control for field-adjustable debounce.
- Add a NetworkModule live test: bring STA up, verify AP disables; bring STA down, wait grace period, verify AP re-enables. Requires hardware or a WiFi simulation stub.
Sprint 11: Ethernet Implementation¶
Scope: Implement
EthernetModulefor real on ESP32 classic (LAN8720 RMII) and ESP32-S3 (W5500 SPI). Add Ethernet PAL functions toPal.hcovering both the ArduinoETH.hpath and the bareIDF_VERpath so a future Arduino-free build compiles cleanly. Add DHCP client and static IP modes; static IP mode serves as the direct-connect ("AP analog") path. Document what ESP32-P4 Ethernet will require when hardware arrives.
Depends on: Sprint 10 (wires eth_ pointer in NetworkModule; adds isConnected() stub; Sprint 11 makes it real).
Identified from: Sprint 10 retrospective seeds; user request.
Summary¶
| Part | Description | Est |
|---|---|---|
| A: Ethernet PAL functions | 6 PAL functions (eth_init, eth_is_connected, etc.); ARDUINO, IDF_VER, and PC stub branches |
M |
| B: EthernetModule implementation | Full module replacing stub: setup/loop/isConnected, DHCP+static controls, healthReport() |
M |
| C: Direct-connect mode | Static IP path (AP analog); recommended defaults; link-local note; doc update | S |
| D: ESP32-P4 documentation | GMAC, SDIO WiFi coprocessor, IDF 5.3+, PAL additions needed; no implementation | S |
| Total | M |
Background: PAL structure and the IDF_VER path¶
All network PAL functions follow a three-way platform switch that must be preserved for every new function added:
#ifdef ARDUINO
// Arduino ESP32 framework — ETH.h / WiFi.h / esp_netif via Arduino wrappers
#elif defined(IDF_VER)
// Bare ESP-IDF — esp_eth / esp_netif / lwIP directly; no Arduino wrappers
#else
// PC build — no-op stubs, returns false / empty string
#endif
The IDF_VER path exists today for WiFi but has minimal/stub bodies. Every Ethernet PAL function added in this sprint must have a real IDF_VER body (not just a stub) because the long-term goal is to be able to build without Arduino.h. This means using esp_eth, esp_netif, and esp_event APIs directly in the IDF_VER branch, not delegating to ETH.h.
The ARDUINO and IDF_VER implementations can share the same PAL function signatures; the #ifdef is inside the function body, not in the declaration.
Hardware variants and board-specific configuration¶
Two Ethernet hardware variants are supported. The selection is made at compile time via a flag defined in platformio.ini per board environment:
| Board | Hardware | Interface | Flag |
|---|---|---|---|
| esp32dev | LAN8720 | RMII (GPIO) | -DPMM_ETH_LAN8720 |
| esp32s3_n16r8 | W5500 | SPI | -DPMM_ETH_W5500 |
Note: flags use the PMM_ prefix (e.g. PMM_ETH_LAN8720 not ETH_PHY_LAN8720) to avoid colliding with the eth_phy_type_t enum values of the same name in esp-idf.
Pin assignments (MDC, MDIO, PHY address for RMII; SCK, MISO, MOSI, CS, IRQ for SPI) are defined as compile-time constants in the same board-specific platformio.ini env, e.g.:
[env:esp32dev]
build_flags =
-DPMM_ETH_LAN8720
-DETH_RMII_MDC=23
-DETH_RMII_MDIO=18
-DETH_RMII_PHY_ADDR=1
[env:esp32s3_n16r8]
build_flags =
-DPMM_ETH_W5500
-DETH_SPI_SCK=12
-DETH_SPI_MISO=13
-DETH_SPI_MOSI=11
-DETH_SPI_CS=10
-DETH_SPI_IRQ=14
EthernetModule itself contains no pin numbers. It calls PAL functions; the PAL reads the compile-time constants and dispatches to the right hardware init.
Part A: Ethernet PAL functions¶
Add to src/pal/Pal.h:
// Returns true if Ethernet hardware is compiled in for this board.
inline constexpr bool has_ethernet();
// Initialise the Ethernet peripheral. Called once from EthernetModule::setup().
// Returns true if the hardware was found and initialisation succeeded.
inline bool eth_init();
// True if Ethernet link is up and an IP address has been assigned (DHCP or static).
inline bool eth_is_connected();
// Write the current Ethernet IP address into buf (null-terminated). Empty string if not connected.
inline void eth_local_ip(char* buf, size_t len);
// Switch to DHCP client mode (default after eth_init).
inline void eth_set_dhcp();
// Set a static IP immediately. Disables DHCP client.
// Pass nullptr for gateway/subnet to use defaults (gw = ip with last octet 1, /24).
inline void eth_set_static_ip(const char* ip, const char* gateway, const char* subnet);
ARDUINO implementation: delegates to ETH.h (ETH.begin(...), ETH.config(...), ETH.localIP().toString()). eth_init() dispatches on PMM_ETH_LAN8720 vs PMM_ETH_W5500 at compile time to call the correct ETH.begin() overload.
IDF_VER implementation: uses esp_eth_driver_install, esp_netif_new, esp_eth_start, esp_event_handler_register(ETH_EVENT, ...). Static IP uses esp_netif_set_ip_info. DHCP client uses esp_netif_dhcpc_start.
PC stub: has_ethernet() returns false; all other functions are no-ops / return false / write empty strings.
Part B: EthernetModule implementation¶
Replace the current stub (src/modules/system/Ethernet.h) with a full implementation:
setup():
1. Calls pal::eth_init(). If it returns false, sets status_ = "init_failed" and returns.
2. If a static IP is configured (loaded from saved state), calls pal::eth_set_static_ip(...).
3. Otherwise calls pal::eth_set_dhcp().
4. Registers controls (see below).
loop():
- Polls pal::eth_is_connected() once per second (millis-based debounce).
- On state change: updates status_ and ip_address_ controls; calls pal::eth_local_ip().
- On PC: has_ethernet() is false, so loop() is a no-op beyond the guard.
isConnected() (used by NetworkModule): returns pal::eth_is_connected().
Controls:
| key | type | description |
|---|---|---|
status |
display | "disconnected" / "connecting" / "connected" / "init_failed" |
ip_address |
display | Current IP (empty when disconnected) |
mode |
select | "dhcp" | "static" (default: "dhcp") |
static_ip |
text | Only active when mode == "static" |
static_gateway |
text | Only active when mode == "static" |
static_subnet |
text | Default "255.255.255.0" |
onUpdate("mode") and onUpdate("static_ip") apply the new config immediately via PAL if Ethernet is already up.
healthReport(): "eth=connected ip=192.168.1.42" / "eth=disconnected" / "eth=unsupported" (PC).
Part C: Direct-connect mode (AP analog)¶
When Ethernet is wired directly between the ESP32 and a laptop (no router), there is no DHCP server to assign addresses. Static IP mode serves as the "AP analog" for Ethernet: set a known fixed IP on the ESP32, then manually configure a matching IP on the laptop.
Recommended defaults for direct-connect:
- ESP32 static IP: 192.168.5.1 / gateway 192.168.5.1 / subnet 255.255.255.0
- Laptop: 192.168.5.2 / subnet 255.255.255.0 (manual config in OS network settings)
- The device is then reachable at http://192.168.5.1
Relationship to WiFi AP: NetworkModule::manageWifi_() (Sprint 10) disables the WiFi AP when Ethernet is connected — this applies to DHCP-connected Ethernet (router present). When using static IP for direct connect, the user is expected to also manage the WiFi AP manually if needed, or configure the ticker to treat static-IP-connected as "connected" (same isConnected() return value — no change needed).
Link-local (169.254.x.x): lwIP supports APIPA link-local addressing. If DHCP fails and mode == "dhcp", the ESP32 may auto-assign a 169.254 address after timeout. Modern OSes do the same. This provides a zero-config direct-connect path without any manual IP setting, but the 169.254.x.x address is non-deterministic. Document this as an observed behavior, not a designed feature. Static mode is the designed direct-connect path.
Part D: ESP32-P4 — what will be needed¶
The ESP32-P4 has on-board GMAC Ethernet and uses an external WiFi/BT coprocessor (e.g., ESP32-C6) connected via SDIO. No ESP32-P4 hardware is targeted in this sprint. This section documents what a future sprint will need.
PlatformIO:
- New [env:esp32p4] and [env:esp32p4_eth] entries in platformio.ini.
- Board JSON files (esp32_p4_nano.json, esp32_p4_eth.json) in the PlatformIO boards directory or boards/ in the project.
- Framework: Arduino ESP32 core 3.x (P4 support landed in core 3.0); or bare IDF 5.3+.
- Build flag: -DETH_PHY_EMAC (P4 uses its own internal EMAC + external PHY, typically IP101).
PAL changes:
- Add ETH_PHY_EMAC branch in eth_init() inside both ARDUINO and IDF_VER sections.
- P4 EMAC uses esp_eth_mac_new_esp32, same esp_netif plumbing as classic ESP32, so the IDF_VER path is largely reusable.
- WiFi on P4 requires SDIO bootstrap before wifi_sta_connect() / wifi_ap_start() can be called. NetworkModule::setup() will need a pal::wifi_coprocessor_init() call (P4 only) before the existing WiFi init.
- Add #ifdef ESP_PLATFORM_P4 guards (or CONFIG_IDF_TARGET_ESP32P4 from sdkconfig) to any P4-specific bootstrap code.
What does NOT change: EthernetModule, NetworkModule, WiFi modules — they call PAL functions only. All hardware differences are absorbed in Pal.h.
Design decisions¶
Why one EthernetModule for both RMII and W5500?
The module only calls PAL functions. The hardware difference is entirely inside eth_init(). Adding a second module type (EthernetW5500Module) would duplicate the status/IP/mode logic for no benefit.
Why put all pin constants in platformio.ini rather than a header?
platformio.ini is the single source of truth for board configuration. Scattering pin assignments across headers creates inconsistency between boards. The PAL reads the CMake/PlatformIO defines directly.
Why require an IDF_VER body now, not later?
The migration from Arduino to bare IDF is a future goal, not an immediate task. However, writing stub-only IDF_VER bodies now means they will rot — when the migration happens, every function needs rewriting anyway. Writing real bodies now keeps the IDF path exercisable and prevents silent compile failures when someone eventually enables IDF_VER on a board.
Why static IP instead of a DHCP server for direct connect?
Running a DHCP server on the Ethernet netif requires esp_netif_dhcps_start() and a server config, which works but adds state complexity to EthernetModule. Static IP achieves the same goal with one PAL call. A DHCP server on Ethernet is noted as a future enhancement.
Definition of Done¶
- [x]
Pal.h:has_ethernet(),eth_init(),eth_is_connected(),eth_local_ip(),eth_set_dhcp(),eth_set_static_ip()— all three platform branches (ARDUINO,IDF_VER, PC stub) present and compiling - [x]
platformio.ini:PMM_ETH_LAN8720+ RMII pin flags added to[env:esp32dev];PMM_ETH_W5500+ SPI pin flags added to[env:esp32s3_n16r8](flags usePMM_prefix — see hardware section) - [x]
src/modules/system/Ethernet.h: full implementation replacing stub —setup()init,loop()poll,isConnected(), status/ip/mode/static_ip/static_gateway/static_subnet controls,healthReport() - [x]
onUpdate("mode")andonUpdate("static_ip")apply config live if Ethernet is already initialised - [x]
NetworkModule:manageWifi_()treatseth_->isConnected()as real (no change to call site; Sprint 10 wired it; Sprint 11 provides the real value) - [x] Tests:
test_network.cppupdated — add tests forisConnected()on PC,loadState/saveStateround-trip, defaultstatic_subnet - [x] Tests: PC build compiles and all tests pass (Ethernet PAL stubs return safe values)
- [x]
docs/modules/network/ethernet-module.md: updated with real controls, DHCP/static modes, direct-connect setup instructions - [x] Direct-connect section added to
docs/modules/network/ethernet-module.md(user chose ethernet-module.md only, not getting-started.md) - [x]
esp32devbuild compiles withPMM_ETH_LAN8720(no hardware flash required in CI) - [x]
esp32s3_n16r8build compiles withPMM_ETH_W5500(no hardware flash required in CI) - [x]
docs/development/release-07.md: Part D (P4 requirements) written (this section) - [ ] ESP32 live tests pass on both devices — deferred; hardware will be tested in Sprint 12 (user intent: test with real hardware next sprint)
Result¶
| Metric | Value |
|---|---|
| Unit tests | 392/392 pass (2 new: isConnected on PC, loadState/saveState round-trip, default static_subnet) |
| PC live tests | 7/7 scenarios pass (82 steps) |
| esp32dev build | 1441 KB flash (78.5%), 64 KB RAM (20.0%) |
| esp32s3_n16r8 build | 1427 KB flash (34.8%), 62 KB RAM (19.5%) |
| ESP32 live tests | deferred (devices unreachable during sprint; not a Sprint 11 regression) |
Flash footprint increased by ~67 KB on both boards versus Sprint 10 (SPI and Ethernet library sources now compiled in for W5500 path; LAN8720 path increased by similar amount as ETH.cpp was always compiled).
See test-results.md and live-pc-macos.md.
Retrospective¶
What went well:
- The three-way PAL platform switch pattern (
ARDUINO/IDF_VER/ PC stub) keptEthernetModulefree of any#ifdef— all hardware differences absorbed inPal.h. Thehas_ethernet()constexpr guard at the top ofsetup()was enough to keep PC tests clean without touching test code. - The
PMM_ETH_prefix decision was made early (prompted by an enum collision witheth_phy_type_tvalues of the same name in esp-idf) and avoided a subtle linker-time name conflict. - Static IP mode as the direct-connect path required no new module infrastructure — it is just
applyIpConfig_()withmodeIdx_ == 1. The "AP analog" pattern reused exactly the same control set as DHCP mode.
What was tricky:
- PlatformIO LDF and the SPI linker gap. The LDF discovered and compiled
ETH.cpp(becausePal.hincludesETH.hviachain+). ButSPI.hwas found viaCPPPATHadded by the pre-script, which satisfied the#includewithout the LDF discovering the SPI library directory — soSPI.cppwas never compiled.EXTRA_CXXSRC(wrong variable) had no effect. Fix:env.BuildSources()inadd_spi_eth_path.pyexplicitly compilesSPI.cppwithout conflicting with the LDF-managedETH.cpp. This is the correct pattern when a library source file must be compiled but would otherwise be bypassed by a CPPPATH shortcut. - arduino-esp32 3.x
ETH.begin()signature for W5500. The 10-parameter form isbegin(eth_phy_type_t, phy_addr, cs, irq, rst, spi_host_device_t, sck, miso, mosi, freq_mhz)— not the 2.x order. Getting this wrong produced a compile error after correcting the SPI linker issue; checking the framework source directly was the only reliable way.
Seeds for Sprint 12:
- Hardware live test of LAN8720 and W5500 with real boards (user intent: next sprint).
- Evaluate an
[env:esp32dev_idf]bare-IDF env for exercising theIDF_VEREthernet path in CI (needssdkconfig, noESPAsyncWebServer; significant setup cost — may be a separate sprint). - DHCP server on Ethernet for direct-connect without static IP (noted as future enhancement).
Complexity estimate: Large (L).
Sprint 12: WiFi Reconnect and PAL Documentation¶
Scope: Automatic WiFi STA reconnect after signal loss or router reboot; re-enable STA when Ethernet drops; align all recovery timers at 30 s. Document the Arduino/IDF mixing strategy in
pal.mdfollowing a design discussion about the three-way platform switch.
Identified from: Sprint 11 retrospective (hardware live test follow-up); design discussion on PAL architecture.
Summary¶
| Part | Description | Est |
|---|---|---|
| A: WifiSta auto-reconnect | Retry connection every 30 s while enabled and disconnected | S |
| B: Network Ethernet-drop recovery | Re-enable STA when Ethernet transitions down; align grace period to 30 s | S |
| C: PAL architecture documentation | "Arduino, IDF, and mixing both" section in pal.md |
S |
| Total | S |
Part A: WifiSta auto-reconnect¶
WifiStaModule::loop() previously stopped retrying after a failed connection attempt. An else if branch added after the connect-polling block retries startConnect() every RETRY_INTERVAL_MS = 30000 ms while:
isEnabled()is true (NetworkModule disables STA when Ethernet is up; no retry while disabled)- Not currently in a connect attempt (
!connecting_) - Has credentials (
ssid_[0] != '\0') - WiFi hardware is available (
pal::has_wifi())
startConnect() resets lastRetryMs_ so each new attempt starts a fresh 30 s window regardless of how much time had elapsed.
Part B: Network Ethernet-drop recovery¶
Two changes to NetworkModule::manageWifi_():
-
ethWasConnected_state tracking — detects the Ethernet up-to-down transition. On that tick,sta_->setControl("enabled", true)re-enables STA immediately so Part A's retry loop kicks in. Without this, STA stayed permanently disabled after Ethernet had been up. -
STA_GRACE_MSreduced from 60 s to 30 s — the AP recovery timer now matches the STA retry interval. All three recovery events (STA retry, Ethernet-drop STA re-enable, AP open) now occur on 30 s boundaries.
Part C: PAL architecture documentation¶
A new "Arduino, IDF, and mixing both" section added to docs/developer-guide/pal.md documents the outcome of a design discussion on the three-way ARDUINO / IDF_VER / PC switch:
- The
IDF_VERbranch is dead code in all current builds (ARDUINOalways matches first whenframework = arduino) - Direct
esp_*calls work insideframework = arduinobuilds — this is common practice for features the Arduino wrappers do not expose (power management, WiFi fine-tuning, P4 hardware) - Library compatibility: ESPAsyncWebServer and FastLED require
Arduino.h; ArduinoJson v7 is framework-agnostic - The
ESP_PLATFORMpath: collapsingARDUINOandIDF_VERinto one branch for new PAL functions where no Arduino wrapper exists (P4 GMAC, codecs) esp_netif_init/ event loop double-init caveat when mixing IDF calls with Arduino WiFi init
Definition of Done¶
- [x]
WifiSta.h: auto-retry everyRETRY_INTERVAL_MS = 30000ms while enabled and disconnected - [x]
Network.h:ethWasConnected_added; STA re-enabled on Ethernet drop;STA_GRACE_MS = 30000 - [x]
docs/developer-guide/pal.md: "Arduino, IDF, and mixing both" section added - [x] All unit tests pass; PC and ESP32 builds clean
- [x] PC live tests pass
Result¶
| Metric | Value |
|---|---|
| Unit tests | 392/392 pass (no new tests — retry logic is ARDUINO-only, not exercisable on PC) |
| PC live tests | 7/7 scenarios pass |
| ArtNet two-device | PASS (esp32s3_n16r8 MM-70BC reached and received packets) |
| esp32dev build | 1441 KB flash (78.5%), 64 KB RAM (20.0%) — unchanged from Sprint 11 |
| esp32s3_n16r8 build | 1427 KB flash (34.8%), 62 KB RAM (19.5%) — unchanged from Sprint 11 |
| esp32dev live test | skipped (device unreachable — stale IP, not a sprint regression) |
See test-results.md and live-pc-macos.md.
Retrospective¶
What went well:
- The
isEnabled()guard in the retry branch reuses the existingenabledbase-class control, so NetworkModule's Ethernet-gating of STA (which setsenabled = false) automatically suppresses retries — no extra flag needed. - Reducing
STA_GRACE_MSto 30 s to matchRETRY_INTERVAL_MSwas a one-line change that aligned the whole recovery model. All recovery events are now on the same cadence. - The PAL design discussion surfaced a useful clarification: the
IDF_VERbranch is currently dead code, but the right long-term strategy isESP_PLATFORMfor new functions rather than maintaining two separate branches that converge on the same IDF API.
What was tricky:
- The
ethWasConnected_fix was the less obvious half of the reconnect story. STA retry alone would not have helped after Ethernet drops because STA had been disabled by NetworkModule while Ethernet was up — it would never retry whileenabled = false. Tracking the Ethernet transition was required to re-arm STA.
Seeds for Sprint 13:
- Hardware live test: connect LAN8720 (esp32dev) and verify Ethernet and WiFi reconnect behavior on real hardware.
- Consolidation question investigated but deferred: merging WifiAp, WifiSta, Ethernet into one NetworkModule is not worth the cost (UI, testing, size). The circular include friction could be reduced by extracting
deviceName()to a lightweightDeviceInfo.h. - PAL
ESP_PLATFORMrefactor: apply to new functions (P4 GMAC) when that hardware arrives; leave existing Arduino-wrapper functions unchanged.
Complexity estimate: Small (S).
Sprint 13: PAL Cleanup and Deploy Pipeline Fixes¶
Scope: Remove all
IDF_VERbranches fromPal.h; consolidate the status docs (removedeploy-summary.md); fixsummarise.pyoverwriting per-env live results files; fixlivetest.pyoverwriting logs when a device is unreachable.
Identified from: Sprint 12 retrospective (PAL cleanup); organic housekeeping on the deploy pipeline.
Summary¶
| Part | Description | Est |
|---|---|---|
| A: Remove IDF_VER branches | Rewrite Pal.h to #ifdef ARDUINO / #else throughout; remove _eth IDF namespace; simplify eth_init |
S |
| B: Update pal.md | Rename "Three-way" to "Two-way" table; remove "IDF migration path" section; add rule statement | XS |
| C: Consolidate status docs | Merge deploy-summary.md into index.md; remove the file; update all references |
S |
| D: Deploy pipeline correctness | summarise.py stops overwriting per-env MD files; livetest.py skips unreachable devices without touching logs |
S |
| Total | M |
Part A: Pal.h rewrite¶
All #elif defined(IDF_VER) branches removed. Every function now follows:
#ifdef ARDUINO
// Arduino ESP32 implementation
#else
// PC / Raspberry Pi stub
#endif
The _eth namespace (IDF event-driven Ethernet state helpers: _EthEvent, eth_event_handler, ethState) was removed entirely. The eth_init function shrank from ~60 lines to ~5:
inline bool eth_init() {
#if defined(ARDUINO) && defined(PMM_ETH_LAN8720)
return ETH.begin(ETH_PHY_LAN8720, ...);
#elif defined(ARDUINO) && defined(PMM_ETH_W5500)
return ETH.begin(ETH_PHY_W5500, ...);
#else
return false;
#endif
}
A comment added to the Ethernet section: future hardware (e.g. ESP32-P4 GMAC) that has no Arduino ETH.h wrapper should add a new PMM_ETH_* flag and use direct IDF calls inside the ARDUINO block.
Part B: pal.md update¶
- "Three-way platform switch" table renamed to "Two-way platform switch";
IDF_VERrow removed. - Rule statement added: use Arduino wrappers by default; fall back to direct IDF calls only when no Arduino wrapper exists; those calls go inside the
ARDUINOblock. - "IDF migration path" section removed (all items in that table were IDF_VER-specific stubs, now gone).
Part C: Status docs consolidation¶
docs/status/deploy-summary.md was a near-duplicate of docs/status/index.md. The deploy pipeline table (Build/Flash/Run/Live columns) was merged into index.md as a new ## Deploy summary section above the existing ## Test results table. deploy-summary.md was deleted and all 14 references across docs, scripts, and mkdocs.yml updated to point to status/index.md.
summarise.py was simplified accordingly: the _write_deploy_summary_md function was removed; its table-generation logic moved into _write_index_md. A ## Detail pages section is now appended to index.md on every run, listing all live-results-*.md files found on disk (not just those written in the current run), so results from previous hardware runs remain visible.
Part D: Deploy pipeline correctness¶
Two bugs where pipeline scripts silently destroyed previous good results:
summarise.py overwrote per-env MD files. live_suite.py writes docs/status/live-results-{env}.md directly after each run; it includes a ## Summary section (with per-test check counts) and a ## Scenarios section (per-step fps/heap data). summarise.py was independently re-generating these same files from the JSON, but using a simpler format without those sections. Fix: replaced _write_single_env_results / _write_live_results_md with _scan_live_files, which scans existing MD files on disk and returns their paths as links. summarise.py no longer writes per-env MD files at all.
livetest.py overwrote logs for unreachable devices. _run_esp32_test opened the log file for writing (truncating it) before attempting to connect, so an unreachable device always destroyed the previous run's log. Fix: a reachability probe (GET /api/system) runs before any file is opened; if it fails the device is skipped with a message and the log and JSON are left untouched.
Definition of Done¶
- [x]
Pal.h: noIDF_VERanywhere; all functions use#ifdef ARDUINO/#else - [x]
pal.md: two-way switch table; rule statement; IDF migration section removed - [x]
deploy-summary.mddeleted;index.mdhas merged deploy + test tables and detail links - [x]
summarise.py: no longer writes per-env MD files;_scan_live_filespreserveslive_suite.pyoutput - [x]
livetest.py: reachability probe before opening log; unreachable devices skip without touching files - [x] PC build clean; 392/392 unit tests pass; PC live tests pass
- [x] esp32dev and esp32s3_n16r8 builds clean
Result¶
| Metric | Value |
|---|---|
| Unit tests | 392/392 pass |
| PC build | clean |
| PC live tests | 15/15 pass |
| esp32s3_n16r8 live tests | 15/15 pass |
| esp32dev build | 1441 KB flash (78.5%), 64 KB RAM (20.0%) — unchanged |
| esp32s3_n16r8 build | 1427 KB flash (34.8%), 62 KB RAM (19.5%) — unchanged |
| Pal.h line count | ~800 (down from ~1510) |
| deploy-summary.md | removed; content merged into index.md |
Retrospective¶
What went well:
- The PAL rewrite was clean: IDF_VER branches were either identical to the Arduino path or simple stubs, so removing them caused zero regressions.
- The
_ethIDF namespace was entirely internal to PAL — no modules depended on it — making deletion safe. - The status consolidation caught two separate bugs in the deploy pipeline during the same session; fixing them together while the code was open was efficient.
What was tricky:
- The
summarise.py/live_suite.pysplit of responsibilities was not obvious: both were writing the same files, withsummarise.py's version silently losing the Scenarios section. The fix required tracing the full data flow from JSON through both writers. - Sprint 11 scope document still references the old three-way pattern (
PAL structure and the IDF_VER pathbackground section). Left as-is since it accurately records the design as it stood when Sprint 11 was written.
Seeds for Sprint 14:
- Hardware live test: connect LAN8720 (esp32dev) and verify Ethernet + WiFi reconnect behavior on real hardware.
- DeviceInfo.h extraction: reducing circular include friction between NetworkModule children (WifiAp/WifiSta both include Network.h for
deviceName()).
Complexity estimate: Medium (M).
Release 7 Backlog¶
All items consolidated into the cross-release backlog.