Skip to content

Release 7: OTA, Ethernet, and Runtime Hardening (v1.7.0)

Theme: Release 7 completes the field deployment story: over-the-air firmware updates with a CI release pipeline, Windows support, and a full Ethernet + WiFi management stack with automatic reconnect. Runtime hardening spans heap/OOM safety, static RAM tuning, and WebSocket log streaming. It closes out with PAL simplification (IDF_VER removal) and deploy pipeline correctness fixes.


Release Overview

Foundation from previous releases

Capability Notes
Virtual/Physical layer split Effects render in virtual space; layouts own the physical mapping
PhysMap 1:0, 1:1, 1:N pixel mappings; PSRAM-backed on S3
Modifier library Mirror, Checkerboard, Scroll, Rotate, Tile
Non-rectangular layouts RingLayout, WheelLayout, XmasTreeLayout
Memory observability MemBoot balance sheet per module; MemLive fragmentation warnings at runtime
Time observability Per-second CPU accounting with module hierarchy; REST + WS exposure
Scenario benchmarking Declarative JSON pipelines shared by unit tests and live tests; fps + heap per step
Live test suite 13 tests (smoke / format / behavioral / integration) on PC + ESP32 via REST
Deploy pipeline all.py with post-flash mem capture (--reset), live tests, and status docs
357 unit tests All passing; smoke / format / behavioral / integration classification
Developer tooling uv workspace, pre-commit hooks, PAL compile-time enforcement, MCP server

What Release 7 delivers

Problem / Goal Sprint
No over-the-air firmware update path Sprint 1 (FirmwareUpdateModule: file + GitHub)
No firmware assets on GitHub releases / no nightly build Sprint 2 (CI release pipeline)
No Windows build or release binary Sprint 3 (Windows build + CI)
Scenario baselines not populated from hardware Sprint 5 (Scenario baseline + extends)
Classic ESP32 static RAM / fragmentation headroom thin Sprint 6 (LOG_RING_SIZE tuning, WiFi buffer counts, dual check_alloc)
Ring buffer diagnostics not visible without serial monitor Sprint 7 (GET /api/log WebSocket streaming + frontend log panel)
Heap safety and HTTP OOM crashes Sprint 8 (per-module controlAllocBytes, HTTP OOM catch)
No dropdown control type Sprint 9 (select control: backend index + frontend <select>)
Module creation UX and WiFi management gaps Sprint 10 (boot module redesign, dynamic AP/STA management)
No Ethernet support (LAN8720 / W5500) Sprint 11 (EthernetModule, PAL functions, static IP)
WiFi does not reconnect after signal loss or Ethernet drop Sprint 12 (STA retry every 30 s, Ethernet-drop STA re-enable)
PAL IDF_VER dead code; status doc duplication; pipeline bugs Sprint 13 (IDF_VER removal, deploy-summary consolidation, pipeline fixes)

Sprints

Sprint Goal
Sprint 1 FirmwareUpdateModule: OTA PAL + file upload + GitHub releases UI + env in SystemStatus
Sprint 2 CI release pipeline: firmware assets on tagged releases + nightly pre-release build
Sprint 3 Windows build: #ifdef _WIN32 in WsServer.h and Pal.h; ws2_32 link; CI job; .zip artifact
Sprint 5 Scenario baseline: first hardware --update-baseline run; "extends" inheritance; wire into all.py
Sprint 6 Static RAM hardening: LOG_RING_SIZE 4 KB on all devices, WiFi buffer tuning, dual check_alloc guard
Sprint 7 GET /api/log frontend panel: WS push of ring buffer entries; collapsible log UI
Sprint 8 Heap safety: per-module controlAllocBytes hook, HTTP OOM catch, live-test correctness, isPermanent() removal
Sprint 9 select control type: addControl(..., "select") + addControlValue(); backend index storage; frontend <select> rendering
Sprint 10 Boot module creation redesign; dynamic AP/STA WiFi management with Ethernet gating
Sprint 11 EthernetModule: LAN8720 (RMII) + W5500 (SPI); PAL functions; DHCP + static IP
Sprint 12 WiFi reconnect: STA retry every 30 s; re-enable STA on Ethernet drop; PAL architecture docs
Sprint 13 PAL IDF_VER removal; deploy-summary.md consolidation; deploy pipeline correctness fixes

Sprint 1: FirmwareUpdateModule

Scope: Full over-the-air firmware update: PAL plumbing, a POST /api/firmware endpoint, and a FirmwareUpdateModule that supports both a local file upload and one-click flashing from GitHub releases. The GitHub releases path fetches the public releases API in the browser, matches assets to the current device environment, and streams the binary to the device — no internet access required on the ESP32 itself.

Deferred from: Release 5 original scope.

Summary

Part Description Est
PAL + HTTP endpoint pal::ota_* functions, POST /api/firmware streaming endpoint, dual-OTA partition scheme M
FirmwareUpdateModule Module lifecycle, update_status control, OTA state integration S
Frontend: file upload File picker tab, XHR streaming to endpoint, progress bar S
Frontend: GitHub releases Browser-direct GitHub API, asset matching by env, version badge, sessionStorage cache M
SystemStatus env field "env": BUILD_TARGET in GET /api/system and healthReport() XS
Tests PAL stub tests, endpoint test (ota_write call count, ota_end once) S
Total L

Planned scope

PAL and endpoint:

  • pal::ota_begin(), pal::ota_write(buf, len), pal::ota_end(), pal::ota_abort() in Pal.h. On ESP32: wraps esp_ota_*. On PC: writes received bytes to a temp file and prints a log line.
  • POST /api/firmware (multipart or chunked binary body): streams bytes through pal::ota_write, calls pal::ota_end() on completion, triggers reboot. Returns {"ok":true} or {"error":"..."}.
  • Partition scheme: dual-OTA layout (partitions/esp32dev-ota.csv, partitions/esp32s3-ota.csv) so a running image can be updated without erasing LittleFS.

SystemStatus — env field:

  • Add "env": BUILD_TARGET to GET /api/system response and to SystemStatus::healthReport(). BUILD_TARGET is already a compile-time define (esp32dev, esp32s3_n16r8, PC, …). This field is what the update UI uses to match GitHub release assets to the current device.

FirmwareUpdateModule:

  • Registered in ModuleRegistrations.cpp; isPermanent() = true.
  • Exposes an "update_status" display control (idle / downloading / flashing X% / done / error) updated via disableSelf() is NOT used here; the module stays alive through the update.
  • WebSocket push: progress events {"type":"ota","pct":42} every ~5% so the frontend can update a progress bar without polling.

Frontend — two update paths in one card:

  • File upload tab: <input type="file" accept=".bin"> → reads file as ArrayBufferPOST /api/firmware with Content-Type: application/octet-stream. Shows a progress bar driven by XMLHttpRequest.upload.onprogress.
  • GitHub releases tab: on open, browser JS calls https://api.github.com/repos/ewowi/projectMM/releases?per_page=5 directly (public API, no auth, no device internet required). For each release shows: tag name, release title, date, pre-release badge. Downloads the asset matching projectMM-{env}.bin (where env comes from GET /api/system). Streams the downloaded ArrayBuffer to POST /api/firmware. Shows the same progress bar. If no matching asset exists for a release, that release is greyed out.
  • Maximum 5 releases shown; controlled by the per_page query parameter.
  • Error handling: network failure fetching GitHub API shows "GitHub unreachable — use file upload"; missing asset shows "No firmware for {env} in this release".

Tests:

  • Unit test: pal::ota_* PC stub writes bytes correctly and returns success.
  • Unit test: POST /api/firmware with a 1 KB payload calls ota_write N times and ota_end once.
  • Live test: flash a known-good binary via POST /api/firmware; assert version field in GET /api/system matches expected value after reboot.

Definition of Done

  • [x] pal::ota_begin/write/end/abort implemented in Pal.h (ESP32 wraps esp_ota_*; PC stubs return true and print)
  • [x] OtaHandle type alias: esp_ota_handle_t on ESP32, int on PC
  • [x] POST /api/firmware using onPostBinary (no body buffering; chunks stream directly to pal::ota_write)
  • [x] onPostBinary added to HttpServer.h (ESP32: upload/body callback; PC: buffers once, calls chunk handler)
  • [x] OtaState.h inline globals (g_otaStatus, g_otaPct, g_otaHandle) shared between AppRoutes and module
  • [x] FirmwareUpdateModule registered in CoreRegistrations.cpp
  • [x] FirmwareUpdateModule auto-created by ensureInfraModules on first boot (embedded only); not permanent — boot guard is the safety net
  • [x] "env": BUILD_TARGET added to GET /api/system via SystemStatus::fillSystemJson
  • [x] Frontend: File Upload tab in FirmwareUpdateModule card (file picker + XHR progress bar)
  • [x] Frontend: GitHub Releases tab (fetches public API, 1 hr sessionStorage cache, matches projectMM-{env}.bin)
  • [x] Frontend: Version badge in status bar when newer non-prerelease GitHub release has a matching asset
  • [x] flash_chip_mode() and psram_mode() PAL functions added; wired into SystemStatus controls and fillSystemJson (psram_mode inside totalPsramKb_ > 0 guard)
  • [x] Light mode fix: #preview-section gets background: #f5f5fa override so sticky bar blends with body in day mode
  • [x] 11 new unit tests (pal::ota_* stubs, OtaState globals, FirmwareUpdateModule lifecycle); 375/375 pass
  • [x] PC live test: PASS; ESP32 live tests: MM-70BC PASS, MM-ESP32 PASS
  • [x] esp32dev and esp32s3_n16r8 build successfully

Result

Metric Value
Unit tests 375/375 pass (11 new)
PC live tests 13/13 PASS
ESP32 live tests MM-70BC PASS, MM-ESP32 PASS
esp32dev build SUCCESS (~1.16 MB)
esp32s3_n16r8 build SUCCESS (~1.16 MB)
POST /api/firmware Returns {"ok":true} (PC stub); streams via body callback on ESP32
Version badge Shown when GitHub latest release tag newer than firmware_version and has matching .bin
flash_chip_mode / psram_mode PAL functions + SystemStatus controls; psram_mode guarded by totalPsramKb_ > 0
Light mode #preview-section background override added; day mode preview bar now white

Backlogged from this sprint (per user Q decisions): - Device-side WS progress events during OTA (Q1-B); XHR upload.onprogress used instead - Nightly pre-release channel in version badge (Q2-B); only stable releases shown - Live test: flash binary + verify version (requires hardware access with known binary on GitHub releases)

Retrospective

What went well: - onPostBinary cleanly separated from onPost (no RAM buffering for large binaries) - Dual-OTA partitions already in place; no CSV changes needed - Browser-direct GitHub API (CORS OK on public repos) avoids device internet access - checkForUpdate() uses sessionStorage to rate-limit GitHub API calls (1 hr TTL) - OtaState.h inline globals give clean shared state between the HTTP route and module without any RTOS sync overhead - flash_chip_mode / psram_mode PAL functions fit the existing pattern cleanly; compile-time CONFIG_SPIRAM_MODE_OCT is the reliable OPI indicator

What was tricky: - ota_end re-fetches the next OTA partition via esp_ota_get_next_update_partition(nullptr) since set_boot_partition is not called before ota_end; easy to miss - Light mode required an explicit #preview-section background override because sticky positioning pins the dark base color through the body override

Seeds for future sprints: - Sprint 2 (CI Release Pipeline) is the next step: assets must be published before the GitHub tab or version badge show anything useful - Live OTA test (flash + verify version change) is backlogged until Sprint 2 ships firmware assets


Sprint 2: CI Release Pipeline

Scope: Attach firmware binaries as GitHub release assets on every tagged release, and add a nightly pre-release that rebuilds automatically each night. These assets are what FirmwareUpdateModule's GitHub tab fetches.

Depends on: Sprint 1 (asset naming convention must match what FirmwareUpdateModule expects).

Complexity: S (YAML only; no C++ or Python changes).

Summary

Part Description Est
release.yml alignment Pin Python 3.12 + PlatformIO <7, add PIO package cache, asset upload job S
nightly.yml New workflow: cron 02:00 UTC, idempotent delete+recreate nightly pre-release S
Total S

Asset naming convention

Asset Source path Matches env
projectMM-esp32dev.bin .pio/build/esp32dev/firmware.bin esp32dev
projectMM-esp32s3_n16r8.bin .pio/build/esp32s3_n16r8/firmware.bin esp32s3_n16r8
projectMM-pc-macos.tar.gz deploy/build/pc/macos/projectMM PC (macOS CI runner)
projectMM-pc-windows.zip deploy/build/pc/windows/projectMM.exe PC (Windows, Sprint 3)

Scope (confirmed)

.github/workflows/release.yml — aligned and complete:

  • Triggered by push: tags: ['v*'] or workflow_dispatch with tag input.
  • PC build: astral-sh/setup-uv@v5 + uv run deploy/build.py -target pc (aligned with ci.yml).
  • ESP32 builds: Python 3.12 pinned, PlatformIO pinned to <7, PlatformIO package cache added (both gaps vs ci.yml fixed).
  • upload-assets job uses gh release upload --clobber after all three build jobs pass.

.github/workflows/nightly.yml — new:

  • Triggered on schedule: cron: '0 2 * * *' (02:00 UTC daily) and workflow_dispatch.
  • Identical build matrix to release.yml (macOS PC + esp32dev + esp32s3_n16r8).
  • publish-nightly job: deletes existing nightly release + tag, re-creates as pre-release titled Nightly (YYYY-MM-DD) with short commit SHA in notes. Idempotent: the gh release delete ... || true guard handles first run.
  • The nightly pre-release appears in FirmwareUpdateModule's GitHub releases tab with a "pre-release" badge.

Backlogged from this sprint:

  • scripts/list_pio_envs.py + deploy/build.py --all-envs: not needed while only 2 ESP32 envs exist; pick up when ESP32-P4 is added (Release 8 Sprint 1).
  • PC Linux build artifact: macOS binary covers the main use case for now.

Definition of Done

  • [x] release.yml: PC build uses uv run deploy/build.py -target pc (was raw cmake)
  • [x] release.yml: ESP32 jobs pin Python 3.12 and platformio<7 (was 3.x + unpinned)
  • [x] release.yml: PlatformIO package cache added to both ESP32 jobs
  • [x] nightly.yml created: cron 02:00 UTC + workflow_dispatch; builds macOS PC + esp32dev + esp32s3_n16r8
  • [x] nightly.yml: publish-nightly job deletes and re-creates nightly pre-release (idempotent)
  • [x] Asset names match FirmwareUpdateModule expectations (projectMM-{env}.bin)

Result

Metric Value
release.yml Aligned with ci.yml; tag-triggered; 3 build jobs + upload
nightly.yml New; cron 02:00 UTC; delete+recreate nightly pre-release
Asset naming projectMM-esp32dev.bin, projectMM-esp32s3_n16r8.bin, projectMM-pc-macos.tar.gz
Python/PlatformIO Pinned to 3.12 and <7 in both workflows (consistent with ci.yml)
Unit tests 375/375 (no new tests; YAML-only sprint)

Retrospective

What went well: - release.yml already existed with the core structure; this sprint was alignment + nightly, not a rebuild from scratch - gh release delete nightly --yes --cleanup-tag 2>/dev/null || true pattern is clean and idempotent; no third-party action needed - Build matrix in nightly.yml is identical to release.yml so both stay in sync by copy

What was tricky: - release.yml had python-version: '3.x' and unpinned PlatformIO; these would have broken on PlatformIO 7 release or a Python 3.13 runner update (same issue ci.yml already fixed months ago)

Seeds for future sprints: - Sprint 3 (Windows) adds projectMM-pc-windows.exe to both release.yml and nightly.yml - Release 8 Sprint 1 (ESP32-P4) adds a third ESP32 build job; at that point list_pio_envs.py becomes worth adding to avoid three copies of the same job - Once a tagged release exists, run a manual workflow_dispatch to verify the upload-assets path end-to-end


Sprint 3: Windows Build

Scope: projectMM builds and runs as a native Windows binary (CMake + Clang/Ninja via llvm-mingw). CI job produces a .zip release artifact, and both release.yml and nightly.yml gain a build-pc-windows job. macOS build is unaffected.

Deferred from: Release 5 original scope.

Summary

Part Description Est
Winsock2 guards WsServer.h + Pal.h UDP: POSIX socket calls behind #ifdef _WIN32 S
Socket shim unification PcSocketShims.h unified header; PcSockets.h deleted; both consumers updated S
CMake + build scripts Ninja on Windows, pc_platform() helper, per-platform binary paths in all deploy scripts M
Windows memory stats VirtualQueryEx for free_heap_bytes(); MemBoot/MemLive correct on Windows M
Output file split live-results-pc-{platform}.json, per-env MD files; live-results-all.json dropped S
CI integration build-pc-windows job in ci.yml, release.yml, nightly.yml; .zip artifact S
Test + misc fixes /tmp/ relative-path fix, UTF-8 encoding, dangling-pointer onInputRemoved fix M
Total XL

Planned scope

  • #ifdef _WIN32 guards in WsServer.h and Pal.h (Winsock2 instead of POSIX sockets).
  • ws2_32 link in CMakeLists.txt and tests/CMakeLists.txt.
  • GitHub Actions CI job on windows-latest (build + unit tests); adds projectMM-pc-windows.zip to release and nightly artifact lists.
  • deploy/build.py: -G Ninja on Windows (single-config generator, binary at predictable path).

Additional work discovered during implementation:

  • src/pal/MemoryStats.h: Windows branch using GetDiskFreeSpaceExA (no sys/statvfs.h).
  • tests/ws_test_client.h: full rewrite with _wstc* socket shims for Winsock2 compatibility.
  • tests/test_module_manager.cpp, tests/test_reorder.cpp: fix hardcoded /tmp/ paths to relative paths (no /tmp/ on Windows).
  • deploy/unittest.py: add encoding='utf-8' to markdown write (Windows default codec lacks emoji support); add blank-line stripping from run-tests.log output.
  • deploy/_lib.py: add pc_platform() helper ("windows" / "macos" / "linux").
  • All deploy scripts (build.py, run.py, livetest.py, summarise.py, unittest.py): paths updated from deploy/build/pc/ to deploy/build/pc/{platform}/ and logs from *-pc.log to *-pc-{platform}.log.
  • Socket code sharing: src/pal/PcSocketShims.h created as a shared header with unified _ws* shim functions (open, close, accept, connect, recv, send, wait). src/pal/PcSockets.h merged in and deleted. WsServer.h and ws_test_client.h both include PcSocketShims.h; ws_test_client.h no longer has its own _wstc* duplicates.
  • Windows MemBoot/MemLive: pal::free_heap_bytes() on Windows implemented via VirtualQueryEx (walks committed private virtual memory regions; "free" = 512 MB ceiling minus committed). pal::total_heap_kb() returns the matching ceiling. pal::s_freeHeapCache_() caches the last scan so max_alloc_bytes() avoids a second scan in the same tick. MemBoot and MemLive lines now appear in the Windows server log with correct per-module deltas.
  • Output file improvements: live-results-pc.json renamed to live-results-pc-{platform}.json; live-results-all.json dropped entirely. docs/status/live-results.md split into per-env files (live-results-pc-windows.md, live-results-esp32dev.md, live-results-esp32s3_n16r8.md). livetest_out.txt deleted. deploy/summarise.py rewritten to read per-device JSON files directly.
  • docs/developer-guide/deploy.md: fully updated for Windows (toolchain requirements, Ninja, llvm-mingw, uv run throughout, per-platform binary paths, CI table with Windows row, log file table).

Definition of Done

  • [x] src/core/WsServer.h: Winsock2 shims replace POSIX socket calls under #ifdef _WIN32.
  • [x] src/pal/Pal.h: UDP functions (udp_bind, udp_recv, udp_send, udp_broadcast) compile on Windows.
  • [x] src/pal/MemoryStats.h: Windows branch provides getMemoryStats() via GetDiskFreeSpaceExA.
  • [x] CMakeLists.txt + tests/CMakeLists.txt: ws2_32 linked on Windows.
  • [x] tests/ws_test_client.h: cross-platform socket shims; test helper compiles on Windows.
  • [x] deploy/build.py: Ninja generator selected on Windows; binary at deploy/build/pc/windows/projectMM.exe.
  • [x] All 375 unit tests pass on Windows (Clang 18 + llvm-mingw-ucrt + Ninja).
  • [x] ci.yml: build-pc-windows job (build + unit tests).
  • [x] release.yml + nightly.yml: build-pc-windows job; .zip artifact included.
  • [x] Deploy scripts use deploy/build/pc/{platform}/ paths; macOS logs remain *-pc-macos.log.
  • [x] All 13 live test groups pass on Windows (all_pc.py: 4 passed, 0 failed).
  • [x] src/pal/PcSocketShims.h: unified socket shim header; PcSockets.h merged in and deleted; WsServer.h and ws_test_client.h both include PcSocketShims.h.
  • [x] pal::free_heap_bytes() on Windows via VirtualQueryEx; MemBoot/MemLive lines appear in Windows server log with correct per-module deltas.
  • [x] Live result files split per platform (live-results-pc-{platform}.json); per-env docs/status/live-results-*.md files generated; live-results-all.json dropped.
  • [x] docs/developer-guide/deploy.md updated: Windows toolchain requirements, uv run throughout, per-platform paths, CI table with Windows row.

Result

Metric Value
Unit tests (Windows) 375 / 375 passed
Test assertions (Windows) 1807 / 1807 passed
Live test groups (Windows) 13 / 13 passed (133 assertions)
all_pc.py result 4 / 4 steps passed
Toolchain Clang 18.1.8 + llvm-mingw-20240619-ucrt-x86_64 + Ninja
Build target projectMM.exe (Windows x86-64)
Files changed 50 source, deploy, docs, and CI files
macOS tests (unaffected) unchanged (375 pass in CI)
Windows MemBoot Correct per-module deltas via VirtualQueryEx (frag% display deferred — see backlog)
Socket shim files PcSocketShims.h unified; PcSockets.h deleted

Retrospective

What went well: - The socket shim pattern (_ws* unified in PcSocketShims.h) kept platform branches out of class bodies and eliminated the duplicate _wstc* block that had grown alongside the original _ws* set. - pc_platform() in _lib.py gives a single source of truth for the three-way platform string; all deploy scripts and CI reference it. - UDP broadcast loopback (Art-Net test5) works on Windows without any changes. - VirtualQueryEx gives realistic, per-module heap deltas in MemBoot — the approach is correct even though the frag% display has a pending fix. - Splitting live-results-all.json into per-platform files and live-results.md into per-env files removes the aggregation step and makes each device's results self-contained.

What was tricky: - sys/statvfs.h (MemoryStats.h) and arpa/inet.h (ws_test_client.h) are not available on Windows and required additional platform guards not in the original scope. - /tmp/ hardcoded in several test files causes silent failures on Windows (file not written, module not loaded, findById returns nullptr). Fixed by switching to relative paths. - std::filesystem::path::write_text on Windows uses the system default encoding (cp1252) which cannot encode emoji () used in test-results.md. Fixed by passing encoding='utf-8'. - The build/pc/ flat layout conflated macOS and Windows artifacts. Restructured to build/pc/{platform}/ in the same sprint. - Latent dangling-pointer bug exposed by Windows: DriverLayer stores raw EffectsLayer* pointers in sources_[]. When delete_all_modules() freed an EffectsLayer, driver1 retained a stale pointer and crashed on the next loop() tick. macOS tolerated the dangling access; Windows terminated the process. Fixed by adding Module::onInputRemoved(Module*). - Windows heap measurement: HeapWalk (Win32 default heap) and GlobalMemoryStatusEx (system-wide RAM) were tried and rejected before settling on VirtualQueryEx. HeapWalk walks the wrong heap (Win32 vs UCRT malloc), giving ~20 KB values that triggered check_alloc() denial and crashed the server. GlobalMemoryStatusEx returns 4 GB+ with no per-allocation granularity. - frag% overflow: largNow * 100u overflows uint32_t at ~500 MB values. Fixed in pal::memEvent() with a (uint64_t) cast; the same overflow exists in StatefulModule.h and Scheduler.cpp and is deferred to the backlog (see index.md).

Seeds for future sprints: - Linux PC build is CI-tested only on macOS. A ubuntu-latest CI leg would close the triangle (low effort: same uv run deploy/build.py -target pc command, linux slug already in pc_platform()). - The timing-sensitive test Scheduler timing accumulator tracks SpinModule within 5% occasionally flakes under heavy CI load on Windows (passes in isolation). Consider widening epsilon or moving to a dedicated timing fixture. - Windows MemBoot frag% accuracy: apply the (uint64_t) overflow fix to StatefulModule.h and Scheduler.cpp, and fix call order (max_alloc before free_heap in Scheduler). Tracked in the backlog. - Effects animate slowly in the WebGL preview on Windows but not on macOS. Root cause not yet identified (push-rate throttle, time-unit mismatch, or browser queue lag). Tracked in the backlog.


Sprint 5: Scenario Baseline and extends

Scope: Populate deploy/test/scenario-baseline.json from a real ESP32 run; add "extends" inheritance to scenario files; wire --compare-baseline into deploy/all.py.

Deferred from: Sprint 10 retrospective seeds.

Complexity: M

Summary

Part Description Est
extends support Single-level inheritance in scenario.py, live_suite.py, test_scenarios.cpp (identical logic in each) M
New scenario files base-pipeline-64x64.json and four-layers.json using extends S
Baseline population Run on MM-70BC hardware, commit scenario-baseline.json S
all_pc.py integration _run_scenario_baseline(): start server, compare baseline, non-fatal S
Total M

Planned scope

  • Run deploy/scenario.py --update-baseline against MM-70BC (ESP32-S3); commit result.
  • Implement "extends" key (single-level): load parent steps and prepend them; child metadata wins.
  • deploy/all_pc.py: after live tests, start the PC server and run deploy/scenario.py --compare-baseline; print warning on regressions (non-fatal).
  • Add base-pipeline-64x64.json and four-layers.json stress scenarios, both using "extends".

Definition of Done

  • [x] deploy/scenario.py load_scenario(): single-level "extends" resolves parent file and prepends parent steps
  • [x] deploy/live_suite.py run_scenario(): same extends resolution for live tests
  • [x] tests/test_scenarios.cpp resolve_extends(): same resolution so C++ scenario replay handles the new files
  • [x] deploy/test/scenarios/base-pipeline-64x64.json: extends base-pipeline-32x32, adds resize to 64x64
  • [x] deploy/test/scenarios/four-layers.json: extends two-layers, adds GameOfLife + Noise layers
  • [x] deploy/all_pc.py: _run_scenario_baseline() starts PC server, runs scenario --compare-baseline, non-fatal
  • [x] deploy/test/scenario-baseline.json: populated from MM-70BC (ESP32-S3); 7 scenarios, all steps measured
  • [x] 375/375 unit tests pass; all scenario replay tests include extended scenarios

Result

Metric Value
Unit tests 375/375 pass (20 new assertions from extended scenario replay)
PC live tests 13/13 PASS (including 2 new extended scenarios)
Baseline Populated from MM-70BC: 7 scenarios, ~177 KB free heap at base pipeline
Scenarios 7 files (5 pre-existing + base-pipeline-64x64, four-layers)

Backlogged from this sprint: - system_fps baseline threshold too tight (50%+ swings between runs on hardware); tracked in cross-release backlog. - Recursive extends (parent can itself extend) deferred until a chain is actually needed.

Retrospective

What went well: - Single-level extends is a clean pattern: parent steps first, child steps appended, child metadata wins. No ambiguity. - The three places that load scenario JSON (scenario.py, live_suite.py, test_scenarios.cpp) each got identical logic in ~8 lines; no shared abstraction needed at this scale. - _run_scenario_baseline() in all_pc.py cleanly manages its own server lifetime (start, run, terminate) as a self-contained helper.

What was tricky: - system_fps is too volatile for a 20% threshold on hardware (WiFi task preemption causes 30-65% swings between identical runs). The baseline pass/fail signal is unreliable for fps; heap metrics are stable and useful. - The live suite (live_suite.py) loads scenario JSON independently of scenario.py, so extends resolution had to be added in three places. A shared Python utility would reduce duplication if more scenario features are added.

Seeds for future sprints: - Scope baseline checks to heap metrics only (heap_free, max_alloc); skip fps or widen its threshold to 50%. - Recursive extends if scenario hierarchies deepen.


Sprint 6: Static RAM Hardening for Classic ESP32

Scope: Reduce the permanent .bss footprint on esp32dev (no PSRAM) to give module setup more headroom. The log ring buffer and WiFi buffer allocation are the two largest tunable levers.

Identified in: R6S8 live device analysis (esp32dev free-heap floor ~109 KB, only 19 KB above 90 KB reserve; fragmentation 55%+).

Complexity: S

Summary

Part Description Est
Ring buffer resize LOG_RING_CAP=32, LOG_RING_ENTRY=64 (2 KB, saves 6 KB .bss); test updated XS
check_alloc dual guard Adds max-alloc block check alongside free-heap reserve; printf on failure reason S
WiFi buffer investigation -DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM attempted then removed (pre-compiled framework conflict) XS
Total S

Planned scope

  • Set ring to 32 entries x 64 bytes = 2 KB on all devices. Saves 6 KB vs the original 8 KB ring on classic ESP32. Trade-off: the ring holds ~32 lines instead of ~64.
  • Tune WiFi dynamic RX buffer count: -DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM=16 was attempted in build_flags but the symbol is already defined in the framework's pre-compiled sdkconfig.h, causing a redefinition warning. Flag removed; WiFi buffer count cannot be overridden this way with the Arduino framework blob.
  • Upgrade pal::check_alloc to a dual guard: free_heap_bytes() >= bytes + reserve AND max_alloc_bytes() >= bytes. Surface the failure reason via printf ("check_alloc: reserve violation" vs "check_alloc: largest block too small").

Complexity: S

Definition of Done

  • [x] src/core/Logger.cpp: LOG_RING_CAP = 32, LOG_RING_ENTRY = 64 (2 KB total)
  • [x] platformio.ini: -DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM=16 attempted and removed — conflicts with pre-compiled sdkconfig.h in Arduino framework blob
  • [x] src/pal/Pal.h check_alloc(): dual guard checks both free heap reserve AND max contiguous block; printf on failure
  • [x] tests/test_logger.cpp: ring overflow test updated for 32-entry cap (38 entries pushed, 32 survive from entry6)
  • [x] esp32dev and esp32s3_n16r8 build successfully
  • [x] 375/375 unit tests pass

Result

Metric Value
Ring size 32 x 64 = 2 KB (was 64 x 128 = 8 KB, saves 6 KB .bss)
WiFi RX buffers Not changed — symbol already defined in framework sdkconfig.h; -D override causes redefinition warning and was removed
check_alloc Dual guard: free-heap reserve + max-alloc block; printf on refusal
Unit tests 375/375 pass
esp32dev build SUCCESS
esp32s3_n16r8 build SUCCESS

Backlogged from this sprint: - Verify WiFi buffer flag runtime effect on hardware (depends on pioarduino compiling WiFi component from source vs precompiled blob). - Update MemBoot/MemLive baseline table in docs with post-hardening numbers from MM-C1BC (requires live flash and measurement).

Retrospective

What went well: - Ring size change is a 2-line edit in Logger.cpp with a single test update; zero risk. - Dual check_alloc guard closes the fragmentation blind spot cleanly — the previous check passed when total free was enough but no single block was large enough to satisfy the allocation. - printf for the guard failure message avoids a Logger dependency in Pal.h.

What was tricky: - -DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM=16 in build_flags causes a redefinition warning: sdkconfig.h in the pre-compiled Arduino framework blob already defines the symbol. The WiFi component is not compiled from source, so Kconfig values are fixed at framework build time and cannot be overridden via compiler flags. Removed the flag; WiFi buffer tuning requires a custom framework build or a sdkconfig.defaults approach outside the standard pioarduino setup.

Seeds for future sprints: - WiFi RX buffer tuning via sdkconfig.defaults (requires custom framework build); backlogged. - Move ring buffer to PSRAM-backed heap allocation in setup() if 2 KB .bss still matters (requires PAL extension).


Sprint 7: GET /api/log Frontend Panel

Scope: Surface the existing ring buffer (R6S2) in the frontend as a live log panel, removing the need for a serial monitor during field debugging.

Identified in: R6S2 retrospective ("ring buffer exists; streaming it to the frontend is the obvious next step") and R6 backlog.

Complexity: S

Summary

Part Description Est
Log colouring _logClass() for warn/error; .log-warn/.log-error CSS classes; light-mode overrides S
Scroll management logAtBottom flag; auto-scroll pauses on manual scroll-up, resumes at bottom S
History backfill GET /api/log fetched on WS connect; ring entries prepended to panel S
Total S

Planned scope

  • WS push is already in place per-line via g_logWsPushFn (format {"t":"log","m":"..."}). Sprint scope is completing the frontend panel.
  • Frontend enhancements: WARN/ERROR line colouring (keyword match); auto-scroll pauses on manual scroll-up; backfill history from GET /api/log on WS connect.
  • LOG_MAX_LINES = 100 JS constant; clear button resets scroll state.

Definition of Done

  • [x] src/frontend/app.js: _logClass(text) colours lines containing warn (amber) or error/fail (red)
  • [x] src/frontend/app.js: logAtBottom flag; auto-scroll only when panel is scrolled to bottom; pauses on manual scroll-up
  • [x] src/frontend/app.js: GET /api/log fetched on wsConn.onopen; ring entries backfilled into panel
  • [x] src/frontend/app.js: clear button resets logAtBottom = true
  • [x] src/frontend/style.css: .log-warn (amber) and .log-error (red) classes; light-mode overrides
  • [x] PC live tests pass (no regressions)

Result

Metric Value
Log panel Collapsible below module list; LOG_MAX_LINES = 100
WS push Per-line {"t":"log","m":"..."} format (pre-existing); kept as-is
Coloring warn lines amber; error/fail lines red; light-mode overrides
Auto-scroll Pauses on manual scroll-up; resumes when scrolled back to bottom
History backfill GET /api/log fetched on WS connect; all ring entries added to panel
PC live tests 13/13 PASS

Backlogged from this sprint: - Timestamp and log level as separate columns (structured rows) deferred; raw message text is sufficient for current debugging needs. - Batched WS push ({"type":"log","entries":[...]}) deferred; per-line push at current log rates does not cause measurable overhead.

Retrospective

What went well: - The per-line WS push (g_logWsPushFn) and frontend handler were already in place. Sprint 7 completed the UX: colouring, scroll-pause, and history backfill. - History backfill on WS connect (4 lines) means a browser that opens 10 s after boot sees the startup log immediately — the most common debugging scenario. - Keyword-based colouring (no prefix parsing) works with the actual log message format, which does not use systematic level prefixes.

What was tricky: - Scroll-pause needs a passive: true scroll listener and an explicit logAtBottom flag tracking scrollTop + clientHeight >= scrollHeight - 5. The 5 px tolerance avoids false "not at bottom" on fractional scroll positions.

Seeds for future sprints: - If log volume grows, consider adding a "level": "warn"|"error" field to the WS frame so colouring is exact rather than keyword-matched.


Sprint 8: Heap Safety, HTTP OOM Hardening, Live-Test Correctness, and isPermanent() Removal

Scope: Four interrelated changes that close the remaining stability gaps on classic ESP32 and clean up dead runtime scaffolding: a per-module opt-in heap check before committing control changes, OOM recovery in the HTTP layer, a set of live-test correctness fixes that had been producing spurious duplicate modules, and full removal of the isPermanent() mechanism that turned out to be dead code.

Identified in: MM-C1BC crash (70x28 GridLayout saved to LittleFS; on browser refresh serializeJson to std::string threw std::bad_alloc → abort); live-test review (PreviewModule top-level, duplicate SystemStatusModule/FirmwareUpdateModule from scenario scripts, Windows MD file written on macOS).

Summary

Part Description Est
A: controlAllocBytes heap guard Per-module controlAllocBytes() opt-in; setControl checks heap before committing large allocs M
B: HTTP OOM hardening Heap check before JSON serialisation in AppRoutes; 503 on OOM; DynamicJsonDocument size guard M
C: Live-test correctness INFRA_TYPES/SINGLETON_TYPES, summarise.py ESP32 guard, unittest.py cwd fix, state pollution fix M
D: FirmwareUpdateModule docs User-guide page and module doc page S
E: isPermanent() removal Delete isPermanent() from all modules and base class; remove 403 route guard S
Total L

Part A: Per-module controlAllocBytes heap guard

Problem: A user could resize GridLayout to 70x28 (1960 pixels). On a fresh boot the allocation succeeded. After WiFi and the HTTP server started, free heap was ~60 KB with a largest block of ~36 KB. On browser refresh GET /api/modules called serializeJson(doc, std::string body) — the growing-string operator new chain threw std::bad_alloc and the device aborted.

Root cause in two parts: 1. setControl("width", 70) had no heap check; the value was saved to LittleFS and survived reboots. 2. serializeJson used a growing std::string that fragmented the already-tight heap.

Fix in StatefulModule.h:

  • readThrough(ControlDescriptor&): static helper that reads a control's current value as float (mirrors the existing writeThrough).
  • virtual size_t controlAllocBytes(const char* key) const: returns 0 by default (opt-in; modules with no significant heap impact need not override it).
  • Both setControl() overloads: save old value with readThrough, write new value, call controlAllocBytes(key), call pal::check_alloc(need) if need > 0, revert with writeThrough(old) and return false if the check fails. No onUpdate() is called on failure; the control stays at its previous value.

Fix in GridLayout.h:

  • safeWidth_, safeHeight_, safeDepth_ (uint32_t, default 10/10/1): the last dimensions successfully committed to DriverLayer.
  • controlAllocBytes() override: computes (newNPix - safeNPix) * sizeof(RGB) * 2 (EffectsLayer double-buffer delta); returns 0 for shrinks.
  • onUpdate(): updates safe fields and rebuilds only after the heap check passes (the framework reverts the control automatically on failure).
  • buildMappings_(): changed to new (std::nothrow) with a null-check log and early return on OOM.
  • setup()/teardown(): initialize/reset safe fields.
size_t controlAllocBytes(const char* /*key*/) const override {
    const uint32_t newNPix = (uint32_t)width_ * height_ * depth_;
    const uint32_t safeNPix = (uint32_t)safeWidth_ * safeHeight_ * safeDepth_;
    return newNPix > safeNPix ? (newNPix - safeNPix) * sizeof(RGB) * 2 : 0;
}

New tests (tests/test_layouts.cpp, +3): - GridLayout - growing dimensions updates mappingCount (width 10→16, height 10→16) - GridLayout - shrinking dimensions updates mappingCount (32x32 → 8x8) - GridLayout - healthReport reflects current dimensions after resize


Part B: HTTP OOM hardening

Problem: serializeJson(doc, std::string body) uses std::string::push_back internally. On a heap-fragmented ESP32 this triggers a growing series of reallocations (16 → 32 → 64 → … → N bytes), each of which can throw std::bad_alloc. The final HttpResponse{body} string construction was also a heap allocation.

Fix in AppRoutes.cpp: All GET routes that serialized a JsonDocument to a std::string now use:

std::string body;
body.reserve(measureJson(doc) + 1);   // one allocation, exact size, no growth
serializeJson(doc, body);
return HttpResponse{200, "application/json", std::move(body)};

measureJson(doc) traverses the document without allocating, returns the exact byte count. reserve() does a single heap allocation of that size. serializeJson then fills the string without reallocating. std::move steals the buffer into HttpResponse without copying. Net result: one heap allocation per response instead of O(log N).

Fix in HttpServer.h (ESP32 section): All handler dispatch points — onGet, onPost (request-complete callback), onDelete, onPatch (request-complete callback) — now wrap the handler call in:

try {
    auto resp = handler(...);
    req->send(resp.status, resp.contentType.c_str(), resp.body.c_str());
} catch (const std::bad_alloc&) {
    req->send(503, "application/json", R"({"error":"low heap"})");
}

The 503 response uses a string literal (no heap). This catches any remaining allocation failure (e.g. JsonDocument internal pool) and returns a clean HTTP error instead of calling abort().


Part C: Live-test correctness

Problem 1: PreviewModule created as a top-level module. test0_infra added preview1 with no parent_id. PreviewModule belongs as a child of DriverLayer (per spec). Because PreviewModule was in INFRA_TYPES, it survived delete_all_modules() and accumulated as an orphan across tests. Scenario scripts tried to add preview1 with parent_id="driver1" but add_or_exists silently accepted the already-running top-level instance.

Problem 2: Duplicate singleton modules. Scenario files include steps for NetworkModule, SystemStatusModule etc. When the running instance had a different id than the scenario's id (e.g. device has systemstatus1, scenario adds sysinfo1), add_or_exists would create a second instance. Same for FirmwareUpdateModule: ensureInfraModules() recreates it on every boot, so it always survives delete_all_modules(), yet the scenario runner could add a second one under a new id.

Problem 3: live-results-pc-windows.md written on macOS. deploy/live/live-results-pc-windows.json is committed from Windows CI. On macOS, summarise.py read it and wrote docs/status/live-results-pc-windows.md. The per-env MD should only be written by the machine that actually ran those tests.

Fixes in deploy/live_suite.py:

  • PreviewModule removed from INFRA_TYPES: it is not infrastructure — it requires a parent driver and must be re-created per-test.
  • New SINGLETON_TYPES = INFRA_TYPES | {"FirmwareUpdateModule"}: types where only one instance should ever exist.
  • _scenario_step: before add_or_exists, checks type_ in SINGLETON_TYPES and type_ in client.types_present(). If true, logs a skip and returns success — prevents a second instance being created when the live state has one under a different id.
  • test0_infra: removed the top-level preview1 add (no driver exists at that point).
  • test1_ripples_pipeline: adds preview1 as child of driver1 right after driver1 is created.
  • test5_artnet_loopback: adds preview_tx as child of tx_drv and preview_rx as child of rx_drv.
  • test7_multi_layout: adds preview7 as child of driver7.

Fix in deploy/summarise.py:

  • CURRENT_PC_PLATFORM derived from platform.system() at module load ("darwin""macos", etc.).
  • _write_live_results_md: skips writing live-results-pc-{other}.md when env != f"pc-{CURRENT_PC_PLATFORM}". The foreign-platform JSON data is still loaded for the index.md summary table; only the per-env MD file is gated.

Part D: FirmwareUpdateModule user documentation

Added docs/modules/system/firmware-update-module.md covering: what the module does (surfaces OTA progress; upload handled by AppRoutes), the two controls (update_status, update_pct), three ways to flash (browser file picker, URL API call, uv run deploy/flash.py), and platform notes (URL OTA returns 501 on PC). Added to mkdocs.yml nav and to the category table and reference list in docs/user-guide/modules/index.md.


Part E: Remove isPermanent()

Problem: isPermanent() was a virtual method on StatefulModuleBase intended to prevent certain modules from being deleted at runtime. ModuleManager was the only class that returned true. However, ModuleManager is never placed in owned_[] — it manages the list but is not part of it. This meant the isPermanent() check in removeModule() was never triggered. The mechanism was dead code, and its presence was actively misleading: it suggested FirmwareUpdateModule should be permanent (it had the override until Sprint 8 Part D), when the correct safety net is the boot guard in ensureInfraModules().

Root cause: ModuleManager can be targeted via its kId for control updates (via the special-case path in setControl), but it is never added to owned_[] via addModule. removeModule() iterates owned_[], so it can never find and delete ModuleManager. The 403 Permanent response in AppRoutes.cpp was therefore unreachable.

Fix:

  • src/core/StatefulModule.h: removed virtual bool isPermanent() const declaration.
  • src/core/ModuleManager.h: removed bool isPermanent() const override { return true; }; removed RemoveResult::Permanent from the enum; updated removeModule() comment.
  • src/core/ModuleManager.cpp: removed isPermanent() check in removeModule(); removed isPermanent() check in replaceModule(); removed obj["permanent"] from getModulesJson().
  • src/core/AppRoutes.cpp: removed case RemoveResult::Permanent (403) from the DELETE handler; updated replace error message to remove mention of "permanent".
  • src/frontend/app.js: replace button and delete button now always rendered — mod.permanent was undefined for all modules anyway (field no longer emitted by the server).
  • tests/test_module_manager.cpp: removed ModuleManager - isPermanent returns true test.
  • tests/test_system_info.cpp: removed FirmwareUpdateModule is permanent test.

Definition of Done

  • [x] src/core/StatefulModule.h: readThrough(), controlAllocBytes() virtual hook, heap check in both setControl() overloads with auto-revert on failure
  • [x] src/modules/layouts/GridLayout.h: safeWidth_/Height_/Depth_ safe dimension tracking; controlAllocBytes() override; buildMappings_() uses new (std::nothrow)
  • [x] tests/test_layouts.cpp: 3 new GridLayout resize tests; 378/378 pass
  • [x] src/core/AppRoutes.cpp: all GET JSON routes use measureJson + reserve + std::move; no growing-string allocation
  • [x] src/core/HttpServer.h (ESP32): try/catch(std::bad_alloc) in all four handler dispatch points; returns HTTP 503 on OOM
  • [x] deploy/live_suite.py: PreviewModule removed from INFRA_TYPES; SINGLETON_TYPES guard in _scenario_step (includes FirmwareUpdateModule); preview1/7/tx/rx wired as children of their driver
  • [x] deploy/summarise.py: CURRENT_PC_PLATFORM guard; live-results-pc-windows.md not written on macOS
  • [x] docs/modules/system/firmware-update-module.md created; added to mkdocs.yml nav and module index
  • [x] MM-C1BC: 70x28 GridLayout removed via DELETE /api/modules/tree1; device stable
  • [x] isPermanent() virtual method removed from StatefulModuleBase; RemoveResult::Permanent enum value removed; all call sites in ModuleManager.cpp, AppRoutes.cpp, and app.js cleaned up; two now-stale tests removed
  • [x] 376/376 unit tests pass; mkdocs serve produces no warnings for the new doc page

Result

Metric Value
Unit tests 376/376 pass (3 new GridLayout resize tests added, 2 stale isPermanent tests removed)
New virtual hook controlAllocBytes() in StatefulModuleBase; default returns 0 (opt-in)
GridLayout Rejects oversized resize when heap check fails; always allows shrink
HTTP OOM try/catch(std::bad_alloc) in all ESP32 handler dispatchers; returns 503
Serialization measureJson + reserve = 1 allocation per response (was O(log N) growing chain)
Live test fix PreviewModule always child of its DriverLayer; no more top-level orphans
Singleton guard SINGLETON_TYPES prevents second instance of NetworkModule, SystemStatusModule, FirmwareUpdateModule etc.
summarise.py live-results-pc-windows.md not written on macOS
FirmwareUpdateModule docs New user-facing doc page; wired into mkdocs nav
isPermanent() Removed entirely: virtual method, enum value, all call sites, frontend gate, 2 tests
Device recovery MM-C1BC: bad 70x28 GridLayout deleted via REST; device stable

Retrospective

What went well: - controlAllocBytes as a virtual hook with a zero default keeps the mechanism entirely opt-in: modules with no significant heap impact add no code and pay no overhead. - measureJson + reserve eliminates the growing-string problem with no memory overhead and no static buffer: the heap allocation is still there, but it is now exactly one call of exactly the right size. - try/catch(std::bad_alloc) in HttpServer.h is the correct safety net: even if measureJson+reserve is not used on some future route, the device will return 503 rather than crash. - SINGLETON_TYPES in the scenario runner is a clean, low-ceremony fix: one set, one guard, solves both the SystemStatusModule and FirmwareUpdateModule duplication problems without touching the scenario JSON files. - Removing PreviewModule from INFRA_TYPES and wiring it as a child of its driver per-test is architecturally correct and required no changes to the scenario JSON files (they already had parent_id: "driver1").

What was tricky: - Initial fix for AppRoutes used a static char kJsonSerBuf[12288] BSS buffer. Rejected: it added 12 KB of static RAM with no saving elsewhere. Replaced by the measureJson+reserve pattern which has the same single-allocation property with zero BSS cost. - The CURRENT_PC_PLATFORM guard in summarise.py is for the per-env MD file only; the JSON data from other platforms is still loaded and appears in index.md. Care was needed not to break the cross-platform summary table.

What was tricky (Part E): - isPermanent() looked load-bearing because it appeared in removeModule(), replaceModule(), the JSON output, and the frontend. Tracing the actual call graph revealed it was never reached: ModuleManager is not in owned_[], so the check at owned_[i].module->isPermanent() was never true for any module. - The boot guard in ensureInfraModules() / ensureNetworkModules() is the correct protection for infra modules: it recreates missing modules on the next reboot rather than refusing DELETE at the API level. This is more resilient and less surprising to users.

Seeds for future sprints: - Other modules that allocate in onUpdate() (e.g. EffectsLayer on buffer resize) should also implement controlAllocBytes(). - The try/catch in HttpServer.h only catches std::bad_alloc. A broader catch (const std::exception&) would catch any handler exception, which could be useful as the handler set grows. - The SINGLETON_TYPES guard prevents a second instance but does not fix the id mismatch (the running module may have a different id than the scenario expects). A future improvement would be a get_or_create_by_type helper that returns the existing instance id if one is found.


Sprint 9: select Control Type

Scope: Add CtrlType::Select to the control system: a uint8_t-backed dropdown registered via addControl(..., "select") followed by addControlValue("label") calls, or via a single addControl() call that takes a pre-declared static options array. Option strings are C string literals and live in flash (.rodata), not DRAM; only the pointer array costs heap, and with the static-array form even that is zero. The backing field stores a uint8_t index (1 byte), not the selected string. The schema emits an "options" array; the frontend renders a native <select> element. No changes to existing control types.

Summary

Part Description Est
A: Memory strategy ControlEntry union (inline 8-char + heap overflow), addControlValue(), teardown() cleanup M
B: addControl overloads 3-arg select overload; remove defaults from uint8_t generic to avoid ambiguity S
C: Schema + persistence getSchema() emits "options" array; saveState()/loadState() by index; setControl() by label M
D: Frontend dropdown app.js renders select as <select>; sends index on change M
Total L

Design decision: store index (uint8_t) or selected string?

This is the first design choice that must be made before any code is written.

The backing field holds the zero-based index of the selected option. Saved state JSON looks like "waveform": 2.

Pros: - 1 byte in RAM and in saved JSON — critical on heap-constrained ESP32. - Consistent with existing Uint8 control type; setControl(), saveState(), getControlValues() all reuse the same numeric path with minimal changes. - Fast in loop(): module code reads a uint8_t directly and uses it in a switch or array index — no string comparison. - controlAllocBytes() returns 0 naturally (index is always 1 byte regardless of option count).

Cons: - Saved state is not self-documenting: "waveform": 2 requires knowing the options list to interpret. - If option order changes between firmware builds, a saved index silently maps to the wrong option (breaking change). Option order must be treated as part of the API, the same as JSON key names. - REST and WebSocket clients must look up the schema to translate an index to a label.

Option B: store as char[] string value

The backing field holds the selected label as a C string. Saved state JSON looks like "waveform": "triangle".

Pros: - Self-documenting in saved state and in logs. - Robust to option reordering: the saved string still matches the right option after a firmware update that reorders the list (though renaming an option still breaks it). - REST clients can post {"waveform": "triangle"} without knowing indices.

Cons: - char[N] backing field: typically 16-32 bytes vs 1 byte for a uint8_t. On a module with four select controls that adds 60-124 bytes of RAM overhead. - loop() must do strcmp or a linear search to map the string back to an integer branch — meaningfully slower on the hot path for effects modules. - setControl() needs a new string-matching path: iterate options list to find the index, then write the string into the backing buffer. More code; more failure modes. - Can still silently break if an option is renamed (different failure mode from index reordering, but equally possible).

Verdict: Option A (uint8_t index)

RAM and hot-path performance win on ESP32. The option-stability risk is the same class of breaking change as renaming a JSON key — already documented as requiring a version bump. The schema always includes the "options" array, so clients are never left guessing.


Part A: memory strategy for option strings

This is the second design decision, and the most important for ESP32.

Where do the strings live?

"sine", "triangle", "square" are C string literals. On ESP32 (and on all targets) they are stored in .rodata — flash memory, not DRAM. Accessing them requires a flash read (cached), but they cost zero bytes of DRAM. This is true whether they appear as addControlValue("sine") inline arguments or as elements of a static constexpr const char*[] array.

The only DRAM cost is the pointer array that ControlDescriptor::options points to: 4 bytes per option on ESP32 (32-bit pointer). For a 4-option select that is 16 bytes of DRAM.

Two registration styles and their costs

Style 1: addControlValue() — ergonomic, one small heap allocation

// in setup():
addControl(waveform_, "waveform", "select");
addControlValue("sine");
addControlValue("triangle");
addControlValue("square");
addControlValue("sawtooth");

The pointer array is heap-allocated during setup(). To avoid realloc churn, the first addControl("select") call pre-allocates a fixed-size slot array (e.g. 8 pointers = 32 bytes). Each addControlValue() fills the next slot; no reallocation until the pre-allocated capacity is exceeded. clearControls() frees the array on teardown().

DRAM cost: 32 bytes pre-allocated pointer slots (fixed per select control, regardless of actual option count up to 8).

Style 2: static array — zero heap, all in flash

// file scope or class body (own code):
static constexpr const char* kWaveforms[] = {"sine", "triangle", "square", "sawtooth"};

// in setup():
addControl(waveform_, "waveform", "select", kWaveforms, 4);

kWaveforms is a constexpr pointer array — it lives in .rodata (flash) alongside the string literals. The descriptor stores the pointer to kWaveforms directly. No heap allocation at any point. clearControls() does not free it.

Library-defined arrays work identically, provided they are declared inline in the header:

// in the library header (C++17 inline variable — one definition across all TUs):
inline const char* const EMITTER_NAMES[EMITTER_COUNT] = {
    "orbitaldots", "swarmingdots", "audiodots", "lissajous",
    "borderrect",  "noisekaleido", "cube",       "fluidjet"
};

// in setup():
addControl(emitter_, "emitter", "select", EMITTER_NAMES, EMITTER_COUNT);

The inline keyword (C++17) guarantees the linker keeps exactly one copy of the pointer array in the final binary even when the header is included from multiple translation units. Without inline the linker might emit a copy per .o file, wasting flash. Either way no DRAM is used — inline just prevents flash duplication.

DRAM cost: 0 bytes. The array and all string data are in flash.

Distinguishing owned vs. borrowed options

clearControls() must know whether to free(d.options). Add a single bit to ControlDescriptor:

bool ownsOptions;  // true: heap-allocated by addControlValue(); false: static array

Set to true by addControlValue(), false by the static-array addControl() overload.

Use the static array form (Style 2) for any module that ships with a fixed option list — which is the common case (effects, drivers, layouts all have known-at-compile-time options). The addControlValue() form is available as convenience for prototyping or for option lists that are built dynamically from discovered resources (e.g. a list of available effects).

ControlDescriptor changes

struct ControlDescriptor {
  const char* key;
  const char* uiType;
  CtrlType type;
  uintptr_t ptr;
  float minVal;
  float maxVal;
  float defVal;
  const char** options;   // pointer to option labels; null for non-Select
  uint8_t optionCount;    // number of valid entries in options
  bool ownsOptions;       // true: heap-allocated; free on clearControls()
};

Per-descriptor overhead for non-Select controls: 4 + 1 + 1 = 6 bytes (pointer + count + owns flag, with padding likely making it 8 bytes). For a module with 8 controls of which 1 is a 4-option static-array Select: 8 × 8 = 64 bytes extra DRAM — modest.

Complete memory picture (ESP32dev, 4-option static-array select)

Item Location DRAM cost
Option strings ("sine" etc.) flash (.rodata) 0 bytes
kWaveforms[] pointer array flash (.rodata) 0 bytes
ControlDescriptor::options pointer heap (controls_ array) 4 bytes
ControlDescriptor::optionCount heap (controls_ array) 1 byte
ControlDescriptor::ownsOptions heap (controls_ array) 1 byte
Backing uint8_t waveform_ field module instance (heap) 1 byte
Total extra DRAM per select ~7 bytes

With addControlValue() style instead: add 32 bytes for the pre-allocated pointer slots on heap.


Part B: addControl overloads and addControlValue

Add CtrlType::Select to the enum in StatefulModule.h:

enum class CtrlType : uint8_t { Float, Uint8, Uint32, Bool, String, EditStr, FloatConst, Select };

Static-array overload (preferred — zero heap):

// Register a select control backed by a uint8_t index. options must outlive the module
// (use static constexpr). min/max are derived from optionCount.
void addControl(uint8_t& variable, const char* key,
                const char* const* options, uint8_t optionCount);

Sets CtrlType::Select, stores the pointer directly, ownsOptions = false, maxVal = optionCount - 1.

Dynamic addControlValue() overload (ergonomic, small heap allocation):

// Register a select control; call addControlValue() immediately after for each label.
// uiType must be "select" — required for API consistency with all other addControl overloads.
void addControl(uint8_t& variable, const char* key, const char* uiType);

// Append a label to the most recently registered Select control.
// Pre-allocates 8 pointer slots on first call; no realloc within that capacity.
void addControlValue(const char* label);

addControlValue() finds the last CtrlType::Select descriptor, allocates 8-slot pointer array on the first call (ownsOptions = true), fills the next slot, increments optionCount, updates maxVal.

clearControls(): iterate descriptors; for any with ownsOptions == true, call free(d.options).

maxVal on the descriptor equals optionCount - 1 in both cases, so the existing range-clamp in setControl() rejects out-of-range indices without changes.


Part C: schema emission, value reads, and persistence

getSchema() — add Select case:

case CtrlType::Select:
  c["value"] = *reinterpret_cast<const uint8_t*>(d.ptr);
  c["default"] = (uint8_t)d.defVal;
  {
    JsonArray opts = c["options"].to<JsonArray>();
    for (uint8_t j = 0; j < d.optionCount; ++j) opts.add(d.options[j]);
  }
  break;

The "type" field in the schema JSON is already d.uiType ("select"), so no other changes are needed for the frontend to identify the control.

getControlValues() — add Select case: identical to Uint8 (emit the index as an integer).

setControl() — add Select case: identical to Uint8 (clamp to [0, optionCount-1], write through the uint8_t* pointer, call onUpdate()).

saveState() / loadState(): no changes needed — Select follows the Uint8 save/load path (save as integer, load as integer via applyPending_()).


Part D: frontend dropdown rendering (app.js)

getSchema() already emits "type": "select" for the control. The frontend renderControl() function currently renders sliders for numeric types and checkboxes for bools. Add a branch for "select":

if (ctrl.type === 'select' && Array.isArray(ctrl.options)) {
    const sel = document.createElement('select');
    ctrl.options.forEach((label, i) => {
        const opt = document.createElement('option');
        opt.value = i;
        opt.textContent = label;
        if (i === ctrl.value) opt.selected = true;
        sel.appendChild(opt);
    });
    sel.onchange = () => sendControlUpdate(modId, ctrl.key, parseInt(sel.value));
    return sel;
}

WebSocket state updates that arrive mid-session must also update the <select> element's selectedIndex, following the same pattern as slider value updates.


Definition of Done

  • [x] CtrlType::Select added to enum in StatefulModule.h
  • [x] ControlDescriptor extended with options, optionCount, ownsOptions fields; non-Select defaults to nullptr / 0 / false
  • [x] addControl(uint8_t&, key, options, count) — static-array form; zero heap; ownsOptions = false
  • [x] addControl(uint8_t&, key, "select") + addControlValue(label) — dynamic form; "select" uiType required for API consistency; ownsOptions = true; generic uint8_t overload has explicit min/max (no defaults) to eliminate 3-arg overload ambiguity
  • [x] clearControls() and destructor free options via freeOwnedOptions_() only when ownsOptions == true
  • [x] getSchema(): Select case emits "value", "default", "options" array
  • [x] getControlValues(): Select emits index as integer (same as Uint8)
  • [x] setControl(): Select reuses Uint8 path via readThrough/writeThrough; value stored as uint8_t
  • [x] saveState() / loadState() / applyPending_(): Select handled same as Uint8
  • [x] SineEffectModule: waveform select (sine/square/triangle/sawtooth); wave_() helper applies chosen shape; static-array form used
  • [x] LinesEffectModule: axis select (all/x/y/z); loop conditionally draws each plane; static-array form used
  • [x] Frontend: <select class="select-input"> rendered for type == "select", addEventListener('change') posts index, live WebSocket updates reflected via select.value
  • [x] CSS: .select-input matches .text-input styling; light-theme override included
  • [x] Tests: 10 new cases in test_stateful_module.cpp — registration, schema, setControl, saveState round-trip, hot-reload leak safety, addControlValue dynamic form, SineEffect waveform, LinesEffect axis
  • [x] 386/386 tests pass (10 new, 1 existing test updated for new SineEffect waveform control)
  • [x] deploy/summarise.py: ESP32 MD guard added — skips writing live-results-esp32-*.md when no current (non-last-good) ESP32 JSON exists, preventing stale ESP32 sections from being rewritten on all_pc.py runs
  • [x] deploy/unittest.py: run_tee accepts optional cwd; test binary invoked with absolute path
  • [x] tests/test_module_manager.cpp: auto-pipeline test calls disableStatePersistence() before teardown to prevent writing state files to the working directory
  • [x] state/grid1.json: reset to 16x16x1 (segfault fix: stale 1013x1018x32 values from a previous live-test run with the server started from the wrong directory)

Result

Metric Value
Unit tests 386/386 pass (10 new, 1 updated)
New control type CtrlType::Select backed by uint8_t index; zero DRAM for static-array form
New ControlDescriptor fields options (4 B), optionCount (1 B), ownsOptions (1 B) per descriptor
addControl overloads Static-array form (zero heap) and dynamic addControlValue() form; both require explicit "select" uiType
Hot-reload safety freeOwnedOptions_() called from clearControls() and destructor
SineEffectModule New waveform select: sine / square / triangle / sawtooth
LinesEffectModule New axis select: all / x / y / z
Frontend <select> element rendered for type == "select"; live WS updates applied; CSS styled
Schema "options" array emitted by getSchema(); "value" and "default" as integer index
PC live tests All pass; test4 (device discovery) expected FAIL without ESP32 on network

See test results for full pass/fail breakdown.

Retrospective

What went well: - The CtrlType::Select case slots cleanly into every existing switch in StatefulModule.h because the backing type (uint8_t) is identical to Uint8. readThrough/writeThrough/applyPending_/saveState all just needed case CtrlType::Select: fall-through onto the existing Uint8 case. - Static-array form (addControl(var, key, kArr, N)) costs exactly 6 bytes of DRAM per descriptor and zero heap — kArr and all string literals live in flash. This is the right default for any module with a compile-time-fixed option list. - The ownsOptions flag on ControlDescriptor cleanly separates the two ownership modes. clearControls() and the destructor both call the same freeOwnedOptions_() helper, so hot-reload and final teardown are handled identically. - Library-provided arrays (e.g. inline const char* const EMITTER_NAMES[]) work directly with the static-array overload — no adaptation needed. - Making "select" an explicit uiType argument on the dynamic form (addControl(var, key, "select")) aligns it with every other addControl overload. The API is now fully consistent: the uiType string is always the third argument, regardless of control type. - The summarise.py ESP32 guard (skip rewriting live-results-esp32-*.md when no current ESP32 JSON exists) prevents all_pc.py runs from silently overwriting the last good ESP32 status with a stale timestamp.

What was tricky: - clearControls() previously just set controlCount_ = 0 without freeing anything. Adding freeOwnedOptions_() there required also calling it from the destructor; overlooking either site would cause a leak on hot-reload or module destruction respectively. - The memmove in runSetup() that promotes enabled_ to index 0 copies ControlDescriptor structs byte-for-byte, including options pointers. This is safe — the pointers remain valid — but it means two descriptor slots briefly point at the same options array. The old slot is immediately overwritten, so there is no double-free risk. Worth understanding before reading this code path. - addControlValue() uses realloc on each call. The sprint doc proposed 8-slot pre-allocation; the implementation went with simple realloc instead (simpler code, acceptable since it only runs during setup()). Backlogged if profiling ever shows setup-time fragmentation. - Adding "select" as an explicit uiType argument to addControl(uint8_t&, key, uiType) required removing the default min/max from the generic uint8_t overload to avoid a 3-argument ambiguity. No existing caller relied on those defaults (all passed explicit min/max), so the change was safe. - A stale state/grid1.json with dimensions 1013x1018x32 (written by a previous live-test session that started the server from the project root) caused a segfault on the next run. The pal::check_alloc guard correctly blocked the allocation, but the state/ file survived. Fixed by resetting to 16x16x1. The all_pc.py pipeline always starts the server from deploy/build/pc/{platform}/ so this state is isolated; running the binary from the project root directly can still contaminate the project-root state/ directory. - The auto-pipeline unit test (ModuleManager - auto-creates default pipeline when no modules exist) did not call disableStatePersistence() after its assertions, causing it to write state/grid1.json (with default 10x10x1 dimensions) to the project root on every test run. Fixed by calling disableStatePersistence() after the assertions, before teardown.

Complexity estimate: Medium. The core StatefulModule.h changes are straightforward switch-case additions. The non-trivial parts were: ownership lifecycle (ownsOptions, freeOwnedOptions_), verifying the memmove path is safe, and the two waveform implementations (wave_() and the axis conditional in LinesEffect).

Seeds for future sprints: - Proper range-clamping in setControl() for Select: clamp submitted value to [0, optionCount-1] rather than relying on uint8_t truncation. Frontend-submitted values are always valid; REST misuse is the only exposure. - addControlValue() 8-slot pre-allocation: measure whether realloc churn during setup() causes fragmentation on ESP32 dev before adding the optimization. - Other modules with natural discrete parameters: NoiseEffect2D blend mode, RipplesEffect shape, MirrorModifier axis. Each is a one-liner addControl + static array addition.


Sprint 10: Boot Module Creation Redesign and Dynamic Network Management

Scope: Replace the ad-hoc ensureNetworkModules / ensureInfraModules / instantiateDefaultPipeline_ boot logic with a single coherent rule: on first boot (no non-network top-level modules), create the full default set; otherwise leave the pipeline alone. Add EthernetModule to the initial network group. Add dynamic network management to NetworkModule so the AP is automatically disabled when STA or Ethernet is connected, and re-enabled when connectivity is lost. Investigate and fix the root causes of apparent duplicate modules (type name bug in AppSetup.h, scenario script behavior, ambiguous 409 error response).

Identified from: Sprint 9 retrospective seeds; user request after Sprint 10 scope discussion.

Summary

Part Description Est
A: EthernetModule in boot Add eth1 (child of network1) to first-boot network group in ensureNetworkModules XS
B: ensureDefaultModules Replace ensureDefaultPipeline+ensureInfraModules; "no non-network top-level modules" rule; update PC instantiateDefaultPipeline_ M
C: Dynamic network management NetworkModule 10 s ticker + 60 s grace-period debounce; onUpdate("enabled") on WifiAp/WifiSta; setInput wiring L
D: ui.md boot section Document boot module creation, dynamic WiFi, delete-to-prevent-recreation S
E: Duplicate investigation + 409 AppSetup type name bug fixed via B; 409 reason field in AppRoutes S
Total L

Current boot logic (before this sprint)

On embedded (AppSetup.cpp): 1. mm.setup() — if no DriverLayer AND no EffectsLayer: create driver1 + grid1 + effects1 + ripples1 + preview1. 2. ensureNetworkModules() — if no NetworkModule: create network1 + sta1 + ap1. 3. ensureInfraModules() — calls ensureDefaultPipeline() (patches EffectsLayer / Preview onto an existing DriverLayer if absent), then unconditionally adds SystemStatus and FirmwareUpdateModule if not present.

On PC (main.cpp): 1. mm.setup() — same pixel-pipeline creation as embedded step 1. No SystemStatus, Firmware, or network modules created.

Problems with the current logic: - ensureDefaultPipeline patchwork adds EffectsLayer / PreviewModule even when the user deliberately built a custom pipeline without them. - SystemStatus and FirmwareUpdateModule are added unconditionally, even on partially customised setups. - EthernetModule is never auto-created. - No dynamic network management: AP runs permanently even when STA is connected.


Part A: Extend ensureNetworkModules — add EthernetModule

ensureNetworkModules() currently creates network1 + sta1 + ap1 on first boot (guarded by hasModuleType("NetworkModule")). Add eth1 to this initial creation:

mm.addModule("EthernetModule", "eth1", {}, {}, 1, "network1");  // child of network1

"If later deleted, don't recreate" guarantee: This is already provided by the hasModuleType("NetworkModule") guard — ensureNetworkModules is a no-op on any boot where NetworkModule already exists. EthernetModule is only created once, alongside the rest of the network group, and is never checked for independently.


Part B: Replace ensureDefaultPipeline + ensureInfraModules with ensureDefaultModules

Replace both functions with a single ensureDefaultModules(mm) that applies the new rule:

New rule: count top-level modules (parentId == "") whose type is not "NetworkModule". If the count is zero, create the full default set. Otherwise, do nothing.

top-level non-network modules == 0  →  create full default set
top-level non-network modules  > 0  →  do nothing

Full default set (created atomically):

id type parent
sysinfo1 SystemStatusModule
firmware1 FirmwareUpdateModule
discovery1 DeviceDiscoveryModule
driver1 DriverLayer
grid1 GridLayout driver1
effects1 EffectsLayer
ripples1 RipplesEffectModule effects1
preview1 PreviewModule driver1

Behavior changes from current logic:

Scenario Before After
Completely blank first boot pixel pipeline only, then SystemStatus + Firmware added separately full default set created atomically
Only DriverLayer exists EffectsLayer + Preview patched in; SystemStatus + Firmware added do nothing
Only NetworkModule + children full default pipeline created full default set created
Any non-network top-level module partial patching applied do nothing

ModuleManager::instantiateDefaultPipeline_() (PC): The existing function runs on PC where AppSetup.h is not compiled. It should be updated to apply the same rule: check for any top-level module (on PC there is no NetworkModule so the check becomes "any top-level module exists"). If none exist, create the full default set including SystemStatus, FirmwareUpdateModule, and DeviceDiscoveryModule. The PC build does register all three types.

ModuleManager::setup() (both platforms): Remove the !hasDriver && !hasEffects auto-pipeline check. On embedded, ensureDefaultModules handles first-boot creation. On PC, instantiateDefaultPipeline_ (updated) handles it. The condition that triggers it changes from "no DriverLayer+EffectsLayer" to "no top-level modules at all".


Part C: Dynamic network management in NetworkModule

NetworkModule::loop() currently does nothing. Add a 10-second periodic check that manages the AP based on current connectivity:

Priority (highest wins): 1. Ethernet connected → disable both WiFi AP and WiFi STA. 2. STA connected → disable WiFi AP (keep STA running). 3. Neither connected, grace period expired → enable WiFi AP (recovery path for configuration access).

Grace period for STA loss: A brief disconnection (network hiccup, AP reboot) should not immediately re-enable the AP — toggling the AP is disruptive (clients connecting mid-hiccup, mDNS flapping). A configurable grace period lets STA recover before any AP change is made.

  • When STA was connected and then drops: record staLostMs_ (millis timestamp).
  • Each tick: if now - staLostMs_ >= sta_grace_ms_ and no other connectivity, enable AP.
  • If STA reconnects or Ethernet comes up before the grace period expires: clear staLostMs_, no AP change.
  • On first boot (STA never connected): no grace period — enable AP immediately.
  • sta_grace_ms_ is a private constant (default 60 000 ms). A future control could expose it; for Sprint 10 a compile-time default is sufficient.

Implementation sketch:

NetworkModule needs typed pointers to its children to call setControl("enabled", ...) on them. Wiring approach: NetworkModule implements setInput("sta", ...), setInput("ap", ...), setInput("eth", ...), receiving the module pointers when the wiring pass runs. ensureNetworkModules passes the child ids as inputs to network1 after creating all children (or a dedicated post-creation wiring step).

WifiApModule and WifiStaModule must override onUpdate("enabled") to actually start/stop their WiFi interface when the enabled control changes. Currently, enabled_ only gates loop execution; setting it to false does not call teardown() or stop WiFi. This change makes enabled semantically equivalent to "WiFi interface is running".

10-second ticker and grace-period state in NetworkModule:

uint32_t lastCheckMs_ = 0;
uint32_t staLostMs_   = 0;   // 0 = STA connected or never-connected; non-zero = grace countdown started
bool     staWasConnected_ = false;
static constexpr uint32_t STA_GRACE_MS = 60000;

void loop() override {
#ifdef ARDUINO
    uint32_t now = pal::millis();
    if (now - lastCheckMs_ < 10000) return;
    lastCheckMs_ = now;
    manageWifi_(now);
#endif
}

manageWifi_(now) logic:

ethConn = eth_ && eth_->isConnected()
staConn = sta_ && pal::wifi_sta_is_connected()

if ethConn:
    clear staLostMs_; disable AP and STA
else if staConn:
    clear staLostMs_; disable AP          // STA healthy: AP not needed
else:
    if staWasConnected_ and staLostMs_ == 0:
        staLostMs_ = now                  // STA just dropped: start grace timer
    if staLostMs_ != 0 and (now - staLostMs_) >= STA_GRACE_MS:
        enable AP; clear staLostMs_       // grace expired: open recovery AP
    // else: within grace period — do nothing, wait for STA to recover

staWasConnected_ = staConn

The children's onUpdate("enabled") handlers propagate the change to the WiFi stack.

EthernetModule isConnected(): currently always returns false. The interface is added now so NetworkModule can call it; the stub is replaced when real Ethernet support is implemented.


Part D: Update docs/user-guide/ui.md

Add (or update) a "Boot module creation" section that describes: - First boot: what modules are created and in what order. - Network group: NetworkModule + WifiSta + WifiAp + Ethernet. - Default pipeline: only created when no non-network top-level modules exist. - Dynamic WiFi: AP is automatically disabled when STA or Ethernet is connected; re-enabled when not. - User control: delete any default module to prevent it being recreated on next boot.


Part E: Duplicate module investigation and 409 error clarity

ModuleManager::addModule already checks for duplicate IDs at line 416-418 and returns false (HTTP 409) if the ID is already registered. The guard is solid. Despite this, users occasionally see duplicate modules in the UI. Three root causes were identified:

1. AppSetup.h type name bug (primary cause)

ensureInfraModules() calls mm.addModule("SystemStatus", ...) but the TypeRegistry key is "SystemStatusModule" (the class name, set by REGISTER_MODULE(SystemStatusModule)). The type lookup fails silently and the module is never created. The live-test test0_infrastructure scenario then creates it using a different id (systemstatus1) via HTTP, so the user sees two apparent SystemStatus entries after running test0 more than once: the real sysinfo1 (if it ever existed from a prior boot) and systemstatus1 from test0. Fix: change "SystemStatus" to "SystemStatusModule" throughout AppSetup.h, or eliminate the call entirely via Part B's ensureDefaultModules.

2. Scenario scripts using add_or_exists (secondary cause)

live_suite.py's add_or_exists treats HTTP 409 as success if a module with the same ID already exists (type-checked). For types listed in SINGLETON_TYPES (INFRA_TYPES | {"FirmwareUpdateModule"}), the test step will skip re-creation when the correct type is already present. For non-singleton types (e.g., EffectsLayer, DriverLayer), a POST with a different ID will create a second instance if the first was not cleaned up. The delete_all_modules step at the start of each scenario should prevent this, but it preserves INFRA_TYPES modules — so if a prior scenario left a non-infra module with the same type but a different id, a new one will be created.

3. Ambiguous HTTP 409 error message (diagnostic cause)

AppRoutes.cpp returns 409 for three distinct failures: ID already exists, unknown type, invalid parent ID. These are currently indistinguishable from the HTTP response alone, making debugging harder. Fix: return a reason field in the JSON body distinguishing the three cases.

Fixes in this sprint:

  • AppSetup.h: fix "SystemStatus""SystemStatusModule" (addressed implicitly by Part B's ensureDefaultModules rewrite)
  • AppRoutes.cpp: return distinct reason strings in the 409 response body ("id_exists", "unknown_type", "invalid_parent")

Design decisions

Why "else do nothing" instead of per-type checks? The previous "patch up missing pieces" approach was opaque — it was hard to predict whether a module would be added on the next reboot. The new rule is a single, testable invariant: the first-boot state is fully deterministic; any subsequent state is entirely the user's configuration.

Why is SystemStatusModule part of the conditional set? Previously it was added unconditionally. Making it conditional brings it in line with the other defaults — if the user removes it they clearly don't want it. The boot-guard pattern (ensureNetworkModules re-creates network if deleted) is reserved for modules that are genuinely required for the device to be accessible (networking). Infra/status modules are optional from the device's perspective.

Why onUpdate("enabled") on WiFi child modules rather than direct PAL calls from NetworkModule? Direct PAL calls from NetworkModule would bypass the module's state machine and leave its status_ control stale. Routing through setControl("enabled")onUpdate keeps the module self-consistent and makes the WiFi state visible in the UI.


Definition of Done

  • [x] AppSetup.h: ensureNetworkModules creates eth1 (child of network1) alongside sta1 and ap1 on first boot
  • [x] AppSetup.h: ensureDefaultModules replaces ensureDefaultPipeline + ensureInfraModules; creates full default set only when no non-network top-level modules exist
  • [x] AppSetup.cpp: calls ensureNetworkModules then ensureDefaultModules (replacing the old pair)
  • [x] ModuleManager.cpp: instantiateDefaultPipeline_ updated for PC — checks "no top-level modules" and creates full default set (including SystemStatus, FirmwareUpdateModule, DeviceDiscoveryModule)
  • [x] ModuleManager.cpp: removes the !hasDriver && !hasEffects auto-pipeline check from setup()
  • [x] NetworkModule: setInput("sta", ...), setInput("ap", ...), setInput("eth", ...) added; loop() manages WiFi with 10-second ticker; lastCheckMs_ uint32_t member
  • [x] WifiApModule: onUpdate("enabled") calls wifi_ap_stop() on disable and startAp() on enable
  • [x] WifiStaModule: onUpdate("enabled") disconnects on disable and reconnects on enable
  • [x] EthernetModule: isConnected() method added (returns false; stub for future Ethernet implementation)
  • [x] ensureNetworkModules wires sta1, ap1, eth1 as inputs to network1 so NetworkModule receives the typed pointers
  • [x] Tests: new unit tests for ensureDefaultModules (no modules → full set created; DriverLayer present → nothing added)
  • [x] Tests: WifiApModule.onUpdate("enabled") stops/starts AP; WifiStaModule.onUpdate("enabled") disconnects/reconnects; EthernetModule.isConnected() returns false on PC
  • Note: NetworkModule grace-period logic is #ifdef ARDUINO-only — not testable on PC; verified by code review
  • [x] AppRoutes.cpp: HTTP 409 response body includes a reason field ("id_exists", "unknown_type", "invalid_parent") so callers can distinguish the three failure cases
  • [x] docs/user-guide/ui.md: boot module creation section added
  • [x] PC live tests pass (7/7 scenarios); ESP32 live tests skipped (no devices connected during sprint completion)
  • [x] esp32dev and esp32s3_n16r8 build successfully

Result

Metric Value
Unit tests 390/390 pass (4 new)
PC live tests 7/7 scenarios pass
esp32dev build 1374 KB flash, 19.9% RAM
esp32s3_n16r8 build 1362 KB flash, 19.3% RAM
AppSetup.h ensureNetworkModules + ensureDefaultModules replace 3 old boot functions
NetworkModule WiFi management with 10 s ticker and 60 s STA grace period
WifiApModule / WifiStaModule onUpdate("enabled") reactive AP/STA control
HTTP 409 Now includes reason field: id_exists / unknown_type / invalid_parent

See test results for full pass/fail breakdown.


Retrospective

What went well: - The "no non-network top-level modules" rule (countTopLevelNonNetwork()) gives a single, testable invariant for first-boot behavior — deterministic and easy to reason about compared to the patchwork of hasDriver && hasEffects checks it replaced. - Routing WiFi enable/disable through setControl("enabled") -> onUpdate keeps each module self-consistent. NetworkModule never needs to know about STA/AP internals; the child modules keep their own status_ display up to date. - rewireModule("network1", inputs) after creating the four network modules is a clean pattern — create children first, then wire the parent. No ordering constraint on addModule itself. - The StatefulModuleBase* type for ap_ and sta_ in NetworkModule solved the circular-include problem cleanly: WifiAp.h and WifiSta.h both include Network.h, so Network.h cannot include them. The base pointer is sufficient for setControl() calls.

What was tricky: - runSetup() vs setup() in tests: setup() does not register the enabled_ control — that is runSetup()'s job (the base-class wrapper). The three new behavior tests initially called ap.setup() and the setControl("enabled", false) returned false silently (control not found), so onUpdate was never called and status stayed "starting". Fixed by switching to runSetup(). - The circular-include problem between Network.h and WifiAp.h/WifiSta.h was not obvious until the first compile. Storing ap_/sta_ as StatefulModuleBase* and casting in setInput() is the right fix, but required understanding which header depends on which. - AppSetup.h previously used "SystemStatus" (wrong) instead of "SystemStatusModule" as the type name string. The bug was latent until Sprint 10's investigation of apparent duplicate-creation. Eliminating ensureInfraModules entirely fixed it without a targeted patch.

Complexity estimate: Large. Three distinct sub-systems changed (boot logic, WiFi management, HTTP error details) plus four test files and the ui.md doc.

Seeds for future sprints: - Sprint 11: implement EthernetModule for real (LAN8720/W5500); NetworkModule.manageWifi_() already calls eth_->isConnected() — just needs the stub replaced. - Expose STA_GRACE_MS (currently 60 000 ms compile-time constant) as a NetworkModule control for field-adjustable debounce. - Add a NetworkModule live test: bring STA up, verify AP disables; bring STA down, wait grace period, verify AP re-enables. Requires hardware or a WiFi simulation stub.


Sprint 11: Ethernet Implementation

Scope: Implement EthernetModule for real on ESP32 classic (LAN8720 RMII) and ESP32-S3 (W5500 SPI). Add Ethernet PAL functions to Pal.h covering both the Arduino ETH.h path and the bare IDF_VER path so a future Arduino-free build compiles cleanly. Add DHCP client and static IP modes; static IP mode serves as the direct-connect ("AP analog") path. Document what ESP32-P4 Ethernet will require when hardware arrives.

Depends on: Sprint 10 (wires eth_ pointer in NetworkModule; adds isConnected() stub; Sprint 11 makes it real).

Identified from: Sprint 10 retrospective seeds; user request.

Summary

Part Description Est
A: Ethernet PAL functions 6 PAL functions (eth_init, eth_is_connected, etc.); ARDUINO, IDF_VER, and PC stub branches M
B: EthernetModule implementation Full module replacing stub: setup/loop/isConnected, DHCP+static controls, healthReport() M
C: Direct-connect mode Static IP path (AP analog); recommended defaults; link-local note; doc update S
D: ESP32-P4 documentation GMAC, SDIO WiFi coprocessor, IDF 5.3+, PAL additions needed; no implementation S
Total M

Background: PAL structure and the IDF_VER path

All network PAL functions follow a three-way platform switch that must be preserved for every new function added:

#ifdef ARDUINO
    // Arduino ESP32 framework — ETH.h / WiFi.h / esp_netif via Arduino wrappers
#elif defined(IDF_VER)
    // Bare ESP-IDF — esp_eth / esp_netif / lwIP directly; no Arduino wrappers
#else
    // PC build — no-op stubs, returns false / empty string
#endif

The IDF_VER path exists today for WiFi but has minimal/stub bodies. Every Ethernet PAL function added in this sprint must have a real IDF_VER body (not just a stub) because the long-term goal is to be able to build without Arduino.h. This means using esp_eth, esp_netif, and esp_event APIs directly in the IDF_VER branch, not delegating to ETH.h.

The ARDUINO and IDF_VER implementations can share the same PAL function signatures; the #ifdef is inside the function body, not in the declaration.


Hardware variants and board-specific configuration

Two Ethernet hardware variants are supported. The selection is made at compile time via a flag defined in platformio.ini per board environment:

Board Hardware Interface Flag
esp32dev LAN8720 RMII (GPIO) -DPMM_ETH_LAN8720
esp32s3_n16r8 W5500 SPI -DPMM_ETH_W5500

Note: flags use the PMM_ prefix (e.g. PMM_ETH_LAN8720 not ETH_PHY_LAN8720) to avoid colliding with the eth_phy_type_t enum values of the same name in esp-idf.

Pin assignments (MDC, MDIO, PHY address for RMII; SCK, MISO, MOSI, CS, IRQ for SPI) are defined as compile-time constants in the same board-specific platformio.ini env, e.g.:

[env:esp32dev]
build_flags =
    -DPMM_ETH_LAN8720
    -DETH_RMII_MDC=23
    -DETH_RMII_MDIO=18
    -DETH_RMII_PHY_ADDR=1

[env:esp32s3_n16r8]
build_flags =
    -DPMM_ETH_W5500
    -DETH_SPI_SCK=12
    -DETH_SPI_MISO=13
    -DETH_SPI_MOSI=11
    -DETH_SPI_CS=10
    -DETH_SPI_IRQ=14

EthernetModule itself contains no pin numbers. It calls PAL functions; the PAL reads the compile-time constants and dispatches to the right hardware init.


Part A: Ethernet PAL functions

Add to src/pal/Pal.h:

// Returns true if Ethernet hardware is compiled in for this board.
inline constexpr bool has_ethernet();

// Initialise the Ethernet peripheral. Called once from EthernetModule::setup().
// Returns true if the hardware was found and initialisation succeeded.
inline bool eth_init();

// True if Ethernet link is up and an IP address has been assigned (DHCP or static).
inline bool eth_is_connected();

// Write the current Ethernet IP address into buf (null-terminated). Empty string if not connected.
inline void eth_local_ip(char* buf, size_t len);

// Switch to DHCP client mode (default after eth_init).
inline void eth_set_dhcp();

// Set a static IP immediately. Disables DHCP client.
// Pass nullptr for gateway/subnet to use defaults (gw = ip with last octet 1, /24).
inline void eth_set_static_ip(const char* ip, const char* gateway, const char* subnet);

ARDUINO implementation: delegates to ETH.h (ETH.begin(...), ETH.config(...), ETH.localIP().toString()). eth_init() dispatches on PMM_ETH_LAN8720 vs PMM_ETH_W5500 at compile time to call the correct ETH.begin() overload.

IDF_VER implementation: uses esp_eth_driver_install, esp_netif_new, esp_eth_start, esp_event_handler_register(ETH_EVENT, ...). Static IP uses esp_netif_set_ip_info. DHCP client uses esp_netif_dhcpc_start.

PC stub: has_ethernet() returns false; all other functions are no-ops / return false / write empty strings.


Part B: EthernetModule implementation

Replace the current stub (src/modules/system/Ethernet.h) with a full implementation:

setup(): 1. Calls pal::eth_init(). If it returns false, sets status_ = "init_failed" and returns. 2. If a static IP is configured (loaded from saved state), calls pal::eth_set_static_ip(...). 3. Otherwise calls pal::eth_set_dhcp(). 4. Registers controls (see below).

loop(): - Polls pal::eth_is_connected() once per second (millis-based debounce). - On state change: updates status_ and ip_address_ controls; calls pal::eth_local_ip(). - On PC: has_ethernet() is false, so loop() is a no-op beyond the guard.

isConnected() (used by NetworkModule): returns pal::eth_is_connected().

Controls:

key type description
status display "disconnected" / "connecting" / "connected" / "init_failed"
ip_address display Current IP (empty when disconnected)
mode select "dhcp" | "static" (default: "dhcp")
static_ip text Only active when mode == "static"
static_gateway text Only active when mode == "static"
static_subnet text Default "255.255.255.0"

onUpdate("mode") and onUpdate("static_ip") apply the new config immediately via PAL if Ethernet is already up.

healthReport(): "eth=connected ip=192.168.1.42" / "eth=disconnected" / "eth=unsupported" (PC).


Part C: Direct-connect mode (AP analog)

When Ethernet is wired directly between the ESP32 and a laptop (no router), there is no DHCP server to assign addresses. Static IP mode serves as the "AP analog" for Ethernet: set a known fixed IP on the ESP32, then manually configure a matching IP on the laptop.

Recommended defaults for direct-connect: - ESP32 static IP: 192.168.5.1 / gateway 192.168.5.1 / subnet 255.255.255.0 - Laptop: 192.168.5.2 / subnet 255.255.255.0 (manual config in OS network settings) - The device is then reachable at http://192.168.5.1

Relationship to WiFi AP: NetworkModule::manageWifi_() (Sprint 10) disables the WiFi AP when Ethernet is connected — this applies to DHCP-connected Ethernet (router present). When using static IP for direct connect, the user is expected to also manage the WiFi AP manually if needed, or configure the ticker to treat static-IP-connected as "connected" (same isConnected() return value — no change needed).

Link-local (169.254.x.x): lwIP supports APIPA link-local addressing. If DHCP fails and mode == "dhcp", the ESP32 may auto-assign a 169.254 address after timeout. Modern OSes do the same. This provides a zero-config direct-connect path without any manual IP setting, but the 169.254.x.x address is non-deterministic. Document this as an observed behavior, not a designed feature. Static mode is the designed direct-connect path.


Part D: ESP32-P4 — what will be needed

The ESP32-P4 has on-board GMAC Ethernet and uses an external WiFi/BT coprocessor (e.g., ESP32-C6) connected via SDIO. No ESP32-P4 hardware is targeted in this sprint. This section documents what a future sprint will need.

PlatformIO: - New [env:esp32p4] and [env:esp32p4_eth] entries in platformio.ini. - Board JSON files (esp32_p4_nano.json, esp32_p4_eth.json) in the PlatformIO boards directory or boards/ in the project. - Framework: Arduino ESP32 core 3.x (P4 support landed in core 3.0); or bare IDF 5.3+. - Build flag: -DETH_PHY_EMAC (P4 uses its own internal EMAC + external PHY, typically IP101).

PAL changes: - Add ETH_PHY_EMAC branch in eth_init() inside both ARDUINO and IDF_VER sections. - P4 EMAC uses esp_eth_mac_new_esp32, same esp_netif plumbing as classic ESP32, so the IDF_VER path is largely reusable. - WiFi on P4 requires SDIO bootstrap before wifi_sta_connect() / wifi_ap_start() can be called. NetworkModule::setup() will need a pal::wifi_coprocessor_init() call (P4 only) before the existing WiFi init. - Add #ifdef ESP_PLATFORM_P4 guards (or CONFIG_IDF_TARGET_ESP32P4 from sdkconfig) to any P4-specific bootstrap code.

What does NOT change: EthernetModule, NetworkModule, WiFi modules — they call PAL functions only. All hardware differences are absorbed in Pal.h.


Design decisions

Why one EthernetModule for both RMII and W5500? The module only calls PAL functions. The hardware difference is entirely inside eth_init(). Adding a second module type (EthernetW5500Module) would duplicate the status/IP/mode logic for no benefit.

Why put all pin constants in platformio.ini rather than a header? platformio.ini is the single source of truth for board configuration. Scattering pin assignments across headers creates inconsistency between boards. The PAL reads the CMake/PlatformIO defines directly.

Why require an IDF_VER body now, not later? The migration from Arduino to bare IDF is a future goal, not an immediate task. However, writing stub-only IDF_VER bodies now means they will rot — when the migration happens, every function needs rewriting anyway. Writing real bodies now keeps the IDF path exercisable and prevents silent compile failures when someone eventually enables IDF_VER on a board.

Why static IP instead of a DHCP server for direct connect? Running a DHCP server on the Ethernet netif requires esp_netif_dhcps_start() and a server config, which works but adds state complexity to EthernetModule. Static IP achieves the same goal with one PAL call. A DHCP server on Ethernet is noted as a future enhancement.


Definition of Done

  • [x] Pal.h: has_ethernet(), eth_init(), eth_is_connected(), eth_local_ip(), eth_set_dhcp(), eth_set_static_ip() — all three platform branches (ARDUINO, IDF_VER, PC stub) present and compiling
  • [x] platformio.ini: PMM_ETH_LAN8720 + RMII pin flags added to [env:esp32dev]; PMM_ETH_W5500 + SPI pin flags added to [env:esp32s3_n16r8] (flags use PMM_ prefix — see hardware section)
  • [x] src/modules/system/Ethernet.h: full implementation replacing stub — setup() init, loop() poll, isConnected(), status/ip/mode/static_ip/static_gateway/static_subnet controls, healthReport()
  • [x] onUpdate("mode") and onUpdate("static_ip") apply config live if Ethernet is already initialised
  • [x] NetworkModule: manageWifi_() treats eth_->isConnected() as real (no change to call site; Sprint 10 wired it; Sprint 11 provides the real value)
  • [x] Tests: test_network.cpp updated — add tests for isConnected() on PC, loadState/saveState round-trip, default static_subnet
  • [x] Tests: PC build compiles and all tests pass (Ethernet PAL stubs return safe values)
  • [x] docs/modules/network/ethernet-module.md: updated with real controls, DHCP/static modes, direct-connect setup instructions
  • [x] Direct-connect section added to docs/modules/network/ethernet-module.md (user chose ethernet-module.md only, not getting-started.md)
  • [x] esp32dev build compiles with PMM_ETH_LAN8720 (no hardware flash required in CI)
  • [x] esp32s3_n16r8 build compiles with PMM_ETH_W5500 (no hardware flash required in CI)
  • [x] docs/development/release-07.md: Part D (P4 requirements) written (this section)
  • [ ] ESP32 live tests pass on both devices — deferred; hardware will be tested in Sprint 12 (user intent: test with real hardware next sprint)

Result

Metric Value
Unit tests 392/392 pass (2 new: isConnected on PC, loadState/saveState round-trip, default static_subnet)
PC live tests 7/7 scenarios pass (82 steps)
esp32dev build 1441 KB flash (78.5%), 64 KB RAM (20.0%)
esp32s3_n16r8 build 1427 KB flash (34.8%), 62 KB RAM (19.5%)
ESP32 live tests deferred (devices unreachable during sprint; not a Sprint 11 regression)

Flash footprint increased by ~67 KB on both boards versus Sprint 10 (SPI and Ethernet library sources now compiled in for W5500 path; LAN8720 path increased by similar amount as ETH.cpp was always compiled).

See test-results.md and live-pc-macos.md.


Retrospective

What went well:

  • The three-way PAL platform switch pattern (ARDUINO / IDF_VER / PC stub) kept EthernetModule free of any #ifdef — all hardware differences absorbed in Pal.h. The has_ethernet() constexpr guard at the top of setup() was enough to keep PC tests clean without touching test code.
  • The PMM_ETH_ prefix decision was made early (prompted by an enum collision with eth_phy_type_t values of the same name in esp-idf) and avoided a subtle linker-time name conflict.
  • Static IP mode as the direct-connect path required no new module infrastructure — it is just applyIpConfig_() with modeIdx_ == 1. The "AP analog" pattern reused exactly the same control set as DHCP mode.

What was tricky:

  • PlatformIO LDF and the SPI linker gap. The LDF discovered and compiled ETH.cpp (because Pal.h includes ETH.h via chain+). But SPI.h was found via CPPPATH added by the pre-script, which satisfied the #include without the LDF discovering the SPI library directory — so SPI.cpp was never compiled. EXTRA_CXXSRC (wrong variable) had no effect. Fix: env.BuildSources() in add_spi_eth_path.py explicitly compiles SPI.cpp without conflicting with the LDF-managed ETH.cpp. This is the correct pattern when a library source file must be compiled but would otherwise be bypassed by a CPPPATH shortcut.
  • arduino-esp32 3.x ETH.begin() signature for W5500. The 10-parameter form is begin(eth_phy_type_t, phy_addr, cs, irq, rst, spi_host_device_t, sck, miso, mosi, freq_mhz) — not the 2.x order. Getting this wrong produced a compile error after correcting the SPI linker issue; checking the framework source directly was the only reliable way.

Seeds for Sprint 12:

  • Hardware live test of LAN8720 and W5500 with real boards (user intent: next sprint).
  • Evaluate an [env:esp32dev_idf] bare-IDF env for exercising the IDF_VER Ethernet path in CI (needs sdkconfig, no ESPAsyncWebServer; significant setup cost — may be a separate sprint).
  • DHCP server on Ethernet for direct-connect without static IP (noted as future enhancement).

Complexity estimate: Large (L).


Sprint 12: WiFi Reconnect and PAL Documentation

Scope: Automatic WiFi STA reconnect after signal loss or router reboot; re-enable STA when Ethernet drops; align all recovery timers at 30 s. Document the Arduino/IDF mixing strategy in pal.md following a design discussion about the three-way platform switch.

Identified from: Sprint 11 retrospective (hardware live test follow-up); design discussion on PAL architecture.

Summary

Part Description Est
A: WifiSta auto-reconnect Retry connection every 30 s while enabled and disconnected S
B: Network Ethernet-drop recovery Re-enable STA when Ethernet transitions down; align grace period to 30 s S
C: PAL architecture documentation "Arduino, IDF, and mixing both" section in pal.md S
Total S

Part A: WifiSta auto-reconnect

WifiStaModule::loop() previously stopped retrying after a failed connection attempt. An else if branch added after the connect-polling block retries startConnect() every RETRY_INTERVAL_MS = 30000 ms while:

  • isEnabled() is true (NetworkModule disables STA when Ethernet is up; no retry while disabled)
  • Not currently in a connect attempt (!connecting_)
  • Has credentials (ssid_[0] != '\0')
  • WiFi hardware is available (pal::has_wifi())

startConnect() resets lastRetryMs_ so each new attempt starts a fresh 30 s window regardless of how much time had elapsed.


Part B: Network Ethernet-drop recovery

Two changes to NetworkModule::manageWifi_():

  1. ethWasConnected_ state tracking — detects the Ethernet up-to-down transition. On that tick, sta_->setControl("enabled", true) re-enables STA immediately so Part A's retry loop kicks in. Without this, STA stayed permanently disabled after Ethernet had been up.

  2. STA_GRACE_MS reduced from 60 s to 30 s — the AP recovery timer now matches the STA retry interval. All three recovery events (STA retry, Ethernet-drop STA re-enable, AP open) now occur on 30 s boundaries.


Part C: PAL architecture documentation

A new "Arduino, IDF, and mixing both" section added to docs/developer-guide/pal.md documents the outcome of a design discussion on the three-way ARDUINO / IDF_VER / PC switch:

  • The IDF_VER branch is dead code in all current builds (ARDUINO always matches first when framework = arduino)
  • Direct esp_* calls work inside framework = arduino builds — this is common practice for features the Arduino wrappers do not expose (power management, WiFi fine-tuning, P4 hardware)
  • Library compatibility: ESPAsyncWebServer and FastLED require Arduino.h; ArduinoJson v7 is framework-agnostic
  • The ESP_PLATFORM path: collapsing ARDUINO and IDF_VER into one branch for new PAL functions where no Arduino wrapper exists (P4 GMAC, codecs)
  • esp_netif_init / event loop double-init caveat when mixing IDF calls with Arduino WiFi init

Definition of Done

  • [x] WifiSta.h: auto-retry every RETRY_INTERVAL_MS = 30000 ms while enabled and disconnected
  • [x] Network.h: ethWasConnected_ added; STA re-enabled on Ethernet drop; STA_GRACE_MS = 30000
  • [x] docs/developer-guide/pal.md: "Arduino, IDF, and mixing both" section added
  • [x] All unit tests pass; PC and ESP32 builds clean
  • [x] PC live tests pass

Result

Metric Value
Unit tests 392/392 pass (no new tests — retry logic is ARDUINO-only, not exercisable on PC)
PC live tests 7/7 scenarios pass
ArtNet two-device PASS (esp32s3_n16r8 MM-70BC reached and received packets)
esp32dev build 1441 KB flash (78.5%), 64 KB RAM (20.0%) — unchanged from Sprint 11
esp32s3_n16r8 build 1427 KB flash (34.8%), 62 KB RAM (19.5%) — unchanged from Sprint 11
esp32dev live test skipped (device unreachable — stale IP, not a sprint regression)

See test-results.md and live-pc-macos.md.


Retrospective

What went well:

  • The isEnabled() guard in the retry branch reuses the existing enabled base-class control, so NetworkModule's Ethernet-gating of STA (which sets enabled = false) automatically suppresses retries — no extra flag needed.
  • Reducing STA_GRACE_MS to 30 s to match RETRY_INTERVAL_MS was a one-line change that aligned the whole recovery model. All recovery events are now on the same cadence.
  • The PAL design discussion surfaced a useful clarification: the IDF_VER branch is currently dead code, but the right long-term strategy is ESP_PLATFORM for new functions rather than maintaining two separate branches that converge on the same IDF API.

What was tricky:

  • The ethWasConnected_ fix was the less obvious half of the reconnect story. STA retry alone would not have helped after Ethernet drops because STA had been disabled by NetworkModule while Ethernet was up — it would never retry while enabled = false. Tracking the Ethernet transition was required to re-arm STA.

Seeds for Sprint 13:

  • Hardware live test: connect LAN8720 (esp32dev) and verify Ethernet and WiFi reconnect behavior on real hardware.
  • Consolidation question investigated but deferred: merging WifiAp, WifiSta, Ethernet into one NetworkModule is not worth the cost (UI, testing, size). The circular include friction could be reduced by extracting deviceName() to a lightweight DeviceInfo.h.
  • PAL ESP_PLATFORM refactor: apply to new functions (P4 GMAC) when that hardware arrives; leave existing Arduino-wrapper functions unchanged.

Complexity estimate: Small (S).


Sprint 13: PAL Cleanup and Deploy Pipeline Fixes

Scope: Remove all IDF_VER branches from Pal.h; consolidate the status docs (remove deploy-summary.md); fix summarise.py overwriting per-env live results files; fix livetest.py overwriting logs when a device is unreachable.

Identified from: Sprint 12 retrospective (PAL cleanup); organic housekeeping on the deploy pipeline.

Summary

Part Description Est
A: Remove IDF_VER branches Rewrite Pal.h to #ifdef ARDUINO / #else throughout; remove _eth IDF namespace; simplify eth_init S
B: Update pal.md Rename "Three-way" to "Two-way" table; remove "IDF migration path" section; add rule statement XS
C: Consolidate status docs Merge deploy-summary.md into index.md; remove the file; update all references S
D: Deploy pipeline correctness summarise.py stops overwriting per-env MD files; livetest.py skips unreachable devices without touching logs S
Total M

Part A: Pal.h rewrite

All #elif defined(IDF_VER) branches removed. Every function now follows:

#ifdef ARDUINO
    // Arduino ESP32 implementation
#else
    // PC / Raspberry Pi stub
#endif

The _eth namespace (IDF event-driven Ethernet state helpers: _EthEvent, eth_event_handler, ethState) was removed entirely. The eth_init function shrank from ~60 lines to ~5:

inline bool eth_init() {
#if defined(ARDUINO) && defined(PMM_ETH_LAN8720)
  return ETH.begin(ETH_PHY_LAN8720, ...);
#elif defined(ARDUINO) && defined(PMM_ETH_W5500)
  return ETH.begin(ETH_PHY_W5500, ...);
#else
  return false;
#endif
}

A comment added to the Ethernet section: future hardware (e.g. ESP32-P4 GMAC) that has no Arduino ETH.h wrapper should add a new PMM_ETH_* flag and use direct IDF calls inside the ARDUINO block.


Part B: pal.md update

  • "Three-way platform switch" table renamed to "Two-way platform switch"; IDF_VER row removed.
  • Rule statement added: use Arduino wrappers by default; fall back to direct IDF calls only when no Arduino wrapper exists; those calls go inside the ARDUINO block.
  • "IDF migration path" section removed (all items in that table were IDF_VER-specific stubs, now gone).

Part C: Status docs consolidation

docs/status/deploy-summary.md was a near-duplicate of docs/status/index.md. The deploy pipeline table (Build/Flash/Run/Live columns) was merged into index.md as a new ## Deploy summary section above the existing ## Test results table. deploy-summary.md was deleted and all 14 references across docs, scripts, and mkdocs.yml updated to point to status/index.md.

summarise.py was simplified accordingly: the _write_deploy_summary_md function was removed; its table-generation logic moved into _write_index_md. A ## Detail pages section is now appended to index.md on every run, listing all live-results-*.md files found on disk (not just those written in the current run), so results from previous hardware runs remain visible.


Part D: Deploy pipeline correctness

Two bugs where pipeline scripts silently destroyed previous good results:

summarise.py overwrote per-env MD files. live_suite.py writes docs/status/live-results-{env}.md directly after each run; it includes a ## Summary section (with per-test check counts) and a ## Scenarios section (per-step fps/heap data). summarise.py was independently re-generating these same files from the JSON, but using a simpler format without those sections. Fix: replaced _write_single_env_results / _write_live_results_md with _scan_live_files, which scans existing MD files on disk and returns their paths as links. summarise.py no longer writes per-env MD files at all.

livetest.py overwrote logs for unreachable devices. _run_esp32_test opened the log file for writing (truncating it) before attempting to connect, so an unreachable device always destroyed the previous run's log. Fix: a reachability probe (GET /api/system) runs before any file is opened; if it fails the device is skipped with a message and the log and JSON are left untouched.


Definition of Done

  • [x] Pal.h: no IDF_VER anywhere; all functions use #ifdef ARDUINO / #else
  • [x] pal.md: two-way switch table; rule statement; IDF migration section removed
  • [x] deploy-summary.md deleted; index.md has merged deploy + test tables and detail links
  • [x] summarise.py: no longer writes per-env MD files; _scan_live_files preserves live_suite.py output
  • [x] livetest.py: reachability probe before opening log; unreachable devices skip without touching files
  • [x] PC build clean; 392/392 unit tests pass; PC live tests pass
  • [x] esp32dev and esp32s3_n16r8 builds clean

Result

Metric Value
Unit tests 392/392 pass
PC build clean
PC live tests 15/15 pass
esp32s3_n16r8 live tests 15/15 pass
esp32dev build 1441 KB flash (78.5%), 64 KB RAM (20.0%) — unchanged
esp32s3_n16r8 build 1427 KB flash (34.8%), 62 KB RAM (19.5%) — unchanged
Pal.h line count ~800 (down from ~1510)
deploy-summary.md removed; content merged into index.md

Retrospective

What went well:

  • The PAL rewrite was clean: IDF_VER branches were either identical to the Arduino path or simple stubs, so removing them caused zero regressions.
  • The _eth IDF namespace was entirely internal to PAL — no modules depended on it — making deletion safe.
  • The status consolidation caught two separate bugs in the deploy pipeline during the same session; fixing them together while the code was open was efficient.

What was tricky:

  • The summarise.py / live_suite.py split of responsibilities was not obvious: both were writing the same files, with summarise.py's version silently losing the Scenarios section. The fix required tracing the full data flow from JSON through both writers.
  • Sprint 11 scope document still references the old three-way pattern (PAL structure and the IDF_VER path background section). Left as-is since it accurately records the design as it stood when Sprint 11 was written.

Seeds for Sprint 14:

  • Hardware live test: connect LAN8720 (esp32dev) and verify Ethernet + WiFi reconnect behavior on real hardware.
  • DeviceInfo.h extraction: reducing circular include friction between NetworkModule children (WifiAp/WifiSta both include Network.h for deviceName()).

Complexity estimate: Medium (M).


Release 7 Backlog

All items consolidated into the cross-release backlog.