Release 7: OTA, Ethernet, and Runtime Hardening (v1.7.0)¶

Theme: Release 7 completes the field deployment story: over-the-air firmware updates with a CI release pipeline, Windows support, and a full Ethernet + WiFi management stack with automatic reconnect. Runtime hardening spans heap/OOM safety, static RAM tuning, and WebSocket log streaming. It closes out with PAL simplification (IDF_VER removal) and deploy pipeline correctness fixes.

Release Overview¶

Foundation from previous releases¶

Capability	Notes
Virtual/Physical layer split	Effects render in virtual space; layouts own the physical mapping
PhysMap	1:0, 1:1, 1:N pixel mappings; PSRAM-backed on S3
Modifier library	Mirror, Checkerboard, Scroll, Rotate, Tile
Non-rectangular layouts	RingLayout, WheelLayout, XmasTreeLayout
Memory observability	MemBoot balance sheet per module; MemLive fragmentation warnings at runtime
Time observability	Per-second CPU accounting with module hierarchy; REST + WS exposure
Scenario benchmarking	Declarative JSON pipelines shared by unit tests and live tests; fps + heap per step
Live test suite	13 tests (smoke / format / behavioral / integration) on PC + ESP32 via REST
Deploy pipeline	`all.py` with post-flash mem capture (`--reset`), live tests, and status docs
357 unit tests	All passing; smoke / format / behavioral / integration classification
Developer tooling	uv workspace, pre-commit hooks, PAL compile-time enforcement, MCP server

What Release 7 delivers¶

Problem / Goal	Sprint
No over-the-air firmware update path	Sprint 1 (FirmwareUpdateModule: file + GitHub)
No firmware assets on GitHub releases / no nightly build	Sprint 2 (CI release pipeline)
No Windows build or release binary	Sprint 3 (Windows build + CI)
Scenario baselines not populated from hardware	Sprint 5 (Scenario baseline + `extends`)
Classic ESP32 static RAM / fragmentation headroom thin	Sprint 6 (`LOG_RING_SIZE` tuning, WiFi buffer counts, dual `check_alloc`)
Ring buffer diagnostics not visible without serial monitor	Sprint 7 (`GET /api/log` WebSocket streaming + frontend log panel)
Heap safety and HTTP OOM crashes	Sprint 8 (per-module `controlAllocBytes`, HTTP OOM catch)
No dropdown control type	Sprint 9 (`select` control: backend index + frontend `<select>`)
Module creation UX and WiFi management gaps	Sprint 10 (boot module redesign, dynamic AP/STA management)
No Ethernet support (LAN8720 / W5500)	Sprint 11 (EthernetModule, PAL functions, static IP)
WiFi does not reconnect after signal loss or Ethernet drop	Sprint 12 (STA retry every 30 s, Ethernet-drop STA re-enable)
PAL IDF_VER dead code; status doc duplication; pipeline bugs	Sprint 13 (IDF_VER removal, deploy-summary consolidation, pipeline fixes)

Sprints¶

Sprint	Goal
Sprint 1	FirmwareUpdateModule: OTA PAL + file upload + GitHub releases UI + env in SystemStatus
Sprint 2	CI release pipeline: firmware assets on tagged releases + nightly pre-release build
Sprint 3	Windows build: `#ifdef _WIN32` in WsServer.h and Pal.h; `ws2_32` link; CI job; `.zip` artifact
Sprint 5	Scenario baseline: first hardware `--update-baseline` run; `"extends"` inheritance; wire into `all.py`
Sprint 6	Static RAM hardening: `LOG_RING_SIZE` 4 KB on all devices, WiFi buffer tuning, dual `check_alloc` guard
Sprint 7	`GET /api/log` frontend panel: WS push of ring buffer entries; collapsible log UI
Sprint 8	Heap safety: per-module `controlAllocBytes` hook, HTTP OOM catch, live-test correctness, `isPermanent()` removal
Sprint 9	`select` control type: `addControl(..., "select")` + `addControlValue()`; backend index storage; frontend `<select>` rendering
Sprint 10	Boot module creation redesign; dynamic AP/STA WiFi management with Ethernet gating
Sprint 11	EthernetModule: LAN8720 (RMII) + W5500 (SPI); PAL functions; DHCP + static IP
Sprint 12	WiFi reconnect: STA retry every 30 s; re-enable STA on Ethernet drop; PAL architecture docs
Sprint 13	PAL IDF_VER removal; `deploy-summary.md` consolidation; deploy pipeline correctness fixes

Sprint 1: FirmwareUpdateModule¶

Scope: Full over-the-air firmware update: PAL plumbing, a POST /api/firmware endpoint, and a FirmwareUpdateModule that supports both a local file upload and one-click flashing from GitHub releases. The GitHub releases path fetches the public releases API in the browser, matches assets to the current device environment, and streams the binary to the device — no internet access required on the ESP32 itself.

Deferred from: Release 5 original scope.

Summary¶

Part	Description	Est
PAL + HTTP endpoint	`pal::ota_*` functions, `POST /api/firmware` streaming endpoint, dual-OTA partition scheme	M
FirmwareUpdateModule	Module lifecycle, `update_status` control, OTA state integration	S
Frontend: file upload	File picker tab, XHR streaming to endpoint, progress bar	S
Frontend: GitHub releases	Browser-direct GitHub API, asset matching by env, version badge, sessionStorage cache	M
SystemStatus env field	`"env": BUILD_TARGET` in `GET /api/system` and `healthReport()`	XS
Tests	PAL stub tests, endpoint test (ota_write call count, ota_end once)	S
Total		L

Planned scope¶

PAL and endpoint:

pal::ota_begin(), pal::ota_write(buf, len), pal::ota_end(), pal::ota_abort() in Pal.h. On ESP32: wraps esp_ota_*. On PC: writes received bytes to a temp file and prints a log line.
POST /api/firmware (multipart or chunked binary body): streams bytes through pal::ota_write, calls pal::ota_end() on completion, triggers reboot. Returns {"ok":true} or {"error":"..."}.
Partition scheme: dual-OTA layout (partitions/esp32dev-ota.csv, partitions/esp32s3-ota.csv) so a running image can be updated without erasing LittleFS.

SystemStatus — env field:

Add "env": BUILD_TARGET to GET /api/system response and to SystemStatus::healthReport(). BUILD_TARGET is already a compile-time define (esp32dev, esp32s3_n16r8, PC, …). This field is what the update UI uses to match GitHub release assets to the current device.

FirmwareUpdateModule:

Registered in ModuleRegistrations.cpp; isPermanent() = true.
Exposes an "update_status" display control (idle / downloading / flashing X% / done / error) updated via disableSelf() is NOT used here; the module stays alive through the update.
WebSocket push: progress events {"type":"ota","pct":42} every ~5% so the frontend can update a progress bar without polling.

Frontend — two update paths in one card:

File upload tab: <input type="file" accept=".bin"> → reads file as ArrayBuffer → POST /api/firmware with Content-Type: application/octet-stream. Shows a progress bar driven by XMLHttpRequest.upload.onprogress.
GitHub releases tab: on open, browser JS calls https://api.github.com/repos/ewowi/projectMM/releases?per_page=5 directly (public API, no auth, no device internet required). For each release shows: tag name, release title, date, pre-release badge. Downloads the asset matching projectMM-{env}.bin (where env comes from GET /api/system). Streams the downloaded ArrayBuffer to POST /api/firmware. Shows the same progress bar. If no matching asset exists for a release, that release is greyed out.
Maximum 5 releases shown; controlled by the per_page query parameter.
Error handling: network failure fetching GitHub API shows "GitHub unreachable — use file upload"; missing asset shows "No firmware for {env} in this release".

Tests:

Unit test: pal::ota_* PC stub writes bytes correctly and returns success.
Unit test: POST /api/firmware with a 1 KB payload calls ota_write N times and ota_end once.
Live test: flash a known-good binary via POST /api/firmware; assert version field in GET /api/system matches expected value after reboot.

Definition of Done¶

[x] pal::ota_begin/write/end/abort implemented in Pal.h (ESP32 wraps esp_ota_*; PC stubs return true and print)
[x] OtaHandle type alias: esp_ota_handle_t on ESP32, int on PC
[x] POST /api/firmware using onPostBinary (no body buffering; chunks stream directly to pal::ota_write)
[x] onPostBinary added to HttpServer.h (ESP32: upload/body callback; PC: buffers once, calls chunk handler)
[x] OtaState.h inline globals (g_otaStatus, g_otaPct, g_otaHandle) shared between AppRoutes and module
[x] FirmwareUpdateModule registered in CoreRegistrations.cpp
[x] FirmwareUpdateModule auto-created by ensureInfraModules on first boot (embedded only); not permanent — boot guard is the safety net
[x] "env": BUILD_TARGET added to GET /api/system via SystemStatus::fillSystemJson
[x] Frontend: File Upload tab in FirmwareUpdateModule card (file picker + XHR progress bar)
[x] Frontend: GitHub Releases tab (fetches public API, 1 hr sessionStorage cache, matches projectMM-{env}.bin)
[x] Frontend: Version badge in status bar when newer non-prerelease GitHub release has a matching asset
[x] flash_chip_mode() and psram_mode() PAL functions added; wired into SystemStatus controls and fillSystemJson (psram_mode inside totalPsramKb_ > 0 guard)
[x] Light mode fix: #preview-section gets background: #f5f5fa override so sticky bar blends with body in day mode
[x] 11 new unit tests (pal::ota_* stubs, OtaState globals, FirmwareUpdateModule lifecycle); 375/375 pass
[x] PC live test: PASS; ESP32 live tests: MM-70BC PASS, MM-ESP32 PASS
[x] esp32dev and esp32s3_n16r8 build successfully

Result¶

Metric	Value
Unit tests	375/375 pass (11 new)
PC live tests	13/13 PASS
ESP32 live tests	MM-70BC PASS, MM-ESP32 PASS
esp32dev build	SUCCESS (~1.16 MB)
esp32s3_n16r8 build	SUCCESS (~1.16 MB)
`POST /api/firmware`	Returns `{"ok":true}` (PC stub); streams via body callback on ESP32
Version badge	Shown when GitHub latest release tag newer than `firmware_version` and has matching `.bin`
`flash_chip_mode` / `psram_mode`	PAL functions + SystemStatus controls; psram_mode guarded by `totalPsramKb_ > 0`
Light mode	`#preview-section` background override added; day mode preview bar now white

Backlogged from this sprint (per user Q decisions): - Device-side WS progress events during OTA (Q1-B); XHR upload.onprogress used instead - Nightly pre-release channel in version badge (Q2-B); only stable releases shown - Live test: flash binary + verify version (requires hardware access with known binary on GitHub releases)

Retrospective¶

What went well: - onPostBinary cleanly separated from onPost (no RAM buffering for large binaries) - Dual-OTA partitions already in place; no CSV changes needed - Browser-direct GitHub API (CORS OK on public repos) avoids device internet access - checkForUpdate() uses sessionStorage to rate-limit GitHub API calls (1 hr TTL) - OtaState.h inline globals give clean shared state between the HTTP route and module without any RTOS sync overhead - flash_chip_mode / psram_mode PAL functions fit the existing pattern cleanly; compile-time CONFIG_SPIRAM_MODE_OCT is the reliable OPI indicator

What was tricky: - ota_end re-fetches the next OTA partition via esp_ota_get_next_update_partition(nullptr) since set_boot_partition is not called before ota_end; easy to miss - Light mode required an explicit #preview-section background override because sticky positioning pins the dark base color through the body override

Seeds for future sprints: - Sprint 2 (CI Release Pipeline) is the next step: assets must be published before the GitHub tab or version badge show anything useful - Live OTA test (flash + verify version change) is backlogged until Sprint 2 ships firmware assets

Sprint 2: CI Release Pipeline¶

Scope: Attach firmware binaries as GitHub release assets on every tagged release, and add a nightly pre-release that rebuilds automatically each night. These assets are what FirmwareUpdateModule's GitHub tab fetches.

Depends on: Sprint 1 (asset naming convention must match what FirmwareUpdateModule expects).

Complexity: S (YAML only; no C++ or Python changes).

Summary¶

Part	Description	Est
`release.yml` alignment	Pin Python 3.12 + PlatformIO `<7`, add PIO package cache, asset upload job	S
`nightly.yml`	New workflow: cron 02:00 UTC, idempotent delete+recreate nightly pre-release	S
Total		S

Asset naming convention¶

Asset	Source path	Matches env
`projectMM-esp32dev.bin`	`.pio/build/esp32dev/firmware.bin`	`esp32dev`
`projectMM-esp32s3_n16r8.bin`	`.pio/build/esp32s3_n16r8/firmware.bin`	`esp32s3_n16r8`
`projectMM-pc-macos.tar.gz`	`deploy/build/pc/macos/projectMM`	`PC` (macOS CI runner)
`projectMM-pc-windows.zip`	`deploy/build/pc/windows/projectMM.exe`	`PC` (Windows, Sprint 3)

Scope (confirmed)¶

.github/workflows/release.yml — aligned and complete:

Triggered by push: tags: ['v*'] or workflow_dispatch with tag input.
PC build: astral-sh/setup-uv@v5 + uv run deploy/build.py -target pc (aligned with ci.yml).
ESP32 builds: Python 3.12 pinned, PlatformIO pinned to <7, PlatformIO package cache added (both gaps vs ci.yml fixed).
upload-assets job uses gh release upload --clobber after all three build jobs pass.

.github/workflows/nightly.yml — new:

Triggered on schedule: cron: '0 2 * * *' (02:00 UTC daily) and workflow_dispatch.
Identical build matrix to release.yml (macOS PC + esp32dev + esp32s3_n16r8).
publish-nightly job: deletes existing nightly release + tag, re-creates as pre-release titled Nightly (YYYY-MM-DD) with short commit SHA in notes. Idempotent: the gh release delete ... || true guard handles first run.
The nightly pre-release appears in FirmwareUpdateModule's GitHub releases tab with a "pre-release" badge.

Backlogged from this sprint:

scripts/list_pio_envs.py + deploy/build.py --all-envs: not needed while only 2 ESP32 envs exist; pick up when ESP32-P4 is added (Release 8 Sprint 1).
PC Linux build artifact: macOS binary covers the main use case for now.

Definition of Done¶

[x] release.yml: PC build uses uv run deploy/build.py -target pc (was raw cmake)
[x] release.yml: ESP32 jobs pin Python 3.12 and platformio<7 (was 3.x + unpinned)
[x] release.yml: PlatformIO package cache added to both ESP32 jobs
[x] nightly.yml created: cron 02:00 UTC + workflow_dispatch; builds macOS PC + esp32dev + esp32s3_n16r8
[x] nightly.yml: publish-nightly job deletes and re-creates nightly pre-release (idempotent)
[x] Asset names match FirmwareUpdateModule expectations (projectMM-{env}.bin)

Result¶

Metric	Value
`release.yml`	Aligned with `ci.yml`; tag-triggered; 3 build jobs + upload
`nightly.yml`	New; cron 02:00 UTC; delete+recreate `nightly` pre-release
Asset naming	`projectMM-esp32dev.bin`, `projectMM-esp32s3_n16r8.bin`, `projectMM-pc-macos.tar.gz`
Python/PlatformIO	Pinned to 3.12 and `<7` in both workflows (consistent with `ci.yml`)
Unit tests	375/375 (no new tests; YAML-only sprint)

Retrospective¶

What went well: - release.yml already existed with the core structure; this sprint was alignment + nightly, not a rebuild from scratch - gh release delete nightly --yes --cleanup-tag 2>/dev/null || true pattern is clean and idempotent; no third-party action needed - Build matrix in nightly.yml is identical to release.yml so both stay in sync by copy

What was tricky: - release.yml had python-version: '3.x' and unpinned PlatformIO; these would have broken on PlatformIO 7 release or a Python 3.13 runner update (same issue ci.yml already fixed months ago)

Seeds for future sprints: - Sprint 3 (Windows) adds projectMM-pc-windows.exe to both release.yml and nightly.yml - Release 8 Sprint 1 (ESP32-P4) adds a third ESP32 build job; at that point list_pio_envs.py becomes worth adding to avoid three copies of the same job - Once a tagged release exists, run a manual workflow_dispatch to verify the upload-assets path end-to-end

Sprint 3: Windows Build¶

Scope: projectMM builds and runs as a native Windows binary (CMake + Clang/Ninja via llvm-mingw). CI job produces a .zip release artifact, and both release.yml and nightly.yml gain a build-pc-windows job. macOS build is unaffected.

Deferred from: Release 5 original scope.

Summary¶

Part	Description	Est
Winsock2 guards	`WsServer.h` + `Pal.h` UDP: POSIX socket calls behind `#ifdef _WIN32`	S
Socket shim unification	`PcSocketShims.h` unified header; `PcSockets.h` deleted; both consumers updated	S
CMake + build scripts	Ninja on Windows, `pc_platform()` helper, per-platform binary paths in all deploy scripts	M
Windows memory stats	`VirtualQueryEx` for `free_heap_bytes()`; `MemBoot`/`MemLive` correct on Windows	M
Output file split	`live-results-pc-{platform}.json`, per-env MD files; `live-results-all.json` dropped	S
CI integration	`build-pc-windows` job in `ci.yml`, `release.yml`, `nightly.yml`; `.zip` artifact	S
Test + misc fixes	`/tmp/` relative-path fix, UTF-8 encoding, dangling-pointer `onInputRemoved` fix	M
Total		XL

Planned scope¶

#ifdef _WIN32 guards in WsServer.h and Pal.h (Winsock2 instead of POSIX sockets).
ws2_32 link in CMakeLists.txt and tests/CMakeLists.txt.
GitHub Actions CI job on windows-latest (build + unit tests); adds projectMM-pc-windows.zip to release and nightly artifact lists.
deploy/build.py: -G Ninja on Windows (single-config generator, binary at predictable path).

Additional work discovered during implementation:

src/pal/MemoryStats.h: Windows branch using GetDiskFreeSpaceExA (no sys/statvfs.h).
tests/ws_test_client.h: full rewrite with _wstc* socket shims for Winsock2 compatibility.
tests/test_module_manager.cpp, tests/test_reorder.cpp: fix hardcoded /tmp/ paths to relative paths (no /tmp/ on Windows).
deploy/unittest.py: add encoding='utf-8' to markdown write (Windows default codec lacks emoji support); add blank-line stripping from run-tests.log output.
deploy/_lib.py: add pc_platform() helper ("windows" / "macos" / "linux").
All deploy scripts (build.py, run.py, livetest.py, summarise.py, unittest.py): paths updated from deploy/build/pc/ to deploy/build/pc/{platform}/ and logs from *-pc.log to *-pc-{platform}.log.
Socket code sharing: src/pal/PcSocketShims.h created as a shared header with unified _ws* shim functions (open, close, accept, connect, recv, send, wait). src/pal/PcSockets.h merged in and deleted. WsServer.h and ws_test_client.h both include PcSocketShims.h; ws_test_client.h no longer has its own _wstc* duplicates.
Windows MemBoot/MemLive: pal::free_heap_bytes() on Windows implemented via VirtualQueryEx (walks committed private virtual memory regions; "free" = 512 MB ceiling minus committed). pal::total_heap_kb() returns the matching ceiling. pal::s_freeHeapCache_() caches the last scan so max_alloc_bytes() avoids a second scan in the same tick. MemBoot and MemLive lines now appear in the Windows server log with correct per-module deltas.
Output file improvements: live-results-pc.json renamed to live-results-pc-{platform}.json; live-results-all.json dropped entirely. docs/status/live-results.md split into per-env files (live-results-pc-windows.md, live-results-esp32dev.md, live-results-esp32s3_n16r8.md). livetest_out.txt deleted. deploy/summarise.py rewritten to read per-device JSON files directly.
docs/developer-guide/deploy.md: fully updated for Windows (toolchain requirements, Ninja, llvm-mingw, uv run throughout, per-platform binary paths, CI table with Windows row, log file table).

Definition of Done¶

[x] src/core/WsServer.h: Winsock2 shims replace POSIX socket calls under #ifdef _WIN32.
[x] src/pal/Pal.h: UDP functions (udp_bind, udp_recv, udp_send, udp_broadcast) compile on Windows.
[x] src/pal/MemoryStats.h: Windows branch provides getMemoryStats() via GetDiskFreeSpaceExA.
[x] CMakeLists.txt + tests/CMakeLists.txt: ws2_32 linked on Windows.
[x] tests/ws_test_client.h: cross-platform socket shims; test helper compiles on Windows.
[x] deploy/build.py: Ninja generator selected on Windows; binary at deploy/build/pc/windows/projectMM.exe.
[x] All 375 unit tests pass on Windows (Clang 18 + llvm-mingw-ucrt + Ninja).
[x] ci.yml: build-pc-windows job (build + unit tests).
[x] release.yml + nightly.yml: build-pc-windows job; .zip artifact included.
[x] Deploy scripts use deploy/build/pc/{platform}/ paths; macOS logs remain *-pc-macos.log.
[x] All 13 live test groups pass on Windows (all_pc.py: 4 passed, 0 failed).
[x] src/pal/PcSocketShims.h: unified socket shim header; PcSockets.h merged in and deleted; WsServer.h and ws_test_client.h both include PcSocketShims.h.
[x] pal::free_heap_bytes() on Windows via VirtualQueryEx; MemBoot/MemLive lines appear in Windows server log with correct per-module deltas.
[x] Live result files split per platform (live-results-pc-{platform}.json); per-env docs/status/live-results-*.md files generated; live-results-all.json dropped.
[x] docs/developer-guide/deploy.md updated: Windows toolchain requirements, uv run throughout, per-platform paths, CI table with Windows row.

Result¶

Metric	Value
Unit tests (Windows)	375 / 375 passed
Test assertions (Windows)	1807 / 1807 passed
Live test groups (Windows)	13 / 13 passed (133 assertions)
`all_pc.py` result	4 / 4 steps passed
Toolchain	Clang 18.1.8 + llvm-mingw-20240619-ucrt-x86_64 + Ninja
Build target	`projectMM.exe` (Windows x86-64)
Files changed	50 source, deploy, docs, and CI files
macOS tests (unaffected)	unchanged (375 pass in CI)
Windows MemBoot	Correct per-module deltas via VirtualQueryEx (frag% display deferred — see backlog)
Socket shim files	`PcSocketShims.h` unified; `PcSockets.h` deleted

Retrospective¶

What went well: - The socket shim pattern (_ws* unified in PcSocketShims.h) kept platform branches out of class bodies and eliminated the duplicate _wstc* block that had grown alongside the original _ws* set. - pc_platform() in _lib.py gives a single source of truth for the three-way platform string; all deploy scripts and CI reference it. - UDP broadcast loopback (Art-Net test5) works on Windows without any changes. - VirtualQueryEx gives realistic, per-module heap deltas in MemBoot — the approach is correct even though the frag% display has a pending fix. - Splitting live-results-all.json into per-platform files and live-results.md into per-env files removes the aggregation step and makes each device's results self-contained.

What was tricky: - sys/statvfs.h (MemoryStats.h) and arpa/inet.h (ws_test_client.h) are not available on Windows and required additional platform guards not in the original scope. - /tmp/ hardcoded in several test files causes silent failures on Windows (file not written, module not loaded, findById returns nullptr). Fixed by switching to relative paths. - std::filesystem::path::write_text on Windows uses the system default encoding (cp1252) which cannot encode emoji (✅) used in test-results.md. Fixed by passing encoding='utf-8'. - The build/pc/ flat layout conflated macOS and Windows artifacts. Restructured to build/pc/{platform}/ in the same sprint. - Latent dangling-pointer bug exposed by Windows: DriverLayer stores raw EffectsLayer* pointers in sources_[]. When delete_all_modules() freed an EffectsLayer, driver1 retained a stale pointer and crashed on the next loop() tick. macOS tolerated the dangling access; Windows terminated the process. Fixed by adding Module::onInputRemoved(Module*). - Windows heap measurement: HeapWalk (Win32 default heap) and GlobalMemoryStatusEx (system-wide RAM) were tried and rejected before settling on VirtualQueryEx. HeapWalk walks the wrong heap (Win32 vs UCRT malloc), giving ~20 KB values that triggered check_alloc() denial and crashed the server. GlobalMemoryStatusEx returns 4 GB+ with no per-allocation granularity. - frag% overflow: largNow * 100u overflows uint32_t at ~500 MB values. Fixed in pal::memEvent() with a (uint64_t) cast; the same overflow exists in StatefulModule.h and Scheduler.cpp and is deferred to the backlog (see index.md).

Seeds for future sprints: - Linux PC build is CI-tested only on macOS. A ubuntu-latest CI leg would close the triangle (low effort: same uv run deploy/build.py -target pc command, linux slug already in pc_platform()). - The timing-sensitive test Scheduler timing accumulator tracks SpinModule within 5% occasionally flakes under heavy CI load on Windows (passes in isolation). Consider widening epsilon or moving to a dedicated timing fixture. - Windows MemBoot frag% accuracy: apply the (uint64_t) overflow fix to StatefulModule.h and Scheduler.cpp, and fix call order (max_alloc before free_heap in Scheduler). Tracked in the backlog. - Effects animate slowly in the WebGL preview on Windows but not on macOS. Root cause not yet identified (push-rate throttle, time-unit mismatch, or browser queue lag). Tracked in the backlog.

Sprint 5: Scenario Baseline and `extends`¶

Scope: Populate deploy/test/scenario-baseline.json from a real ESP32 run; add "extends" inheritance to scenario files; wire --compare-baseline into deploy/all.py.

Deferred from: Sprint 10 retrospective seeds.

Complexity: M

Summary¶

Part	Description	Est
`extends` support	Single-level inheritance in `scenario.py`, `live_suite.py`, `test_scenarios.cpp` (identical logic in each)	M
New scenario files	`base-pipeline-64x64.json` and `four-layers.json` using extends	S
Baseline population	Run on MM-70BC hardware, commit `scenario-baseline.json`	S
`all_pc.py` integration	`_run_scenario_baseline()`: start server, compare baseline, non-fatal	S
Total		M

Planned scope¶

Run deploy/scenario.py --update-baseline against MM-70BC (ESP32-S3); commit result.
Implement "extends" key (single-level): load parent steps and prepend them; child metadata wins.
deploy/all_pc.py: after live tests, start the PC server and run deploy/scenario.py --compare-baseline; print warning on regressions (non-fatal).
Add base-pipeline-64x64.json and four-layers.json stress scenarios, both using "extends".

Definition of Done¶

[x] deploy/scenario.py load_scenario(): single-level "extends" resolves parent file and prepends parent steps
[x] deploy/live_suite.py run_scenario(): same extends resolution for live tests
[x] tests/test_scenarios.cpp resolve_extends(): same resolution so C++ scenario replay handles the new files
[x] deploy/test/scenarios/base-pipeline-64x64.json: extends base-pipeline-32x32, adds resize to 64x64
[x] deploy/test/scenarios/four-layers.json: extends two-layers, adds GameOfLife + Noise layers
[x] deploy/all_pc.py: _run_scenario_baseline() starts PC server, runs scenario --compare-baseline, non-fatal
[x] deploy/test/scenario-baseline.json: populated from MM-70BC (ESP32-S3); 7 scenarios, all steps measured
[x] 375/375 unit tests pass; all scenario replay tests include extended scenarios

Result¶

Metric	Value
Unit tests	375/375 pass (20 new assertions from extended scenario replay)
PC live tests	13/13 PASS (including 2 new extended scenarios)
Baseline	Populated from MM-70BC: 7 scenarios, ~177 KB free heap at base pipeline
Scenarios	7 files (5 pre-existing + `base-pipeline-64x64`, `four-layers`)

Backlogged from this sprint: - system_fps baseline threshold too tight (50%+ swings between runs on hardware); tracked in cross-release backlog. - Recursive extends (parent can itself extend) deferred until a chain is actually needed.

Retrospective¶

What went well: - Single-level extends is a clean pattern: parent steps first, child steps appended, child metadata wins. No ambiguity. - The three places that load scenario JSON (scenario.py, live_suite.py, test_scenarios.cpp) each got identical logic in ~8 lines; no shared abstraction needed at this scale. - _run_scenario_baseline() in all_pc.py cleanly manages its own server lifetime (start, run, terminate) as a self-contained helper.

What was tricky: - system_fps is too volatile for a 20% threshold on hardware (WiFi task preemption causes 30-65% swings between identical runs). The baseline pass/fail signal is unreliable for fps; heap metrics are stable and useful. - The live suite (live_suite.py) loads scenario JSON independently of scenario.py, so extends resolution had to be added in three places. A shared Python utility would reduce duplication if more scenario features are added.

Seeds for future sprints: - Scope baseline checks to heap metrics only (heap_free, max_alloc); skip fps or widen its threshold to 50%. - Recursive extends if scenario hierarchies deepen.

Sprint 6: Static RAM Hardening for Classic ESP32¶

Scope: Reduce the permanent .bss footprint on esp32dev (no PSRAM) to give module setup more headroom. The log ring buffer and WiFi buffer allocation are the two largest tunable levers.

Identified in: R6S8 live device analysis (esp32dev free-heap floor ~109 KB, only 19 KB above 90 KB reserve; fragmentation 55%+).

Complexity: S

Summary¶

Part	Description	Est
Ring buffer resize	`LOG_RING_CAP=32`, `LOG_RING_ENTRY=64` (2 KB, saves 6 KB .bss); test updated	XS
`check_alloc` dual guard	Adds max-alloc block check alongside free-heap reserve; `printf` on failure reason	S
WiFi buffer investigation	`-DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM` attempted then removed (pre-compiled framework conflict)	XS
Total		S

Planned scope¶

Set ring to 32 entries x 64 bytes = 2 KB on all devices. Saves 6 KB vs the original 8 KB ring on classic ESP32. Trade-off: the ring holds ~32 lines instead of ~64.
Tune WiFi dynamic RX buffer count: -DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM=16 was attempted in build_flags but the symbol is already defined in the framework's pre-compiled sdkconfig.h, causing a redefinition warning. Flag removed; WiFi buffer count cannot be overridden this way with the Arduino framework blob.
Upgrade pal::check_alloc to a dual guard: free_heap_bytes() >= bytes + reserve AND max_alloc_bytes() >= bytes. Surface the failure reason via printf ("check_alloc: reserve violation" vs "check_alloc: largest block too small").

Complexity: S

Definition of Done¶

[x] src/core/Logger.cpp: LOG_RING_CAP = 32, LOG_RING_ENTRY = 64 (2 KB total)
[x] platformio.ini: -DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM=16 attempted and removed — conflicts with pre-compiled sdkconfig.h in Arduino framework blob
[x] src/pal/Pal.h check_alloc(): dual guard checks both free heap reserve AND max contiguous block; printf on failure
[x] tests/test_logger.cpp: ring overflow test updated for 32-entry cap (38 entries pushed, 32 survive from entry6)
[x] esp32dev and esp32s3_n16r8 build successfully
[x] 375/375 unit tests pass

Result¶

Metric	Value
Ring size	32 x 64 = 2 KB (was 64 x 128 = 8 KB, saves 6 KB .bss)
WiFi RX buffers	Not changed — symbol already defined in framework `sdkconfig.h`; `-D` override causes redefinition warning and was removed
`check_alloc`	Dual guard: free-heap reserve + max-alloc block; printf on refusal
Unit tests	375/375 pass
esp32dev build	SUCCESS
esp32s3_n16r8 build	SUCCESS

Backlogged from this sprint: - Verify WiFi buffer flag runtime effect on hardware (depends on pioarduino compiling WiFi component from source vs precompiled blob). - Update MemBoot/MemLive baseline table in docs with post-hardening numbers from MM-C1BC (requires live flash and measurement).

Retrospective¶

What went well: - Ring size change is a 2-line edit in Logger.cpp with a single test update; zero risk. - Dual check_alloc guard closes the fragmentation blind spot cleanly — the previous check passed when total free was enough but no single block was large enough to satisfy the allocation. - printf for the guard failure message avoids a Logger dependency in Pal.h.

What was tricky: - -DCONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM=16 in build_flags causes a redefinition warning: sdkconfig.h in the pre-compiled Arduino framework blob already defines the symbol. The WiFi component is not compiled from source, so Kconfig values are fixed at framework build time and cannot be overridden via compiler flags. Removed the flag; WiFi buffer tuning requires a custom framework build or a sdkconfig.defaults approach outside the standard pioarduino setup.

Seeds for future sprints: - WiFi RX buffer tuning via sdkconfig.defaults (requires custom framework build); backlogged. - Move ring buffer to PSRAM-backed heap allocation in setup() if 2 KB .bss still matters (requires PAL extension).

Sprint 7: `GET /api/log` Frontend Panel¶

Scope: Surface the existing ring buffer (R6S2) in the frontend as a live log panel, removing the need for a serial monitor during field debugging.

Identified in: R6S2 retrospective ("ring buffer exists; streaming it to the frontend is the obvious next step") and R6 backlog.

Complexity: S

Summary¶

Part	Description	Est
Log colouring	`_logClass()` for warn/error; `.log-warn`/`.log-error` CSS classes; light-mode overrides	S
Scroll management	`logAtBottom` flag; auto-scroll pauses on manual scroll-up, resumes at bottom	S
History backfill	`GET /api/log` fetched on WS connect; ring entries prepended to panel	S
Total		S

Planned scope¶

WS push is already in place per-line via g_logWsPushFn (format {"t":"log","m":"..."}). Sprint scope is completing the frontend panel.
Frontend enhancements: WARN/ERROR line colouring (keyword match); auto-scroll pauses on manual scroll-up; backfill history from GET /api/log on WS connect.
LOG_MAX_LINES = 100 JS constant; clear button resets scroll state.

Definition of Done¶

[x] src/frontend/app.js: _logClass(text) colours lines containing warn (amber) or error/fail (red)
[x] src/frontend/app.js: logAtBottom flag; auto-scroll only when panel is scrolled to bottom; pauses on manual scroll-up
[x] src/frontend/app.js: GET /api/log fetched on wsConn.onopen; ring entries backfilled into panel
[x] src/frontend/app.js: clear button resets logAtBottom = true
[x] src/frontend/style.css: .log-warn (amber) and .log-error (red) classes; light-mode overrides
[x] PC live tests pass (no regressions)

Result¶

Metric	Value
Log panel	Collapsible below module list; `LOG_MAX_LINES = 100`
WS push	Per-line `{"t":"log","m":"..."}` format (pre-existing); kept as-is
Coloring	`warn` lines amber; `error`/`fail` lines red; light-mode overrides
Auto-scroll	Pauses on manual scroll-up; resumes when scrolled back to bottom
History backfill	`GET /api/log` fetched on WS connect; all ring entries added to panel
PC live tests	13/13 PASS

Backlogged from this sprint: - Timestamp and log level as separate columns (structured rows) deferred; raw message text is sufficient for current debugging needs. - Batched WS push ({"type":"log","entries":[...]}) deferred; per-line push at current log rates does not cause measurable overhead.

Retrospective¶

What went well: - The per-line WS push (g_logWsPushFn) and frontend handler were already in place. Sprint 7 completed the UX: colouring, scroll-pause, and history backfill. - History backfill on WS connect (4 lines) means a browser that opens 10 s after boot sees the startup log immediately — the most common debugging scenario. - Keyword-based colouring (no prefix parsing) works with the actual log message format, which does not use systematic level prefixes.

What was tricky: - Scroll-pause needs a passive: true scroll listener and an explicit logAtBottom flag tracking scrollTop + clientHeight >= scrollHeight - 5. The 5 px tolerance avoids false "not at bottom" on fractional scroll positions.

Seeds for future sprints: - If log volume grows, consider adding a "level": "warn"|"error" field to the WS frame so colouring is exact rather than keyword-matched.

Sprint 8: Heap Safety, HTTP OOM Hardening, Live-Test Correctness, and `isPermanent()` Removal¶

Scope: Four interrelated changes that close the remaining stability gaps on classic ESP32 and clean up dead runtime scaffolding: a per-module opt-in heap check before committing control changes, OOM recovery in the HTTP layer, a set of live-test correctness fixes that had been producing spurious duplicate modules, and full removal of the isPermanent() mechanism that turned out to be dead code.

Identified in: MM-C1BC crash (70x28 GridLayout saved to LittleFS; on browser refresh serializeJson to std::string threw std::bad_alloc → abort); live-test review (PreviewModule top-level, duplicate SystemStatusModule/FirmwareUpdateModule from scenario scripts, Windows MD file written on macOS).

Summary¶

Part	Description	Est
A: controlAllocBytes heap guard	Per-module `controlAllocBytes()` opt-in; `setControl` checks heap before committing large allocs	M
B: HTTP OOM hardening	Heap check before JSON serialisation in `AppRoutes`; 503 on OOM; `DynamicJsonDocument` size guard	M
C: Live-test correctness	`INFRA_TYPES`/`SINGLETON_TYPES`, `summarise.py` ESP32 guard, `unittest.py` cwd fix, state pollution fix	M
D: FirmwareUpdateModule docs	User-guide page and module doc page	S
E: `isPermanent()` removal	Delete `isPermanent()` from all modules and base class; remove 403 route guard	S
Total		L

Part A: Per-module `controlAllocBytes` heap guard¶

Problem: A user could resize GridLayout to 70x28 (1960 pixels). On a fresh boot the allocation succeeded. After WiFi and the HTTP server started, free heap was ~60 KB with a largest block of ~36 KB. On browser refresh GET /api/modules called serializeJson(doc, std::string body) — the growing-string operator new chain threw std::bad_alloc and the device aborted.

Root cause in two parts: 1. setControl("width", 70) had no heap check; the value was saved to LittleFS and survived reboots. 2. serializeJson used a growing std::string that fragmented the already-tight heap.

Fix in StatefulModule.h:

readThrough(ControlDescriptor&): static helper that reads a control's current value as float (mirrors the existing writeThrough).
virtual size_t controlAllocBytes(const char* key) const: returns 0 by default (opt-in; modules with no significant heap impact need not override it).
Both setControl() overloads: save old value with readThrough, write new value, call controlAllocBytes(key), call pal::check_alloc(need) if need > 0, revert with writeThrough(old) and return false if the check fails. No onUpdate() is called on failure; the control stays at its previous value.

Fix in GridLayout.h:

safeWidth_, safeHeight_, safeDepth_ (uint32_t, default 10/10/1): the last dimensions successfully committed to DriverLayer.
controlAllocBytes() override: computes (newNPix - safeNPix) * sizeof(RGB) * 2 (EffectsLayer double-buffer delta); returns 0 for shrinks.
onUpdate(): updates safe fields and rebuilds only after the heap check passes (the framework reverts the control automatically on failure).
buildMappings_(): changed to new (std::nothrow) with a null-check log and early return on OOM.
setup()/teardown(): initialize/reset safe fields.

size_t controlAllocBytes(const char* /*key*/) const override {
    const uint32_t newNPix = (uint32_t)width_ * height_ * depth_;
    const uint32_t safeNPix = (uint32_t)safeWidth_ * safeHeight_ * safeDepth_;
    return newNPix > safeNPix ? (newNPix - safeNPix) * sizeof(RGB) * 2 : 0;
}

New tests (tests/test_layouts.cpp, +3): - GridLayout - growing dimensions updates mappingCount (width 10→16, height 10→16) - GridLayout - shrinking dimensions updates mappingCount (32x32 → 8x8) - GridLayout - healthReport reflects current dimensions after resize

Part B: HTTP OOM hardening¶

Problem: serializeJson(doc, std::string body) uses std::string::push_back internally. On a heap-fragmented ESP32 this triggers a growing series of reallocations (16 → 32 → 64 → … → N bytes), each of which can throw std::bad_alloc. The final HttpResponse{body} string construction was also a heap allocation.

Fix in AppRoutes.cpp: All GET routes that serialized a JsonDocument to a std::string now use:

std::string body;
body.reserve(measureJson(doc) + 1);   // one allocation, exact size, no growth
serializeJson(doc, body);
return HttpResponse{200, "application/json", std::move(body)};

measureJson(doc) traverses the document without allocating, returns the exact byte count. reserve() does a single heap allocation of that size. serializeJson then fills the string without reallocating. std::move steals the buffer into HttpResponse without copying. Net result: one heap allocation per response instead of O(log N).

Fix in HttpServer.h (ESP32 section): All handler dispatch points — onGet, onPost (request-complete callback), onDelete, onPatch (request-complete callback) — now wrap the handler call in:

try {
    auto resp = handler(...);
    req->send(resp.status, resp.contentType.c_str(), resp.body.c_str());
} catch (const std::bad_alloc&) {
    req->send(503, "application/json", R"({"error":"low heap"})");
}

The 503 response uses a string literal (no heap). This catches any remaining allocation failure (e.g. JsonDocument internal pool) and returns a clean HTTP error instead of calling abort().

Part C: Live-test correctness¶

Problem 1: PreviewModule created as a top-level module. test0_infra added preview1 with no parent_id. PreviewModule belongs as a child of DriverLayer (per spec). Because PreviewModule was in INFRA_TYPES, it survived delete_all_modules() and accumulated as an orphan across tests. Scenario scripts tried to add preview1 with parent_id="driver1" but add_or_exists silently accepted the already-running top-level instance.

Problem 2: Duplicate singleton modules. Scenario files include steps for NetworkModule, SystemStatusModule etc. When the running instance had a different id than the scenario's id (e.g. device has systemstatus1, scenario adds sysinfo1), add_or_exists would create a second instance. Same for FirmwareUpdateModule: ensureInfraModules() recreates it on every boot, so it always survives delete_all_modules(), yet the scenario runner could add a second one under a new id.

Problem 3: live-results-pc-windows.md written on macOS. deploy/live/live-results-pc-windows.json is committed from Windows CI. On macOS, summarise.py read it and wrote docs/status/live-results-pc-windows.md. The per-env MD should only be written by the machine that actually ran those tests.

Fixes in deploy/live_suite.py:

PreviewModule removed from INFRA_TYPES: it is not infrastructure — it requires a parent driver and must be re-created per-test.
New SINGLETON_TYPES = INFRA_TYPES | {"FirmwareUpdateModule"}: types where only one instance should ever exist.
_scenario_step: before add_or_exists, checks type_ in SINGLETON_TYPES and type_ in client.types_present(). If true, logs a skip and returns success — prevents a second instance being created when the live state has one under a different id.
test0_infra: removed the top-level preview1 add (no driver exists at that point).
test1_ripples_pipeline: adds preview1 as child of driver1 right after driver1 is created.
test5_artnet_loopback: adds preview_tx as child of tx_drv and preview_rx as child of rx_drv.
test7_multi_layout: adds preview7 as child of driver7.

Fix in deploy/summarise.py:

CURRENT_PC_PLATFORM derived from platform.system() at module load ("darwin" → "macos", etc.).
_write_live_results_md: skips writing live-results-pc-{other}.md when env != f"pc-{CURRENT_PC_PLATFORM}". The foreign-platform JSON data is still loaded for the index.md summary table; only the per-env MD file is gated.

Part D: FirmwareUpdateModule user documentation¶

Added docs/modules/system/firmware-update-module.md covering: what the module does (surfaces OTA progress; upload handled by AppRoutes), the two controls (update_status, update_pct), three ways to flash (browser file picker, URL API call, uv run deploy/flash.py), and platform notes (URL OTA returns 501 on PC). Added to mkdocs.yml nav and to the category table and reference list in docs/user-guide/modules/index.md.

Part E: Remove `isPermanent()`¶

Problem: isPermanent() was a virtual method on StatefulModuleBase intended to prevent certain modules from being deleted at runtime. ModuleManager was the only class that returned true. However, ModuleManager is never placed in owned_[] — it manages the list but is not part of it. This meant the isPermanent() check in removeModule() was never triggered. The mechanism was dead code, and its presence was actively misleading: it suggested FirmwareUpdateModule should be permanent (it had the override until Sprint 8 Part D), when the correct safety net is the boot guard in ensureInfraModules().

Root cause: ModuleManager can be targeted via its kId for control updates (via the special-case path in setControl), but it is never added to owned_[] via addModule. removeModule() iterates owned_[], so it can never find and delete ModuleManager. The 403 Permanent response in AppRoutes.cpp was therefore unreachable.

Fix:

src/core/StatefulModule.h: removed virtual bool isPermanent() const declaration.
src/core/ModuleManager.h: removed bool isPermanent() const override { return true; }; removed RemoveResult::Permanent from the enum; updated removeModule() comment.
src/core/ModuleManager.cpp: removed isPermanent() check in removeModule(); removed isPermanent() check in replaceModule(); removed obj["permanent"] from getModulesJson().
src/core/AppRoutes.cpp: removed case RemoveResult::Permanent (403) from the DELETE handler; updated replace error message to remove mention of "permanent".
src/frontend/app.js: replace button and delete button now always rendered — mod.permanent was undefined for all modules anyway (field no longer emitted by the server).
tests/test_module_manager.cpp: removed ModuleManager - isPermanent returns true test.
tests/test_system_info.cpp: removed FirmwareUpdateModule is permanent test.

Definition of Done¶

[x] src/core/StatefulModule.h: readThrough(), controlAllocBytes() virtual hook, heap check in both setControl() overloads with auto-revert on failure
[x] src/modules/layouts/GridLayout.h: safeWidth_/Height_/Depth_ safe dimension tracking; controlAllocBytes() override; buildMappings_() uses new (std::nothrow)
[x] tests/test_layouts.cpp: 3 new GridLayout resize tests; 378/378 pass
[x] src/core/AppRoutes.cpp: all GET JSON routes use measureJson + reserve + std::move; no growing-string allocation
[x] src/core/HttpServer.h (ESP32): try/catch(std::bad_alloc) in all four handler dispatch points; returns HTTP 503 on OOM
[x] deploy/live_suite.py: PreviewModule removed from INFRA_TYPES; SINGLETON_TYPES guard in _scenario_step (includes FirmwareUpdateModule); preview1/7/tx/rx wired as children of their driver
[x] deploy/summarise.py: CURRENT_PC_PLATFORM guard; live-results-pc-windows.md not written on macOS
[x] docs/modules/system/firmware-update-module.md created; added to mkdocs.yml nav and module index
[x] MM-C1BC: 70x28 GridLayout removed via DELETE /api/modules/tree1; device stable
[x] isPermanent() virtual method removed from StatefulModuleBase; RemoveResult::Permanent enum value removed; all call sites in ModuleManager.cpp, AppRoutes.cpp, and app.js cleaned up; two now-stale tests removed
[x] 376/376 unit tests pass; mkdocs serve produces no warnings for the new doc page

Result¶

Metric	Value
Unit tests	376/376 pass (3 new GridLayout resize tests added, 2 stale `isPermanent` tests removed)
New virtual hook	`controlAllocBytes()` in `StatefulModuleBase`; default returns 0 (opt-in)
GridLayout	Rejects oversized resize when heap check fails; always allows shrink
HTTP OOM	`try/catch(std::bad_alloc)` in all ESP32 handler dispatchers; returns 503
Serialization	`measureJson` + `reserve` = 1 allocation per response (was O(log N) growing chain)
Live test fix	PreviewModule always child of its DriverLayer; no more top-level orphans
Singleton guard	`SINGLETON_TYPES` prevents second instance of NetworkModule, SystemStatusModule, FirmwareUpdateModule etc.
summarise.py	`live-results-pc-windows.md` not written on macOS
FirmwareUpdateModule docs	New user-facing doc page; wired into mkdocs nav
`isPermanent()`	Removed entirely: virtual method, enum value, all call sites, frontend gate, 2 tests
Device recovery	MM-C1BC: bad 70x28 GridLayout deleted via REST; device stable

Retrospective¶

What went well: - controlAllocBytes as a virtual hook with a zero default keeps the mechanism entirely opt-in: modules with no significant heap impact add no code and pay no overhead. - measureJson + reserve eliminates the growing-string problem with no memory overhead and no static buffer: the heap allocation is still there, but it is now exactly one call of exactly the right size. - try/catch(std::bad_alloc) in HttpServer.h is the correct safety net: even if measureJson+reserve is not used on some future route, the device will return 503 rather than crash. - SINGLETON_TYPES in the scenario runner is a clean, low-ceremony fix: one set, one guard, solves both the SystemStatusModule and FirmwareUpdateModule duplication problems without touching the scenario JSON files. - Removing PreviewModule from INFRA_TYPES and wiring it as a child of its driver per-test is architecturally correct and required no changes to the scenario JSON files (they already had parent_id: "driver1").

What was tricky: - Initial fix for AppRoutes used a static char kJsonSerBuf[12288] BSS buffer. Rejected: it added 12 KB of static RAM with no saving elsewhere. Replaced by the measureJson+reserve pattern which has the same single-allocation property with zero BSS cost. - The CURRENT_PC_PLATFORM guard in summarise.py is for the per-env MD file only; the JSON data from other platforms is still loaded and appears in index.md. Care was needed not to break the cross-platform summary table.

What was tricky (Part E): - isPermanent() looked load-bearing because it appeared in removeModule(), replaceModule(), the JSON output, and the frontend. Tracing the actual call graph revealed it was never reached: ModuleManager is not in owned_[], so the check at owned_[i].module->isPermanent() was never true for any module. - The boot guard in ensureInfraModules() / ensureNetworkModules() is the correct protection for infra modules: it recreates missing modules on the next reboot rather than refusing DELETE at the API level. This is more resilient and less surprising to users.

Seeds for future sprints: - Other modules that allocate in onUpdate() (e.g. EffectsLayer on buffer resize) should also implement controlAllocBytes(). - The try/catch in HttpServer.h only catches std::bad_alloc. A broader catch (const std::exception&) would catch any handler exception, which could be useful as the handler set grows. - The SINGLETON_TYPES guard prevents a second instance but does not fix the id mismatch (the running module may have a different id than the scenario expects). A future improvement would be a get_or_create_by_type helper that returns the existing instance id if one is found.

Sprint 9: `select` Control Type¶

Scope: Add CtrlType::Select to the control system: a uint8_t-backed dropdown registered via addControl(..., "select") followed by addControlValue("label") calls, or via a single addControl() call that takes a pre-declared static options array. Option strings are C string literals and live in flash (.rodata), not DRAM; only the pointer array costs heap, and with the static-array form even that is zero. The backing field stores a uint8_t index (1 byte), not the selected string. The schema emits an "options" array; the frontend renders a native <select> element. No changes to existing control types.

Summary¶

Part	Description	Est
A: Memory strategy	`ControlEntry` union (inline 8-char + heap overflow), `addControlValue()`, `teardown()` cleanup	M
B: `addControl` overloads	3-arg select overload; remove defaults from `uint8_t` generic to avoid ambiguity	S
C: Schema + persistence	`getSchema()` emits `"options"` array; `saveState()`/`loadState()` by index; `setControl()` by label	M
D: Frontend dropdown	`app.js` renders `select` as `<select>`; sends index on change	M
Total		L

Design decision: store index (`uint8_t`) or selected string?¶

This is the first design choice that must be made before any code is written.

Option A: store as `uint8_t` index (recommended)¶

The backing field holds the zero-based index of the selected option. Saved state JSON looks like "waveform": 2.

Pros: - 1 byte in RAM and in saved JSON — critical on heap-constrained ESP32. - Consistent with existing Uint8 control type; setControl(), saveState(), getControlValues() all reuse the same numeric path with minimal changes. - Fast in loop(): module code reads a uint8_t directly and uses it in a switch or array index — no string comparison. - controlAllocBytes() returns 0 naturally (index is always 1 byte regardless of option count).

Cons: - Saved state is not self-documenting: "waveform": 2 requires knowing the options list to interpret. - If option order changes between firmware builds, a saved index silently maps to the wrong option (breaking change). Option order must be treated as part of the API, the same as JSON key names. - REST and WebSocket clients must look up the schema to translate an index to a label.

Option B: store as `char[]` string value¶

The backing field holds the selected label as a C string. Saved state JSON looks like "waveform": "triangle".

Pros: - Self-documenting in saved state and in logs. - Robust to option reordering: the saved string still matches the right option after a firmware update that reorders the list (though renaming an option still breaks it). - REST clients can post {"waveform": "triangle"} without knowing indices.

Cons: - char[N] backing field: typically 16-32 bytes vs 1 byte for a uint8_t. On a module with four select controls that adds 60-124 bytes of RAM overhead. - loop() must do strcmp or a linear search to map the string back to an integer branch — meaningfully slower on the hot path for effects modules. - setControl() needs a new string-matching path: iterate options list to find the index, then write the string into the backing buffer. More code; more failure modes. - Can still silently break if an option is renamed (different failure mode from index reordering, but equally possible).

Verdict: Option A (`uint8_t` index)¶

RAM and hot-path performance win on ESP32. The option-stability risk is the same class of breaking change as renaming a JSON key — already documented as requiring a version bump. The schema always includes the "options" array, so clients are never left guessing.

Part A: memory strategy for option strings¶

This is the second design decision, and the most important for ESP32.

Where do the strings live?¶

"sine", "triangle", "square" are C string literals. On ESP32 (and on all targets) they are stored in .rodata — flash memory, not DRAM. Accessing them requires a flash read (cached), but they cost zero bytes of DRAM. This is true whether they appear as addControlValue("sine") inline arguments or as elements of a static constexpr const char*[] array.

The only DRAM cost is the pointer array that ControlDescriptor::options points to: 4 bytes per option on ESP32 (32-bit pointer). For a 4-option select that is 16 bytes of DRAM.

Two registration styles and their costs¶

Style 1: addControlValue() — ergonomic, one small heap allocation

// in setup():
addControl(waveform_, "waveform", "select");
addControlValue("sine");
addControlValue("triangle");
addControlValue("square");
addControlValue("sawtooth");

The pointer array is heap-allocated during setup(). To avoid realloc churn, the first addControl("select") call pre-allocates a fixed-size slot array (e.g. 8 pointers = 32 bytes). Each addControlValue() fills the next slot; no reallocation until the pre-allocated capacity is exceeded. clearControls() frees the array on teardown().

DRAM cost: 32 bytes pre-allocated pointer slots (fixed per select control, regardless of actual option count up to 8).

Style 2: static array — zero heap, all in flash

// file scope or class body (own code):
static constexpr const char* kWaveforms[] = {"sine", "triangle", "square", "sawtooth"};

// in setup():
addControl(waveform_, "waveform", "select", kWaveforms, 4);

kWaveforms is a constexpr pointer array — it lives in .rodata (flash) alongside the string literals. The descriptor stores the pointer to kWaveforms directly. No heap allocation at any point. clearControls() does not free it.

Library-defined arrays work identically, provided they are declared inline in the header:

// in the library header (C++17 inline variable — one definition across all TUs):
inline const char* const EMITTER_NAMES[EMITTER_COUNT] = {
    "orbitaldots", "swarmingdots", "audiodots", "lissajous",
    "borderrect",  "noisekaleido", "cube",       "fluidjet"
};

// in setup():
addControl(emitter_, "emitter", "select", EMITTER_NAMES, EMITTER_COUNT);

The inline keyword (C++17) guarantees the linker keeps exactly one copy of the pointer array in the final binary even when the header is included from multiple translation units. Without inline the linker might emit a copy per .o file, wasting flash. Either way no DRAM is used — inline just prevents flash duplication.

DRAM cost: 0 bytes. The array and all string data are in flash.

Distinguishing owned vs. borrowed options¶

clearControls() must know whether to free(d.options). Add a single bit to ControlDescriptor:

bool ownsOptions;  // true: heap-allocated by addControlValue(); false: static array

Set to true by addControlValue(), false by the static-array addControl() overload.

Recommended strategy¶

Use the static array form (Style 2) for any module that ships with a fixed option list — which is the common case (effects, drivers, layouts all have known-at-compile-time options). The addControlValue() form is available as convenience for prototyping or for option lists that are built dynamically from discovered resources (e.g. a list of available effects).

ControlDescriptor changes¶

struct ControlDescriptor {
  const char* key;
  const char* uiType;
  CtrlType type;
  uintptr_t ptr;
  float minVal;
  float maxVal;
  float defVal;
  const char** options;   // pointer to option labels; null for non-Select
  uint8_t optionCount;    // number of valid entries in options
  bool ownsOptions;       // true: heap-allocated; free on clearControls()
};

Per-descriptor overhead for non-Select controls: 4 + 1 + 1 = 6 bytes (pointer + count + owns flag, with padding likely making it 8 bytes). For a module with 8 controls of which 1 is a 4-option static-array Select: 8 × 8 = 64 bytes extra DRAM — modest.

Complete memory picture (ESP32dev, 4-option static-array select)¶

Item	Location	DRAM cost
Option strings ("sine" etc.)	flash (.rodata)	0 bytes
`kWaveforms[]` pointer array	flash (.rodata)	0 bytes
`ControlDescriptor::options` pointer	heap (controls_ array)	4 bytes
`ControlDescriptor::optionCount`	heap (controls_ array)	1 byte
`ControlDescriptor::ownsOptions`	heap (controls_ array)	1 byte
Backing `uint8_t waveform_` field	module instance (heap)	1 byte
Total extra DRAM per select		~7 bytes

With addControlValue() style instead: add 32 bytes for the pre-allocated pointer slots on heap.

Part B: `addControl` overloads and `addControlValue`¶

Add CtrlType::Select to the enum in StatefulModule.h:

enum class CtrlType : uint8_t { Float, Uint8, Uint32, Bool, String, EditStr, FloatConst, Select };

Static-array overload (preferred — zero heap):

// Register a select control backed by a uint8_t index. options must outlive the module
// (use static constexpr). min/max are derived from optionCount.
void addControl(uint8_t& variable, const char* key,
                const char* const* options, uint8_t optionCount);

Sets CtrlType::Select, stores the pointer directly, ownsOptions = false, maxVal = optionCount - 1.

Dynamic addControlValue() overload (ergonomic, small heap allocation):

// Register a select control; call addControlValue() immediately after for each label.
// uiType must be "select" — required for API consistency with all other addControl overloads.
void addControl(uint8_t& variable, const char* key, const char* uiType);

// Append a label to the most recently registered Select control.
// Pre-allocates 8 pointer slots on first call; no realloc within that capacity.
void addControlValue(const char* label);

addControlValue() finds the last CtrlType::Select descriptor, allocates 8-slot pointer array on the first call (ownsOptions = true), fills the next slot, increments optionCount, updates maxVal.

clearControls(): iterate descriptors; for any with ownsOptions == true, call free(d.options).

maxVal on the descriptor equals optionCount - 1 in both cases, so the existing range-clamp in setControl() rejects out-of-range indices without changes.

Part C: schema emission, value reads, and persistence¶

getSchema() — add Select case:

case CtrlType::Select:
  c["value"] = *reinterpret_cast<const uint8_t*>(d.ptr);
  c["default"] = (uint8_t)d.defVal;
  {
    JsonArray opts = c["options"].to<JsonArray>();
    for (uint8_t j = 0; j < d.optionCount; ++j) opts.add(d.options[j]);
  }
  break;

The "type" field in the schema JSON is already d.uiType ("select"), so no other changes are needed for the frontend to identify the control.

getControlValues() — add Select case: identical to Uint8 (emit the index as an integer).

setControl() — add Select case: identical to Uint8 (clamp to [0, optionCount-1], write through the uint8_t* pointer, call onUpdate()).

saveState() / loadState(): no changes needed — Select follows the Uint8 save/load path (save as integer, load as integer via applyPending_()).

Part D: frontend dropdown rendering (`app.js`)¶

getSchema() already emits "type": "select" for the control. The frontend renderControl() function currently renders sliders for numeric types and checkboxes for bools. Add a branch for "select":

if (ctrl.type === 'select' && Array.isArray(ctrl.options)) {
    const sel = document.createElement('select');
    ctrl.options.forEach((label, i) => {
        const opt = document.createElement('option');
        opt.value = i;
        opt.textContent = label;
        if (i === ctrl.value) opt.selected = true;
        sel.appendChild(opt);
    });
    sel.onchange = () => sendControlUpdate(modId, ctrl.key, parseInt(sel.value));
    return sel;
}

WebSocket state updates that arrive mid-session must also update the <select> element's selectedIndex, following the same pattern as slider value updates.

Definition of Done¶

[x] CtrlType::Select added to enum in StatefulModule.h
[x] ControlDescriptor extended with options, optionCount, ownsOptions fields; non-Select defaults to nullptr / 0 / false
[x] addControl(uint8_t&, key, options, count) — static-array form; zero heap; ownsOptions = false
[x] addControl(uint8_t&, key, "select") + addControlValue(label) — dynamic form; "select" uiType required for API consistency; ownsOptions = true; generic uint8_t overload has explicit min/max (no defaults) to eliminate 3-arg overload ambiguity
[x] clearControls() and destructor free options via freeOwnedOptions_() only when ownsOptions == true
[x] getSchema(): Select case emits "value", "default", "options" array
[x] getControlValues(): Select emits index as integer (same as Uint8)
[x] setControl(): Select reuses Uint8 path via readThrough/writeThrough; value stored as uint8_t
[x] saveState() / loadState() / applyPending_(): Select handled same as Uint8
[x] SineEffectModule: waveform select (sine/square/triangle/sawtooth); wave_() helper applies chosen shape; static-array form used
[x] LinesEffectModule: axis select (all/x/y/z); loop conditionally draws each plane; static-array form used
[x] Frontend: <select class="select-input"> rendered for type == "select", addEventListener('change') posts index, live WebSocket updates reflected via select.value
[x] CSS: .select-input matches .text-input styling; light-theme override included
[x] Tests: 10 new cases in test_stateful_module.cpp — registration, schema, setControl, saveState round-trip, hot-reload leak safety, addControlValue dynamic form, SineEffect waveform, LinesEffect axis
[x] 386/386 tests pass (10 new, 1 existing test updated for new SineEffect waveform control)
[x] deploy/summarise.py: ESP32 MD guard added — skips writing live-results-esp32-*.md when no current (non-last-good) ESP32 JSON exists, preventing stale ESP32 sections from being rewritten on all_pc.py runs
[x] deploy/unittest.py: run_tee accepts optional cwd; test binary invoked with absolute path
[x] tests/test_module_manager.cpp: auto-pipeline test calls disableStatePersistence() before teardown to prevent writing state files to the working directory
[x] state/grid1.json: reset to 16x16x1 (segfault fix: stale 1013x1018x32 values from a previous live-test run with the server started from the wrong directory)

Result¶

Metric	Value
Unit tests	386/386 pass (10 new, 1 updated)
New control type	`CtrlType::Select` backed by `uint8_t` index; zero DRAM for static-array form
New `ControlDescriptor` fields	`options` (4 B), `optionCount` (1 B), `ownsOptions` (1 B) per descriptor
`addControl` overloads	Static-array form (zero heap) and dynamic `addControlValue()` form; both require explicit `"select"` uiType
Hot-reload safety	`freeOwnedOptions_()` called from `clearControls()` and destructor
`SineEffectModule`	New `waveform` select: sine / square / triangle / sawtooth
`LinesEffectModule`	New `axis` select: all / x / y / z
Frontend	`<select>` element rendered for `type == "select"`; live WS updates applied; CSS styled
Schema	`"options"` array emitted by `getSchema()`; `"value"` and `"default"` as integer index
PC live tests	All pass; test4 (device discovery) expected FAIL without ESP32 on network

See test results for full pass/fail breakdown.

Retrospective¶

What went well: - The CtrlType::Select case slots cleanly into every existing switch in StatefulModule.h because the backing type (uint8_t) is identical to Uint8. readThrough/writeThrough/applyPending_/saveState all just needed case CtrlType::Select: fall-through onto the existing Uint8 case. - Static-array form (addControl(var, key, kArr, N)) costs exactly 6 bytes of DRAM per descriptor and zero heap — kArr and all string literals live in flash. This is the right default for any module with a compile-time-fixed option list. - The ownsOptions flag on ControlDescriptor cleanly separates the two ownership modes. clearControls() and the destructor both call the same freeOwnedOptions_() helper, so hot-reload and final teardown are handled identically. - Library-provided arrays (e.g. inline const char* const EMITTER_NAMES[]) work directly with the static-array overload — no adaptation needed. - Making "select" an explicit uiType argument on the dynamic form (addControl(var, key, "select")) aligns it with every other addControl overload. The API is now fully consistent: the uiType string is always the third argument, regardless of control type. - The summarise.py ESP32 guard (skip rewriting live-results-esp32-*.md when no current ESP32 JSON exists) prevents all_pc.py runs from silently overwriting the last good ESP32 status with a stale timestamp.

What was tricky: - clearControls() previously just set controlCount_ = 0 without freeing anything. Adding freeOwnedOptions_() there required also calling it from the destructor; overlooking either site would cause a leak on hot-reload or module destruction respectively. - The memmove in runSetup() that promotes enabled_ to index 0 copies ControlDescriptor structs byte-for-byte, including options pointers. This is safe — the pointers remain valid — but it means two descriptor slots briefly point at the same options array. The old slot is immediately overwritten, so there is no double-free risk. Worth understanding before reading this code path. - addControlValue() uses realloc on each call. The sprint doc proposed 8-slot pre-allocation; the implementation went with simple realloc instead (simpler code, acceptable since it only runs during setup()). Backlogged if profiling ever shows setup-time fragmentation. - Adding "select" as an explicit uiType argument to addControl(uint8_t&, key, uiType) required removing the default min/max from the generic uint8_t overload to avoid a 3-argument ambiguity. No existing caller relied on those defaults (all passed explicit min/max), so the change was safe. - A stale state/grid1.json with dimensions 1013x1018x32 (written by a previous live-test session that started the server from the project root) caused a segfault on the next run. The pal::check_alloc guard correctly blocked the allocation, but the state/ file survived. Fixed by resetting to 16x16x1. The all_pc.py pipeline always starts the server from deploy/build/pc/{platform}/ so this state is isolated; running the binary from the project root directly can still contaminate the project-root state/ directory. - The auto-pipeline unit test (ModuleManager - auto-creates default pipeline when no modules exist) did not call disableStatePersistence() after its assertions, causing it to write state/grid1.json (with default 10x10x1 dimensions) to the project root on every test run. Fixed by calling disableStatePersistence() after the assertions, before teardown.

Complexity estimate: Medium. The core StatefulModule.h changes are straightforward switch-case additions. The non-trivial parts were: ownership lifecycle (ownsOptions, freeOwnedOptions_), verifying the memmove path is safe, and the two waveform implementations (wave_() and the axis conditional in LinesEffect).

Seeds for future sprints: - Proper range-clamping in setControl() for Select: clamp submitted value to [0, optionCount-1] rather than relying on uint8_t truncation. Frontend-submitted values are always valid; REST misuse is the only exposure. - addControlValue() 8-slot pre-allocation: measure whether realloc churn during setup() causes fragmentation on ESP32 dev before adding the optimization. - Other modules with natural discrete parameters: NoiseEffect2D blend mode, RipplesEffect shape, MirrorModifier axis. Each is a one-liner addControl + static array addition.

Sprint 10: Boot Module Creation Redesign and Dynamic Network Management¶

Scope: Replace the ad-hoc ensureNetworkModules / ensureInfraModules / instantiateDefaultPipeline_ boot logic with a single coherent rule: on first boot (no non-network top-level modules), create the full default set; otherwise leave the pipeline alone. Add EthernetModule to the initial network group. Add dynamic network management to NetworkModule so the AP is automatically disabled when STA or Ethernet is connected, and re-enabled when connectivity is lost. Investigate and fix the root causes of apparent duplicate modules (type name bug in AppSetup.h, scenario script behavior, ambiguous 409 error response).

Identified from: Sprint 9 retrospective seeds; user request after Sprint 10 scope discussion.

Summary¶

Part	Description	Est
A: EthernetModule in boot	Add `eth1` (child of network1) to first-boot network group in `ensureNetworkModules`	XS
B: `ensureDefaultModules`	Replace `ensureDefaultPipeline`+`ensureInfraModules`; "no non-network top-level modules" rule; update PC `instantiateDefaultPipeline_`	M
C: Dynamic network management	`NetworkModule` 10 s ticker + 60 s grace-period debounce; `onUpdate("enabled")` on WifiAp/WifiSta; `setInput` wiring	L
D: ui.md boot section	Document boot module creation, dynamic WiFi, delete-to-prevent-recreation	S
E: Duplicate investigation + 409	AppSetup type name bug fixed via B; 409 `reason` field in `AppRoutes`	S
Total		L

Current boot logic (before this sprint)¶

On embedded (AppSetup.cpp): 1. mm.setup() — if no DriverLayer AND no EffectsLayer: create driver1 + grid1 + effects1 + ripples1 + preview1. 2. ensureNetworkModules() — if no NetworkModule: create network1 + sta1 + ap1. 3. ensureInfraModules() — calls ensureDefaultPipeline() (patches EffectsLayer / Preview onto an existing DriverLayer if absent), then unconditionally adds SystemStatus and FirmwareUpdateModule if not present.

On PC (main.cpp): 1. mm.setup() — same pixel-pipeline creation as embedded step 1. No SystemStatus, Firmware, or network modules created.

Problems with the current logic: - ensureDefaultPipeline patchwork adds EffectsLayer / PreviewModule even when the user deliberately built a custom pipeline without them. - SystemStatus and FirmwareUpdateModule are added unconditionally, even on partially customised setups. - EthernetModule is never auto-created. - No dynamic network management: AP runs permanently even when STA is connected.

Part A: Extend `ensureNetworkModules` — add EthernetModule¶

ensureNetworkModules() currently creates network1 + sta1 + ap1 on first boot (guarded by hasModuleType("NetworkModule")). Add eth1 to this initial creation:

mm.addModule("EthernetModule", "eth1", {}, {}, 1, "network1");  // child of network1

"If later deleted, don't recreate" guarantee: This is already provided by the hasModuleType("NetworkModule") guard — ensureNetworkModules is a no-op on any boot where NetworkModule already exists. EthernetModule is only created once, alongside the rest of the network group, and is never checked for independently.

Part B: Replace `ensureDefaultPipeline` + `ensureInfraModules` with `ensureDefaultModules`¶

Replace both functions with a single ensureDefaultModules(mm) that applies the new rule:

New rule: count top-level modules (parentId == "") whose type is not "NetworkModule". If the count is zero, create the full default set. Otherwise, do nothing.

top-level non-network modules == 0  →  create full default set
top-level non-network modules  > 0  →  do nothing

Full default set (created atomically):

id	type	parent
sysinfo1	SystemStatusModule	—
firmware1	FirmwareUpdateModule	—
discovery1	DeviceDiscoveryModule	—
driver1	DriverLayer	—
grid1	GridLayout	driver1
effects1	EffectsLayer	—
ripples1	RipplesEffectModule	effects1
preview1	PreviewModule	driver1

Behavior changes from current logic:

Scenario	Before	After
Completely blank first boot	pixel pipeline only, then SystemStatus + Firmware added separately	full default set created atomically
Only DriverLayer exists	EffectsLayer + Preview patched in; SystemStatus + Firmware added	do nothing
Only NetworkModule + children	full default pipeline created	full default set created
Any non-network top-level module	partial patching applied	do nothing

ModuleManager::instantiateDefaultPipeline_() (PC): The existing function runs on PC where AppSetup.h is not compiled. It should be updated to apply the same rule: check for any top-level module (on PC there is no NetworkModule so the check becomes "any top-level module exists"). If none exist, create the full default set including SystemStatus, FirmwareUpdateModule, and DeviceDiscoveryModule. The PC build does register all three types.

ModuleManager::setup() (both platforms): Remove the !hasDriver && !hasEffects auto-pipeline check. On embedded, ensureDefaultModules handles first-boot creation. On PC, instantiateDefaultPipeline_ (updated) handles it. The condition that triggers it changes from "no DriverLayer+EffectsLayer" to "no top-level modules at all".

Part C: Dynamic network management in `NetworkModule`¶

NetworkModule::loop() currently does nothing. Add a 10-second periodic check that manages the AP based on current connectivity:

Priority (highest wins): 1. Ethernet connected → disable both WiFi AP and WiFi STA. 2. STA connected → disable WiFi AP (keep STA running). 3. Neither connected, grace period expired → enable WiFi AP (recovery path for configuration access).

Grace period for STA loss: A brief disconnection (network hiccup, AP reboot) should not immediately re-enable the AP — toggling the AP is disruptive (clients connecting mid-hiccup, mDNS flapping). A configurable grace period lets STA recover before any AP change is made.

When STA was connected and then drops: record staLostMs_ (millis timestamp).
Each tick: if now - staLostMs_ >= sta_grace_ms_ and no other connectivity, enable AP.
If STA reconnects or Ethernet comes up before the grace period expires: clear staLostMs_, no AP change.
On first boot (STA never connected): no grace period — enable AP immediately.
sta_grace_ms_ is a private constant (default 60 000 ms). A future control could expose it; for Sprint 10 a compile-time default is sufficient.

Implementation sketch:

NetworkModule needs typed pointers to its children to call setControl("enabled", ...) on them. Wiring approach: NetworkModule implements setInput("sta", ...), setInput("ap", ...), setInput("eth", ...), receiving the module pointers when the wiring pass runs. ensureNetworkModules passes the child ids as inputs to network1 after creating all children (or a dedicated post-creation wiring step).

WifiApModule and WifiStaModule must override onUpdate("enabled") to actually start/stop their WiFi interface when the enabled control changes. Currently, enabled_ only gates loop execution; setting it to false does not call teardown() or stop WiFi. This change makes enabled semantically equivalent to "WiFi interface is running".

10-second ticker and grace-period state in NetworkModule:

uint32_t lastCheckMs_ = 0;
uint32_t staLostMs_   = 0;   // 0 = STA connected or never-connected; non-zero = grace countdown started
bool     staWasConnected_ = false;
static constexpr uint32_t STA_GRACE_MS = 60000;

void loop() override {
#ifdef ARDUINO
    uint32_t now = pal::millis();
    if (now - lastCheckMs_ < 10000) return;
    lastCheckMs_ = now;
    manageWifi_(now);
#endif
}

manageWifi_(now) logic:

ethConn = eth_ && eth_->isConnected()
staConn = sta_ && pal::wifi_sta_is_connected()

if ethConn:
    clear staLostMs_; disable AP and STA
else if staConn:
    clear staLostMs_; disable AP          // STA healthy: AP not needed
else:
    if staWasConnected_ and staLostMs_ == 0:
        staLostMs_ = now                  // STA just dropped: start grace timer
    if staLostMs_ != 0 and (now - staLostMs_) >= STA_GRACE_MS:
        enable AP; clear staLostMs_       // grace expired: open recovery AP
    // else: within grace period — do nothing, wait for STA to recover

staWasConnected_ = staConn

The children's onUpdate("enabled") handlers propagate the change to the WiFi stack.

EthernetModule isConnected(): currently always returns false. The interface is added now so NetworkModule can call it; the stub is replaced when real Ethernet support is implemented.

Part D: Update `docs/user-guide/ui.md`¶

Add (or update) a "Boot module creation" section that describes: - First boot: what modules are created and in what order. - Network group: NetworkModule + WifiSta + WifiAp + Ethernet. - Default pipeline: only created when no non-network top-level modules exist. - Dynamic WiFi: AP is automatically disabled when STA or Ethernet is connected; re-enabled when not. - User control: delete any default module to prevent it being recreated on next boot.

Part E: Duplicate module investigation and 409 error clarity¶

ModuleManager::addModule already checks for duplicate IDs at line 416-418 and returns false (HTTP 409) if the ID is already registered. The guard is solid. Despite this, users occasionally see duplicate modules in the UI. Three root causes were identified:

1. AppSetup.h type name bug (primary cause)

ensureInfraModules() calls mm.addModule("SystemStatus", ...) but the TypeRegistry key is "SystemStatusModule" (the class name, set by REGISTER_MODULE(SystemStatusModule)). The type lookup fails silently and the module is never created. The live-test test0_infrastructure scenario then creates it using a different id (systemstatus1) via HTTP, so the user sees two apparent SystemStatus entries after running test0 more than once: the real sysinfo1 (if it ever existed from a prior boot) and systemstatus1 from test0. Fix: change "SystemStatus" to "SystemStatusModule" throughout AppSetup.h, or eliminate the call entirely via Part B's ensureDefaultModules.

2. Scenario scripts using add_or_exists (secondary cause)

live_suite.py's add_or_exists treats HTTP 409 as success if a module with the same ID already exists (type-checked). For types listed in SINGLETON_TYPES (INFRA_TYPES | {"FirmwareUpdateModule"}), the test step will skip re-creation when the correct type is already present. For non-singleton types (e.g., EffectsLayer, DriverLayer), a POST with a different ID will create a second instance if the first was not cleaned up. The delete_all_modules step at the start of each scenario should prevent this, but it preserves INFRA_TYPES modules — so if a prior scenario left a non-infra module with the same type but a different id, a new one will be created.

3. Ambiguous HTTP 409 error message (diagnostic cause)

AppRoutes.cpp returns 409 for three distinct failures: ID already exists, unknown type, invalid parent ID. These are currently indistinguishable from the HTTP response alone, making debugging harder. Fix: return a reason field in the JSON body distinguishing the three cases.

Fixes in this sprint:

AppSetup.h: fix "SystemStatus" → "SystemStatusModule" (addressed implicitly by Part B's ensureDefaultModules rewrite)
AppRoutes.cpp: return distinct reason strings in the 409 response body ("id_exists", "unknown_type", "invalid_parent")

Design decisions¶

Why "else do nothing" instead of per-type checks? The previous "patch up missing pieces" approach was opaque — it was hard to predict whether a module would be added on the next reboot. The new rule is a single, testable invariant: the first-boot state is fully deterministic; any subsequent state is entirely the user's configuration.

Why is SystemStatusModule part of the conditional set? Previously it was added unconditionally. Making it conditional brings it in line with the other defaults — if the user removes it they clearly don't want it. The boot-guard pattern (ensureNetworkModules re-creates network if deleted) is reserved for modules that are genuinely required for the device to be accessible (networking). Infra/status modules are optional from the device's perspective.

Why onUpdate("enabled") on WiFi child modules rather than direct PAL calls from NetworkModule? Direct PAL calls from NetworkModule would bypass the module's state machine and leave its status_ control stale. Routing through setControl("enabled") → onUpdate keeps the module self-consistent and makes the WiFi state visible in the UI.

Definition of Done¶

[x] AppSetup.h: ensureNetworkModules creates eth1 (child of network1) alongside sta1 and ap1 on first boot
[x] AppSetup.h: ensureDefaultModules replaces ensureDefaultPipeline + ensureInfraModules; creates full default set only when no non-network top-level modules exist
[x] AppSetup.cpp: calls ensureNetworkModules then ensureDefaultModules (replacing the old pair)
[x] ModuleManager.cpp: instantiateDefaultPipeline_ updated for PC — checks "no top-level modules" and creates full default set (including SystemStatus, FirmwareUpdateModule, DeviceDiscoveryModule)
[x] ModuleManager.cpp: removes the !hasDriver && !hasEffects auto-pipeline check from setup()
[x] NetworkModule: setInput("sta", ...), setInput("ap", ...), setInput("eth", ...) added; loop() manages WiFi with 10-second ticker; lastCheckMs_ uint32_t member
[x] WifiApModule: onUpdate("enabled") calls wifi_ap_stop() on disable and startAp() on enable
[x] WifiStaModule: onUpdate("enabled") disconnects on disable and reconnects on enable
[x] EthernetModule: isConnected() method added (returns false; stub for future Ethernet implementation)
[x] ensureNetworkModules wires sta1, ap1, eth1 as inputs to network1 so NetworkModule receives the typed pointers
[x] Tests: new unit tests for ensureDefaultModules (no modules → full set created; DriverLayer present → nothing added)
[x] Tests: WifiApModule.onUpdate("enabled") stops/starts AP; WifiStaModule.onUpdate("enabled") disconnects/reconnects; EthernetModule.isConnected() returns false on PC
Note: NetworkModule grace-period logic is #ifdef ARDUINO-only — not testable on PC; verified by code review
[x] AppRoutes.cpp: HTTP 409 response body includes a reason field ("id_exists", "unknown_type", "invalid_parent") so callers can distinguish the three failure cases
[x] docs/user-guide/ui.md: boot module creation section added
[x] PC live tests pass (7/7 scenarios); ESP32 live tests skipped (no devices connected during sprint completion)
[x] esp32dev and esp32s3_n16r8 build successfully

Result¶

Metric	Value
Unit tests	390/390 pass (4 new)
PC live tests	7/7 scenarios pass
esp32dev build	1374 KB flash, 19.9% RAM
esp32s3_n16r8 build	1362 KB flash, 19.3% RAM
`AppSetup.h`	`ensureNetworkModules` + `ensureDefaultModules` replace 3 old boot functions
`NetworkModule`	WiFi management with 10 s ticker and 60 s STA grace period
`WifiApModule` / `WifiStaModule`	`onUpdate("enabled")` reactive AP/STA control
HTTP 409	Now includes `reason` field: `id_exists` / `unknown_type` / `invalid_parent`

See test results for full pass/fail breakdown.

Retrospective¶

What went well: - The "no non-network top-level modules" rule (countTopLevelNonNetwork()) gives a single, testable invariant for first-boot behavior — deterministic and easy to reason about compared to the patchwork of hasDriver && hasEffects checks it replaced. - Routing WiFi enable/disable through setControl("enabled") -> onUpdate keeps each module self-consistent. NetworkModule never needs to know about STA/AP internals; the child modules keep their own status_ display up to date. - rewireModule("network1", inputs) after creating the four network modules is a clean pattern — create children first, then wire the parent. No ordering constraint on addModule itself. - The StatefulModuleBase* type for ap_ and sta_ in NetworkModule solved the circular-include problem cleanly: WifiAp.h and WifiSta.h both include Network.h, so Network.h cannot include them. The base pointer is sufficient for setControl() calls.

What was tricky: - runSetup() vs setup() in tests: setup() does not register the enabled_ control — that is runSetup()'s job (the base-class wrapper). The three new behavior tests initially called ap.setup() and the setControl("enabled", false) returned false silently (control not found), so onUpdate was never called and status stayed "starting". Fixed by switching to runSetup(). - The circular-include problem between Network.h and WifiAp.h/WifiSta.h was not obvious until the first compile. Storing ap_/sta_ as StatefulModuleBase* and casting in setInput() is the right fix, but required understanding which header depends on which. - AppSetup.h previously used "SystemStatus" (wrong) instead of "SystemStatusModule" as the type name string. The bug was latent until Sprint 10's investigation of apparent duplicate-creation. Eliminating ensureInfraModules entirely fixed it without a targeted patch.

Complexity estimate: Large. Three distinct sub-systems changed (boot logic, WiFi management, HTTP error details) plus four test files and the ui.md doc.

Seeds for future sprints: - Sprint 11: implement EthernetModule for real (LAN8720/W5500); NetworkModule.manageWifi_() already calls eth_->isConnected() — just needs the stub replaced. - Expose STA_GRACE_MS (currently 60 000 ms compile-time constant) as a NetworkModule control for field-adjustable debounce. - Add a NetworkModule live test: bring STA up, verify AP disables; bring STA down, wait grace period, verify AP re-enables. Requires hardware or a WiFi simulation stub.

Sprint 11: Ethernet Implementation¶

Scope: Implement EthernetModule for real on ESP32 classic (LAN8720 RMII) and ESP32-S3 (W5500 SPI). Add Ethernet PAL functions to Pal.h covering both the Arduino ETH.h path and the bare IDF_VER path so a future Arduino-free build compiles cleanly. Add DHCP client and static IP modes; static IP mode serves as the direct-connect ("AP analog") path. Document what ESP32-P4 Ethernet will require when hardware arrives.

Depends on: Sprint 10 (wires eth_ pointer in NetworkModule; adds isConnected() stub; Sprint 11 makes it real).

Identified from: Sprint 10 retrospective seeds; user request.

Summary¶

Part	Description	Est
A: Ethernet PAL functions	6 PAL functions (`eth_init`, `eth_is_connected`, etc.); ARDUINO, IDF_VER, and PC stub branches	M
B: EthernetModule implementation	Full module replacing stub: setup/loop/isConnected, DHCP+static controls, `healthReport()`	M
C: Direct-connect mode	Static IP path (AP analog); recommended defaults; link-local note; doc update	S
D: ESP32-P4 documentation	GMAC, SDIO WiFi coprocessor, IDF 5.3+, PAL additions needed; no implementation	S
Total		M

Background: PAL structure and the IDF_VER path¶

All network PAL functions follow a three-way platform switch that must be preserved for every new function added:

#ifdef ARDUINO
    // Arduino ESP32 framework — ETH.h / WiFi.h / esp_netif via Arduino wrappers
#elif defined(IDF_VER)
    // Bare ESP-IDF — esp_eth / esp_netif / lwIP directly; no Arduino wrappers
#else
    // PC build — no-op stubs, returns false / empty string
#endif

The IDF_VER path exists today for WiFi but has minimal/stub bodies. Every Ethernet PAL function added in this sprint must have a real IDF_VER body (not just a stub) because the long-term goal is to be able to build without Arduino.h. This means using esp_eth, esp_netif, and esp_event APIs directly in the IDF_VER branch, not delegating to ETH.h.

The ARDUINO and IDF_VER implementations can share the same PAL function signatures; the #ifdef is inside the function body, not in the declaration.

Hardware variants and board-specific configuration¶

Two Ethernet hardware variants are supported. The selection is made at compile time via a flag defined in platformio.ini per board environment:

Board	Hardware	Interface	Flag
esp32dev	LAN8720	RMII (GPIO)	`-DPMM_ETH_LAN8720`
esp32s3_n16r8	W5500	SPI	`-DPMM_ETH_W5500`

Note: flags use the PMM_ prefix (e.g. PMM_ETH_LAN8720 not ETH_PHY_LAN8720) to avoid colliding with the eth_phy_type_t enum values of the same name in esp-idf.

Pin assignments (MDC, MDIO, PHY address for RMII; SCK, MISO, MOSI, CS, IRQ for SPI) are defined as compile-time constants in the same board-specific platformio.ini env, e.g.:

[env:esp32dev]
build_flags =
    -DPMM_ETH_LAN8720
    -DETH_RMII_MDC=23
    -DETH_RMII_MDIO=18
    -DETH_RMII_PHY_ADDR=1

[env:esp32s3_n16r8]
build_flags =
    -DPMM_ETH_W5500
    -DETH_SPI_SCK=12
    -DETH_SPI_MISO=13
    -DETH_SPI_MOSI=11
    -DETH_SPI_CS=10
    -DETH_SPI_IRQ=14

EthernetModule itself contains no pin numbers. It calls PAL functions; the PAL reads the compile-time constants and dispatches to the right hardware init.

Part A: Ethernet PAL functions¶

Add to src/pal/Pal.h:

// Returns true if Ethernet hardware is compiled in for this board.
inline constexpr bool has_ethernet();

// Initialise the Ethernet peripheral. Called once from EthernetModule::setup().
// Returns true if the hardware was found and initialisation succeeded.
inline bool eth_init();

// True if Ethernet link is up and an IP address has been assigned (DHCP or static).
inline bool eth_is_connected();

// Write the current Ethernet IP address into buf (null-terminated). Empty string if not connected.
inline void eth_local_ip(char* buf, size_t len);

// Switch to DHCP client mode (default after eth_init).
inline void eth_set_dhcp();

// Set a static IP immediately. Disables DHCP client.
// Pass nullptr for gateway/subnet to use defaults (gw = ip with last octet 1, /24).
inline void eth_set_static_ip(const char* ip, const char* gateway, const char* subnet);

ARDUINO implementation: delegates to ETH.h (ETH.begin(...), ETH.config(...), ETH.localIP().toString()). eth_init() dispatches on PMM_ETH_LAN8720 vs PMM_ETH_W5500 at compile time to call the correct ETH.begin() overload.

IDF_VER implementation: uses esp_eth_driver_install, esp_netif_new, esp_eth_start, esp_event_handler_register(ETH_EVENT, ...). Static IP uses esp_netif_set_ip_info. DHCP client uses esp_netif_dhcpc_start.

PC stub: has_ethernet() returns false; all other functions are no-ops / return false / write empty strings.

Part B: EthernetModule implementation¶

Replace the current stub (src/modules/system/Ethernet.h) with a full implementation:

setup(): 1. Calls pal::eth_init(). If it returns false, sets status_ = "init_failed" and returns. 2. If a static IP is configured (loaded from saved state), calls pal::eth_set_static_ip(...). 3. Otherwise calls pal::eth_set_dhcp(). 4. Registers controls (see below).

loop(): - Polls pal::eth_is_connected() once per second (millis-based debounce). - On state change: updates status_ and ip_address_ controls; calls pal::eth_local_ip(). - On PC: has_ethernet() is false, so loop() is a no-op beyond the guard.

isConnected() (used by NetworkModule): returns pal::eth_is_connected().

Controls:

key	type	description
`status`	display	`"disconnected"` / `"connecting"` / `"connected"` / `"init_failed"`
`ip_address`	display	Current IP (empty when disconnected)
`mode`	select	`"dhcp"` \| `"static"` (default: `"dhcp"`)
`static_ip`	text	Only active when `mode == "static"`
`static_gateway`	text	Only active when `mode == "static"`
`static_subnet`	text	Default `"255.255.255.0"`

onUpdate("mode") and onUpdate("static_ip") apply the new config immediately via PAL if Ethernet is already up.

healthReport(): "eth=connected ip=192.168.1.42" / "eth=disconnected" / "eth=unsupported" (PC).

Part C: Direct-connect mode (AP analog)¶

When Ethernet is wired directly between the ESP32 and a laptop (no router), there is no DHCP server to assign addresses. Static IP mode serves as the "AP analog" for Ethernet: set a known fixed IP on the ESP32, then manually configure a matching IP on the laptop.

Recommended defaults for direct-connect: - ESP32 static IP: 192.168.5.1 / gateway 192.168.5.1 / subnet 255.255.255.0 - Laptop: 192.168.5.2 / subnet 255.255.255.0 (manual config in OS network settings) - The device is then reachable at http://192.168.5.1

Relationship to WiFi AP: NetworkModule::manageWifi_() (Sprint 10) disables the WiFi AP when Ethernet is connected — this applies to DHCP-connected Ethernet (router present). When using static IP for direct connect, the user is expected to also manage the WiFi AP manually if needed, or configure the ticker to treat static-IP-connected as "connected" (same isConnected() return value — no change needed).

Link-local (169.254.x.x): lwIP supports APIPA link-local addressing. If DHCP fails and mode == "dhcp", the ESP32 may auto-assign a 169.254 address after timeout. Modern OSes do the same. This provides a zero-config direct-connect path without any manual IP setting, but the 169.254.x.x address is non-deterministic. Document this as an observed behavior, not a designed feature. Static mode is the designed direct-connect path.

Part D: ESP32-P4 — what will be needed¶

The ESP32-P4 has on-board GMAC Ethernet and uses an external WiFi/BT coprocessor (e.g., ESP32-C6) connected via SDIO. No ESP32-P4 hardware is targeted in this sprint. This section documents what a future sprint will need.

PlatformIO: - New [env:esp32p4] and [env:esp32p4_eth] entries in platformio.ini. - Board JSON files (esp32_p4_nano.json, esp32_p4_eth.json) in the PlatformIO boards directory or boards/ in the project. - Framework: Arduino ESP32 core 3.x (P4 support landed in core 3.0); or bare IDF 5.3+. - Build flag: -DETH_PHY_EMAC (P4 uses its own internal EMAC + external PHY, typically IP101).

PAL changes: - Add ETH_PHY_EMAC branch in eth_init() inside both ARDUINO and IDF_VER sections. - P4 EMAC uses esp_eth_mac_new_esp32, same esp_netif plumbing as classic ESP32, so the IDF_VER path is largely reusable. - WiFi on P4 requires SDIO bootstrap before wifi_sta_connect() / wifi_ap_start() can be called. NetworkModule::setup() will need a pal::wifi_coprocessor_init() call (P4 only) before the existing WiFi init. - Add #ifdef ESP_PLATFORM_P4 guards (or CONFIG_IDF_TARGET_ESP32P4 from sdkconfig) to any P4-specific bootstrap code.

What does NOT change: EthernetModule, NetworkModule, WiFi modules — they call PAL functions only. All hardware differences are absorbed in Pal.h.

Design decisions¶

Why one EthernetModule for both RMII and W5500? The module only calls PAL functions. The hardware difference is entirely inside eth_init(). Adding a second module type (EthernetW5500Module) would duplicate the status/IP/mode logic for no benefit.

Why put all pin constants in platformio.ini rather than a header? platformio.ini is the single source of truth for board configuration. Scattering pin assignments across headers creates inconsistency between boards. The PAL reads the CMake/PlatformIO defines directly.

Why require an IDF_VER body now, not later? The migration from Arduino to bare IDF is a future goal, not an immediate task. However, writing stub-only IDF_VER bodies now means they will rot — when the migration happens, every function needs rewriting anyway. Writing real bodies now keeps the IDF path exercisable and prevents silent compile failures when someone eventually enables IDF_VER on a board.

Why static IP instead of a DHCP server for direct connect? Running a DHCP server on the Ethernet netif requires esp_netif_dhcps_start() and a server config, which works but adds state complexity to EthernetModule. Static IP achieves the same goal with one PAL call. A DHCP server on Ethernet is noted as a future enhancement.

Definition of Done¶

[x] Pal.h: has_ethernet(), eth_init(), eth_is_connected(), eth_local_ip(), eth_set_dhcp(), eth_set_static_ip() — all three platform branches (ARDUINO, IDF_VER, PC stub) present and compiling
[x] platformio.ini: PMM_ETH_LAN8720 + RMII pin flags added to [env:esp32dev]; PMM_ETH_W5500 + SPI pin flags added to [env:esp32s3_n16r8] (flags use PMM_ prefix — see hardware section)
[x] src/modules/system/Ethernet.h: full implementation replacing stub — setup() init, loop() poll, isConnected(), status/ip/mode/static_ip/static_gateway/static_subnet controls, healthReport()
[x] onUpdate("mode") and onUpdate("static_ip") apply config live if Ethernet is already initialised
[x] NetworkModule: manageWifi_() treats eth_->isConnected() as real (no change to call site; Sprint 10 wired it; Sprint 11 provides the real value)
[x] Tests: test_network.cpp updated — add tests for isConnected() on PC, loadState/saveState round-trip, default static_subnet
[x] Tests: PC build compiles and all tests pass (Ethernet PAL stubs return safe values)
[x] docs/modules/network/ethernet-module.md: updated with real controls, DHCP/static modes, direct-connect setup instructions
[x] Direct-connect section added to docs/modules/network/ethernet-module.md (user chose ethernet-module.md only, not getting-started.md)
[x] esp32dev build compiles with PMM_ETH_LAN8720 (no hardware flash required in CI)
[x] esp32s3_n16r8 build compiles with PMM_ETH_W5500 (no hardware flash required in CI)
[x] docs/development/release-07.md: Part D (P4 requirements) written (this section)
[ ] ESP32 live tests pass on both devices — deferred; hardware will be tested in Sprint 12 (user intent: test with real hardware next sprint)

Result¶

Metric	Value
Unit tests	392/392 pass (2 new: `isConnected` on PC, `loadState`/`saveState` round-trip, default `static_subnet`)
PC live tests	7/7 scenarios pass (82 steps)
esp32dev build	1441 KB flash (78.5%), 64 KB RAM (20.0%)
esp32s3_n16r8 build	1427 KB flash (34.8%), 62 KB RAM (19.5%)
ESP32 live tests	deferred (devices unreachable during sprint; not a Sprint 11 regression)

Flash footprint increased by ~67 KB on both boards versus Sprint 10 (SPI and Ethernet library sources now compiled in for W5500 path; LAN8720 path increased by similar amount as ETH.cpp was always compiled).

See test-results.md and live-pc-macos.md.

Retrospective¶

What went well:

The three-way PAL platform switch pattern (ARDUINO / IDF_VER / PC stub) kept EthernetModule free of any #ifdef — all hardware differences absorbed in Pal.h. The has_ethernet() constexpr guard at the top of setup() was enough to keep PC tests clean without touching test code.
The PMM_ETH_ prefix decision was made early (prompted by an enum collision with eth_phy_type_t values of the same name in esp-idf) and avoided a subtle linker-time name conflict.
Static IP mode as the direct-connect path required no new module infrastructure — it is just applyIpConfig_() with modeIdx_ == 1. The "AP analog" pattern reused exactly the same control set as DHCP mode.

What was tricky:

PlatformIO LDF and the SPI linker gap. The LDF discovered and compiled ETH.cpp (because Pal.h includes ETH.h via chain+). But SPI.h was found via CPPPATH added by the pre-script, which satisfied the #include without the LDF discovering the SPI library directory — so SPI.cpp was never compiled. EXTRA_CXXSRC (wrong variable) had no effect. Fix: env.BuildSources() in add_spi_eth_path.py explicitly compiles SPI.cpp without conflicting with the LDF-managed ETH.cpp. This is the correct pattern when a library source file must be compiled but would otherwise be bypassed by a CPPPATH shortcut.
arduino-esp32 3.x ETH.begin() signature for W5500. The 10-parameter form is begin(eth_phy_type_t, phy_addr, cs, irq, rst, spi_host_device_t, sck, miso, mosi, freq_mhz) — not the 2.x order. Getting this wrong produced a compile error after correcting the SPI linker issue; checking the framework source directly was the only reliable way.

Seeds for Sprint 12:

Hardware live test of LAN8720 and W5500 with real boards (user intent: next sprint).
Evaluate an [env:esp32dev_idf] bare-IDF env for exercising the IDF_VER Ethernet path in CI (needs sdkconfig, no ESPAsyncWebServer; significant setup cost — may be a separate sprint).
DHCP server on Ethernet for direct-connect without static IP (noted as future enhancement).

Complexity estimate: Large (L).

Sprint 12: WiFi Reconnect and PAL Documentation¶

Scope: Automatic WiFi STA reconnect after signal loss or router reboot; re-enable STA when Ethernet drops; align all recovery timers at 30 s. Document the Arduino/IDF mixing strategy in pal.md following a design discussion about the three-way platform switch.

Identified from: Sprint 11 retrospective (hardware live test follow-up); design discussion on PAL architecture.

Summary¶

Part	Description	Est
A: WifiSta auto-reconnect	Retry connection every 30 s while enabled and disconnected	S
B: Network Ethernet-drop recovery	Re-enable STA when Ethernet transitions down; align grace period to 30 s	S
C: PAL architecture documentation	"Arduino, IDF, and mixing both" section in `pal.md`	S
Total		S

Part A: WifiSta auto-reconnect¶

WifiStaModule::loop() previously stopped retrying after a failed connection attempt. An else if branch added after the connect-polling block retries startConnect() every RETRY_INTERVAL_MS = 30000 ms while:

isEnabled() is true (NetworkModule disables STA when Ethernet is up; no retry while disabled)
Not currently in a connect attempt (!connecting_)
Has credentials (ssid_[0] != '\0')
WiFi hardware is available (pal::has_wifi())

startConnect() resets lastRetryMs_ so each new attempt starts a fresh 30 s window regardless of how much time had elapsed.

Part B: Network Ethernet-drop recovery¶

Two changes to NetworkModule::manageWifi_():

ethWasConnected_ state tracking — detects the Ethernet up-to-down transition. On that tick, sta_->setControl("enabled", true) re-enables STA immediately so Part A's retry loop kicks in. Without this, STA stayed permanently disabled after Ethernet had been up.
STA_GRACE_MS reduced from 60 s to 30 s — the AP recovery timer now matches the STA retry interval. All three recovery events (STA retry, Ethernet-drop STA re-enable, AP open) now occur on 30 s boundaries.

Part C: PAL architecture documentation¶

A new "Arduino, IDF, and mixing both" section added to docs/developer-guide/pal.md documents the outcome of a design discussion on the three-way ARDUINO / IDF_VER / PC switch:

The IDF_VER branch is dead code in all current builds (ARDUINO always matches first when framework = arduino)
Direct esp_* calls work inside framework = arduino builds — this is common practice for features the Arduino wrappers do not expose (power management, WiFi fine-tuning, P4 hardware)
Library compatibility: ESPAsyncWebServer and FastLED require Arduino.h; ArduinoJson v7 is framework-agnostic
The ESP_PLATFORM path: collapsing ARDUINO and IDF_VER into one branch for new PAL functions where no Arduino wrapper exists (P4 GMAC, codecs)
esp_netif_init / event loop double-init caveat when mixing IDF calls with Arduino WiFi init

Definition of Done¶

[x] WifiSta.h: auto-retry every RETRY_INTERVAL_MS = 30000 ms while enabled and disconnected
[x] Network.h: ethWasConnected_ added; STA re-enabled on Ethernet drop; STA_GRACE_MS = 30000
[x] docs/developer-guide/pal.md: "Arduino, IDF, and mixing both" section added
[x] All unit tests pass; PC and ESP32 builds clean
[x] PC live tests pass

Result¶

Metric	Value
Unit tests	392/392 pass (no new tests — retry logic is ARDUINO-only, not exercisable on PC)
PC live tests	7/7 scenarios pass
ArtNet two-device	PASS (esp32s3_n16r8 MM-70BC reached and received packets)
esp32dev build	1441 KB flash (78.5%), 64 KB RAM (20.0%) — unchanged from Sprint 11
esp32s3_n16r8 build	1427 KB flash (34.8%), 62 KB RAM (19.5%) — unchanged from Sprint 11
esp32dev live test	skipped (device unreachable — stale IP, not a sprint regression)

See test-results.md and live-pc-macos.md.

Retrospective¶

What went well:

The isEnabled() guard in the retry branch reuses the existing enabled base-class control, so NetworkModule's Ethernet-gating of STA (which sets enabled = false) automatically suppresses retries — no extra flag needed.
Reducing STA_GRACE_MS to 30 s to match RETRY_INTERVAL_MS was a one-line change that aligned the whole recovery model. All recovery events are now on the same cadence.
The PAL design discussion surfaced a useful clarification: the IDF_VER branch is currently dead code, but the right long-term strategy is ESP_PLATFORM for new functions rather than maintaining two separate branches that converge on the same IDF API.

What was tricky:

The ethWasConnected_ fix was the less obvious half of the reconnect story. STA retry alone would not have helped after Ethernet drops because STA had been disabled by NetworkModule while Ethernet was up — it would never retry while enabled = false. Tracking the Ethernet transition was required to re-arm STA.

Seeds for Sprint 13:

Hardware live test: connect LAN8720 (esp32dev) and verify Ethernet and WiFi reconnect behavior on real hardware.
Consolidation question investigated but deferred: merging WifiAp, WifiSta, Ethernet into one NetworkModule is not worth the cost (UI, testing, size). The circular include friction could be reduced by extracting deviceName() to a lightweight DeviceInfo.h.
PAL ESP_PLATFORM refactor: apply to new functions (P4 GMAC) when that hardware arrives; leave existing Arduino-wrapper functions unchanged.

Complexity estimate: Small (S).

Sprint 13: PAL Cleanup and Deploy Pipeline Fixes¶

Scope: Remove all IDF_VER branches from Pal.h; consolidate the status docs (remove deploy-summary.md); fix summarise.py overwriting per-env live results files; fix livetest.py overwriting logs when a device is unreachable.

Identified from: Sprint 12 retrospective (PAL cleanup); organic housekeeping on the deploy pipeline.

Summary¶

Part	Description	Est
A: Remove IDF_VER branches	Rewrite Pal.h to `#ifdef ARDUINO` / `#else` throughout; remove `_eth` IDF namespace; simplify `eth_init`	S
B: Update pal.md	Rename "Three-way" to "Two-way" table; remove "IDF migration path" section; add rule statement	XS
C: Consolidate status docs	Merge `deploy-summary.md` into `index.md`; remove the file; update all references	S
D: Deploy pipeline correctness	`summarise.py` stops overwriting per-env MD files; `livetest.py` skips unreachable devices without touching logs	S
Total		M

Part A: Pal.h rewrite¶

All #elif defined(IDF_VER) branches removed. Every function now follows:

#ifdef ARDUINO
    // Arduino ESP32 implementation
#else
    // PC / Raspberry Pi stub
#endif

The _eth namespace (IDF event-driven Ethernet state helpers: _EthEvent, eth_event_handler, ethState) was removed entirely. The eth_init function shrank from ~60 lines to ~5:

inline bool eth_init() {
#if defined(ARDUINO) && defined(PMM_ETH_LAN8720)
  return ETH.begin(ETH_PHY_LAN8720, ...);
#elif defined(ARDUINO) && defined(PMM_ETH_W5500)
  return ETH.begin(ETH_PHY_W5500, ...);
#else
  return false;
#endif
}

A comment added to the Ethernet section: future hardware (e.g. ESP32-P4 GMAC) that has no Arduino ETH.h wrapper should add a new PMM_ETH_* flag and use direct IDF calls inside the ARDUINO block.

Part B: pal.md update¶

"Three-way platform switch" table renamed to "Two-way platform switch"; IDF_VER row removed.
Rule statement added: use Arduino wrappers by default; fall back to direct IDF calls only when no Arduino wrapper exists; those calls go inside the ARDUINO block.
"IDF migration path" section removed (all items in that table were IDF_VER-specific stubs, now gone).

Part C: Status docs consolidation¶

docs/status/deploy-summary.md was a near-duplicate of docs/status/index.md. The deploy pipeline table (Build/Flash/Run/Live columns) was merged into index.md as a new ## Deploy summary section above the existing ## Test results table. deploy-summary.md was deleted and all 14 references across docs, scripts, and mkdocs.yml updated to point to status/index.md.

summarise.py was simplified accordingly: the _write_deploy_summary_md function was removed; its table-generation logic moved into _write_index_md. A ## Detail pages section is now appended to index.md on every run, listing all live-results-*.md files found on disk (not just those written in the current run), so results from previous hardware runs remain visible.

Part D: Deploy pipeline correctness¶

Two bugs where pipeline scripts silently destroyed previous good results:

summarise.py overwrote per-env MD files. live_suite.py writes docs/status/live-results-{env}.md directly after each run; it includes a ## Summary section (with per-test check counts) and a ## Scenarios section (per-step fps/heap data). summarise.py was independently re-generating these same files from the JSON, but using a simpler format without those sections. Fix: replaced _write_single_env_results / _write_live_results_md with _scan_live_files, which scans existing MD files on disk and returns their paths as links. summarise.py no longer writes per-env MD files at all.

livetest.py overwrote logs for unreachable devices. _run_esp32_test opened the log file for writing (truncating it) before attempting to connect, so an unreachable device always destroyed the previous run's log. Fix: a reachability probe (GET /api/system) runs before any file is opened; if it fails the device is skipped with a message and the log and JSON are left untouched.

Definition of Done¶

[x] Pal.h: no IDF_VER anywhere; all functions use #ifdef ARDUINO / #else
[x] pal.md: two-way switch table; rule statement; IDF migration section removed
[x] deploy-summary.md deleted; index.md has merged deploy + test tables and detail links
[x] summarise.py: no longer writes per-env MD files; _scan_live_files preserves live_suite.py output
[x] livetest.py: reachability probe before opening log; unreachable devices skip without touching files
[x] PC build clean; 392/392 unit tests pass; PC live tests pass
[x] esp32dev and esp32s3_n16r8 builds clean

Result¶

Metric	Value
Unit tests	392/392 pass
PC build	clean
PC live tests	15/15 pass
esp32s3_n16r8 live tests	15/15 pass
esp32dev build	1441 KB flash (78.5%), 64 KB RAM (20.0%) — unchanged
esp32s3_n16r8 build	1427 KB flash (34.8%), 62 KB RAM (19.5%) — unchanged
Pal.h line count	~800 (down from ~1510)
deploy-summary.md	removed; content merged into index.md

Retrospective¶

What went well:

The PAL rewrite was clean: IDF_VER branches were either identical to the Arduino path or simple stubs, so removing them caused zero regressions.
The _eth IDF namespace was entirely internal to PAL — no modules depended on it — making deletion safe.
The status consolidation caught two separate bugs in the deploy pipeline during the same session; fixing them together while the code was open was efficient.

What was tricky:

The summarise.py / live_suite.py split of responsibilities was not obvious: both were writing the same files, with summarise.py's version silently losing the Scenarios section. The fix required tracing the full data flow from JSON through both writers.
Sprint 11 scope document still references the old three-way pattern (PAL structure and the IDF_VER path background section). Left as-is since it accurately records the design as it stood when Sprint 11 was written.

Seeds for Sprint 14:

Hardware live test: connect LAN8720 (esp32dev) and verify Ethernet + WiFi reconnect behavior on real hardware.
DeviceInfo.h extraction: reducing circular include friction between NetworkModule children (WifiAp/WifiSta both include Network.h for deviceName()).

Complexity estimate: Medium (M).

Release 7 Backlog¶

All items consolidated into the cross-release backlog.

Release 7: OTA, Ethernet, and Runtime Hardening (v1.7.0)¶

Release Overview¶

Foundation from previous releases¶

What Release 7 delivers¶

Sprints¶

Sprint 1: FirmwareUpdateModule¶

Summary¶

Planned scope¶

Definition of Done¶

Result¶

Retrospective¶

Sprint 2: CI Release Pipeline¶

Summary¶

Asset naming convention¶

Scope (confirmed)¶

Definition of Done¶

Result¶

Retrospective¶

Sprint 3: Windows Build¶

Summary¶

Planned scope¶

Definition of Done¶

Result¶

Retrospective¶

Sprint 5: Scenario Baseline and extends¶

Summary¶

Planned scope¶

Definition of Done¶

Result¶

Retrospective¶

Sprint 6: Static RAM Hardening for Classic ESP32¶

Summary¶

Planned scope¶

Definition of Done¶

Result¶

Retrospective¶

Sprint 7: GET /api/log Frontend Panel¶

Summary¶

Planned scope¶

Definition of Done¶

Result¶

Retrospective¶

Sprint 8: Heap Safety, HTTP OOM Hardening, Live-Test Correctness, and isPermanent() Removal¶

Summary¶

Part A: Per-module controlAllocBytes heap guard¶

Part B: HTTP OOM hardening¶

Part C: Live-test correctness¶

Part D: FirmwareUpdateModule user documentation¶

Part E: Remove isPermanent()¶

Definition of Done¶

Result¶

Retrospective¶

Sprint 9: select Control Type¶

Summary¶

Design decision: store index (uint8_t) or selected string?¶

Option A: store as uint8_t index (recommended)¶

Option B: store as char[] string value¶

Verdict: Option A (uint8_t index)¶

Part A: memory strategy for option strings¶

Where do the strings live?¶

Two registration styles and their costs¶

Distinguishing owned vs. borrowed options¶

Recommended strategy¶

ControlDescriptor changes¶

Complete memory picture (ESP32dev, 4-option static-array select)¶

Part B: addControl overloads and addControlValue¶

Part C: schema emission, value reads, and persistence¶

Part D: frontend dropdown rendering (app.js)¶

Definition of Done¶

Result¶

Retrospective¶

Sprint 10: Boot Module Creation Redesign and Dynamic Network Management¶

Summary¶

Current boot logic (before this sprint)¶

Part A: Extend ensureNetworkModules — add EthernetModule¶

Part B: Replace ensureDefaultPipeline + ensureInfraModules with ensureDefaultModules¶

Part C: Dynamic network management in NetworkModule¶

Part D: Update docs/user-guide/ui.md¶

Part E: Duplicate module investigation and 409 error clarity¶

Design decisions¶

Sprint 5: Scenario Baseline and `extends`¶

Sprint 7: `GET /api/log` Frontend Panel¶

Sprint 8: Heap Safety, HTTP OOM Hardening, Live-Test Correctness, and `isPermanent()` Removal¶

Part A: Per-module `controlAllocBytes` heap guard¶

Part E: Remove `isPermanent()`¶

Sprint 9: `select` Control Type¶

Design decision: store index (`uint8_t`) or selected string?¶

Option A: store as `uint8_t` index (recommended)¶

Option B: store as `char[]` string value¶

Verdict: Option A (`uint8_t` index)¶

Part B: `addControl` overloads and `addControlValue`¶

Part D: frontend dropdown rendering (`app.js`)¶

Part A: Extend `ensureNetworkModules` — add EthernetModule¶

Part B: Replace `ensureDefaultPipeline` + `ensureInfraModules` with `ensureDefaultModules`¶

Part C: Dynamic network management in `NetworkModule`¶

Part D: Update `docs/user-guide/ui.md`¶