Self-Hosted CI Runners for Embedded E2E Testing

self-hosted CI runner embedded testing

Most embedded teams have CI. Very few have CI that touches hardware.

The standard setup compiles the firmware on a cloud runner, runs the unit tests in a simulator, and turns the badge green. Everything physical (flashing, boot, peripherals, power behaviour) is validated by a human, manually, sometime before release. Which means the most failure-prone layer of the product is the one layer with no automation.

The reason is not laziness. It is that a cloud runner cannot flash a board. The last mile of embedded CI is physical, so the runner has to be physical too. This post is the architecture we use for self-hosted runners wired to real hardware, and the operational practices that decide whether the setup stays trustworthy or rots into a flaky bench everyone ignores.

The Architecture

Five components, all cheap, all replaceable:

1. The runner host. A small Linux box, a used mini-PC or NUC-class machine is plenty, registered as a self-hosted runner with GitHub Actions, GitLab CI, or whatever orchestrator the team already uses. It sits physically next to the bench and runs the agent. Total cost is usually under €300.

2. The debug probe. A J-Link, ST-Link, or CMSIS-DAP probe permanently wired to the device under test, driven from the host by probe-rs, OpenOCD, or the vendor CLI. This is the flash-and-reset path. One probe per DUT, permanently connected, shared probes that get unplugged for desk debugging are the number one source of “the bench is broken again.”

3. Programmable power. A USB relay board or smart PDU that lets the pipeline hard power-cycle the DUT. This is non-negotiable. A wedged board that needs a human to pull the cable turns an automated pipeline back into a manual one. Power control is also what makes power-loss-during-update testing, arguably the test that matters most for OTA validation, scriptable.

4. Serial and instrument capture. Every DUT UART permanently wired to the host (a multi-port USB-serial adapter covers a whole bench). Logs are captured for every job and uploaded as artifacts whether the job passes or fails. Add instruments, logic analyser, programmable supply with current logging, behind the same driver abstraction the bench already uses to talk to the probe and power control.

5. A golden unit. A known-good board, permanently wired, that runs a short self-test suite before any DUT job executes. If the golden unit fails, the pipeline reports bench failure, not test failure, and skips the run. This single practice is what separates benches teams trust from benches teams ignore, it converts “flaky tests” into a diagnosable, attributable signal.

The Pipeline Shape

Route by label, gate by stage:

  • Every commit: build + static analysis + unit tests on cloud runners as normal. Hardware never blocks the inner loop.
  • Every pull request: a hardware-labelled job queues for the self-hosted runner, flash the DUT, boot, run the smoke suite (boot confirmation, peripheral init, one happy-path E2E flow). Ten minutes, not an hour.
  • Every release candidate: the full E2E suite, OTA update and rollback under power interruption, peripheral exercise via instruments, timing measurements against published thresholds. An hour is acceptable here.

The discipline that makes this scale: one job per bench at a time. Hardware is a mutually exclusive resource. Use the orchestrator’s concurrency controls (runner groups, GitHub Actions concurrency groups, GitLab resource_group locks) so jobs queue rather than collide. Two jobs flashing one board simultaneously produces failures that look like firmware bugs and cost afternoons.

The Operational Practices That Decide Everything

The hardware above is a weekend of work. These practices are the difference between a runner that holds up for years and one that gets abandoned in a quarter.

Recover by power, not by hope. Every job starts with a hard power-cycle and ends with one. Every job has a timeout. A hung DUT must never require a human walk to the bench.

Artifacts always, especially on failure. Serial logs, probe output, instrument captures, uploaded on every run. A hardware test failure without its serial log is a failure you will reproduce manually, which defeats the point.

Ephemeral workspaces. Self-hosted runners keep state between jobs by default; firmware builds that “pass” because of a stale artifact from the previous job are a recurring trap. Clean the workspace every run, or run the agent in a container that maps through only the specific USB and serial devices the job needs (probe, serial adapters, relay board) rather than running privileged.

Treat the bench as code. Wiring documented in the repo. Test scripts in the same repo as the firmware. Udev rules and runner config in version control. The standard from the HIL bench post applies doubly here: if only one engineer can revive the runner, the runner has failed.

Lock down what the runner can run. This is the security item teams skip. A self-hosted runner executes whatever the pipeline sends it, which, on a public repository, can include code from a fork PR. Never attach self-hosted runners to public-repo PR triggers; on private repos, restrict the runner to the firmware repository and require approval for first-time contributors. The runner is a machine on your network with USB access to your hardware. Treat it like one.

What It Costs and What It Returns

A complete setup, mini-PC, probe, relay board, USB-serial, golden unit, lands under €500 in hardware on top of a bench that exists anyway, plus one to two weeks of engineering to wire the pipeline and write the smoke suite. The return shows up the first time a release candidate fails the power-interruption OTA test in CI instead of in the field.

There is also a compounding effect with how teams develop now. AI-assisted development is increasing the volume of firmware shipped per week, and the bug classes it introduces are disproportionately timing races, peripheral edge cases, and power-state bugs, the ones only real hardware catches. A hardware-attached runner is the cheapest way to put real silicon in the path of every pull request, which, in 2026, is no longer a luxury practice.


Want hardware in your CI pipeline?

Better Devices design and build hardware-attached CI setups, runner, bench, fixtures, and the test suite, as fixed-scope engagements, typically two to four weeks. Talk to a HIL Engineer →


Frequently Asked Questions

Why use a self-hosted runner for embedded CI?

Because cloud runners cannot flash, power-cycle, or observe a physical board. Build and unit-test stages belong on cloud runners; anything that touches real hardware, flashing, boot verification, peripheral testing, OTA validation, requires a runner physically wired to the device under test.

What hardware does a hardware-attached CI runner need?

A small Linux host registered as the runner, a permanently wired debug probe (J-Link, ST-Link, or CMSIS-DAP), programmable power control (USB relay or smart PDU), permanent serial capture on every DUT UART, and a golden-unit board for bench self-tests. Typical cost is under €500 on top of an existing bench.

How do you stop hardware CI tests from being flaky?

Run a golden-unit self-test before every DUT job so bench failures are reported as bench failures rather than test failures; hard power-cycle the DUT at the start and end of every job; enforce one job per bench with concurrency controls; clean the workspace every run; and upload serial logs as artifacts on every run so failures are diagnosable without manual reproduction.

Are self-hosted CI runners a security risk?

They can be. A self-hosted runner executes whatever the pipeline sends it, so it must never be attached to public-repository pull-request triggers, should be restricted to the specific firmware repository, and should require approval for workflows from first-time contributors. Treat the runner as a networked machine with physical access to your hardware.

How long do hardware tests add to a pipeline?

With staged gating, very little where it matters: the per-PR smoke suite (flash, boot, one end-to-end flow) should run in around ten minutes, while the full E2E suite, including OTA-under-power-interruption testing, runs only on release candidates, where an hour is acceptable.


Work With Us
Ready to de-risk your next hardware project?

Join other engineering leaders receiving our monthly insights, or reach out to discuss how Better Devices can help your team ship faster.

Leave a Reply

Your email address will not be published. Required fields are marked *

Let's Discuss Your Engineering Goals

Ready to move your project forward? Schedule a technical discovery session with our senior engineers to explore solutions for your embedded systems, CI/CD, or field engineering challenges.

Better Devices — Newsletter Popup