System Design¶

System design is the top-level view of a robot: what the major pieces are, who owns what, how they pass information, and why they run at different rates. If you can look at a robotics stack and explain the boundary between sensing and estimation, how planning hands off to control, and what happens when data goes stale, you can reason about the system before reading a line of code.

1. A Robot Is Not One Loop¶

The first mental shift in robotics system design is giving up the idea that a robot is one big loop that reads sensors, makes a decision, and sends commands. Real robots are usually several loops running at different rates because different parts of the problem live on different time scales.

A robot is rarely a single sense–think–act loop. Different parts of the problem live on different time scales: mission logic updates every second or so, planners run at a few hertz, perception runs at sensor rate, estimators track the dynamics, and controllers close the loop at high frequency. Slower layers decide what to do; faster layers decide how.

Control Hierarchy

The exact rates vary across robots, but the overall pattern does not: control runs fastest, estimation follows the dynamics, perception is bounded by sensors and compute, and planning and mission logic run more slowly because they reason over longer horizons. Those differences come from physics, latency, and system requirements, not coding style.

This is also why you can't just call the slow layer from inside the fast one. A controller running at 1 kHz cannot block on a 30 Hz perception update. It would miss 32 of its own deadlines waiting. The usual pattern is a shared latest-value slot: the slow producer writes whenever it has something new, the fast consumer reads the most recent value without blocking, and a timestamp lets the consumer notice when the value is stale.

struct Stamped { Pose value; double t_sec; };
std::atomic<Stamped> latest_pose;  // written by perception, read by controller

// Perception thread @ 30 Hz
void perception_loop() {
    while (running) {
        Pose p = run_perception();
        latest_pose.store({p, now()});
        sleep_until_next_tick(30);
    }
}

// Controller thread @ 1 kHz
void controller_loop() {
    while (running) {
        Stamped s = latest_pose.load();              // never blocks
        if (now() - s.t_sec > 0.1) enter_safe_mode(); // freshness check
        else send_command(compute(s.value));
        sleep_until_next_tick(1000);
    }
}

The controller never waits on perception, but it also never silently uses a half-second-old pose. The timestamp is what makes the decoupling safe.

Two distinctions matter early. Perception is not estimation: detecting a lane marker is different from estimating vehicle pose. And planning is not control: deciding where to go is different from generating the fast actuator commands that make the robot follow that decision.

2. Interfaces Are The Real Architecture¶

System design is mostly boundary design. A module stays swappable only if its inputs are explicit, its outputs are typed, and its responsibilities don't leak into neighboring layers.

Good:
camera -> perception -> planner -> controller

Bad:
camera -> planner
UI -> controller
controller -> map internals

Good boundaries let you replace a camera, planner, or estimator without rewriting the rest of the stack. Bad boundaries create hidden coupling: a planner reaches into sensor internals, a UI writes actuator commands directly, or a controller depends on map-building details it should never know about. This is why robotics diagrams matter. They are not decoration; they expose ownership boundaries.

At the code level, those same boundaries are function signatures. The planner asks for exactly what it needs and nothing else:

# Clean — depends on a typed world model.
def plan(world: WorldModel, goal: Goal) -> Trajectory: ...

# Leaky — depends on three layers the planner should never know about.
def plan(world, camera_driver, imu_serial_port, ros_node): ...

The leaky version is no longer a planner. It's a planner-plus-half-the-driver-stack, and swapping the camera or rewiring the IMU now breaks the planning logic. The clean version doesn't care where its WorldModel came from: a stereo camera, a lidar, a simulator, or a recorded log.

3. Rate, Latency, And Freshness Budgets¶

Once the modules are separated, the next question is not just what talks to what, but how fast and how stale is still acceptable. Every important module has both a nominal rate and a latency budget.

Module	Typical Rate	If Stale...
Controller	~1 kHz	Tracking degrades immediately
Estimator	~100 Hz	Downstream state becomes wrong
Perception	~10-30 Hz	World model goes stale
Planner / mission	~1-5 Hz	Robot can often coast briefly

A 1 kHz controller should not block waiting on a 30 Hz perception output. A planner can update slowly as long as a faster lower layer can keep the robot stable in the meantime. A stale pose estimate, on the other hand, can poison everything downstream. In robotics, a perfect answer that arrives too late is often worse than a rough answer that arrives on time.

4. Design For Degraded Modes¶

The final question in any system design is not just "how does this work when everything is healthy?" but "what happens when part of it breaks?" Real systems need an answer for dropped sensors, delayed perception, crashed planners, and stale state.

Degraded Modules

This is where safety authority becomes explicit. A well-designed system is clear about which module can command a stop, which inputs are optional, and what to do when something fails hard enough to need a fallback. Some faults should trigger a controlled stop; others should degrade performance while keeping the robot stable. You can't prevent all failures. The goal is that when they happen, the system does something predictable.

One concrete example makes the interaction between layers easier to see. Picture a small indoor delivery robot moving medicine between rooms in a hospital:

Module	Input	Output	Rate	If It Fails...
Mission manager	operator request, task queue	next delivery goal	1 Hz	robot stops taking new jobs
Planner	map, current pose, goal	short-horizon path	5 Hz	robot can briefly hold last safe plan
Perception	depth camera, lidar	obstacle tracks, free space	15 Hz	world model goes stale quickly
Estimator	wheel odometry, IMU, landmarks	robot pose and velocity	100 Hz	every downstream decision gets worse
Controller	latest pose, target trajectory	wheel velocity commands	200 Hz	tracking degrades immediately

When everything is healthy, the mission manager picks the next room, the planner turns that into a path through the hall, perception marks carts and people as obstacles, the estimator keeps the robot localized, and the controller tracks the path.

Now break one thing: the depth camera freezes for 300 ms. A well-designed system does not let that failure blur across the whole stack. Perception stops publishing fresh obstacle updates. The planner can keep its last path for a moment, but only within a freshness budget. The controller keeps running from the latest valid trajectory instead of blocking. If perception stays stale past the allowed timeout, safety authority escalates: either the planner commands a stop because the world model is stale, or a supervisor module forces safe mode directly. That's system design in practice: not just the boxes, but who can keep going, for how long, and who gets to say stop.

Assignment¶

The Air Traffic Control Tracker: you're given a radar (50 Hz, noisy), a GPS transponder (5 Hz, accurate), a Kalman filter, and a live 3D visualization. Your job: implement one method, tick(), that fuses both sensors without blocking, tracks when readings go stale, and degrades gracefully when a sensor dies. Seven graded scenarios each target a specific failure mode (sensor death, timing jitter, hardcoded rate assumptions, and more). A second required question asks what it actually takes to add a new sensor. Where it gets wired in, what the config contract changes, and how partial failure should be surfaced.

Go to gtcloudrobotics/air-traffic-control-tracker, click Use this template to make your own copy, then clone and push. The autograder runs on every push and you'll see pass/fail in the Actions tab. Enrolled GT students just send me your GitHub username at the start of the semester so I can match your repo to your grade.