Debugging & Fault Analysis
Executive Summary
When an STM32 crashes, it triggers a hardware exception (HardFault, MemManage, BusFault). Unlike a desktop OS that segfaults and keeps running, an embedded controller must either reboot immediately (Watchdog) or freeze for inspection.
ArduPilot on ChibiOS implements a robust fault capture system that saves the CPU state to persistent storage before rebooting, allowing post-mortem debugging without JTAG.
Theory & Concepts
1. The HardFault Exception
The ARM Cortex-M core jumps to the HardFault_Handler vector when an illegal operation occurs (e.g., accessing invalid memory, dividing by zero, executing a non-thumb instruction).
- The Stack Frame: The CPU automatically pushes
R0-R3,R12,LR,PC, andxPSRto the stack. - Analysis: By reading the Program Counter (
PC) from the stack, we know exactly where the crash happened.
2. The Watchdog
ArduPilot uses the STM32 Independent Watchdog (IWDG).
- Timeout: Typically 2 seconds.
- The Pat:
Scheduler::watchdog_pat()resets the timer. - Thread Lockup: If the main loop freezes (e.g., infinite loop), the pat stops, and the IWDG resets the chip.
Codebase Investigation
1. The Fault Handler: HardFault_Handler
Located in libraries/AP_HAL_ChibiOS/system.cpp.
- It copies the stack frame to
struct port_extctx. - It calls
save_fault_watchdog(), which writes the registers tohal.util->persistent_data(Backup SRAM or RTC registers). - On the next boot, this data is logged to the SD card (
INTERNAL_ERRORmessages).
2. The Monitor Thread: _monitor_thread
Located in libraries/AP_HAL_ChibiOS/Scheduler.cpp.
- A dedicated high-priority thread that wakes up every 10ms.
- Deadlock Detection: It checks if the main loop hasn't run for > 200ms.
- Mitigation: If it detects a semaphore deadlock, it attempts to "Force Release" the mutex (
try_force_mutex()) to unstick the system before the watchdog bites.
Source Code Reference
- System Faults:
libraries/AP_HAL_ChibiOS/system.cpp - Watchdog/Monitor:
libraries/AP_HAL_ChibiOS/Scheduler.cpp
Practical Guide: Analyzing a Crash
1. "Internal Error" Message
If your HUD says Internal Error 0x... or the logs show WDOG (Watchdog) events:
- Get the Log: Download the
.binDataFlash log. - Look for
MSG: It might print "HardFault at 0x0804..." - Addr2Line: Run
arm-none-eabi-addr2line -e arducopter.elf 0x0804...to find the exact C++ line number.
2. Using CrashCatcher
For deep debugging, enable AP_CRASHDUMP_ENABLED in hwdef.dat.
- Behavior: Instead of rebooting, the board freezes and blinks the LED.
- Recovery: Use
Tools/scripts/CrashCatcher.pywith a UART adapter to dump the entire RAM state for analysis in GDB.
3. Stack Overflow
If the system reboots randomly during complex missions:
- Check:
SYS_STACK_FREEin the logs. - Fix: Increase stack sizes in
Scheduler.cpp(THD_WORKING_AREA) orhwdef.dat(PROCESS_STACK).