Watchdog & System Faults
Executive Summary
When the autopilot reboots unexpectedly or disarms mid-air, the Watchdog and Fault logs are the first place to look. These logs capture the "Black Box" data at the moment of failure.
- Watchdog (WDOG): The CPU locked up, and the independent hardware timer reset it.
- Fault (ERR): A software subsystem reported a critical failure.
Theory & Concepts
1. The Watchdog Timer
The STM32 has an independent hardware timer that counts down. The main loop must "pet" (reset) this timer every loop. If the CPU freezes (infinite loop, DMA lockup), the timer expires and forces a hard reset.
- Log:
WDOG. - Data: Captures the Program Counter (PC) and Stack Pointer (SP) at the moment of death.
2. The HardFault Handler
If code tries to access invalid memory (Null Pointer, Stack Overflow), the CPU triggers a HardFault. ArduPilot catches this, saves the register state to a special area of RAM (no-init), reboots, and then writes it to the log on the next boot.
Codebase Investigation
1. Watchdog Logging
Located in libraries/AP_Logger/AP_Logger.cpp.
- On boot,
AP_Loggerchecks if a watchdog reset occurred (hal.util->was_watchdog_armed()). - If yes, it writes a
WDOGmessage containing the fault registers.
2. Error Subsystems: Log_Write_Error
Located in ArduCopter/Log.cpp (and other vehicles).
- Subsystem: Where the error occurred (Compass, GPS, EKF, etc.).
- Error Code: Specific failure type (e.g.,
ERROR_SUBSYSTEM_FAILSAFE_RADIO).
Source Code Reference
- Logger Implementation:
libraries/AP_Logger/AP_Logger.cpp
Practical Guide: The "Crash" Log
1. "Internal Error"
- Search the log for
MSGlines containing "Internal Error". - Example:
Internal Error: 0x8000000 (map_fail). - Action: This is almost always a firmware bug. Report it to the developers with the log.
2. The Watchdog Reset
- If
WDOGis present, the board reset in flight. - Task:
SchR(Scheduler Overrun). If this is high just before the reset, the CPU was overloaded.
3. "Subsys" Codes
- 1: Main (Never good).
- 2: Radio (RC Failsafe).
- 3: Compass (Mag failure).
- 4: OptFlow.
- 5: FailSafe (General).