MAVLINKHUD

Watchdog & System Faults

Executive Summary

When the autopilot reboots unexpectedly or disarms mid-air, the Watchdog and Fault logs are the first place to look. These logs capture the "Black Box" data at the moment of failure.

  • Watchdog (WDOG): The CPU locked up, and the independent hardware timer reset it.
  • Fault (ERR): A software subsystem reported a critical failure.

Theory & Concepts

1. The Watchdog Timer

The STM32 has an independent hardware timer that counts down. The main loop must "pet" (reset) this timer every loop. If the CPU freezes (infinite loop, DMA lockup), the timer expires and forces a hard reset.

  • Log: WDOG.
  • Data: Captures the Program Counter (PC) and Stack Pointer (SP) at the moment of death.

2. The HardFault Handler

If code tries to access invalid memory (Null Pointer, Stack Overflow), the CPU triggers a HardFault. ArduPilot catches this, saves the register state to a special area of RAM (no-init), reboots, and then writes it to the log on the next boot.

Codebase Investigation

1. Watchdog Logging

Located in libraries/AP_Logger/AP_Logger.cpp.

  • On boot, AP_Logger checks if a watchdog reset occurred (hal.util->was_watchdog_armed()).
  • If yes, it writes a WDOG message containing the fault registers.

2. Error Subsystems: Log_Write_Error

Located in ArduCopter/Log.cpp (and other vehicles).

  • Subsystem: Where the error occurred (Compass, GPS, EKF, etc.).
  • Error Code: Specific failure type (e.g., ERROR_SUBSYSTEM_FAILSAFE_RADIO).

Source Code Reference

Practical Guide: The "Crash" Log

1. "Internal Error"

  • Search the log for MSG lines containing "Internal Error".
  • Example: Internal Error: 0x8000000 (map_fail).
  • Action: This is almost always a firmware bug. Report it to the developers with the log.

2. The Watchdog Reset

  • If WDOG is present, the board reset in flight.
  • Task: SchR (Scheduler Overrun). If this is high just before the reset, the CPU was overloaded.

3. "Subsys" Codes

  • 1: Main (Never good).
  • 2: Radio (RC Failsafe).
  • 3: Compass (Mag failure).
  • 4: OptFlow.
  • 5: FailSafe (General).