Debugging Process
This section describes the general process of debugging the robot. These process may involve multiple disciplines. Students are highly encouraged to learn multiple disciplines because when the robot does not work, you have no idea if the problem is mechanical, electrical or programming. Knowing just one discipline will render you handicapped in debugging the root cause. For example, one of the most common problems is: a subsystem does not respond to control. A programmer may try for hours looking into the code trying to figure out why the code is not controlling the mechanism. But in reality, the cause could be as simple as the motor was unplugged. Therefore, when something is not working, one needs to understand how the mechanism works in the big picture involving both mechanically, electrically and programmatically. The most useful debugging technique is divide and conquer. To apply this technique, you need to understand how the mechanism works in the complete picture.
- Code reading gamepad controls.
- Code sending gamepad values to robot controller.
- Robot controller sending control signals to motor controller.
- Motor controller sending electricity to the motor.
- Motor moving the mechanism.
With this complete picture, you can pick a point where you can easily figure out if the control has successfully reached that point. For example, point 3 above was about the robot controller sending control signals to the motor controller. Ask yourself this question: how can you tell if the motor controller received signal from the robot controller? For a TalonSRX motor controller used in FRC, one can tell by looking at the LED light on the controller. If the motor controller received a forward signal, it should flash green. If it received reverse signal, it should flash red. If this is indeed the case, you can rule out problems from point 1 to point 3. Therefore, the problem is not in the code. If the motor controller does not have status light (e.g. FTC motor controllers), you may pick point 4. Then, the question is: how can you tell if the motor controller is sending electricity to the motor? You can easily prove that by getting a known good motor and plug it in to the motor controller and see if the code will spin the known good motor. If it does, then the problem is the motor. If it does not, the problem is upstream from point 1 to point 3. You can also check the Dashboard for the subsystem status and whether the code successfully read the gamepad control and sent the value to control the subsystem (point 1 and 2). And if you have limit switches, whether the limit switches are in the state preventing the motor movement.
Once it has been determined the code is the culprit, it needs to be debugged and fixed. It is often tempting for programmers to hypothesize the cause and formulate a hack without proving the actual cause. Sometimes the hack seems to address the symptom but most likely the wrong fix. For example, when the robot is going the opposite direction in autonomous, programmers often just find a place to negate a value to force the robot to go the correct direction without understanding why it was going the wrong way in the first place. This video humorously describes that exact problem-solving mentality.
The following shows a list of typical bugs you will encounter:
- Code is crashing: The code is causing an Exception. This is the most common and easiest type of bugs to fix because when an Exception occurs, you will get a stack dump which shows you the reason and the exact line of code that caused the Exception. It also shows you the history of calls leading to the code that caused the Exception. The most common Exception is NullPointerException. For example, when declaring a variable in java to hold an object, the variable is initialized to null. If the variable is used before it is initialized, a NullPointerException will be thrown.
- Code is hung in TeleOp: The robot stopped responding to human input. Apply the divide and conquer technique described above to diagnose the root cause.
- Code is hung in Autonomous: This is typically caused by an asynchronous operation that never got completed. Check the Dashboard to tell what state Autonomous is stuck at, in which what operation was it performing. Then figure out why the operation is not completing. Typically, it is a PID operation that was hung due to improper PID tuning causing excessive Steady State Error beyond the allowed tolerance. The solution is either re-tune PID to allow stronger response or add a timeout to the operation as a safety measure. Refer to the PID tuning section for more information.
- Unexpected code behavior: This is typically caused by logic error in the code. Making use of the Dashboard or Debug Tracing to identify where in the code it was performing the erroneous operation. Once the code location is identified, trace through the logic to figure out why it is performing the erroneous operation. Once the problem is understood, formulate a proper fix considering all corner cases.
- Robot lost communication: This is generally an electrical issue caused by power interruption to the robot radio. The root cause may be in the wiring where the power wire/connector to the radio is not secured or the routing of the wires is too taut so that any impact to the robot will cause power to disconnect. In FTC, it is also commonly caused by Electrostatic Discharge (ESD). The FTC robot running on the field mat building up static electric charge and discharging to a metal object it hits. This caused the Control Hub to malfunction and disconnected WiFi. Examining the wire path powering the radio and make sure it has sufficient slack. Also examine the power and network connectors to make sure they are securely plugged in and have strain relief. For the ESD problem in FTC, make sure the Resistive Ground Strap is installed.
When the code is not behaving correctly, you need to apply the following debugging process:
- Identify the code that was performing the unexpected operation.
- Trace that code to understand why it is performing the unexpected operation.
- Once the root cause is understood, formulate a proper fix and code it.
- Test the fix to prove that the code is now behaving properly.
- Make sure the fix works in all possible scenarios by running the fixed code in all code paths.
- If some scenarios are still not behaving correctly, repeat this process until everything works as expected.
- Add detail comments in the code to explain the issue and how the fix remedy the problem.
- Before checking in the final fix, have a mentor/peer to code review the fix.
- Check in the fix and add check-in notes on what the fix is for.
To understand the root cause of a bug, you need to trace through the code to find out why it is behaving erroneously. There are three ways to trace through the code.
- Real Time Debugging: Setting code breakpoints and trace through the code in real time. Generally, this is not a preferred way in robotics because if you trace through code that turns on a motor, the motor will remain on for the duration while you are tracing the code until the code turns the motor off. If the motor is controlling an arm or elevator, it would have gone beyond its position limit. This way is only desirable if the code doesn’t involve anything that’s time sensitive.
- Dashboard: When the robot is not behaving as expected, you may want to check the state of the subsystems. The Framework Library provides a Dashboard mechanism allowing you to display the values of variables. For example, when the elevator is not moving while you command it to move using a joystick, the Dashboard may show that one of the limit switches is malfunctioning and preventing the elevator to move.
- Trace Logging: Do a postmortem analysis of the trace log. The Framework Library provides Debug Tracing allowing you to log events and variable values to the debug console as well as in the log file. Even after the erroneous event has happened, you can look through the trace log to understand what had happened exactly.