Heisenbug hunting from the trenches
Andrei Terechko
Vector Fabrics
Did you ever spend a full week to find the root cause of a segmentation fault? Did the bug disappear when you tried to look at it in a debugger? Congratulations: you hit a veritable “Heisenbug” - one that disappears when you look at it. Your customers get weary from the feeling of being used as Guinea pigs, unpredictable delays in delivery, and your team finally wants to move on the next project. But your software engineering setup is a text-book example of continuous integration, test-driven design, and advanced static analysis tools. So why the field recalls, missed deadlines and debugging nightmare?
Today’s software is inherently dynamic: reacting to outside events, processing depending on user input, dynamic memory management. With event-driven and multithreaded software, the variations to test have become endless. Dynamic behavior requires dynamic program analysis for bug hunting. Established static analysis techniques look at the static code structure and have a tunnel’s view on the program. In the context of dynamic behavior (e.g. pointers, virtual functions, dynamic loop bounds), static analysis often either reports bogus bugs or simply misses bugs.
Andrei presents a case study of how dynamic analysis accurately detects critical errors in industrial software stacks. The story is littered with real-life, embarrassing bugs that went undetected by established methods, including static analysis and unit-testing. Andrei demonstrates that dynamic analysis tooling can be engineered to effectively unravel Heisenbugs long before they hit the field. So, no false positives, no project delays, no field recalls.