SPIN Bug Museum

This page is a memorial to the nasty bugs we have solved in the course of constructing SPIN. To qualify as nasty, a problem must of the hard to reproduce variety -- it makes the system ``flakey''. Easily reproduced problems do not qualify since you can solve it simply by putting the circumstances under the microscope. Elusive bugs that do not sit still under the microscope make the hunt much more challenging. Other bugs that may qualify include those that occur in situations, such as boot, when most of the usual debugging tools are unavailable.

Do not understimate the difficulty and hard work that solved these bugs just because their descriptions are short and the solution clear cut and the moral is cute.

Flakey TCP

Symptoms
TCP connections mysteriously failed and device drivers handling specifically tcp packets would fail.
Tracking
Since substituting spls for the locks improved the situation, Marc figured implementing the lock as in osf solve it. The problem with spls was that tcp would sleep and spl drops during the context switch.
Moral of the story
a lock does not an spl make

NARROW failed

Symptoms
The same NARROW call would raise an exception for the same variable sometimes and not at other times.
Tracking
Stepped throught the NARROW assembly often enough to catch it fail. Turns out that a dyn link was in progress in another thread the times it failed and the problem was that not all of the new m3 runtime state was protected by a lock.
Moral of the story
messing with runtime globals considered harmful

GC clobbered memory

Symptoms
When GC was enabled, pointers would occasionally get NIL, badaddr, or unaligned faults and kernel data was occasionally corrupted.
Tracking
Follow on each occurance as closely as possible. Usually the extra debug info or gdb usage changed the symptoms, so most attempts led nowhere. Finally a chain of events was devised that led to a reproducible fault that still faulted when prints were added to find the location. Turned out that GC was not looking at the entire thread stack for pointers and was collecting objects in use higher in the stack.
Moral of the story
GC is hard

Bad Access

Symptoms
Tracking
Moral of the story
Don't mess with Texas or UX.

Template

Symptoms
Tracking
Moral of the story


becker@cs.washington.edu