SPIN Bug Museum
This page is a memorial to the nasty bugs we have solved in the course
of constructing SPIN. To qualify as nasty, a problem must of
the hard to reproduce variety -- it makes the system ``flakey''.
Easily reproduced problems do not qualify since you can solve it simply
by putting the circumstances under the microscope. Elusive bugs that do
not sit still under the microscope make the hunt much more challenging.
Other bugs that may qualify include those that occur in situations, such as
boot, when most of the usual debugging tools are unavailable.
Do not understimate the difficulty and hard work that solved these bugs
just because their descriptions are short and the solution clear cut and
the moral is cute.
Flakey TCP
- Symptoms
-
TCP connections mysteriously failed and device drivers
handling specifically tcp packets would fail.
- Tracking
-
Since substituting spls for the locks improved the
situation, Marc figured implementing the lock as in osf solve it. The
problem with spls was that tcp would sleep and spl drops during the
context switch.
- Moral of the story
-
a lock does not an spl make
NARROW failed
- Symptoms
- The same NARROW call would raise an exception for the same
variable
sometimes and not at other times.
- Tracking
- Stepped throught the NARROW assembly often enough to
catch it fail. Turns out that a dyn link was in progress in another
thread the times it failed and the problem was that not all of the new
m3 runtime state was protected by a lock.
- Moral of the story
- messing with runtime globals considered harmful
GC clobbered memory
- Symptoms
- When GC was enabled, pointers would occasionally get NIL,
badaddr, or unaligned faults and kernel data was occasionally corrupted.
- Tracking
- Follow on each occurance as closely as possible. Usually
the extra debug info or gdb usage changed the symptoms, so most attempts
led nowhere. Finally a chain of events was devised that led to a
reproducible fault that still faulted when prints were added to find the
location. Turned out that GC was not looking at the entire thread stack
for pointers and was collecting objects in use higher in the stack.
- Moral of the story
- GC is hard
Bad Access
- Symptoms
-
- Tracking
-
- Moral of the story
- Don't mess with Texas or UX.
Template
- Symptoms
-
- Tracking
-
- Moral of the story
-