[Rhythmbox-devel] first kill for the Cooperative Bug Isolation Project



Some of you may already be aware of the Cooperative Bug Isolation 
Project: <http://www.cs.berkeley.edu/~liblit/sampler/>.  This is a 
research project at UC Berkeley and Stanford University that tries to 
find bugs by identifying statistically significant differences in 
program behavior between good and bad runs.  We have a public arm, which 
offers instrumented binaries for popular GNOME packages for anyone to 
download and use.  We have also been using our approach together with 
scripted runs to generate large numbers of feedback reports quickly.

I am pleased to report that Cooperative Bug Isolation has killed its 
first GNOME bug.  As described in 
<http://bugzilla.gnome.org/show_bug.cgi?id=137460>, mining data from 
scripted Rhythmbox runs reveals a mismanaged timeout event source ID 
that can result in a fairly large number of crashes.  These crashes 
result from memory corruption, so the post-crash stack is essentially 
useless.  Our feedback instrumentation identifies the problem by 
revealing that crashes are far more likely when a specific 
g_source_remove() call on a specific line of code returns values greater 
than zero.

The same problem actually appears twice in Rhythmbox; the second 
instance may be responsible for previously reported bug 
<http://bugzilla.gnome.org/show_bug.cgi?id=130788>.

I bring this to the <gnome-bugsquad> and <rhythmbox-devel> lists' 
attention for two reasons.  First, Luis Villa told me that this is cool 
enough that I should spread the word.  :-)  Second, if this timeout 
mismanagement bug appeared twice in Rhythmbox then it may appear in 
other code too.

The mistake is to keep around a timeout event source ID after the 
corresponding timeout callback has returned FALSE.  When the callback 
returns FALSE, the timeout event source is implicitly destroyed.  That 
means that this event source's ID number is no longer valid.  Keeping it 
around is the ID number equivalent of having a dangling pointer.

What we see in Rhythmbox is that the event source ID is being retained 
in a private field of an object even after the callback has returned 
FALSE.  Later on, that object might decide to destroy the event source 
by calling g_source_remove() on this stale ID.  If the ID number has 
been reassigned to some other event source, that other source will be 
prematurely destroyed.  A positive return value from g_source_remove() 
indicates that it did find and destroy some unlucky event source.  In 
the case of Rhythmbox, I see this in the form of increased likelihood of 
crashing when one particular g_source_remove() call returns positive values.

I'm going to repost this description to <desktop-devel-list>, both to 
document this easy-to-make error and to suggest that developers audit 
their own code to see if they are doing the same thing.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]