Prognosis
Last month I was suffering from chronic laptop crashes after I decided to run a
long-overdue apt dist-upgrade
(I run sid). I knew that software was causing
this issue since before the dist-upgrade, the laptop was happily running fine
for the last couple of years. My issue, though was: what exactly was causing
such a painful thing as seemingly random but often-recurring crashes?
I wanted to diagnose. The situation got to a point where it was so aggravating that I was using a desktop at the office instead of the laptop and I would ssh into the laptop to access important information. This had to stop.
Symptoms
After that fated dist-upgrade
, my laptop started to show some visual flashes,
as if I could see the screen blackout and refresh. Those flashes would only
occur when I typed on the keyboard, and I found they would reproduce the
easiest when I would change from one terminal to another in Terminator
and
started typing.
After a certain amount of time (or typing) which to me seemed random, the screen would turn black exactly when I typed any letter on the keyboard. A one-pixel vertical line would go crazy zigzagging colours on the left side. And then after about 10s the whole machine would shutdown by itself.
Frequency (or repetitiveness)
After some time suffering from this, I could see a clear pattern: the more I would use the keyboard, the bigger the chances of crashing were growing.
I then found out that if I purposefully caused visual flashes by going back and forth between two terminal windows and typing all the time, I would repeat the crash faily easily.
Great: with repetitiveness comes easy testing.
Treatment
My first hypothesis was that it might be caused by the video driver. However
when I downgraded it to the latest version that was installed before the
dist-upgrade
that brought the them, the symptoms didn't go away.
I tried a couple more packages like gnome components. I tried running Xwayland instead of Xorg. I tried using fluxbox instead of gnome. Nothing would cut it: the crashes were still there.
I also tried installing a fresh debian jessie from a debian live image, and then upgrading that to debian sid. The issue was reproducing in this setup.
The issue was that during the dist-upgrade
mentioned in the diagnosis section
above, there were so many packages upgraded at once that I'd go crazy triaging
them all, especially since some of them would refuse to dowgrade because of
dependency issues.
Strong medicine
So I was completely fed up with this issue and I had a means of easily reproducing the crash. The strong medicine would be needed: bisecting between the last known good state (the upgrade before last) and the the first known bad state (last upgrade).
The main tool for this bisection would be the awesome service from debian that keeps 4 snapshots per day of the whole debian package repository:
http://snapshot.debian.org/
I used the fresh debian sid install that reproduced the crashes so that I wouldn't be messing around too much with my main setup.
Setting up debian sid for bisection
In order to be able to jump back and forth in time, some preparation was needed. First, since I would be upgrading and downgrading, I needed to tell dpkg to just follow along. The unstable release would be moving around with sources from snapshot.debian.org, so a preferences file would suit the purpose nicely:
Package: *
Pin-Priority: 1001
Pin: release a=unstable
Then to make things easier, I needed the bisection process to be somewhat
scripted. First, I listed all of the days in between the good and bad states
(including the good and bad state days to mark boundaries). Then I marked the
oldest date by adding " GOOD" to the right of the date, and similarly the last
date would be marked with " BAD". This was done manually in a text file called
dpkg-bisect
.
The following figure shows a shortened version of the file (assume dates continue sequentially where content is ellipsized):
06-14 GOOD
06-15
06-16
06-17
[...]
08-04
08-05
08-06
08-07 BAD
Then I wrote two functions in .bashrc
to make it easy to mark a date as
either good or bad, respectively. Those functions would accept a date as
argument and, depending on which function I called, simply add a mark of
" GOOD" or of " BAD" on the line that started with that date:
function good () {
sed -i "s/^\($1\).*\$/\1 GOOD/" dpkg-bisect
}
function bad () {
sed -i "s/^\($1\).*\$/\1 BAD/" dpkg-bisect
}
Then I created aliases in .bashrc
to make it easier to upgrade or downgrade
packages and cleanup after an upgrade/downgrade. The cleanup alias shows
packages which were not installed on current archives, so it would show us
packages that come from a different day (e.g. a different snapshot source). The
update alias gives an option to apt-get
to disregard errors about a source
being outdated: sources on snapshot.debian.org are only valid for a dozen or so
days:
alias cleanup="aptitude search '?narrow(?not(?archive(\"^[^n][^o].*$\")),?version(CURRENT))'"
alias update='sudo apt-get -o Acquire::Check-Valid-Until=false update'
Finally, to make bisection easier, I wrote an alias also in .bashrc
that
would remove all lines until the last occurence of a "GOOD" marker and all
lines after the first occurence of a "BAD" marker, and print the date that was
right in the middle of what was left:
alias bisect="sed '/BAD\$/Q' dpkg-bisect | tac | sed '/GOOD\$/Q' | tac | awk '{ lines[NR]=\$0; } END { print lines[int(NR/2)+1] }'"
Running bisection
With all this in place, I could start the process, which boils down to:
- call
bisect
and find the given date on snapshot.debian.org. For consistency, I would always choose the first snapshot of that day.- if the result is empty, then stop: the file would have one good and one bad line just next to each other. This bad date would be the first point where the issue was introduced.
- change
/etc/apt/sources.list
to list only the snapshot.debian.org URL corresponding to the date from previous point - call
upgrade
- call
cleanup
and figure out how to make state consistent (old libs can be removed, conflicts should be fixed) - reboot
- try to reproduce the bug
- mark the date as either good or bad depending on the result of previous point
- repeat process
When I finally found the fateful day that introduced the error, I actually drilled down in this day by adding times of the 4 snapshots in between the last good line and the first bad line. e.g.:
[...]
07-16
07-17
07-18 GOOD
07-19-04:18:30 GOOD
07-19-10:13:20
07-19-16:21:50
07-19-22:21:00
07-20-04:39:05 BAD
07-21
07-22 BAD
07-23
[...]
Then I continued to bisect to find the exact snapshot within that day that brought the problem. Arguably, I could have just listed all snapshots at the beginning of the process.
Diagnosis
The snapshot that introduced the crashes was upgrading only 5 packages. Among
these, there was one that would stand out from the rest as possible cause so I
tried to dowgrade this one first, and voilà! I had found the source of the bug.
It was xserver-xorg-core
.
So I downgraded this package on my main setup and the bug was gone. I put it on hold so it wouldn't re-upgrade automatically, and finally I reported a bug on the package, #837451. Apparently, a patch for intel-specific hardware was introduced in the package in the version that introduces the crashes and I suspect this patch is what was causing my woes.
This issue was really annoying, but learning how to run a binary search on debian archive snapshots was very interesting. It's a great tool for when finding a remedy to software bugs that make you loose hair.
Today I learned through discussions about bisections that there is a script called
debsnap
. This tool can list versions of a package that are available on snapshot.debian.net, and it can also download a specific .deb for you if you know the desired version.This tool would have saved me some time during the process of debugging the video driver for X as jotted down in this article: no need to add the source to your
apt/sources.list.d/
anymore with this.The
debsnap
script is part of thedevscripts
package.