Prognosis

Last month I was suffering from chronic laptop crashes after I decided to run a long-overdue apt dist-upgrade (I run sid). I knew that software was causing this issue since before the dist-upgrade, the laptop was happily running fine for the last couple of years. My issue, though was: what exactly was causing such a painful thing as seemingly random but often-recurring crashes?

I wanted to diagnose. The situation got to a point where it was so aggravating that I was using a desktop at the office instead of the laptop and I would ssh into the laptop to access important information. This had to stop.

Symptoms

After that fated dist-upgrade, my laptop started to show some visual flashes, as if I could see the screen blackout and refresh. Those flashes would only occur when I typed on the keyboard, and I found they would reproduce the easiest when I would change from one terminal to another in Terminator and started typing.

After a certain amount of time (or typing) which to me seemed random, the screen would turn black exactly when I typed any letter on the keyboard. A one-pixel vertical line would go crazy zigzagging colours on the left side. And then after about 10s the whole machine would shutdown by itself.

Frequency (or repetitiveness)

After some time suffering from this, I could see a clear pattern: the more I would use the keyboard, the bigger the chances of crashing were growing.

I then found out that if I purposefully caused visual flashes by going back and forth between two terminal windows and typing all the time, I would repeat the crash faily easily.

Great: with repetitiveness comes easy testing.

Treatment

My first hypothesis was that it might be caused by the video driver. However when I downgraded it to the latest version that was installed before the dist-upgrade that brought the them, the symptoms didn't go away.

I tried a couple more packages like gnome components. I tried running Xwayland instead of Xorg. I tried using fluxbox instead of gnome. Nothing would cut it: the crashes were still there.

I also tried installing a fresh debian jessie from a debian live image, and then upgrading that to debian sid. The issue was reproducing in this setup.

The issue was that during the dist-upgrade mentioned in the diagnosis section above, there were so many packages upgraded at once that I'd go crazy triaging them all, especially since some of them would refuse to dowgrade because of dependency issues.

Strong medicine

So I was completely fed up with this issue and I had a means of easily reproducing the crash. The strong medicine would be needed: bisecting between the last known good state (the upgrade before last) and the the first known bad state (last upgrade).

The main tool for this bisection would be the awesome service from debian that keeps 4 snapshots per day of the whole debian package repository:

http://snapshot.debian.org/

I used the fresh debian sid install that reproduced the crashes so that I wouldn't be messing around too much with my main setup.

Setting up debian sid for bisection

In order to be able to jump back and forth in time, some preparation was needed. First, since I would be upgrading and downgrading, I needed to tell dpkg to just follow along. The unstable release would be moving around with sources from snapshot.debian.org, so a preferences file would suit the purpose nicely:

Package: *
Pin-Priority: 1001
Pin: release a=unstable

Then to make things easier, I needed the bisection process to be somewhat scripted. First, I listed all of the days in between the good and bad states (including the good and bad state days to mark boundaries). Then I marked the oldest date by adding " GOOD" to the right of the date, and similarly the last date would be marked with " BAD". This was done manually in a text file called dpkg-bisect.

The following figure shows a shortened version of the file (assume dates continue sequentially where content is ellipsized):

06-14 GOOD
06-15
06-16
06-17
[...]
08-04
08-05
08-06
08-07 BAD

Then I wrote two functions in .bashrc to make it easy to mark a date as either good or bad, respectively. Those functions would accept a date as argument and, depending on which function I called, simply add a mark of " GOOD" or of " BAD" on the line that started with that date:

function good () {
  sed -i "s/^\($1\).*\$/\1 GOOD/" dpkg-bisect
}
function bad () {
  sed -i "s/^\($1\).*\$/\1 BAD/" dpkg-bisect
}

Then I created aliases in .bashrc to make it easier to upgrade or downgrade packages and cleanup after an upgrade/downgrade. The cleanup alias shows packages which were not installed on current archives, so it would show us packages that come from a different day (e.g. a different snapshot source). The update alias gives an option to apt-get to disregard errors about a source being outdated: sources on snapshot.debian.org are only valid for a dozen or so days:

alias cleanup="aptitude search '?narrow(?not(?archive(\"^[^n][^o].*$\")),?version(CURRENT))'"
alias update='sudo apt-get -o Acquire::Check-Valid-Until=false update'

Finally, to make bisection easier, I wrote an alias also in .bashrc that would remove all lines until the last occurence of a "GOOD" marker and all lines after the first occurence of a "BAD" marker, and print the date that was right in the middle of what was left:

alias bisect="sed '/BAD\$/Q' dpkg-bisect | tac | sed '/GOOD\$/Q' | tac | awk '{ lines[NR]=\$0; } END { print lines[int(NR/2)+1] }'"

Running bisection

With all this in place, I could start the process, which boils down to:

  • call bisect and find the given date on snapshot.debian.org. For consistency, I would always choose the first snapshot of that day.
    • if the result is empty, then stop: the file would have one good and one bad line just next to each other. This bad date would be the first point where the issue was introduced.
  • change /etc/apt/sources.list to list only the snapshot.debian.org URL corresponding to the date from previous point
  • call upgrade
  • call cleanup and figure out how to make state consistent (old libs can be removed, conflicts should be fixed)
  • reboot
  • try to reproduce the bug
  • mark the date as either good or bad depending on the result of previous point
  • repeat process

When I finally found the fateful day that introduced the error, I actually drilled down in this day by adding times of the 4 snapshots in between the last good line and the first bad line. e.g.:

[...]
07-16
07-17
07-18 GOOD
07-19-04:18:30 GOOD
07-19-10:13:20
07-19-16:21:50
07-19-22:21:00
07-20-04:39:05 BAD
07-21
07-22 BAD
07-23
[...]

Then I continued to bisect to find the exact snapshot within that day that brought the problem. Arguably, I could have just listed all snapshots at the beginning of the process.

Diagnosis

The snapshot that introduced the crashes was upgrading only 5 packages. Among these, there was one that would stand out from the rest as possible cause so I tried to dowgrade this one first, and voilà! I had found the source of the bug. It was xserver-xorg-core.

So I downgraded this package on my main setup and the bug was gone. I put it on hold so it wouldn't re-upgrade automatically, and finally I reported a bug on the package, #837451. Apparently, a patch for intel-specific hardware was introduced in the package in the version that introduces the crashes and I suspect this patch is what was causing my woes.

This issue was really annoying, but learning how to run a binary search on debian archive snapshots was very interesting. It's a great tool for when finding a remedy to software bugs that make you loose hair.