The bug that Stole xmas

I loved working at Mozilla for the almost two years that I spent there. But even a great job has hard days. The weeks around Christmas of 2018 were a stressful nightmare.

I’d been working as the senior engineer on the Firefox Focus browser for Android for a few months by then. Releases had been mostly uneventful. This story is about when that stopped.

Up to this point, our releases had been quiet because we used A/B experiments to roll out new features to Nightly and Beta users before anything reached our millions of Google Play users in production. We would watch dashboards to ensure that crashes, Application Not Responding (ANR) errors, page load time, reviews, and the number of active users over time were all trending in the right direction as we progressively rolled out new releases and would roll back at any sign of trouble.

Right before Christmas 2018, our long chain of pleasant releases came to an end.

As I was arriving back from a week-long vacation in Florida, we released Focus 8.0.4 to production. In fact, I believe I may have pushed the release from my economy seat on the flight using in-flight WiFi because no one else would be around to watch the rollout.

At first, everything seemed fine. This was because our two crash reporting tools, Sentry and Socorro, did not show that WebViews were crashing inside of Chrome code. These Native Development Kit (NDK) crashes were not caught by our tools, so they only appeared on the Google Play Developer Console.

To compound matters, the rest of my team went away on vacation then. I was alone and soon very bewildered.

We would’ve been fine if we’d used a crash reporting solution that reports all NDK crashes– not simply crashes in our own native code.

Lesson #1

After rolling out the release to 1% of our users, nothing was noticed. We discovered these crashes only in Google Play’s ANRs and Crashes section of the Google Play Console as our app was deployed to the first 10% of our users. This was surprising because I’d never before used any crash reporting solution that did not show data about crashes in C++ NDK code.

Having three crash reporting solutions with three separate and exclusive data sets split between them made finding our problem much harder. Google Play ended up picking up the crashes which the other two solutions did not handle. We would’ve been fine if we’d used a crash reporting solution that reports all NDK crashes– not simply crashes in our own native code.

When we became aware of the crashes, we still had very little data to go on. Google Play showed three primary crashes in C++ code. They were all inside the Chrome or WebView packages. Each of the three stack traces consisted of nothing more than the line “signal 5 (SIGTRAP), code 4 (TRAP_HWBKPT)” and another line pointing to the file name of the Chrome or WebView APK.

To be clear, there was no stacktrace. There was no native crash tombstone. There were no known steps to reproduce. I was flying blind and I was alone in my investigation. There was no one online but one Quality Assurance (QA) engineer and my own self to figure out what broke and how to produce the crash.

This wasn’t the kind of error that would produce results on StackOverflow or any of the usual sites which helps young engineers appear briefly superhuman simply by searching Google and copying the first answer. Crashing on SIGTRAP clearly pointed to native C++ issues, but it wasn’t happening in our own C++ code or the trace would’ve shown up in Socorro– our web engine’s native crash reporter. Instead, these errors happened inside the Chrome-based WebViews themselves.

To compound matters, we had just rolled out our new GeckoView engine to more users in an A/B test. This was a big deal. Firefox Focus had been built originally on Google’s WebView technology to quickly launch a prototype and to test our product hypotheses. However, we had huge plans to launch user privacy features which required a custom engine.

GeckoView was the culmination of years of work catching up to and sometimes even surpassing Google on performance as well as offering new features which could differentiate our browsers into the future. My job up to this point had been to ensure a smooth rollout of this new technology. Any potential issues had to be quickly caught and fixed.

Changing one variable at a time reduces the complexity of troubleshooting.

Lesson #2

When these errors appeared, they were affecting about 7-8% of all browser sessions. This meant we were crashing a lot and were likely losing many thousands of dedicated users. Most of the previous apps I’d worked on maintained far less than 0.5% session crash rates after a production release.

If we’d rolled out the new Firefox Focus 8.0.4 release further apart from increasing the percentage of GeckoView users, my investigation would’ve been far easier. Rolling out our release around the same time we rolled out our A/B test to more users meant we couldn’t tell which was responsible for the crashes. Changing one variable at a time reduces the complexity of troubleshooting.

Don’t panic. Software investigations go faster when you take your time and think methodically.

Lesson #3

There were reasons to worry that the GeckoView engine change caused these crashes. On first install, our app would always use WebView and would only switch engines after downloading a new A/B configuration file and then restarting the app from scratch after a reboot or upgrade. This made sense to avoid errors related to switching engines midstream that would invalidate our experiment. There was a lot of concern that our engine switching code might introduce issues.

There’s a problem in Computer Science where the initial gut reaction of good engineers is that they themselves must have made a mistake somewhere. My first hypothesis was that my engine switching code was causing the WebView to crash in the process of switching new users to the GeckoView engine. This may have been slightly neurotic. There was no evidence in the code or the error to indicate that my switching code was at fault–only the knowledge that my code was finally being exercised by hundreds of thousands of users for the first time.

I should have been able to quickly abandon my initial hypothesis, even without a stack trace, simply by examining the code. When troubleshooting, it’s really important you don’t panic. Software investigations go faster when you take your time and think methodically.

To compound matters further, there had been an urgent Google Chrome security update rushed out the door in the same week. I had to scour the Internet to ensure that update was not the cause.

Sadly, it took days to get to the heart of the issue. I had to first give up on discovering steps to reproduce the issue or to hear from a QA engineer or end user how to do so. I knew it was unlikely there would be any quick fix without a stack trace and my investigation led there.

I was able to quickly abandon Stack Overflow and Internet searches as a dead end.

I focused on how I could obtain a stack trace. It seemed likely I’d need to contact Google, but I’d seen with past Android issues how it could sometimes take many years to get an engineer response and fix to a real bug in the Android framework. If I went through official channels, it seemed highly unlikely I would get a response within a useful timeframe.

I ended up reaching out to the #webview-wtf channel in the Android Study Group Slack. This is an invite-only Slack service for professional Android developers only. It turns out that channels ending in “-wtf” are code for channels where at least some of the maintainers of the software are present in the channel. This meant I could try to get the attention of Chrome developers directly.

I reached out on December 26th and felt so lucky to get a response from a Chrome developer two days later. Better yet, it was useful. Amazingly, he was able to dig up the specific crash reports for the WebViews in our app. It seems that the callbacks we set for Chrome WebViews did not return logs or tracebacks of why they crashed. Even though the code was written in Kotlin, the error was caught by Chrome’s NDK code and it caused a SIGTRAP.

Threading is hard and concurrent access is a minefield. Functional programming isn’t just for Haskell programmers. Purity even makes a difference in imperative code.

Lesson #4

As soon as I had a traceback of the exception that caused the crash, the fix was fairly easy. Our blocker tracking code for the WebView engine had just been replaced by Mozilla’s Android Components to have a new, unified backend. That update used the Kotlin Observable property delegate, but the lambda operated directly on the field instead of local variables. This is bad because it’s not thread-safe.

The root cause of the crashes was multiple threads were both changing and reading the quantity of trackers blocked. When we tried to read the last tracker blocked, another thread had emptied the list before we could. This crash was inside a Chrome callback, so Chrome got the stack trace.

Here is the core part of the fix I wrote:

    var trackersBlocked: List<String> by Delegates.observable(emptyList()) { _, old, new ->
        notifyObservers(old, new) {
            if (new.isNotEmpty()) {
-                onTrackerBlocked(this@Session, trackersBlocked.last(), trackersBlocked)
+                onTrackerBlocked(this@Session, new.last(), new)

This was possibly the hardest bug hunt I’ve ever had to deal with. I bring it up not to shame the otherwise smart developer who made the mistake and likely learned from it, but because it makes a good Halloween horror story for Android developers everywhere. There’s also some useful lessons to learn.

Threading is hard and concurrent access is a minefield. Functional programming isn’t just for Haskell programmers. Purity even makes a difference in imperative code. In this case, it even could have prevented an epic horror story that ruined my holidays.

Here’s a summary of the issue.