The bug that Stole xmas

I loved working at Mozilla for the almost two years that I spent there. But even a great job has hard days. The weeks around Christmas of 2018 were a stressful nightmare.

I’d been working as the senior engineer on the Firefox Focus browser for Android for a few months by then. Releases had been mostly uneventful. This story is about when that stopped.

Up to this point, our releases had been quiet because we used A/B experiments to roll out new features to Nightly and Beta users before anything reached our millions of Google Play users in production. We would watch dashboards to ensure that crashes, Application Not Responding (ANR) errors, page load time, reviews, and the number of active users over time were all trending in the right direction as we progressively rolled out new releases and would roll back at any sign of trouble.

Right before Christmas 2018, our long chain of pleasant releases came to an end.

As I was arriving back from a week-long vacation in Florida, we released Focus 8.0.4 to production. In fact, I believe I may have pushed the release from my economy seat on the flight using in-flight WiFi because no one else would be around to watch the rollout.

At first, everything seemed fine. This was because our two crash reporting tools, Sentry and Socorro, did not show that WebViews were crashing inside of Chrome code. These Native Development Kit (NDK) crashes were not caught by our tools, so they only appeared on the Google Play Developer Console.

To compound matters, the rest of my team went away on vacation then. I was alone and soon very bewildered.

We would’ve been fine if we’d used a crash reporting solution that reports all NDK crashes– not simply crashes in our own native code.

Lesson #1

After rolling out the release to 1% of our users, nothing was noticed. We discovered these crashes only in Google Play’s ANRs and Crashes section of the Google Play Console as our app was deployed to the first 10% of our users. This was surprising because I’d never before used any crash reporting solution that did not show data about crashes in C++ NDK code.

Having three crash reporting solutions with three separate and exclusive data sets split between them made finding our problem much harder. Google Play ended up picking up the crashes which the other two solutions did not handle. We would’ve been fine if we’d used a crash reporting solution that reports all NDK crashes– not simply crashes in our own native code.

When we became aware of the crashes, we still had very little data to go on. Google Play showed three primary crashes in C++ code. They were all inside the Chrome or WebView packages. Each of the three stack traces consisted of nothing more than the line “signal 5 (SIGTRAP), code 4 (TRAP_HWBKPT)” and another line pointing to the file name of the Chrome or WebView APK.

To be clear, there was no stacktrace. There was no native crash tombstone. There were no known steps to reproduce. I was flying blind and I was alone in my investigation. There was no one online but one Quality Assurance (QA) engineer and my own self to figure out what broke and how to produce the crash.

This wasn’t the kind of error that would produce results on StackOverflow or any of the usual sites which helps young engineers appear briefly superhuman simply by searching Google and copying the first answer. Crashing on SIGTRAP clearly pointed to native C++ issues, but it wasn’t happening in our own C++ code or the trace would’ve shown up in Socorro– our web engine’s native crash reporter. Instead, these errors happened inside the Chrome-based WebViews themselves.

To compound matters, we had just rolled out our new GeckoView engine to more users in an A/B test. This was a big deal. Firefox Focus had been built originally on Google’s WebView technology to quickly launch a prototype and to test our product hypotheses. However, we had huge plans to launch user privacy features which required a custom engine.

GeckoView was the culmination of years of work catching up to and sometimes even surpassing Google on performance as well as offering new features which could differentiate our browsers into the future. My job up to this point had been to ensure a smooth rollout of this new technology. Any potential issues had to be quickly caught and fixed.

Changing one variable at a time reduces the complexity of troubleshooting.

Lesson #2

When these errors appeared, they were affecting about 7-8% of all browser sessions. This meant we were crashing a lot and were likely losing many thousands of dedicated users. Most of the previous apps I’d worked on maintained far less than 0.5% session crash rates after a production release.

If we’d rolled out the new Firefox Focus 8.0.4 release further apart from increasing the percentage of GeckoView users, my investigation would’ve been far easier. Rolling out our release around the same time we rolled out our A/B test to more users meant we couldn’t tell which was responsible for the crashes. Changing one variable at a time reduces the complexity of troubleshooting.

Don’t panic. Software investigations go faster when you take your time and think methodically.

Lesson #3

There were reasons to worry that the GeckoView engine change caused these crashes. On first install, our app would always use WebView and would only switch engines after downloading a new A/B configuration file and then restarting the app from scratch after a reboot or upgrade. This made sense to avoid errors related to switching engines midstream that would invalidate our experiment. There was a lot of concern that our engine switching code might introduce issues.

There’s a problem in Computer Science where the initial gut reaction of good engineers is that they themselves must have made a mistake somewhere. My first hypothesis was that my engine switching code was causing the WebView to crash in the process of switching new users to the GeckoView engine. This may have been slightly neurotic. There was no evidence in the code or the error to indicate that my switching code was at fault–only the knowledge that my code was finally being exercised by hundreds of thousands of users for the first time.

I should have been able to quickly abandon my initial hypothesis, even without a stack trace, simply by examining the code. When troubleshooting, it’s really important you don’t panic. Software investigations go faster when you take your time and think methodically.

To compound matters further, there had been an urgent Google Chrome security update rushed out the door in the same week. I had to scour the Internet to ensure that update was not the cause.

Sadly, it took days to get to the heart of the issue. I had to first give up on discovering steps to reproduce the issue or to hear from a QA engineer or end user how to do so. I knew it was unlikely there would be any quick fix without a stack trace and my investigation led there.

I was able to quickly abandon Stack Overflow and Internet searches as a dead end.

I focused on how I could obtain a stack trace. It seemed likely I’d need to contact Google, but I’d seen with past Android issues how it could sometimes take many years to get an engineer response and fix to a real bug in the Android framework. If I went through official channels, it seemed highly unlikely I would get a response within a useful timeframe.

I ended up reaching out to the #webview-wtf channel in the Android Study Group Slack. This is an invite-only Slack service for professional Android developers only. It turns out that channels ending in “-wtf” are code for channels where at least some of the maintainers of the software are present in the channel. This meant I could try to get the attention of Chrome developers directly.

I reached out on December 26th and felt so lucky to get a response from a Chrome developer two days later. Better yet, it was useful. Amazingly, he was able to dig up the specific crash reports for the WebViews in our app. It seems that the callbacks we set for Chrome WebViews did not return logs or tracebacks of why they crashed. Even though the code was written in Kotlin, the error was caught by Chrome’s NDK code and it caused a SIGTRAP.

Threading is hard and concurrent access is a minefield. Functional programming isn’t just for Haskell programmers. Purity even makes a difference in imperative code.

Lesson #4

As soon as I had a traceback of the exception that caused the crash, the fix was fairly easy. Our blocker tracking code for the WebView engine had just been replaced by Mozilla’s Android Components to have a new, unified backend. That update used the Kotlin Observable property delegate, but the lambda operated directly on the field instead of local variables. This is bad because it’s not thread-safe.

The root cause of the crashes was multiple threads were both changing and reading the quantity of trackers blocked. When we tried to read the last tracker blocked, another thread had emptied the list before we could. This crash was inside a Chrome callback, so Chrome got the stack trace.

Here is the core part of the fix I wrote:

    var trackersBlocked: List<String> by Delegates.observable(emptyList()) { _, old, new ->
        notifyObservers(old, new) {
            if (new.isNotEmpty()) {
-                onTrackerBlocked(this@Session, trackersBlocked.last(), trackersBlocked)
+                onTrackerBlocked(this@Session, new.last(), new)
            }
        }
    }

This was possibly the hardest bug hunt I’ve ever had to deal with. I bring it up not to shame the otherwise smart developer who made the mistake and likely learned from it, but because it makes a good Halloween horror story for Android developers everywhere. There’s also some useful lessons to learn.

Threading is hard and concurrent access is a minefield. Functional programming isn’t just for Haskell programmers. Purity even makes a difference in imperative code. In this case, it even could have prevented an epic horror story that ruined my holidays.

Here’s a summary of the issue.

A dispatcher of many threads

Kotlin Coroutines have been almost unanimously received with applause by the Android software development community. However, there are some significant issues I have noticed in practice that are still unknown to most teams.

This week, one major issue was finally resolved.

Dispatchers.Main

Dispatchers.Main is the Kotlin Coroutines thread dispatcher for accessing Android’s main thread. Until recent improvements, the very first time this call is used, most apps experience a delay of up to a quarter second. Blocking the main thread pauses drawing the user interface and is enough to cause a noticeable lag to users.

It takes 16ms to draw one frame on Android. Blocking the main UI thread for 250ms skips fifteen frames at 60fps or thirty frames at 120fps. This is in addition to any other slow, blocking operations that might occur during your application’s initial startup.

This blocking operation serves no purpose. The Main Dispatcher was written using a ServiceLoader call. This choice resulted in a wasteful checksum operation being performed on the entire JAR file inside the APK.

I was able to workaround this problem several months ago on Firefox Preview by using a pre-release version of Google’s R8 code stripper along with custom rules. This wasn’t ideal. We were potentially subjecting our app to early stage bugs and were unlikely to get the same level of support for an unreleased version. We made this sacrifice because a quarter second startup time reduction was worth the cost.

At long last, there’s now a solution in a stable version of the Kotlin coroutines library itself. To take advantage of this fix, you must perform the following steps.

  1. Upgrade your version of Kotlin coroutines to 1.3.3 or higher in your build.gradle dependencies.
  2. Insert the required Proguard rules into your proguard-rules.pro file or whatever you’ve named it. See below.
  3. Ensure you’re using proguard-android-optimize.txt instead of the default, non-optimizing version.
  4. Ensure R8 is enabled. The R8 code stripper was introduced in Android Studio 3.3.0 and became default in Android Studio 3.4.0 or greater. It replaces Proguard’s code stripper, but should accept the same rules files. It can be configured in the gradle.properties file at the root of your application code.
# Add or replace these lines in your build.gradle file
# dependencies section
implementation org.jetbrains.kotlinx:kotlinx-coroutines-core:1.3.3
implementation org.jetbrains.kotlinx:kotlinx-coroutines-android:1.3.3
# Add these lines to proguard-rules.pro

# Allow R8 to optimize away the FastServiceLoader.
# Together with ServiceLoader optimization in R8
# this results in direct instantiation when loading Dispatchers.Main
-assumenosideeffects class kotlinx.coroutines.internal.MainDispatcherLoader {
    boolean FAST_SERVICE_LOADER_ENABLED return false;
}

-assumenosideeffects class kotlinx.coroutines.internal.FastServiceLoader {
    boolean ANDROID_DETECTED return true;
}
# Add these lines to your application build.gradle file
# or simply change the getDefaultProguardFile line

android {
   buildTypes {
      release {
        shrinkResources true
        minifyEnabled true
        proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro'
      }
   }
}
# Put this line somewhere in gradle.properties
android.enableR8=true

If you perform the above steps, gradle sync, and then perform a release build with optimization enabled, you should see a significant improvement in your cold start time versus previous release builds.

There are many options to quickly see this improvement with a flame chart. You could use Google’s Traceview or a dedicated, third-party tool like Nimbledroid.

Google’s new Jetpack Benchmark tool seems useful for UI code, but it is not recommended by its creators for measuring cold application startup time.

Update: I was planning a part two of this blog, but decided the second issue wasn’t as interesting, nor as clearcut as I expected for a blog post.

Colin, Now Sly As a FireFox

Colin Lee from ColinTheShots LLC just joined Mozilla as the newest Senior Android Engineer working on Firefox Android products. I’ll be changing my professional consulting site to function as a blog. Since my job is open source software and no longer involves secrecy, I’m excited to be able to speak about my experiences developing new and amazing Android products.

Initially, I’ll be writing primarily Kotlin code in projects mixed with existing Java source.

In my first weeks on the job, I built a switcher for changing browser rendering engines and a new nightly build process integrating a hot, new nightly icon for the Firefox Focus browser. You can see that icon above.

Look forward to more posts about Android development challenges and learning.