Coding at Google

I wrote this a few years back, but I’ve had occasion to cite it yet again when explaining why engineering at Google was awesome. To avoid it getting eaten by the bitbucket, I’m publishing it here.

Background: From January 2016 to May 2018, I was a Senior SWE on the Chrome Enamel Security team.

Google culture prioritizes developer productivity and code velocity. The internal development environment has been described (by myself and others) as “borderline magic.” Google’s developer focus carried over to the Chrome organization even though it’s a client codebase, and its open-source nature means that cannot depend upon Google-internal tooling and infrastructure.

I recounted the following experience after starting at Google:

When an engineer first joins Google, they start with a week or two of technical training on the Google infrastructure. I’ve worked in software development for nearly two decades, and I’ve never even dreamed of the development environment Google engineers get to use. I felt like Charlie Bucket on his tour of Willa Wonka’s Chocolate Factory—astonished by the amazing and unbelievable goodies available at any turn. The computing infrastructure was something out of Star Trek, the development tools were slick and amazing, the process was jaw-dropping.

While I was doing a “hello world” coding exercise in Google’s environment, a former colleague from the IE team pinged me on Hangouts chat, probably because he’d seen my tweets about feeling like an imposter as a SWE. He sent me a link to click, which I did. Code from Google’s core advertising engine appeared in my browser in a web app IDE. Google’s engineers have access to nearly all of the code across the whole company. This alone was astonishing—in contrast, I’d initially joined the IE team so I could get access to the networking code to figure out why the Office Online team’s website wasn’t working.
“Neat, I can see everything!” I typed back. “Push the Analyze button” he instructed. I did, and some sort of automated analyzer emitted a report identifying a few dozen performance bugs in the code. “Wow, that’s amazing!” I gushed. “Now, push the Fix button” he instructed. “Uh, this isn’t some sort of security red team exercise, right?” I asked. He assured me that it wasn’t. I pushed the button. The code changed to fix some unnecessary object copies. “Amazing!” I effused. “Click Submit” he instructed. I did, and watched as the system compiled the code in the cloud, determined which tests to run, and ran them.
Later that afternoon, an owner of the code in the affected folder typed LGTM (Googlers approve changes by typing the acronym for Looks Good To Me) on the change list I had submitted, and my change was live in production later that day. I was, in a word, gobsmacked. That night, I searched the entire codebase for misuse of an IE cache control token and proposed fixes for the instances I found.
-Me, 2017

The development tooling and build test infrastructure at Google enable fearless commits—even a novice can make contributions into the codebase without breaking anything—and if something does break, culturally, it’s not that novice’s fault: instead, everyone agrees that the fault lies with the environment – usually either an incomplete presubmit check or missing test automation for some corner case. Regressing CLs (changelists) can be quickly and easily reverted and resubmitted with the error corrected. Relatedly, Google invests heavily in blameless post-mortems for any problem that meaningfully impacts customer experience or metrics. Beyond investing in researching and authoring the post-mortem in a timely fashion, post-mortems are broadly-reviewed and preventative action items identified therein are fixed with priority.

Google makes it easy to get started and contribute. When ramping up into a new space, the new engineer is pointed to a Wiki or other easily-updated source of step-by-step instructions for configuring their development environment. This set of instructions is expected to be current, and if the reader encounters any problems or changes, they’re expected to improve the document for the next reader (“Leave it better than you found it”). If needed, there’s usually a script or other provisioning tool used to help get the right packages/tools/dependencies installed, and again, if the user encounters any problems, the expectation is that they’ll either file a bug or commit the fix to the script.

Similarly, any ongoing Process is expected to have a “Playbook” that explains how to perform the process – for example, Chrome’s HSTS Preload list is compiled into the Chrome codebase from snapshots of data exported from HSTSPreload.org. There’s a “Playbook” document that explains the relevant scripts to run, when to run them, and how to diagnose and fix any problems. This Playbook is updated whenever any aspect of the process changes as a part of whatever checkin changes the process tooling.

As a relatively recent update, the Chromium project now offers a very lightweight contribution experience that can be run entirely in a web browser, which mimics the Google internal development environment (Cider IDE with Borg compiler backend).

Mono-repro, no team/feature branches, Google internally uses a mono-repo into which almost all code (with few exceptions, including Chrome) is checked in, and the permissions allow any engineer anywhere in the company to read it, dramatically simplifying both direct code reuse as well as finding expertise in a given topic. Because Chrome is an open-source project, it uses its own mono-repo containing approximately 25 million lines of code. Chrome does not, in general, use shared branches for feature development, only to fork for the release branches (e.g. Canary is forked in order to create the Dev branch, and there are firm rules about cherry-picking from Main into those branches).

An individual developer will locally create branches for each fix that he’s working on, but those branches are almost never seen by anyone else; his PR is merged to HEAD at which point everyone can see it. As a consequence, landing non-trivial changes, especially in areas where others are merging, often results in many commits and a sort of “chess game” where you have to anticipate where the code will be moving as your pieces are put in. This strongly encourages developers to land code in many small CLs that coax the project toward the desired end-state, each with matching automated tests to ensure that you’re protected against anyone else landing a change that regresses your code. Those tests end up defending your code for years to come.

Because all work is done in Main, there’s little in the way of cross-team latency, because you need not wait for an RI/FI to bring features around to/from other branches.

Cloud build. Google uses cloud build infrastructure (Borg/Goma) to build its projects so developers can work on relatively puny workstations but compile with hundreds to thousands of cores. A clean build of Chrome for Windows that took 46 minutes on a 48 thread Xeon workstation would take 6 minutes on 960 Goma cores, and most engineers are not doing clean builds very often.

This Cloud build infrastructure is heavily leveraged throughout the engineering system—it means that when an engineer puts a changelist up for review, the code is compiled for five to ten different platforms in parallel in the background and then the entire automated test suite is run (“Tryjob”) such that the engineer can find any errors before another engineer even begins their code review. Similarly, artifacts from each landed CL’s compilation are archived such that there’s a complete history of the project’s binaries, which enables automated tooling to pinpoint regressions (performance via perfbots, security via ClusterFuzz, reliability via their version of Watson) and engineers to quickly bisect other types of regressions.

Great code search/blame. Google’s Code Search features are extremely fast and, thanks to the View-All monorepo and lack of branches, it’s very easy to quickly find code from anywhere in the company. Cross-references work correctly, so things like “Find References” will properly find all callers of a specific function rather than just doing a string search for that name. Viewing Git history and blame is integrated, so it’s quick and easy to see how code evolved over time.

24-hour Code Review Culture. Google’s engineering team has a general SLA of 24 hours on code-review. The tools help you find appropriate reviewers, and the automation helps ensure that your CL is in the best possible shape (proper linting, formatting, all tests pass, code coverage %s did not decline) before another human needs to look at it. The fast and simple review tools help reviewers concentrate on the task at hand, and the fact that almost all CLs are small/tiny by Microsoft standards help keep reviews moving quickly. Similarly, Google’s worldwide engineering culture mean that it’s often easy to submit a CL at the end of the day Pacific time and then respond to review feedback received overnight from engineers in Japan or Germany.

Opinionated and Enforced Coding Standards. Google has coding standards documents for each language (e.g. C++) that are opinionated and carefully revised after broad and deep discussions among practitioners interested in participating. These coding standards are, to the extent possible, enforced by automated tooling to ensure that all code is written to the standard, and these standards are shared across teams by default, with any per-project exceptions (e.g. Chrome’s C++) treated as an overlay.

Easily Discovered Area Interest/Ownership Google has an extremely good internal “People Directory” – it allows you to search for any employee based on tags/keywords, so you can very quickly find other folks in the company that own a particular area. Think “Dr Whom/Who+” with 100ms page-load-times, and backed by a work culture where folks keep their own areas of ownership and interest up-to-date because it’s both simple and because if they fail to do so, they’re going to keep getting questions about things they no longer own. Similarly, the OWNERS system within the codebases are up-to-date because they are used to enforce OWNERS review of changes, so after you find a piece of code, it’s easy to find both who wrote it (fast GIT BLAME) and who’s responsible for it today. Company/Division/Team/Individual OKRs are all globally visible, so it’s easy to figure out what is important to a given level of the organization, no matter how remote.

Simple/fast bug trackers. Google’s bug tracker tools are simple, load extremely quickly, and allow filing/finding bugs against anything very quickly. There’s a single internal tracker for most of Google, and a public tracker (crbug.com) for the Chromium OSS project.

Simple/fast telemetry/data science tools. Google’s equivalent of Watson is extremely fast and has code to automatically generate stack information, hit counts, recent checkins near the top-of-stack functions, etc. Google’s equivalent of SQM/OCV is extremely fast and enables viewing of histograms and answering questions like “What percentage of page loads result in this behavior” without learning a query language, getting complicated data access permissions, or suffering slow page loads. These tools enable easy creation of “notifications/subscriptions” so developers interested in an area can get a “chirp” email if a metric moves meaningfully.

Sheriffs and Rotations. Most recurring processes (e.g. bug triage) have both a Sheriff and a Deputy and Google has tools for automatically managing “rotations” so that the load is spread throughout the team. For some expensive roles (e.g. a “Build Sheriff”) the developer’s primary responsibility while sheriff becomes the process in question and their normal development work is deferred until their rotation ends; the rotation tool shows the schedule for the next few months, so it is relatively easy to plan for this disruption in your productivity.

Intranet Search that doesn’t suck While Google tries to get many important design docs and so forth into the repo directly there’s still a bunch of documentation and other things on assorted wikis and Google Docs, etc. As you might guess, Google has an internal search engine for this non-public content that works quite well, in contrast to other places I’ve worked.

Impatient optimist. Dad. Author/speaker. Created Fiddler & SlickRun. PM @ Microsoft 2001-2012, and 2018-, working on Office, IE, and Edge. Now a GPM for Microsoft Defender. My words are my own, I do not speak for any other entity. View more posts