GitHub's Reliability Journey: Overcoming Rapid Growth Challenges

Introduction

In recent months, GitHub experienced two significant availability incidents that disrupted workflows for many users. These events were unacceptable, and we sincerely apologize for the impact. This article outlines the root causes, the steps we've taken to address them, and our ongoing efforts to ensure a more resilient platform for the future.

GitHub's Reliability Journey: Overcoming Rapid Growth Challenges — Source: github.blog

The Driving Forces Behind the Need for Scale

In October 2025, we began executing a plan to increase GitHub's capacity by 10x, aiming to substantially improve reliability and failover mechanisms. However, by February 2026, it became evident that we needed to design for a future requiring 30x today's scale. The primary catalyst? A dramatic shift in software development practices.

Since the second half of December 2025, the adoption of agentic development workflows has accelerated sharply. Key metrics—repository creation, pull request activity, API usage, automation, and large-repository workloads—are all growing rapidly. This exponential growth does not stress isolated systems; a single pull request can touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At high scale, small inefficiencies compound: queues deepen, cache misses become database load, indexes fall behind, retries amplify traffic, and one slow dependency can affect several product experiences.

Our Priorities: Availability First

Our priorities are clear: availability first, then capacity, then new features. We are reducing unnecessary work, improving caching, isolating critical services, removing single points of failure, and moving performance-sensitive paths into systems designed for these workloads. This is distributed systems work: reducing hidden coupling, limiting blast radius, and making GitHub degrade gracefully when one subsystem is under pressure. We're making progress quickly, but these incidents highlight areas where more work remains.

Short-Term Actions and Immediate Fixes

In the short term, we resolved several bottlenecks that appeared faster than anticipated:

Migrating webhooks to a different backend (out of MySQL) to reduce database strain.
Redesigning the user session cache and redoing authentication and authorization flows to substantially reduce database load.
Leveraging our migration to Azure to stand up significantly more compute resources.

Isolating Critical Services and Minimizing Blast Radius

Next, we focused on isolating critical services like Git and GitHub Actions from other workloads, minimizing the blast radius by eliminating single points of failure. This work started with a careful analysis of dependencies and different tiers of traffic to understand what needs to be separated and how to minimize impact on legitimate traffic during various attacks. We addressed these in order of risk. Similarly, we accelerated the migration of performance- or scale-sensitive code from the Ruby monolith into Go, improving efficiency and reliability.

The Path to Multi-Cloud

While we were already in the process of moving out of smaller custom data centers into the public cloud, we began working on a path to multi-cloud. This approach enhances resilience, allowing us to distribute workloads across multiple providers and reduce dependency on any single infrastructure.

Looking Ahead: Building a More Resilient GitHub

Our work is ongoing. We continue to identify and address hidden couplings, improve monitoring and alerting, and invest in automated failover systems. The rapid growth in agentic development and automation means we must stay ahead of the curve. Every incident teaches us valuable lessons that inform our architecture and operations. We are committed to transparency and will share further updates as we make progress. Thank you for your patience and understanding as we build a more reliable GitHub for everyone.