I77537 StackDocsOpen Source
Related
OpenClaw Overtakes React as Most-Starred GitHub Project, Igniting Security Debate in AI CommunityHow to Build and Run a Self-Improving AI Agent with Hermes on NVIDIA Hardware10 Surprising Facts About the Ploopy Bean: The Open-Source Pointing Stick MouseAchieving Resilient Scalability: A GitHub-Inspired Guide to High Availability10 Ways GitHub Is Revolutionizing Accessibility With Continuous AIEmbrace April: Fresh Desktop Wallpapers to Inspire Your Month7 Critical Insights into Diffusion Models for Video Generation5 Key Reasons Why Block Gifted Goose to the Linux Foundation

GitHub's Reliability Journey: Overcoming Rapid Growth Challenges

Last updated: 2026-05-02 05:23:22 · Open Source

Introduction

In recent months, GitHub experienced two significant availability incidents that disrupted workflows for many users. These events were unacceptable, and we sincerely apologize for the impact. This article outlines the root causes, the steps we've taken to address them, and our ongoing efforts to ensure a more resilient platform for the future.

GitHub's Reliability Journey: Overcoming Rapid Growth Challenges
Source: github.blog

The Driving Forces Behind the Need for Scale

In October 2025, we began executing a plan to increase GitHub's capacity by 10x, aiming to substantially improve reliability and failover mechanisms. However, by February 2026, it became evident that we needed to design for a future requiring 30x today's scale. The primary catalyst? A dramatic shift in software development practices.

Since the second half of December 2025, the adoption of agentic development workflows has accelerated sharply. Key metrics—repository creation, pull request activity, API usage, automation, and large-repository workloads—are all growing rapidly. This exponential growth does not stress isolated systems; a single pull request can touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At high scale, small inefficiencies compound: queues deepen, cache misses become database load, indexes fall behind, retries amplify traffic, and one slow dependency can affect several product experiences.

Our Priorities: Availability First

Our priorities are clear: availability first, then capacity, then new features. We are reducing unnecessary work, improving caching, isolating critical services, removing single points of failure, and moving performance-sensitive paths into systems designed for these workloads. This is distributed systems work: reducing hidden coupling, limiting blast radius, and making GitHub degrade gracefully when one subsystem is under pressure. We're making progress quickly, but these incidents highlight areas where more work remains.

Short-Term Actions and Immediate Fixes

In the short term, we resolved several bottlenecks that appeared faster than anticipated:

  • Migrating webhooks to a different backend (out of MySQL) to reduce database strain.
  • Redesigning the user session cache and redoing authentication and authorization flows to substantially reduce database load.
  • Leveraging our migration to Azure to stand up significantly more compute resources.

Isolating Critical Services and Minimizing Blast Radius

Next, we focused on isolating critical services like Git and GitHub Actions from other workloads, minimizing the blast radius by eliminating single points of failure. This work started with a careful analysis of dependencies and different tiers of traffic to understand what needs to be separated and how to minimize impact on legitimate traffic during various attacks. We addressed these in order of risk. Similarly, we accelerated the migration of performance- or scale-sensitive code from the Ruby monolith into Go, improving efficiency and reliability.

GitHub's Reliability Journey: Overcoming Rapid Growth Challenges
Source: github.blog

The Path to Multi-Cloud

While we were already in the process of moving out of smaller custom data centers into the public cloud, we began working on a path to multi-cloud. This approach enhances resilience, allowing us to distribute workloads across multiple providers and reduce dependency on any single infrastructure.

Looking Ahead: Building a More Resilient GitHub

Our work is ongoing. We continue to identify and address hidden couplings, improve monitoring and alerting, and invest in automated failover systems. The rapid growth in agentic development and automation means we must stay ahead of the curve. Every incident teaches us valuable lessons that inform our architecture and operations. We are committed to transparency and will share further updates as we make progress. Thank you for your patience and understanding as we build a more reliable GitHub for everyone.