I77537 StackDocsEducation & Careers
Related
NVIDIA CEO Tells Graduates: AI Revolution Marks the Start of Your CareerStop Wasting Time on Setup: How Grafana Assistant Pre-Learns Your Infrastructure for Instant TroubleshootingEmpowering Educators: ISTE+ASCD Announces 2026-27 Voices of Change Fellows7 Key Facts About Joby's JFK-to-Midtown Air Taxi DemonstrationNavigating the AI Revolution: A Step-by-Step Guide for New GraduatesIs Your Website Ready for AI Agents? Understanding the Agent Readiness Score10 Essential Techniques for Converting Between ByteBuffer and Byte Array in Java10 Key Takeaways from NVIDIA’s AI Manufacturing Revolution at Hannover Messe 2026

Cloudflare Engineers Uncover Hidden ClickHouse Bottleneck Threatening Billion-Dollar Billing Pipeline

Last updated: 2026-05-17 12:43:14 · Education & Careers

Billing Pipeline Grinds to a Crawl

Cloudflare’s daily billing aggregation jobs—responsible for generating hundreds of millions of dollars in usage revenue—unexpectedly slowed to a halt after a recent migration. The delay threatened to disrupt invoice reconciliation and downstream systems, including fraud detection.

Cloudflare Engineers Uncover Hidden ClickHouse Bottleneck Threatening Billion-Dollar Billing Pipeline
Source: blog.cloudflare.com

“It was a big problem when daily aggregation jobs slowed down,” said a Cloudflare engineer who worked on the fix. “Everything we normally check—I/O, memory, rows scanned, parts read—appeared normal. That’s when we knew it was something deeper.”

Hidden Bottleneck Discovered Inside ClickHouse

The bottleneck was traced to a subtle inefficiency within ClickHouse’s internals, specifically in how the database handles per-namespace data sorting. The system, called Ready-Analytics, stores petabytes of data from hundreds of applications in a single massive table, sorted by namespace, indexID, and timestamp.

“We had to dig deep into ClickHouse’s query execution logic to find the culprit,” another engineer explained. “It wasn’t a resource issue—it was a design flaw in our own schema and retention policy.”

Background: The Rise of Ready-Analytics

Cloudflare built Ready-Analytics in early 2022 to simplify data onboarding for internal teams. Instead of creating custom tables, teams stream data into one unified table with a standard schema of 20 float fields, 20 string fields, a timestamp, and an indexID. The indexID is a string that forms part of the primary key, allowing each namespace’s data to be sorted optimally for its queries.

By December 2024, Ready-Analytics held over 2 petabytes of data and ingested millions of rows per second. But its retention policy—dropping partitions older than 31 days—was a blunt instrument. Teams requiring longer retention had to skip Ready-Analytics entirely, opting for a much more complex conventional setup.

Cloudflare Engineers Uncover Hidden ClickHouse Bottleneck Threatening Billion-Dollar Billing Pipeline
Source: blog.cloudflare.com

The Problem: One-Size-Fits-All Retention

Cloudflare has used ClickHouse for years, predating native Time-to-Live (TTL) features. The company built its own retention system based on daily partitioning. The Ready-Analytics table used a 31-day global retention, which forced teams with legal or contractual obligations to store data for years to build separate infrastructures.

“This restriction meant many use cases couldn’t use Ready-Analytics,” a product manager noted. “We needed a per-namespace retention solution that didn’t require abandoning the platform.”

What This Means for Cloudflare and Users

The three patches written to fix the bottleneck not only restored billing pipeline performance but also enabled per-namespace retention, opening Ready-Analytics to teams that previously had to avoid it. The engineers have documented their approach to share with the ClickHouse community.

“The fix eliminated the hidden bottleneck and gave us the flexibility we needed,” said a lead engineer. “Now teams can set their own retention periods without impacting the entire cluster.”

Cloudflare expects the improvements to accelerate onboarding for internal teams and reduce operational overhead. Users will benefit from more accurate and timely billing, while the company avoids revenue reconciliation headaches.