Firestore Collapse at 800K Writes/Min: Klarna's Scaling Fail

When Klarna surpassed 500 million monthly transactions on its platform in 2024, the engineering team didn't anticipate that Firebase Firestore, Google's serverless NoSQL database, would become its most costly obstacle. The monthly bill on Google Cloud reached $180,000. But the issue wasn't just financial: during Black Friday surges, the system took 4.7 seconds to process a checkout. While competitors managed payments in under 800ms, Klarna saw users drop off. The forced migration to a hybrid architecture with PostgreSQL and Redis took seven months and $2.3 million in engineering costs.

3D render of cloud computing concept Photo: Growtika on Unsplash

This real case, documented in a technical postmortem published on their engineering blog in March 2025, highlights a structural problem that no Firebase tutorial mentions: Firestore is not designed to scale with massive concurrent writes in collections with high cardinality. The limit isn't technical but architectural. What happens when you need to process more than 10,000 writes per second in the same collection? Firestore forces you to redesign your data model from scratch. What Google touts as "automatic scalability" has an invisible ceiling that you only discover when it's too late.

The 1 write/second per document limit no one reads in the docs

Firestore has a documented but rarely understood restriction: an individual document cannot handle more than 1 write per second continuously. Google mentions this on its limits page but buries it under the technical description of "hot spots." In my experience, this means if your architecture relies on updating counters, aggregated metrics, or timestamps in a central document, Firestore will simply stop responding as traffic increases.

Klarna encountered precisely this issue. Their anti-fraud system updated a document per user with each processed payment, storing the recent transaction history. With 800,000 active users simultaneously during traffic peaks, the system attempted to write to hundreds of thousands of documents at once. The problem wasn't Firestore itself but that their data model assumed all writes were independent. But when 15% of those users initiated multiple checkouts simultaneously, certain documents received 3-5 writes per second.

Firestore's response was swift: cascading RESOURCE_EXHAUSTED errors. There was no gradual reduction or graceful degradation. It simply stopped accepting writes, causing transactions to start failing. Klarna's team discovered that the real limit wasn't the total number of writes but how those writes were distributed. Firestore can handle millions of writes per second if perfectly distributed, but it collapses with 50,000 writes if 20% focus on 1,000 documents.

Forced Sharding: The Solution Google Doesn't Automate

Official documentation suggests "sharding" when you detect hot spots: splitting a document into multiple subdocuments and distributing the writes. In theory, it sounds simple. In production, it's a consistency nightmare.

Klarna had to rewrite their aggregation logic. Instead of maintaining a users/{userId}/fraud_score document, they created 10 subdocuments: users/{userId}/fraud_score_shard_{0-9}. Each write was directed to one of the shards randomly, and when they needed to read the complete score, they performed 10 parallel reads and summed the results. The code went from 20 lines to 180, and read latency doubled. Worse still, transactions were no longer atomic. If a payment failed after updating 3 of the 10 shards, the system was left in an inconsistent state.

The real problem is that Firestore does not offer distributed transactions across shards. You can only execute an atomic transaction on individual documents or within the same parent collection. When you introduce manual sharding, you lose the ACID guarantees that make a transactional database useful.

Collections with millions of documents: the invisible index that paralyzes your app

a computer screen with a cloud shaped object on top of it Photo: Hazel Z on Unsplash

The second trap of Firestore appears when your collection exceeds 10 million documents. Technically, Firestore can handle collections with hundreds of millions of records. But what they don't tell you is that every query in a large collection consumes exponentially greater indexing resources, even if you're filtering by indexed fields.

Klarna stored all historical transactions in a single transactions collection. When they reached 50 million documents, queries with where() clauses began taking 8-12 seconds, even with correctly configured composite indexes. Curiously, the problem wasn't the lack of indexes but that Firestore scans complete indexes before applying result limits. If you execute a .limit(10) on a 50 million collection, Firestore evaluates the entire index before returning just 10 documents.

The Architecture Stripe Uses (and Firestore Can't Replicate)

Stripe faces a similar problem: billions of transactions need to be queried in real-time. But their architecture is radically different. They use PostgreSQL with partitioning by date and a BRIN (Block Range Index) on the timestamp. When you query transactions from the last 30 days, PostgreSQL only scans the relevant partitions, completely ignoring historical tables.

Firestore doesn't have native partitioning. You can't tell it to "automatically ignore documents older than 6 months." The only solution is to manually maintain separate collections: transactions_2025_Q1, transactions_2025_Q2, etc. But this breaks the serverless abstraction that Firebase sold. Now your code has to know which collection to search in, manage cross-collection queries, and handle TTLs manually.

Klarna tried this strategy and discovered a new issue: Firestore charges for index reads, not returned documents. If your query touches 100,000 documents before returning 10 results, you pay for 100,000 reads. In their case, this represented 60% of their monthly Firestore bill: $108,000 of $180,000 were index reads in historical collections.

Real-time listeners: the feature that destroys your budget when you scale

The third breaking point is the most insidious because it's Firestore's flagship feature: real-time listeners. The promise is elegant: your frontend subscribes to document changes, and Firestore pushes updates automatically without polling. In development, it works perfectly. In production with 500,000 simultaneous connections, it's a financial bomb.

Klarna had active listeners in each user session to update the cart state in real-time. With 800,000 concurrent users, that meant 800,000 active WebSocket connections. Firestore handled the connections without issue, but each change in the carts collection triggered 800,000 simultaneous notifications. Each write cost an additional 800,000 reads because Firestore has to evaluate if each listener needs to be notified.

The team discovered this when their Firestore bill tripled in a month without changes in write traffic. The culprit: they added a last_updated field to the cart document, updating it every time the user scrolled. That field generated 40 million daily writes, which in turn triggered 32 billion listener evaluations.

Why Redis Pub/Sub Is Not a Direct Solution

The obvious solution is to move listeners to Redis Pub/Sub and use Firestore only for persistence. Klarna tried this but encountered an architectural problem: Redis Pub/Sub does not persist messages. If a client disconnects for 30 seconds, it misses all updates during that period. Firestore, on the other hand, maintains a change log and can automatically sync when the client reconnects.

To replicate this with Redis, you need to implement your own logging system: store each change in a list with TTL, identify which client received which message, and manage resynchronizations manually. It's doable, but now you're building the functionality that Firebase offered out-of-the-box. And that custom code has bugs, requires maintenance, and adds latency.

Klarna's final architecture was hybrid: Redis Pub/Sub for real-time notifications, Firestore for critical persistence, and PostgreSQL for analytical queries. However, coordinating three different systems, they lost the simplicity that initially led them to choose Firebase.

The Hidden Cost: The Migration No One Budgets For

The fourth problem with Firestore isn't technical but organizational. When you decide to migrate away from Firestore, you're not just rewriting queries. You're redesigning your data model, rewriting your entire data access layer, and migrating petabytes of information with no downtime. And since Firestore is document-oriented with flexible schemas, each document can have different fields, making an automated migration impossible.

Klarna dedicated a team of 8 engineers over 7 months to the migration. The first 3 months were just planning: designing the PostgreSQL schema, deciding what data to migrate vs. archive, and building a dual-write synchronization pipeline (writing simultaneously to Firestore and PostgreSQL during the transition). The total cost: $2.3 million in salaries, staging infrastructure, and losses from production bugs.

Real-Time ETL: The Piece Firebase Doesn't Offer

The most complex problem was maintaining consistency during the migration. Klarna couldn't afford downtime, so they implemented dual-writes: each transaction was written to Firestore and PostgreSQL simultaneously. However, Firestore doesn't support two-phase commits: you can't perform a distributed rollback if PostgreSQL accepts the write but Firestore fails.

They ended up building a reconciliation system: a job that ran every 5 minutes comparing Firestore vs. PostgreSQL and resolving discrepancies. But how do you decide which is the source of truth? If a document exists in Firestore but not in PostgreSQL, was it a write error or a race condition? During migration, they detected 40,000 inconsistent records needing manual review.

The irony: Firebase was marketed as a solution to avoid managing databases. In the end, Klarna spent more resources migrating away from Firebase than if they had used PostgreSQL from the start.

When Firestore Makes Sense (and When You're Building a Future Problem)

Firestore isn't a bad technology: it's a technology with a very specific use case. It works perfectly for applications with these patterns:

Read-heavy with evenly distributed writes: content apps, blogs, product catalogs
Small documents (<1KB) with low update volume: user profiles, configurations
Rapid prototyping without strict consistency requirements: MVPs, hackathons

But if your app has any of these patterns, Firestore is a technical debt in disguise:

Real-time aggregations: dashboards, metrics, counters
Financial transactions with ACID requirements: payments, inventories, billing
Complex searches with multiple filters: date range + category + price filters
Write volume >100K/min in high cardinality collections: logs, events, user activity

The trap with Firestore is that it works perfectly until it doesn't. Isn't it ironic? There's no gradual degradation. One day your app handles 50,000 users without issues, the next day you have 100,000 and everything collapses. And when that happens, you're stuck: the migration takes months and millions of dollars.

Klarna ended up with a more complex architecture, more expensive to operate, and less "serverless" than they started. But their checkouts now process in under 600ms consistently, and their infrastructure bill dropped from $180,000 to $90,000 monthly by combining PostgreSQL ($40K on RDS), Redis ($15K on ElastiCache), and Firestore reduced for specific features ($35K).

Is your startup building on Firestore? The question isn't if you'll need to migrate, but when. And the longer you wait, the more expensive it will be.

Editorial note: This article was generated with AI assistance and reviewed by the NewsTide editorial team to ensure accuracy and relevance. Read our editorial policy.

More on AI

→Why Claude 3.5 Trips Over Cultural Contexts →Shell's $2.1B Bet on AIOS Hits a Legacy Infrastructure Wall →Mistral 7B in Edtech: Is It Really Advisable to Hand Over Pedagogical Control to a Small Model?→Supabase Faces Challenges After 100,000 Active Users: The Connection Pooling, Not Postgres, Is the Issue →The Hidden Costs of Fine-Tuning on Hugging Face: Why 73% of Models Never Reach Production →Tally Turns Conversational AI into Business Advantage: How They Use ChatGPT to Boost Survey Engagement by 50%→The Unfulfilled Promise of Vercel + Supabase: When Real-Time Takes a Technical Toll →Why AI Agents Like Claude 3.5 Fail in E-commerce: They Learned from Amazon, Not Your Store

← Back to home View all AI →