Skip main navigation
/user/kayd @ devops :~$ cat how-to-scale-database-in-microservices.md

Database Scaling: From 100K to 5M Users in 18 Months Database Scaling: From 100K to 5M Users in 18 Months

QR Code linking to: Database Scaling: From 100K to 5M Users in 18 Months
Karandeep Singh
Karandeep Singh
• 11 minutes

Summary

Watch a database scale from a single Postgres instance to a sharded architecture supporting 5M users. Start simple, hit real bottlenecks (connection limits, replica lag, query timeouts), fix each one, measure the improvement. Based on scaling a Calgary SaaS from 100K to 5M users over 18 months.

In 2023, I joined a Calgary-based SaaS company with 100,000 users and a single Postgres database. By mid-2024, we had 5 million users and a sharded, multi-region database architecture. This is the story of every scaling decision we made, every bottleneck we hit, and every mistake we fixed.

No theory. No “you might want to consider.” Just the actual progression: single database → connection pool → read replica → caching → sharding. Each step shows the problem that forced us to act, the metrics that proved it, and the fix that bought us the next 6 months of growth.

Computer motherboard close-up representing scalable database architecture

Starting Point: Single Postgres Database

When I joined, the architecture was simple:

┌──────────────┐
│   API Server │ (10 instances, auto-scaled)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  Postgres DB │ (1 instance, db.m5.large)
│  - 2 vCPU    │
│  - 8GB RAM   │
│  - 100GB SSD │
└──────────────┘

Metrics at 100K users:

  • Average response time: 180ms
  • 95th percentile: 450ms
  • Database CPU: 35%
  • Connection count: 150 (max 200)
  • Queries per second: ~800

This worked fine. No problems. Then we ran a successful marketing campaign.

Step 1: The First Bottleneck (Connection Limits)

Two weeks after the campaign, API requests started failing:

Error: could not connect to server: remaining connection slots are reserved

Check the database:

SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

Result: 198 connections (max is 200).

The API auto-scaled from 10 to 25 instances. Each instance opened 20 connections to the database on startup. 25 × 20 = 500 connections. But Postgres on db.m5.large only allows 200.

Metrics at this point:

  • Users: 150K
  • API instances: 25
  • Database connections: 198/200 (99%)
  • Query failures: 12% of requests

Why it happened: We scaled the API servers but not the database. More servers = more connections. Hit the wall.

Step 2: First Fix — Connection Pooling

The simple fix: add PgBouncer, a connection pooler that sits between the API and the database. It maintains a small pool of database connections and reuses them across many API connections.

┌──────────────┐
│   API Server │ (25 instances)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  PgBouncer   │ (connection pooler)
│  max_client  │ = 5000
│  pool_size   │ = 100
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  Postgres DB │
└──────────────┘

PgBouncer configuration:

[databases]
myapp = host=postgres.internal port=5432 dbname=myapp

[pgbouncer]
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 100

pool_mode = transaction means PgBouncer returns the connection to the pool after each transaction, not after the client disconnects. This is key. One database connection can serve thousands of API requests per second.

Deploy PgBouncer. Update the API connection string from postgres://postgres.internal:5432/myapp to postgres://pgbouncer.internal:6432/myapp.

Results after deploying PgBouncer:

  • Database connections: 100 (from 198)
  • API instances: 25 (unchanged)
  • Connection errors: 0%
  • Response time: 175ms (slightly better)

This fixed the connection problem. But usage kept growing.

Step 3: The Second Bottleneck (Database CPU)

Two months later, response times started degrading:

Metrics:

  • Users: 300K
  • Average response time: 850ms (was 180ms)
  • 95th percentile: 2.3s
  • Database CPU: 92%
  • Queries per second: ~3,200

Check the slow query log:

SELECT * FROM users WHERE email = $1;  -- 450ms
SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20;  -- 680ms

These queries used to take 5ms. Now they’re 100x slower. Why?

Check CPU usage:

htop

Both CPUs pegged at 100%. The database is CPU-bound. Every query waits for CPU time.

Why it happened: Read queries scale with users. More users = more reads. The db.m5.large (2 vCPU) can’t handle 3,200 queries/sec.

Step 4: Second Fix — Read Replica

Vertical scaling (bigger instance) is the easy fix. We tried db.m5.xlarge (4 vCPU, 16GB RAM). That helped for 3 weeks. Then CPU hit 90% again. Vertical scaling has limits.

The real fix: horizontal scaling with a read replica. Most queries are reads (95% of our traffic). Reads can be served from a replica. Writes still go to the primary.

                     ┌──────────────┐
                ┌───▶│ Read Replica │◀─── Read queries (95%)
                │    └──────────────┘
┌──────────────┐│
│  PgBouncer   ├┤
└──────────────┘│
                │    ┌──────────────┐
                └───▶│  Primary DB  │◀─── Write queries (5%)
                     └──────┬───────┘
                            │
                            │ (replication)
                            ▼
                     ┌──────────────┐
                     │ Read Replica │
                     └──────────────┘

Create the read replica (AWS RDS):

aws rds create-db-instance-read-replica \
  --db-instance-identifier myapp-read-replica-1 \
  --source-db-instance-identifier myapp-primary \
  --db-instance-class db.m5.large \
  --publicly-accessible false

Update the API to split reads and writes. We used a simple wrapper:

package db

import (
	"database/sql"
)

type DB struct {
	primary  *sql.DB
	replicas []*sql.DB
	current  int
}

func (d *DB) Write() *sql.DB {
	return d.primary
}

func (d *DB) Read() *sql.DB {
	// Round-robin across replicas
	replica := d.replicas[d.current%len(d.replicas)]
	d.current++
	return replica
}

Usage in API:

// Write
_, err := db.Write().Exec("INSERT INTO users ...")

// Read
rows, err := db.Read().Query("SELECT * FROM users WHERE ...")

Results after adding read replica:

  • Primary CPU: 18% (was 92%)
  • Replica CPU: 45%
  • Average response time: 210ms (was 850ms)
  • 95th percentile: 480ms (was 2.3s)

The primary now only handles writes (5% of traffic). The replica handles all reads. We added a second replica a month later to split read traffic 50/50.

Step 5: The Third Bottleneck (Replica Lag)

With read replicas deployed, we started getting bug reports: “I just changed my password but the app still shows the old one.”

Check replication lag:

-- On replica
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;

Result: 8 seconds.

The replica is 8 seconds behind the primary. A user changes their password (write goes to primary), then immediately requests their profile (read goes to replica). The replica doesn’t have the new password yet. Stale read.

Why it happened: High write volume + network latency. Postgres replicates asynchronously by default. The replica applies changes as fast as it can, but it falls behind during peak traffic.

Step 6: Third Fix — Read-After-Write Consistency

We can’t force the replica to sync faster (async replication is the only option for performance). Instead, fix it at the application level: after a write, read from the primary for the next N seconds.

package db

import (
	"context"
	"database/sql"
	"sync"
	"time"
)

type DB struct {
	primary  *sql.DB
	replicas []*sql.DB
	current  int

	// Track recent writes per user
	recentWrites map[string]time.Time
	mu           sync.RWMutex
}

const stalenessTolerance = 10 * time.Second

func (d *DB) Write(ctx context.Context, userID string) *sql.DB {
	d.mu.Lock()
	d.recentWrites[userID] = time.Now()
	d.mu.Unlock()
	return d.primary
}

func (d *DB) Read(ctx context.Context, userID string) *sql.DB {
	d.mu.RLock()
	writeTime, exists := d.recentWrites[userID]
	d.mu.RUnlock()

	if exists && time.Since(writeTime) < stalenessTolerance {
		// Read from primary if user wrote recently
		return d.primary
	}

	// Otherwise read from replica
	replica := d.replicas[d.current%len(d.replicas)]
	d.current++
	return replica
}

After a user writes, their reads go to the primary for 10 seconds. After 10 seconds, they go back to the replica. This guarantees consistency where it matters.

Results:

  • Stale read reports: 0 (was 5-10 per day)
  • Primary CPU increased slightly: 22% (from 18%)
  • Replica CPU: 40% each

The trade-off: primary handles slightly more read traffic (only recent-write users). But consistency is guaranteed.

Step 7: The Fourth Bottleneck (Query Performance)

At 800K users, queries started slowing down again despite CPU being fine:

SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20;

Execution time: 2.1 seconds.

Check the query plan:

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20;
Limit  (cost=45231.23..45231.28 rows=20 width=180) (actual time=2087.231..2087.245 rows=20 loops=1)
  ->  Sort  (cost=45231.23..45356.89 rows=50265 width=180) (actual time=2087.229..2087.236 rows=20 loops=1)
        Sort Key: created_at DESC
        Sort Method: top-N heapsort  Memory: 29kB
        ->  Seq Scan on orders  (cost=0.00..43980.12 rows=50265 width=180) (actual time=0.034..1843.421 rows=48392 loops=1)
              Filter: (user_id = $1)
              Rows Removed by Filter: 4738261
Planning Time: 0.215 ms
Execution Time: 2087.291 ms

Seq Scan = sequential scan. Postgres read 4.7 million rows to find 48K matching rows. No index on user_id.

Add the index:

CREATE INDEX CONCURRENTLY idx_orders_user_id_created_at ON orders (user_id, created_at DESC);

CONCURRENTLY means it builds the index without locking the table. Takes longer but doesn’t block writes.

Check the query plan again:

Limit  (cost=0.43..45.67 rows=20 width=180) (actual time=0.123..0.234 rows=20 loops=1)
  ->  Index Scan using idx_orders_user_id_created_at on orders  (cost=0.43..113456.78 rows=50265 width=180) (actual time=0.121..0.226 rows=20 loops=1)
        Index Cond: (user_id = $1)
Planning Time: 0.089 ms
Execution Time: 0.267 ms

Index Scan now. Execution time: 0.27ms (was 2087ms). 7,700x faster.

Lesson: Indexes matter more than hardware. Fix your queries before throwing bigger servers at the problem.

Step 8: The Fifth Bottleneck (Data Size and Cost)

At 2M users, we hit a cost problem. Database storage grew to 800GB. AWS RDS charges $0.115/GB/month for SSD. That’s $92/month for storage alone. Plus the primary and 2 replicas = 3 × 800GB = 2.4TB = $276/month just for storage.

And the backups. RDS automated backups were 7 days × 800GB = 5.6TB. Another $200/month.

Total database cost: ~$1,800/month and growing.

Check what’s using space:

SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;
 schemaname |    tablename    |  size
------------+-----------------+---------
 public     | events          | 520 GB
 public     | logs            | 180 GB
 public     | orders          | 65 GB
 public     | users           | 35 GB

The events and logs tables are analytics data. We’re querying them for reports, but they don’t need to be in the primary database.

Step 9: Fifth Fix — Offload Analytics to Data Warehouse

Move events and logs to a separate data warehouse. We chose AWS Redshift (columnar database optimized for analytics).

Set up a daily sync:

# Export from Postgres
pg_dump --table events --table logs --data-only myapp > /tmp/analytics.sql

# Import to Redshift
psql -h analytics.redshift.amazonaws.com -d analytics -f /tmp/analytics.sql

Later, we automated this with AWS Data Pipeline.

Delete old data from Postgres:

DELETE FROM events WHERE created_at < NOW() - INTERVAL '90 days';
DELETE FROM logs WHERE created_at < NOW() - INTERVAL '30 days';

Results:

  • Database size: 300GB (from 800GB)
  • Monthly cost: $800 (from $1,800)
  • Query performance: unchanged (analytics queries now go to Redshift)

Analytics queries got faster too because Redshift is optimized for them. Win-win.

Step 10: The Final Bottleneck (Write Throughput)

At 4M users, writes started queueing:

Metrics:

  • Users: 4M
  • Write queries/sec: 1,200
  • Primary CPU: 85%
  • Disk I/O: 8,000 IOPS (max 10,000)
  • Write latency: 950ms

The primary can’t keep up. We’ve optimized queries. We’ve offloaded reads. But writes have to go to one place: the primary.

Options:

  1. Vertical scaling: Bigger instance (more CPU, more IOPS)
  2. Sharding: Split data across multiple databases

Vertical scaling bought us 2 more months with db.m5.4xlarge (16 vCPU, 64GB RAM, 20,000 IOPS). Then we hit the same wall. Time to shard.

0

Step 11: Final Fix — Sharding

Sharding means splitting data across multiple databases. Each database (shard) holds a subset of users.

API determines user's shard:
  shard_id = user_id % num_shards

┌─────────────┐
│  API Server │
└──────┬──────┘
       │
       ├───▶ Shard 0 (user_id % 4 == 0)
       ├───▶ Shard 1 (user_id % 4 == 1)
       ├───▶ Shard 2 (user_id % 4 == 2)
       └───▶ Shard 3 (user_id % 4 == 3)

We started with 4 shards. Each shard is a full Postgres primary + 2 read replicas.

Shard routing logic:

package db

import (
	"database/sql"
	"fmt"
)

type ShardedDB struct {
	shards []*DB
}

func (s *ShardedDB) GetShard(userID int64) *DB {
	shardID := userID % int64(len(s.shards))
	return s.shards[shardID]
}

func (s *ShardedDB) Write(userID int64) *sql.DB {
	return s.GetShard(userID).Write()
}

func (s *ShardedDB) Read(userID int64) *sql.DB {
	return s.GetShard(userID).Read()
}

Usage:

shard := db.GetShard(userID)
_, err := shard.Write().Exec("INSERT INTO orders ...")

Migration process:

  1. Create 4 empty databases (shards)
  2. Write a migration script that reads from the old database and writes to the correct shard
  3. Run the migration over a weekend (6 hours for 4M users)
  4. Update the API to use sharded routing
  5. Monitor for 1 week, then decommission the old database

Results after sharding:

1
  • Primary CPU per shard: 22% (was 85% on single database)
  • Write latency: 180ms (was 950ms)
  • Write throughput: 4,800 queries/sec total (was 1,200)
  • Can scale to 16M users before needing more shards

What We Built

Over 18 months, the database architecture evolved through 6 major changes:

  1. Single database → Connection errors at 150K users
  2. Added PgBouncer → Fixed connections, hit CPU limits at 300K users
  3. Added read replicas → Fixed CPU, encountered replica lag
  4. Read-after-write consistency → Fixed stale reads, hit query performance issues
  5. Added indexes → Fixed queries, hit storage costs at 2M users
  6. Offloaded analytics → Reduced costs, hit write throughput at 4M users
  7. Sharding → Now supports 5M users, can scale to 16M+

Each change bought us 2-6 months of growth. Total cost went from $1,800/month (single database) to $4,200/month (4 shards + Redshift), but we’re serving 50x more users.

2

Key Metrics Summary

StageUsersArchitectureAvg ResponseP95 ResponseDB Cost/Month
Start100KSingle DB180ms450ms$400
Connection Pool150KSingle DB + PgBouncer175ms440ms$450
Read Replica500KPrimary + 2 Replicas210ms480ms$1,200
Analytics Offload2MPrimary + 2 Replicas + Redshift195ms420ms$1,100
Sharding5M4 Shards (12 DBs total) + Redshift185ms395ms$4,200

Response times actually improved as we scaled because we fixed bottlenecks at each stage.

3

Lessons Learned

1. Don’t shard too early. We could have sharded at 300K users. But we didn’t need to. Vertical scaling + replicas + indexes got us to 4M users before sharding was necessary.

2. Measure before fixing. Every change was driven by metrics. CPU high? Add replicas. Writes slow? Add indexes. Costs high? Offload cold data.

3. Application-level fixes are cheaper than infrastructure. Connection pooling cost $0 and bought us 6 months. Read-after-write consistency cost 2 days of dev time and fixed stale reads. Sharding cost 4 weeks of dev time and $3,000/month more infrastructure.

4. Indexes matter more than hardware. That 7,700x speedup from one index saved us from sharding for another 6 months.

5. Know your bottleneck. CPU? Add replicas. I/O? Bigger disks. Connections? Add pooling. Writes? Shard. Don’t guess. Profile.

4

Keep Reading

Similar Articles

More from system-design