Discover how to implement Google's Site Reliability Engineering (SRE) principles using Bash scripts. …
Database Scaling: From 100K to 5M Users in 18 Months Database Scaling: From 100K to 5M Users in 18 Months

Summary
In 2023, I joined a Calgary-based SaaS company with 100,000 users and a single Postgres database. By mid-2024, we had 5 million users and a sharded, multi-region database architecture. This is the story of every scaling decision we made, every bottleneck we hit, and every mistake we fixed.
No theory. No “you might want to consider.” Just the actual progression: single database → connection pool → read replica → caching → sharding. Each step shows the problem that forced us to act, the metrics that proved it, and the fix that bought us the next 6 months of growth.

Starting Point: Single Postgres Database
When I joined, the architecture was simple:
┌──────────────┐
│ API Server │ (10 instances, auto-scaled)
└──────┬───────┘
│
▼
┌──────────────┐
│ Postgres DB │ (1 instance, db.m5.large)
│ - 2 vCPU │
│ - 8GB RAM │
│ - 100GB SSD │
└──────────────┘
Metrics at 100K users:
- Average response time: 180ms
- 95th percentile: 450ms
- Database CPU: 35%
- Connection count: 150 (max 200)
- Queries per second: ~800
This worked fine. No problems. Then we ran a successful marketing campaign.
Expand your knowledge with Linux Access Control: From sudo to a Go Security Scanner
Step 1: The First Bottleneck (Connection Limits)
Two weeks after the campaign, API requests started failing:
Error: could not connect to server: remaining connection slots are reserved
Check the database:
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
Result: 198 connections (max is 200).
The API auto-scaled from 10 to 25 instances. Each instance opened 20 connections to the database on startup. 25 × 20 = 500 connections. But Postgres on db.m5.large only allows 200.
Metrics at this point:
- Users: 150K
- API instances: 25
- Database connections: 198/200 (99%)
- Query failures: 12% of requests
Why it happened: We scaled the API servers but not the database. More servers = more connections. Hit the wall.
Deepen your understanding in Terraform From Scratch: Provision AWS Infrastructure Step by Step
Step 2: First Fix — Connection Pooling
The simple fix: add PgBouncer, a connection pooler that sits between the API and the database. It maintains a small pool of database connections and reuses them across many API connections.
┌──────────────┐
│ API Server │ (25 instances)
└──────┬───────┘
│
▼
┌──────────────┐
│ PgBouncer │ (connection pooler)
│ max_client │ = 5000
│ pool_size │ = 100
└──────┬───────┘
│
▼
┌──────────────┐
│ Postgres DB │
└──────────────┘
PgBouncer configuration:
[databases]
myapp = host=postgres.internal port=5432 dbname=myapp
[pgbouncer]
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 100
pool_mode = transaction means PgBouncer returns the connection to the pool after each transaction, not after the client disconnects. This is key. One database connection can serve thousands of API requests per second.
Deploy PgBouncer. Update the API connection string from postgres://postgres.internal:5432/myapp to postgres://pgbouncer.internal:6432/myapp.
Results after deploying PgBouncer:
- Database connections: 100 (from 198)
- API instances: 25 (unchanged)
- Connection errors: 0%
- Response time: 175ms (slightly better)
This fixed the connection problem. But usage kept growing.
Explore this further in How to Replace Text in Multiple Files with Sed: A Step-by-Step Guide
Step 3: The Second Bottleneck (Database CPU)
Two months later, response times started degrading:
- Users: 300K
- Average response time: 850ms (was 180ms)
- 95th percentile: 2.3s
- Database CPU: 92%
- Queries per second: ~3,200
Check the slow query log:
SELECT * FROM users WHERE email = $1; -- 450ms
SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20; -- 680ms
These queries used to take 5ms. Now they’re 100x slower. Why?
Check CPU usage:
htop
Both CPUs pegged at 100%. The database is CPU-bound. Every query waits for CPU time.
Why it happened: Read queries scale with users. More users = more reads. The db.m5.large (2 vCPU) can’t handle 3,200 queries/sec.
Discover related concepts in Terraform From Scratch: Provision AWS Infrastructure Step by Step
Step 4: Second Fix — Read Replica
Vertical scaling (bigger instance) is the easy fix. We tried db.m5.xlarge (4 vCPU, 16GB RAM). That helped for 3 weeks. Then CPU hit 90% again. Vertical scaling has limits.
The real fix: horizontal scaling with a read replica. Most queries are reads (95% of our traffic). Reads can be served from a replica. Writes still go to the primary.
┌──────────────┐
┌───▶│ Read Replica │◀─── Read queries (95%)
│ └──────────────┘
┌──────────────┐│
│ PgBouncer ├┤
└──────────────┘│
│ ┌──────────────┐
└───▶│ Primary DB │◀─── Write queries (5%)
└──────┬───────┘
│
│ (replication)
▼
┌──────────────┐
│ Read Replica │
└──────────────┘
Create the read replica (AWS RDS):
aws rds create-db-instance-read-replica \
--db-instance-identifier myapp-read-replica-1 \
--source-db-instance-identifier myapp-primary \
--db-instance-class db.m5.large \
--publicly-accessible false
Update the API to split reads and writes. We used a simple wrapper:
package db
import (
"database/sql"
)
type DB struct {
primary *sql.DB
replicas []*sql.DB
current int
}
func (d *DB) Write() *sql.DB {
return d.primary
}
func (d *DB) Read() *sql.DB {
// Round-robin across replicas
replica := d.replicas[d.current%len(d.replicas)]
d.current++
return replica
}
Usage in API:
// Write
_, err := db.Write().Exec("INSERT INTO users ...")
// Read
rows, err := db.Read().Query("SELECT * FROM users WHERE ...")
Results after adding read replica:
- Primary CPU: 18% (was 92%)
- Replica CPU: 45%
- Average response time: 210ms (was 850ms)
- 95th percentile: 480ms (was 2.3s)
The primary now only handles writes (5% of traffic). The replica handles all reads. We added a second replica a month later to split read traffic 50/50.
Uncover more details in Terraform From Scratch: Provision AWS Infrastructure Step by Step
Step 5: The Third Bottleneck (Replica Lag)
With read replicas deployed, we started getting bug reports: “I just changed my password but the app still shows the old one.”
Check replication lag:
-- On replica
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;
Result: 8 seconds.
The replica is 8 seconds behind the primary. A user changes their password (write goes to primary), then immediately requests their profile (read goes to replica). The replica doesn’t have the new password yet. Stale read.
Why it happened: High write volume + network latency. Postgres replicates asynchronously by default. The replica applies changes as fast as it can, but it falls behind during peak traffic.
Journey deeper into this topic with Terraform From Scratch: Provision AWS Infrastructure Step by Step
Step 6: Third Fix — Read-After-Write Consistency
We can’t force the replica to sync faster (async replication is the only option for performance). Instead, fix it at the application level: after a write, read from the primary for the next N seconds.
package db
import (
"context"
"database/sql"
"sync"
"time"
)
type DB struct {
primary *sql.DB
replicas []*sql.DB
current int
// Track recent writes per user
recentWrites map[string]time.Time
mu sync.RWMutex
}
const stalenessTolerance = 10 * time.Second
func (d *DB) Write(ctx context.Context, userID string) *sql.DB {
d.mu.Lock()
d.recentWrites[userID] = time.Now()
d.mu.Unlock()
return d.primary
}
func (d *DB) Read(ctx context.Context, userID string) *sql.DB {
d.mu.RLock()
writeTime, exists := d.recentWrites[userID]
d.mu.RUnlock()
if exists && time.Since(writeTime) < stalenessTolerance {
// Read from primary if user wrote recently
return d.primary
}
// Otherwise read from replica
replica := d.replicas[d.current%len(d.replicas)]
d.current++
return replica
}
After a user writes, their reads go to the primary for 10 seconds. After 10 seconds, they go back to the replica. This guarantees consistency where it matters.
Results:
- Stale read reports: 0 (was 5-10 per day)
- Primary CPU increased slightly: 22% (from 18%)
- Replica CPU: 40% each
The trade-off: primary handles slightly more read traffic (only recent-write users). But consistency is guaranteed.
Enrich your learning with Terraform From Scratch: Provision AWS Infrastructure Step by Step
Step 7: The Fourth Bottleneck (Query Performance)
At 800K users, queries started slowing down again despite CPU being fine:
SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20;
Execution time: 2.1 seconds.
Check the query plan:
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20;
Limit (cost=45231.23..45231.28 rows=20 width=180) (actual time=2087.231..2087.245 rows=20 loops=1)
-> Sort (cost=45231.23..45356.89 rows=50265 width=180) (actual time=2087.229..2087.236 rows=20 loops=1)
Sort Key: created_at DESC
Sort Method: top-N heapsort Memory: 29kB
-> Seq Scan on orders (cost=0.00..43980.12 rows=50265 width=180) (actual time=0.034..1843.421 rows=48392 loops=1)
Filter: (user_id = $1)
Rows Removed by Filter: 4738261
Planning Time: 0.215 ms
Execution Time: 2087.291 ms
Seq Scan = sequential scan. Postgres read 4.7 million rows to find 48K matching rows. No index on user_id.
Add the index:
CREATE INDEX CONCURRENTLY idx_orders_user_id_created_at ON orders (user_id, created_at DESC);
CONCURRENTLY means it builds the index without locking the table. Takes longer but doesn’t block writes.
Check the query plan again:
Limit (cost=0.43..45.67 rows=20 width=180) (actual time=0.123..0.234 rows=20 loops=1)
-> Index Scan using idx_orders_user_id_created_at on orders (cost=0.43..113456.78 rows=50265 width=180) (actual time=0.121..0.226 rows=20 loops=1)
Index Cond: (user_id = $1)
Planning Time: 0.089 ms
Execution Time: 0.267 ms
Index Scan now. Execution time: 0.27ms (was 2087ms). 7,700x faster.
Lesson: Indexes matter more than hardware. Fix your queries before throwing bigger servers at the problem.
Gain comprehensive insights from Terraform From Scratch: Provision AWS Infrastructure Step by Step
Step 8: The Fifth Bottleneck (Data Size and Cost)
At 2M users, we hit a cost problem. Database storage grew to 800GB. AWS RDS charges $0.115/GB/month for SSD. That’s $92/month for storage alone. Plus the primary and 2 replicas = 3 × 800GB = 2.4TB = $276/month just for storage.
And the backups. RDS automated backups were 7 days × 800GB = 5.6TB. Another $200/month.
Total database cost: ~$1,800/month and growing.
Check what’s using space:
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;
schemaname | tablename | size
------------+-----------------+---------
public | events | 520 GB
public | logs | 180 GB
public | orders | 65 GB
public | users | 35 GB
The events and logs tables are analytics data. We’re querying them for reports, but they don’t need to be in the primary database.
Master this concept through How to Replace Text in Multiple Files with Sed: A Step-by-Step Guide
Step 9: Fifth Fix — Offload Analytics to Data Warehouse
Move events and logs to a separate data warehouse. We chose AWS Redshift (columnar database optimized for analytics).
Set up a daily sync:
# Export from Postgres
pg_dump --table events --table logs --data-only myapp > /tmp/analytics.sql
# Import to Redshift
psql -h analytics.redshift.amazonaws.com -d analytics -f /tmp/analytics.sql
Later, we automated this with AWS Data Pipeline.
Delete old data from Postgres:
DELETE FROM events WHERE created_at < NOW() - INTERVAL '90 days';
DELETE FROM logs WHERE created_at < NOW() - INTERVAL '30 days';
Results:
- Database size: 300GB (from 800GB)
- Monthly cost: $800 (from $1,800)
- Query performance: unchanged (analytics queries now go to Redshift)
Analytics queries got faster too because Redshift is optimized for them. Win-win.
Delve into specifics at Boto3 and AWS Lambda: Building Production-Grade Serverless Data Pipelines
Step 10: The Final Bottleneck (Write Throughput)
At 4M users, writes started queueing:
- Users: 4M
- Write queries/sec: 1,200
- Primary CPU: 85%
- Disk I/O: 8,000 IOPS (max 10,000)
- Write latency: 950ms
The primary can’t keep up. We’ve optimized queries. We’ve offloaded reads. But writes have to go to one place: the primary.
Options:
- Vertical scaling: Bigger instance (more CPU, more IOPS)
- Sharding: Split data across multiple databases
Vertical scaling bought us 2 more months with db.m5.4xlarge (16 vCPU, 64GB RAM, 20,000 IOPS). Then we hit the same wall. Time to shard.
Deepen your understanding in Terraform From Scratch: Provision AWS Infrastructure Step by Step
Step 11: Final Fix — Sharding
Sharding means splitting data across multiple databases. Each database (shard) holds a subset of users.
API determines user's shard:
shard_id = user_id % num_shards
┌─────────────┐
│ API Server │
└──────┬──────┘
│
├───▶ Shard 0 (user_id % 4 == 0)
├───▶ Shard 1 (user_id % 4 == 1)
├───▶ Shard 2 (user_id % 4 == 2)
└───▶ Shard 3 (user_id % 4 == 3)
We started with 4 shards. Each shard is a full Postgres primary + 2 read replicas.
Shard routing logic:
package db
import (
"database/sql"
"fmt"
)
type ShardedDB struct {
shards []*DB
}
func (s *ShardedDB) GetShard(userID int64) *DB {
shardID := userID % int64(len(s.shards))
return s.shards[shardID]
}
func (s *ShardedDB) Write(userID int64) *sql.DB {
return s.GetShard(userID).Write()
}
func (s *ShardedDB) Read(userID int64) *sql.DB {
return s.GetShard(userID).Read()
}
Usage:
shard := db.GetShard(userID)
_, err := shard.Write().Exec("INSERT INTO orders ...")
Migration process:
- Create 4 empty databases (shards)
- Write a migration script that reads from the old database and writes to the correct shard
- Run the migration over a weekend (6 hours for 4M users)
- Update the API to use sharded routing
- Monitor for 1 week, then decommission the old database
Results after sharding:
Deepen your understanding in Terraform From Scratch: Provision AWS Infrastructure Step by Step
- Primary CPU per shard: 22% (was 85% on single database)
- Write latency: 180ms (was 950ms)
- Write throughput: 4,800 queries/sec total (was 1,200)
- Can scale to 16M users before needing more shards
What We Built
Over 18 months, the database architecture evolved through 6 major changes:
- Single database → Connection errors at 150K users
- Added PgBouncer → Fixed connections, hit CPU limits at 300K users
- Added read replicas → Fixed CPU, encountered replica lag
- Read-after-write consistency → Fixed stale reads, hit query performance issues
- Added indexes → Fixed queries, hit storage costs at 2M users
- Offloaded analytics → Reduced costs, hit write throughput at 4M users
- Sharding → Now supports 5M users, can scale to 16M+
Each change bought us 2-6 months of growth. Total cost went from $1,800/month (single database) to $4,200/month (4 shards + Redshift), but we’re serving 50x more users.
Deepen your understanding in Terraform From Scratch: Provision AWS Infrastructure Step by Step
Key Metrics Summary
| Stage | Users | Architecture | Avg Response | P95 Response | DB Cost/Month |
|---|---|---|---|---|---|
| Start | 100K | Single DB | 180ms | 450ms | $400 |
| Connection Pool | 150K | Single DB + PgBouncer | 175ms | 440ms | $450 |
| Read Replica | 500K | Primary + 2 Replicas | 210ms | 480ms | $1,200 |
| Analytics Offload | 2M | Primary + 2 Replicas + Redshift | 195ms | 420ms | $1,100 |
| Sharding | 5M | 4 Shards (12 DBs total) + Redshift | 185ms | 395ms | $4,200 |
Response times actually improved as we scaled because we fixed bottlenecks at each stage.
Deepen your understanding in Terraform From Scratch: Provision AWS Infrastructure Step by Step
Lessons Learned
1. Don’t shard too early. We could have sharded at 300K users. But we didn’t need to. Vertical scaling + replicas + indexes got us to 4M users before sharding was necessary.
2. Measure before fixing. Every change was driven by metrics. CPU high? Add replicas. Writes slow? Add indexes. Costs high? Offload cold data.
3. Application-level fixes are cheaper than infrastructure. Connection pooling cost $0 and bought us 6 months. Read-after-write consistency cost 2 days of dev time and fixed stale reads. Sharding cost 4 weeks of dev time and $3,000/month more infrastructure.
4. Indexes matter more than hardware. That 7,700x speedup from one index saved us from sharding for another 6 months.
5. Know your bottleneck. CPU? Add replicas. I/O? Bigger disks. Connections? Add pooling. Writes? Shard. Don’t guess. Profile.
Deepen your understanding in Terraform From Scratch: Provision AWS Infrastructure Step by Step
Keep Reading
- Task Automation: From Cron to a Go Task Runner — automate database maintenance tasks (backups, migrations, monitoring)
- Nginx Log Analysis: From grep to a Go Log Parser — apply the same scaling patterns to log processing
- Config Templating: From envsubst to Go — manage database connection strings across environments
Similar Articles
Related Content
More from system-design
You Might Also Like
Build a log aggregator in Go from scratch. Tail files with inotify, survive log rotation, parse …
Learn deployment automation from scratch. Start with SSH, rsync, and shell scripts, then build a Go …
Learn service health monitoring from the ground up. Start with curl, ping, and /proc, then build a …
Contents
- Starting Point: Single Postgres Database
- Step 1: The First Bottleneck (Connection Limits)
- Step 2: First Fix — Connection Pooling
- Step 3: The Second Bottleneck (Database CPU)
- Step 4: Second Fix — Read Replica
- Step 5: The Third Bottleneck (Replica Lag)
- Step 6: Third Fix — Read-After-Write Consistency
- Step 7: The Fourth Bottleneck (Query Performance)
- Step 8: The Fifth Bottleneck (Data Size and Cost)
- Step 9: Fifth Fix — Offload Analytics to Data Warehouse
- Step 10: The Final Bottleneck (Write Throughput)
- Step 11: Final Fix — Sharding
- What We Built
- Key Metrics Summary
- Lessons Learned
- Keep Reading

