Skip main navigation
/user/kayd @ devops :~$ cat boto3-and-aws-lambda-a-match-made-in-serverless-heaven.md

Boto3 + AWS Lambda: A Production Serverless Pipeline Boto3 + AWS Lambda: A Production Serverless Pipeline

QR Code linking to: Boto3 + AWS Lambda: A Production Serverless Pipeline
Karandeep Singh
Karandeep Singh
• 6 minutes

Summary

Production guide to building serverless data pipelines with Boto3 and Lambda. Based on processing high-volume daily events through an analytics pipeline. Covers cold starts, concurrent execution limits, error handling, retries, and cost optimization.

Abstract cloud computing visualization representing serverless architecture

The Surprisingly High Lambda Bill That Taught Me Boto3

I once built a serverless analytics pipeline. The requirements seemed straightforward:

  • Process user activity events from SQS queue
  • Enrich events with user data from DynamoDB
  • Store processed events in S3 for analysis
  • Handle a high volume of events daily, with significant peak traffic

First month’s AWS bill: much higher than expected.

The problem wasn’t Lambda itself. The problem was how I used Boto3 in Lambda. This article documents the optimization journey that reduced costs significantly while improving reliability.

The Naive First Implementation (That Cost Way Too Much)

Here’s my initial Lambda function - textbook example but terrible for production:

import boto3
import json

# DON'T DO THIS - creates client on every invocation
def lambda_handler(event, context):
    # Cold start penalty - initializing clients inside handler
    s3 = boto3.client('s3')
    dynamodb = boto3.client('dynamodb')
    sqs = boto3.client('sqs')

    for record in event['Records']:
        # Parse event
        event_data = json.loads(record['body'])

        # Enrich with user data - SYNCHRONOUS call (slow!)
        user_response = dynamodb.get_item(
            TableName='users',
            Key={'user_id': {'S': event_data['user_id']}}
        )

        # Process data
        processed = {
            'event': event_data,
            'user': user_response.get('Item', {})
        }

        # Write to S3 - one file per event (expensive!)
        s3.put_object(
            Bucket='analytics-raw',
            Key=f"events/{event_data['event_id']}.json",
            Body=json.dumps(processed)
        )

    return {'statusCode': 200}

What went wrong:

  1. Client initialization inside handler: noticeable cold start overhead per invocation
  2. One S3 PUT per event: every event becomes a PUT request, ballooning PUT costs
  3. Synchronous DynamoDB calls: slow average execution time
  4. No batch processing: Each Lambda invoked for single event
  5. No error handling: Failed events lost forever
  6. Memory over-provisioned: 1024MB when 256MB sufficient

Where the costs piled up:

  • Lambda execution dominated the bill (long durations × huge invocation count)
  • S3 PUT requests added up fast (one PUT per event)
  • DynamoDB reads were costly (one read per event, with consistent reads)
  • Data transfer added a smaller but real chunk

The Optimized Implementation (Major Cost Reduction)

After several weeks of optimization, here’s the production version:

import boto3
import json
from typing import List, Dict
import os

# Initialize clients OUTSIDE handler (reused across invocations)
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
users_table = dynamodb.Table('users')

# Environment variables
BUCKET_NAME = os.environ['ANALYTICS_BUCKET']
BATCH_SIZE = 100

def lambda_handler(event, context):
    """
    Process SQS events in batches
    Memory: 256MB (reduced from 1024MB)
    Timeout: 60s
    Batch size: 10 messages (configured in SQS trigger)
    """
    events_buffer = []
    failed_items = []

    for record in event['Records']:
        try:
            event_data = json.loads(record['body'])

            # Batch DynamoDB requests (10x faster than individual gets)
            user_data = get_user_cached(event_data['user_id'])

            events_buffer.append({
                'event': event_data,
                'user': user_data,
                'processed_at': context.request_id
            })

            # Flush buffer when full
            if len(events_buffer) >= BATCH_SIZE:
                write_batch_to_s3(events_buffer)
                events_buffer = []

        except Exception as e:
            # Send failures to DLQ for reprocessing
            failed_items.append({
                'itemIdentifier': record['messageId'],
                'error': str(e)
            })

    # Flush remaining events
    if events_buffer:
        write_batch_to_s3(events_buffer)

    # Return partial batch failures
    return {
        'batchItemFailures': failed_items
    }

# In-memory cache (persists across warm invocations)
user_cache = {}

def get_user_cached(user_id: str) -> Dict:
    """Get user with Lambda execution context caching"""
    if user_id in user_cache:
        return user_cache[user_id]

    # Batch read with consistent read disabled (eventual consistency OK)
    response = users_table.get_item(
        Key={'user_id': user_id},
        ConsistentRead=False  # 50% cost reduction
    )

    user = response.get('Item', {})
    user_cache[user_id] = user  # Cache for warm invocations
    return user

def write_batch_to_s3(events: List[Dict]):
    """Write 100 events as single S3 object instead of 100 separate PUTs"""
    timestamp = events[0]['event']['timestamp']
    date = timestamp[:10]  # YYYY-MM-DD

    s3.put_object(
        Bucket=BUCKET_NAME,
        Key=f"events/date={date}/{context.request_id}.json",
        Body='\n'.join(json.dumps(e) for e in events),
        ContentType='application/json'
    )

Key optimizations:

  1. Client initialization outside handler: Eliminated repeated cold start overhead
  2. Batch S3 writes: Many events per PUT (a large reduction in PUT requests)
  3. In-memory caching: Strong cache hit rate on users
  4. Eventual consistency for DynamoDB: meaningful cost reduction
  5. Reduced memory: 256MB (sufficient for workload)
  6. Partial batch failure handling: Failed events automatically retry

Where costs dropped:

  • Lambda execution dropped sharply once average duration came down
  • S3 PUT requests fell dramatically thanks to batching
  • DynamoDB reads got cheaper with eventual consistency and caching
  • A small DLQ storage line item appeared, but the overall bill saw a large reduction

Production Lessons from Running This at Scale

Lesson 1: Cold Starts Matter

Initial cold starts were painfully slow with client initialization inside the handler.

Optimizations that worked:

  • Initialize Boto3 clients outside handler: noticeably faster cold starts
  • Use Lambda layers for dependencies: further cold start improvement
  • Minimize deployment package: another nudge faster
  • Provisioned concurrency for critical paths: eliminated cold starts entirely

End result: substantially faster cold starts.

Lesson 2: Concurrent Execution Limits Will Hit You

We hit AWS account concurrency limits during a traffic spike. Our queue backed up significantly.

The fix:

  • Requested a concurrency limit increase
  • Implemented exponential backoff in producers
  • Added CloudWatch alarms for queue depth thresholds

Lesson 3: DLQ Configuration is Not Optional

Early on, we lost events due to unhandled errors before implementing DLQ.

Proper error handling:

  • Configure SQS DLQ with multi-day retention
  • Set a sensible maxReceiveCount (retry failed messages a few times)
  • Monitor DLQ depth daily
  • Weekly review of DLQ messages to identify systematic issues

Lesson 4: Memory vs Duration is a Trade-off

After testing several memory configurations, doubling memory from 128MB to 256MB roughly halved duration at similar cost — so 256MB ended up being the sweet spot for this workload. Going higher cost more without proportional speedup.

Lesson 5: Boto3 Retries Need Configuration

Default Boto3 retry config caused timeout issues during AWS service hiccups.

Custom retry configuration:

from botocore.config import Config

retry_config = Config(
    retries={
        'max_attempts': 3,
        'mode': 'adaptive'  # Uses exponential backoff
    },
    connect_timeout=5,
    read_timeout=10
)

s3 = boto3.client('s3', config=retry_config)

This dramatically reduced the rate of timeout errors across invocations.

Cost Optimization Checklist

From expensive mistakes:

  • Initialize Boto3 clients outside handler function
  • Batch operations (S3 PUTs, DynamoDB batch operations)
  • Use eventual consistency for DynamoDB when possible
  • Right-size memory allocation (test different configs)
  • Implement caching for frequently accessed data
  • Configure proper timeouts to avoid runaway executions
  • Use reserved capacity for predictable workloads (notable discount)
  • Enable compression for S3 objects
  • Clean up old DLQ messages
  • Monitor costs daily in first month

When NOT to Use Lambda + Boto3

After building many serverless pipelines, Lambda + Boto3 is NOT appropriate for:

  1. Long-running tasks (>15 minutes) - Use Fargate or EC2
  2. Large file processing (>10GB) - Lambda has 10GB storage limit
  3. Consistent sub-10ms latency requirements - Cold starts are unpredictable
  4. High-frequency, steady-state workloads - EC2 is cheaper
  5. Complex dependencies - Deployment packages >250MB don’t work well

Lambda + Boto3 excels at:

  • Event-driven architectures
  • Intermittent workloads
  • Rapid scaling requirements (idle to many concurrent invocations in seconds)
  • Variable traffic patterns

References

Question

What's been your biggest challenge with serverless data pipelines? Cold starts? Cost optimization? Error handling?

Similar Articles

More from cloud