Back to Notes

AWS S3

Pre-requisite: [[Data Engineering Fundamentals]] — covers Data Types, Data Lake vs Warehouse, ETL, Data Formats


What is S3?

  • Simple Storage Service — object storage service by AWS
  • Infinitely scalable, highly durable (99.999999999% — 11 nines)
  • Used as a Data Lake, backup store, static website hosting, ML data store

S3 Buckets & Objects

  • Stores data as objects (files) inside buckets (directories) — not a file system, not block storage
  • Each bucket has a globally unique name across all regions and accounts
    • From March 2026: AWS introduced Account Regional Namespaces — allows creating buckets within your own reserved namespace
  • Buckets are defined at the region level — S3 looks global but buckets are regional

Naming convention:

  • No uppercase, no underscore
  • 3–63 characters long
  • Must start with a lowercase letter or number
  • Must NOT start with xn-- or end with -s3alias
  • Cannot be formatted as an IP address

Objects

  • Each object has a Key — the full path: s3://my-bucket/folder/file.txt
  • Key = Prefix + Object name
  • Max object size: 5 TB (use multipart upload for > 5 GB)
  • Objects can have: Metadata (key-value pairs), Tags (up to 10), Version ID

Storage Classes

graph LR
    Hot[Frequent Access] --> Standard[S3 Standard<br/>Highest cost<br/>Low latency]

    Warm[Infrequent Access] --> IA[S3 Standard-IA<br/>Retrieval fee<br/>Multi-AZ]
    Warm --> OneZone[S3 One Zone-IA<br/>Retrieval fee<br/>Single AZ cheaper]
    Warm --> IT[S3 Intelligent-Tiering<br/>Auto moves tiers<br/>Monitoring fee]

    Cold[Archive] --> GI[Glacier Instant<br/>Ms retrieval]
    Cold --> GF[Glacier Flexible<br/>Mins-hrs retrieval]
    Cold --> GDA[Glacier Deep Archive<br/>Cheapest<br/>12hr retrieval]

    style Hot fill:#fef9c3,stroke:#ca8a04
    style Warm fill:#fed7aa,stroke:#ea580c
    style Cold fill:#dbeafe,stroke:#3b82f6
    style Standard fill:#fef9c3,stroke:#ca8a04
    style IA fill:#fed7aa,stroke:#ea580c
    style OneZone fill:#fed7aa,stroke:#ea580c
    style IT fill:#fed7aa,stroke:#ea580c
    style GI fill:#dbeafe,stroke:#3b82f6
    style GF fill:#dbeafe,stroke:#3b82f6
    style GDA fill:#dbeafe,stroke:#3b82f6
ClassExamplesRetrievalAvailabilityDurability
StandardWebsite images, ML training data, active app dataImmediate99.99%11 9s, Multi-AZ
Intelligent-TieringUser uploads with unknown access pattern, mixed workloadsImmediate99.9%11 9s, Multi-AZ
Standard-IADR backups, older logs still needed on demandImmediate + fee99.9%11 9s, Multi-AZ
One Zone-IAThumbnails (reproducible), secondary backupsImmediate + fee99.5%11 9s, Single AZ
Glacier InstantMedical images, news archives accessed occasionallyMilliseconds99.9%11 9s, Multi-AZ
Glacier FlexibleYearly audit logs, raw video footage archiveMinutes–Hours99.99%11 9s, Multi-AZ
Glacier Deep ArchiveFinancial records (7yr compliance), HIPAA healthcare data12 hrs99.99%11 9s, Multi-AZ

Lifecycle Policies

Automatically transition or expire objects based on age.

graph LR
    Upload([Object Uploaded]) -->|Day 0| Standard[S3 Standard]
    Standard -->|Day 30| IA[Standard-IA]
    IA -->|Day 90| Glacier[Glacier Flexible]
    Glacier -->|Day 365| Delete([Expired / Deleted])

    style Upload fill:#dcfce7,stroke:#16a34a
    style Standard fill:#fef9c3,stroke:#ca8a04
    style IA fill:#fed7aa,stroke:#ea580c
    style Glacier fill:#dbeafe,stroke:#3b82f6
    style Delete fill:#fce7f3,stroke:#db2777
  • Transition rules — move objects to a cheaper storage class after N days
  • Expiration rules — delete objects after N days
  • Can target: current versions, non-current versions (versioning), or incomplete multipart uploads
  • Rules can be scoped by prefix or object tags

Versioning

  • Keeps multiple versions of the same object in a bucket — enabled at bucket level
  • Protects against accidental delete or overwrite
  • Once enabled, cannot be fully disabled (only suspended)
  • Delete adds a delete marker — older versions still exist and can be restored
  • MFA Delete — requires MFA to permanently delete a version (extra protection)
  • Files uploaded before versioning was enabled get version null
bucket/
  photo.jpg  (v3 — current)
  photo.jpg  (v2)
  photo.jpg  (v1)

Replication

  • Requires versioning enabled on both source and destination buckets
  • Copying is asynchronous — only new objects are replicated after rule is created
  • To replicate existing objects: use S3 Batch Replication
  • No chaining — replication from A→B does not auto-replicate to C

Delete behaviour:

  • Replicating delete markers is optional
  • Permanent deletes (by version ID) are never replicated
CRRSRR
Full nameCross-Region ReplicationSame-Region Replication
Use caseDR, lower latency for global usersLog aggregation, compliance, test/prod sync
CostHigher (cross-region transfer)Lower

Encryption

TypeKey OwnerNotes
SSE-S3AWSDefault, AES-256, no extra cost
SSE-KMSYou (AWS KMS)CloudTrail audit trail, extra cost, quota limits apply
SSE-CYou (per request)AWS never stores your key, HTTPS required
Client-SideYouEncrypt before upload; AWS sees only ciphertext

SSE-S3

  • Keys managed entirely by AWS — transparent to the user
  • AES-256 encryption, server-side
  • Header: "x-amz-server-side-encryption": "AES256"
  • Enabled by default for all new buckets and objects

SSE-KMS

  • Keys managed via AWS KMS — you control key rotation, access, and get CloudTrail audit logs
  • Header: "x-amz-server-side-encryption": "aws:kms"
  • Quota limits: upload calls GenerateDataKey, download calls Decrypt — both count against KMS quota (5,500–30,000 req/s per region)
  • Quota can be increased via Service Quotas Console

SSE-C

  • You supply the key with every request — AWS uses it to encrypt/decrypt but never stores it
  • HTTPS required — key is sent in HTTP headers

Client-Side Encryption

  • Encrypt data client-side before uploading — AWS sees only ciphertext
  • You manage keys and encryption logic entirely

Encryption in Transit

  • S3 supports both HTTP (unencrypted) and HTTPS (TLS/SSL)
  • HTTPS is mandatory for SSE-C, recommended for all
  • To enforce HTTPS: add bucket policy condition aws:SecureTransport = false → Deny

Access Control

graph TD
    User([User / Service]) --> BP[Bucket Policy]
    User --> IAM[IAM Policy]
    User --> ACL[ACL]
    BP --> Bucket[(S3 Bucket)]
    IAM --> Bucket
    ACL --> Bucket
    Bucket --> BPA[Block Public Access<br/>Override — always wins]

    style User fill:#dbeafe,stroke:#3b82f6
    style BP fill:#fef9c3,stroke:#ca8a04
    style IAM fill:#dcfce7,stroke:#16a34a
    style ACL fill:#fce7f3,stroke:#db2777
    style Bucket fill:#ccfbf1,stroke:#0d9488
    style BPA fill:#ffc9c9,stroke:#dc2626

Bucket Policy

  • JSON policy attached directly to the bucket — controls users, roles, other accounts, services
  • Use for: cross-account access, making bucket public, enforcing upload encryption
{
  "Effect": "Allow",
  "Principal": "*",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::my-bucket/*"
}

IAM Policy

  • Attached to IAM users, groups, or roles — not to the bucket itself
  • Both IAM policy and bucket policy must allow access (unless the caller is the resource owner)

ACLs (Access Control Lists)

  • Legacy mechanism — basic read/write per AWS account or predefined group
  • Mostly deprecated — use bucket policies instead

Block Public Access

  • Default: ON for all new buckets — overrides bucket policies and ACLs
  • Turn off only when intentionally hosting public content (e.g. static website)
SettingWhat it blocks
BlockPublicAclsAdding new public ACLs
IgnorePublicAclsExisting public ACLs are ignored
BlockPublicPolicyNew bucket policies granting public access
RestrictPublicBucketsAll public + cross-account access

Pre-signed URLs

  • Temporary URL granting time-limited access to a private object
  • Generated using your credentials — recipient gets your permissions for that object
  • Default expiry: 1 hour | Max: 7 days (SDK-generated)
  • Use case: share a private download link, allow a one-time upload
https://my-bucket.s3.amazonaws.com/photo.jpg
  ?X-Amz-Expires=3600   ← expires in 1 hour
  &X-Amz-Signature=...

Which to use?

ScenarioUse
Grant another AWS account accessBucket Policy
Give an EC2 instance access to S3IAM Role + IAM Policy
Temporary access to a private filePre-signed URL
Make entire bucket publicBucket Policy + Disable Block Public Access
Lock down everythingBlock Public Access

Performance & Optimization

  • Automatically scales to high request rates — latency 100–200 ms
  • Per-prefix throughput: 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests/sec
  • No limit on number of prefixes — spread objects across prefixes to scale reads/writes

Multipart Upload

  • Recommended for > 100 MB, required for > 5 GB
  • Splits object into parts, uploads in parallel — faster, resumable on failure

Transfer Acceleration

  • Routes upload through nearest CloudFront edge location → forwards to S3 bucket region via AWS backbone
  • Best for: large files uploaded by globally distributed users
  • Compatible with multipart upload

Byte Range Fetches

  • Download specific byte ranges of an object in parallel
  • Use to: speed up large downloads, retrieve partial data (e.g. just the file header), improve resilience

Event Notifications

Trigger actions automatically when objects are created, deleted, or restored.

graph LR
    S3[(S3 Bucket)] -->|Object events| Direct[Direct Destinations]
    S3 -->|All events| EB[Amazon EventBridge]

    Direct --> Lambda[AWS Lambda]
    Direct --> SQS[SQS Queue]
    Direct --> SNS[SNS Topic]

    EB -->|Rules + filters| Step[Step Functions]
    EB -->|Rules + filters| Kinesis[Kinesis Streams]
    EB -->|Rules + filters| Lambda2[Lambda / 18+ services]

    style S3 fill:#c3fae8,stroke:#0d9488
    style EB fill:#fef9c3,stroke:#ca8a04
    style Direct fill:#dbeafe,stroke:#3b82f6
    style Lambda fill:#fed7aa,stroke:#ea580c
    style SQS fill:#fed7aa,stroke:#ea580c
    style SNS fill:#fed7aa,stroke:#ea580c
    style Step fill:#f3e8ff,stroke:#9333ea
    style Kinesis fill:#f3e8ff,stroke:#9333ea
    style Lambda2 fill:#f3e8ff,stroke:#9333ea
Direct (Lambda / SQS / SNS)Via EventBridge
SetupSimpleRequires EventBridge rule
FilteringBasic (prefix/suffix)Advanced JSON rules
Destinations3 options18+ AWS services
Extra featuresArchive, replay, reliable delivery

S3 Select

  • Run SQL queries directly on S3 objects — no need to download and process the full file
  • Supported formats: CSV, JSON, Parquet (with optional GZIP/BZIP2 compression)
  • Returns only the filtered subset → reduces data transferred and processing cost
  • Example: SELECT * FROM S3Object WHERE age > 30

S3 Access Points

  • Named entry points for a bucket — each with its own DNS name and access policy
  • Simplify security when multiple teams or apps need different access to the same bucket
graph LR
    TeamA[Team A] -->|Access Point A<br/>finance policy| Bucket[(S3 Bucket)]
    TeamB[Team B] -->|Access Point B<br/>analytics policy| Bucket
    TeamC[Team C] -->|Access Point C<br/>read-only policy| Bucket

    style Bucket fill:#c3fae8,stroke:#0d9488
    style TeamA fill:#dbeafe,stroke:#3b82f6
    style TeamB fill:#dcfce7,stroke:#16a34a
    style TeamC fill:#fef9c3,stroke:#ca8a04

VPC Origin

  • Restrict access point to VPC only — no public internet access
  • Requires a VPC Endpoint (Gateway or Interface type) pointing to the access point
  • VPC Endpoint Policy must allow access to both the access point and the underlying bucket
graph LR
    App[VPC Application] --> VPCe[VPC Endpoint<br/>Gateway or Interface]
    VPCe --> AP[S3 Access Point<br/>VPC Origin]
    AP --> Bucket[(S3 Bucket)]

    style App fill:#dbeafe,stroke:#3b82f6
    style VPCe fill:#fef9c3,stroke:#ca8a04
    style AP fill:#fed7aa,stroke:#ea580c
    style Bucket fill:#c3fae8,stroke:#0d9488

S3 Object Lambda

  • Transform an object on-the-fly before returning it to the caller — no need to store multiple versions
  • Only one S3 bucket needed — add an S3 Access Point + an S3 Object Lambda Access Point on top
graph LR
    App([Caller App]) --> OLAP[S3 Object Lambda<br/>Access Point]
    OLAP --> Lambda[Lambda Function<br/>transform on-the-fly]
    Lambda --> AP[S3 Access Point]
    AP --> Bucket[(S3 Bucket<br/>raw object)]

    style App fill:#dbeafe,stroke:#3b82f6
    style OLAP fill:#f3e8ff,stroke:#9333ea
    style Lambda fill:#fed7aa,stroke:#ea580c
    style AP fill:#fef9c3,stroke:#ca8a04
    style Bucket fill:#c3fae8,stroke:#0d9488

Use Cases:

  • Redact PII before serving data to analytics or non-prod environments
  • Convert formats on-the-fly (e.g. XML → JSON)
  • Resize or watermark images per requesting user