AWS S3
Pre-requisite: [[Data Engineering Fundamentals]] — covers Data Types, Data Lake vs Warehouse, ETL, Data Formats
What is S3?
- Simple Storage Service — object storage service by AWS
- Infinitely scalable, highly durable (99.999999999% — 11 nines)
- Used as a Data Lake, backup store, static website hosting, ML data store
S3 Buckets & Objects
- Stores data as objects (files) inside buckets (directories) — not a file system, not block storage
- Each bucket has a globally unique name across all regions and accounts
- From March 2026: AWS introduced Account Regional Namespaces — allows creating buckets within your own reserved namespace
- Buckets are defined at the region level — S3 looks global but buckets are regional
Naming convention:
- No uppercase, no underscore
- 3–63 characters long
- Must start with a lowercase letter or number
- Must NOT start with
xn--or end with-s3alias - Cannot be formatted as an IP address
Objects
- Each object has a Key — the full path:
s3://my-bucket/folder/file.txt - Key = Prefix + Object name
- Max object size: 5 TB (use multipart upload for > 5 GB)
- Objects can have: Metadata (key-value pairs), Tags (up to 10), Version ID
Storage Classes
graph LR
Hot[Frequent Access] --> Standard[S3 Standard<br/>Highest cost<br/>Low latency]
Warm[Infrequent Access] --> IA[S3 Standard-IA<br/>Retrieval fee<br/>Multi-AZ]
Warm --> OneZone[S3 One Zone-IA<br/>Retrieval fee<br/>Single AZ cheaper]
Warm --> IT[S3 Intelligent-Tiering<br/>Auto moves tiers<br/>Monitoring fee]
Cold[Archive] --> GI[Glacier Instant<br/>Ms retrieval]
Cold --> GF[Glacier Flexible<br/>Mins-hrs retrieval]
Cold --> GDA[Glacier Deep Archive<br/>Cheapest<br/>12hr retrieval]
style Hot fill:#fef9c3,stroke:#ca8a04
style Warm fill:#fed7aa,stroke:#ea580c
style Cold fill:#dbeafe,stroke:#3b82f6
style Standard fill:#fef9c3,stroke:#ca8a04
style IA fill:#fed7aa,stroke:#ea580c
style OneZone fill:#fed7aa,stroke:#ea580c
style IT fill:#fed7aa,stroke:#ea580c
style GI fill:#dbeafe,stroke:#3b82f6
style GF fill:#dbeafe,stroke:#3b82f6
style GDA fill:#dbeafe,stroke:#3b82f6
| Class | Examples | Retrieval | Availability | Durability |
|---|---|---|---|---|
| Standard | Website images, ML training data, active app data | Immediate | 99.99% | 11 9s, Multi-AZ |
| Intelligent-Tiering | User uploads with unknown access pattern, mixed workloads | Immediate | 99.9% | 11 9s, Multi-AZ |
| Standard-IA | DR backups, older logs still needed on demand | Immediate + fee | 99.9% | 11 9s, Multi-AZ |
| One Zone-IA | Thumbnails (reproducible), secondary backups | Immediate + fee | 99.5% | 11 9s, Single AZ |
| Glacier Instant | Medical images, news archives accessed occasionally | Milliseconds | 99.9% | 11 9s, Multi-AZ |
| Glacier Flexible | Yearly audit logs, raw video footage archive | Minutes–Hours | 99.99% | 11 9s, Multi-AZ |
| Glacier Deep Archive | Financial records (7yr compliance), HIPAA healthcare data | 12 hrs | 99.99% | 11 9s, Multi-AZ |
Lifecycle Policies
Automatically transition or expire objects based on age.
graph LR
Upload([Object Uploaded]) -->|Day 0| Standard[S3 Standard]
Standard -->|Day 30| IA[Standard-IA]
IA -->|Day 90| Glacier[Glacier Flexible]
Glacier -->|Day 365| Delete([Expired / Deleted])
style Upload fill:#dcfce7,stroke:#16a34a
style Standard fill:#fef9c3,stroke:#ca8a04
style IA fill:#fed7aa,stroke:#ea580c
style Glacier fill:#dbeafe,stroke:#3b82f6
style Delete fill:#fce7f3,stroke:#db2777
- Transition rules — move objects to a cheaper storage class after N days
- Expiration rules — delete objects after N days
- Can target: current versions, non-current versions (versioning), or incomplete multipart uploads
- Rules can be scoped by prefix or object tags
Versioning
- Keeps multiple versions of the same object in a bucket — enabled at bucket level
- Protects against accidental delete or overwrite
- Once enabled, cannot be fully disabled (only suspended)
- Delete adds a delete marker — older versions still exist and can be restored
- MFA Delete — requires MFA to permanently delete a version (extra protection)
- Files uploaded before versioning was enabled get version
null
bucket/
photo.jpg (v3 — current)
photo.jpg (v2)
photo.jpg (v1)
Replication
- Requires versioning enabled on both source and destination buckets
- Copying is asynchronous — only new objects are replicated after rule is created
- To replicate existing objects: use S3 Batch Replication
- No chaining — replication from A→B does not auto-replicate to C
Delete behaviour:
- Replicating delete markers is optional
- Permanent deletes (by version ID) are never replicated
| CRR | SRR | |
|---|---|---|
| Full name | Cross-Region Replication | Same-Region Replication |
| Use case | DR, lower latency for global users | Log aggregation, compliance, test/prod sync |
| Cost | Higher (cross-region transfer) | Lower |
Encryption
| Type | Key Owner | Notes |
|---|---|---|
| SSE-S3 | AWS | Default, AES-256, no extra cost |
| SSE-KMS | You (AWS KMS) | CloudTrail audit trail, extra cost, quota limits apply |
| SSE-C | You (per request) | AWS never stores your key, HTTPS required |
| Client-Side | You | Encrypt before upload; AWS sees only ciphertext |
SSE-S3
- Keys managed entirely by AWS — transparent to the user
- AES-256 encryption, server-side
- Header:
"x-amz-server-side-encryption": "AES256" - Enabled by default for all new buckets and objects
SSE-KMS
- Keys managed via AWS KMS — you control key rotation, access, and get CloudTrail audit logs
- Header:
"x-amz-server-side-encryption": "aws:kms" - Quota limits: upload calls
GenerateDataKey, download callsDecrypt— both count against KMS quota (5,500–30,000 req/s per region) - Quota can be increased via Service Quotas Console
SSE-C
- You supply the key with every request — AWS uses it to encrypt/decrypt but never stores it
- HTTPS required — key is sent in HTTP headers
Client-Side Encryption
- Encrypt data client-side before uploading — AWS sees only ciphertext
- You manage keys and encryption logic entirely
Encryption in Transit
- S3 supports both HTTP (unencrypted) and HTTPS (TLS/SSL)
- HTTPS is mandatory for SSE-C, recommended for all
- To enforce HTTPS: add bucket policy condition
aws:SecureTransport = false→ Deny
Access Control
graph TD
User([User / Service]) --> BP[Bucket Policy]
User --> IAM[IAM Policy]
User --> ACL[ACL]
BP --> Bucket[(S3 Bucket)]
IAM --> Bucket
ACL --> Bucket
Bucket --> BPA[Block Public Access<br/>Override — always wins]
style User fill:#dbeafe,stroke:#3b82f6
style BP fill:#fef9c3,stroke:#ca8a04
style IAM fill:#dcfce7,stroke:#16a34a
style ACL fill:#fce7f3,stroke:#db2777
style Bucket fill:#ccfbf1,stroke:#0d9488
style BPA fill:#ffc9c9,stroke:#dc2626
Bucket Policy
- JSON policy attached directly to the bucket — controls users, roles, other accounts, services
- Use for: cross-account access, making bucket public, enforcing upload encryption
{
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*"
}
IAM Policy
- Attached to IAM users, groups, or roles — not to the bucket itself
- Both IAM policy and bucket policy must allow access (unless the caller is the resource owner)
ACLs (Access Control Lists)
- Legacy mechanism — basic read/write per AWS account or predefined group
- Mostly deprecated — use bucket policies instead
Block Public Access
- Default: ON for all new buckets — overrides bucket policies and ACLs
- Turn off only when intentionally hosting public content (e.g. static website)
| Setting | What it blocks |
|---|---|
| BlockPublicAcls | Adding new public ACLs |
| IgnorePublicAcls | Existing public ACLs are ignored |
| BlockPublicPolicy | New bucket policies granting public access |
| RestrictPublicBuckets | All public + cross-account access |
Pre-signed URLs
- Temporary URL granting time-limited access to a private object
- Generated using your credentials — recipient gets your permissions for that object
- Default expiry: 1 hour | Max: 7 days (SDK-generated)
- Use case: share a private download link, allow a one-time upload
https://my-bucket.s3.amazonaws.com/photo.jpg
?X-Amz-Expires=3600 ← expires in 1 hour
&X-Amz-Signature=...
Which to use?
| Scenario | Use |
|---|---|
| Grant another AWS account access | Bucket Policy |
| Give an EC2 instance access to S3 | IAM Role + IAM Policy |
| Temporary access to a private file | Pre-signed URL |
| Make entire bucket public | Bucket Policy + Disable Block Public Access |
| Lock down everything | Block Public Access |
Performance & Optimization
- Automatically scales to high request rates — latency 100–200 ms
- Per-prefix throughput: 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests/sec
- No limit on number of prefixes — spread objects across prefixes to scale reads/writes
Multipart Upload
- Recommended for > 100 MB, required for > 5 GB
- Splits object into parts, uploads in parallel — faster, resumable on failure
Transfer Acceleration
- Routes upload through nearest CloudFront edge location → forwards to S3 bucket region via AWS backbone
- Best for: large files uploaded by globally distributed users
- Compatible with multipart upload
Byte Range Fetches
- Download specific byte ranges of an object in parallel
- Use to: speed up large downloads, retrieve partial data (e.g. just the file header), improve resilience
Event Notifications
Trigger actions automatically when objects are created, deleted, or restored.
graph LR
S3[(S3 Bucket)] -->|Object events| Direct[Direct Destinations]
S3 -->|All events| EB[Amazon EventBridge]
Direct --> Lambda[AWS Lambda]
Direct --> SQS[SQS Queue]
Direct --> SNS[SNS Topic]
EB -->|Rules + filters| Step[Step Functions]
EB -->|Rules + filters| Kinesis[Kinesis Streams]
EB -->|Rules + filters| Lambda2[Lambda / 18+ services]
style S3 fill:#c3fae8,stroke:#0d9488
style EB fill:#fef9c3,stroke:#ca8a04
style Direct fill:#dbeafe,stroke:#3b82f6
style Lambda fill:#fed7aa,stroke:#ea580c
style SQS fill:#fed7aa,stroke:#ea580c
style SNS fill:#fed7aa,stroke:#ea580c
style Step fill:#f3e8ff,stroke:#9333ea
style Kinesis fill:#f3e8ff,stroke:#9333ea
style Lambda2 fill:#f3e8ff,stroke:#9333ea
| Direct (Lambda / SQS / SNS) | Via EventBridge | |
|---|---|---|
| Setup | Simple | Requires EventBridge rule |
| Filtering | Basic (prefix/suffix) | Advanced JSON rules |
| Destinations | 3 options | 18+ AWS services |
| Extra features | — | Archive, replay, reliable delivery |
S3 Select
- Run SQL queries directly on S3 objects — no need to download and process the full file
- Supported formats: CSV, JSON, Parquet (with optional GZIP/BZIP2 compression)
- Returns only the filtered subset → reduces data transferred and processing cost
- Example:
SELECT * FROM S3Object WHERE age > 30
S3 Access Points
- Named entry points for a bucket — each with its own DNS name and access policy
- Simplify security when multiple teams or apps need different access to the same bucket
graph LR
TeamA[Team A] -->|Access Point A<br/>finance policy| Bucket[(S3 Bucket)]
TeamB[Team B] -->|Access Point B<br/>analytics policy| Bucket
TeamC[Team C] -->|Access Point C<br/>read-only policy| Bucket
style Bucket fill:#c3fae8,stroke:#0d9488
style TeamA fill:#dbeafe,stroke:#3b82f6
style TeamB fill:#dcfce7,stroke:#16a34a
style TeamC fill:#fef9c3,stroke:#ca8a04
VPC Origin
- Restrict access point to VPC only — no public internet access
- Requires a VPC Endpoint (Gateway or Interface type) pointing to the access point
- VPC Endpoint Policy must allow access to both the access point and the underlying bucket
graph LR
App[VPC Application] --> VPCe[VPC Endpoint<br/>Gateway or Interface]
VPCe --> AP[S3 Access Point<br/>VPC Origin]
AP --> Bucket[(S3 Bucket)]
style App fill:#dbeafe,stroke:#3b82f6
style VPCe fill:#fef9c3,stroke:#ca8a04
style AP fill:#fed7aa,stroke:#ea580c
style Bucket fill:#c3fae8,stroke:#0d9488
S3 Object Lambda
- Transform an object on-the-fly before returning it to the caller — no need to store multiple versions
- Only one S3 bucket needed — add an S3 Access Point + an S3 Object Lambda Access Point on top
graph LR
App([Caller App]) --> OLAP[S3 Object Lambda<br/>Access Point]
OLAP --> Lambda[Lambda Function<br/>transform on-the-fly]
Lambda --> AP[S3 Access Point]
AP --> Bucket[(S3 Bucket<br/>raw object)]
style App fill:#dbeafe,stroke:#3b82f6
style OLAP fill:#f3e8ff,stroke:#9333ea
style Lambda fill:#fed7aa,stroke:#ea580c
style AP fill:#fef9c3,stroke:#ca8a04
style Bucket fill:#c3fae8,stroke:#0d9488
Use Cases:
- Redact PII before serving data to analytics or non-prod environments
- Convert formats on-the-fly (e.g. XML → JSON)
- Resize or watermark images per requesting user