ML Feature Store | My Portfolio

Problem

Design a Feature Store — a centralised repository for storing, sharing, and serving ML features for training and inference.

Why It Matters for You

Direct relevance: Crest Data ML thresholding engine — KPI features needed consistent computation across training and real-time inference. A feature store solves exactly this.

Functional Requirements

Store + version features computed from raw data
Serve features for real-time inference (low latency)
Batch features for model training
Feature discovery — catalog with metadata

Non-Functional Requirements

Low latency online serving (< 10ms p99)
High throughput batch reads
Consistency between training and serving (training-serving skew problem)

High-Level Design

Data Sources → Feature Pipeline (Spark/Flink) →
    Offline Store (S3/Data Warehouse) → Model Training
    Online Store (Redis/DynamoDB) → Real-time Inference
                    ↑
             Feature Registry (metadata, versioning)

Key Concepts

Online Store — low-latency KV store (Redis, DynamoDB) for inference
Offline Store — historical data lake (S3 + Parquet) for training
Feature Pipeline — batch (Spark) or streaming (Flink/Kafka) computation
Feature Registry — catalog of all features, versions, ownership
Training-serving skew — biggest problem: same feature must be computed identically in batch and streaming

Interview Angle (Your War Story)

At Crest, the ML thresholding engine needed KPI features at training time and at real-time inference time. Without a feature store, we risked skew. Use this to explain WHY feature stores exist.

Notes