Section D · Production

Data Pipelines for AI

ETL/ELT, batch vs streaming, Airflow/Spark/Kafka, RAG indexing — the vocabulary and patterns to talk about pipelines fluently.

What "data pipeline" means in this role

In an AI engineering role, "data pipeline" usually means one of three things:

  1. The pipeline that gets data INTO the model at inference time (RAG indexing, retrieval, prompt assembly).
  2. The pipeline that gets training/fine-tuning data ready (less common in agentic-LLM roles, but possible).
  3. The pipeline that takes model outputs and routes them downstream (post-processing, validation, persistence, audit logging).

For this kind of role, you're mostly in worlds 1 and 3. World 2 matters mainly for evals (building ground-truth datasets).

Vocabulary that lets you talk about pipelines

ETL vs ELT

  • ETL (Extract, Transform, Load): transform happens before loading into the warehouse. Older pattern, optimized for storage cost.
  • ELT (Extract, Load, Transform): land raw data, transform inside the warehouse. Modern default — cloud warehouses make compute cheap relative to storage.

Batch vs streaming

  • Batch: scheduled, processes a set of records at once. Use for: training data prep, periodic re-indexing, regulatory reporting.
  • Streaming: continuous, processes events as they arrive. Use for: real-time alerts, transaction monitoring, anything where latency matters.
  • Micro-batch: streaming-ish (small batches every few seconds). Spark Structured Streaming default.
  • Lambda architecture: combines batch + streaming for the same data.
  • Kappa architecture: streaming-only, treating batch as a special case.

In compliance: transaction monitoring is often streaming (alerts within minutes); regulatory change ingestion is batch (daily/weekly).

Idempotency / exactly-once / at-least-once

  • At-most-once: messages may be lost, never duplicated.
  • At-least-once: messages always delivered, may be duplicated. Most common.
  • Exactly-once: messages delivered exactly once. Hard; usually approximate via at-least-once + idempotency at the consumer.

For AI: matters when an agent action triggers a downstream pipeline. Without idempotency, a retried agent step can cause duplicate filings or duplicate notifications.

Schema evolution

What happens when the shape of data changes. Strategies:

  • Forward compatibility: new code reads old data.
  • Backward compatibility: old code reads new data.
  • Schema registry (Confluent for Kafka, Glue Schema Registry on AWS): centrally manages evolution.

Lineage

The ability to trace any data point from source to consumption. Tools: OpenLineage, DataHub, Atlan. In compliance, lineage is non-optional — you must be able to show a regulator "this AI decision was based on these specific data sources, retrieved at this time."

Data quality

Tests that catch bad data before it reaches consumers. Tools: Great Expectations, dbt tests, Soda. Categories: completeness, uniqueness, referential integrity, freshness, distribution drift.

The ecosystem — names you should recognize

ToolPurposeWhen you'd use it
AirflowWorkflow orchestrator (DAGs of tasks)Batch pipelines, scheduled jobs
Prefect, DagsterModern Airflow alternativesSame, with better UX/dev ergonomics
dbtSQL transformations in the warehouseELT transformations, modeling
SparkDistributed computeBig-data batch / streaming transforms
KafkaDistributed event streamingHigh-throughput message bus
Kinesis (AWS), Pub/Sub (GCP)Cloud-native streamingSame role as Kafka, managed
FlinkStream processing engineStateful streaming computations
Beam (Apache)Unified batch+streaming model (Dataflow on GCP)Cross-cloud streaming pipelines
Snowflake, BigQuery, RedshiftCloud data warehousesStorage + analytics compute
Iceberg, Delta Lake, HudiTable formats on object storage (lakehouse)Modern data lake with ACID
Glue (AWS), Dataproc (GCP)Managed SparkCloud-native Spark workloads

You don't need depth in all. You need to use the right names for the right roles.

Pipelines for RAG / retrieval (most relevant)

This is the data pipeline that matters most for agentic AI in compliance:

[Source documents] ↓ [Ingestion] — fetch, dedupe, store raw ↓ [Cleanup] — strip headers/footers, OCR if scanned, normalize ↓ [Chunking] — split into retrievable units ↓ [Enrichment] — add metadata (jurisdiction, document_type, effective_date) ↓ [Contextualization] (Anthropic technique) — prepend doc-context to each chunk ↓ [Embedding] — call embedding model, get vectors ↓ [Indexing] — write to vector DB + metadata store ↓ [Refresh trigger] — schedule re-runs / change detection

Key compliance considerations

  • Source freshness: regulations change. The index must reflect the current version. Pipeline includes detection of "the source document changed since we last indexed it."
  • Versioning: when a regulation is superseded, you keep the old version (audit may need it) but de-prioritize it in retrieval. Index both with effective_date metadata.
  • PII: customer-related documents need access controls. Maintain separate indexes by data classification.
  • Lineage: every retrieved chunk must be traceable to its source document, version, and retrieval time. Goes in the audit log.

Chunking strategies

  • Fixed-size: split every N tokens (with overlap). Simple, often inadequate.
  • Sentence/paragraph: respects natural boundaries. Better for prose.
  • Recursive structural: split by chapter → section → paragraph. Best for structured documents (regulations, policies).
  • Semantic chunking: use embeddings to find topic boundaries. More complex, sometimes better.

For a regulatory document, recursive structural is usually the right default: preserves heading hierarchy, ensures each chunk is self-contained.

Contextual retrieval (Anthropic technique)

Before embedding, prepend a brief context paragraph generated by an LLM: "This chunk is from FATF Recommendation 10, section on Customer Due Diligence, requiring ongoing monitoring..."

That context dramatically improves retrieval quality on chunks that would otherwise be ambiguous out of context. Cite by name — it's a known, current best practice.

Pipelines for evals and ground-truth

The other big pipeline: building the evaluation dataset.

[Production trace store] ← every agent run logs here ↓ [Sample for review] — random + stratified by case type / risk tier ↓ [SME annotation tool] — human grades the AI output, captures rationale ↓ [Eval dataset] — versioned, schema-validated ↓ [Eval harness] — runs system under test, scores, reports

Tools: simple SME-tooling can be a Streamlit app pointing at the trace store. More mature: Argilla, Humanloop, Braintrust for annotation workflows.

Streaming for compliance: transaction monitoring

A typical streaming flow:

[Trade execution events] → Kafka topic ↓ [Enrichment service] — joins customer KYC, prior history, sanctions list cache ↓ [Rule engine] — deterministic rules ↓ [ML model scoring] — anomaly score, fraud score ↓ [Threshold] → if alert-worthy, emit to alerts topic ↓ [Alert pre-screening agent] (Claude API + tools) ↓ [Pre-screened alerts queue] → human investigators

If asked to design this end-to-end, walk through it. Each stage has its own scaling and reliability concerns; the AI agent is one piece, not the whole thing.

AWS pipeline shape

  • Ingestion: API Gateway → Lambda, or direct producer → Kinesis Data Streams / MSK (managed Kafka).
  • Processing: Lambda (small), Glue (Spark), or Kinesis Data Analytics (Flink).
  • Storage: S3 (raw / staging / curated zones), Iceberg/Delta tables on top.
  • Warehouse: Redshift, or query S3 directly via Athena.
  • Orchestration: Step Functions, MWAA (managed Airflow).
  • Vector DB: OpenSearch (with vector engine), or Aurora pgvector, or external (Pinecone).

GCP pipeline shape

  • Ingestion: Pub/Sub.
  • Processing: Dataflow (Apache Beam), Dataproc (Spark), Cloud Functions.
  • Storage: GCS (raw / staging / curated), Iceberg/Delta on GCS.
  • Warehouse: BigQuery — central piece.
  • Orchestration: Cloud Composer (managed Airflow), Workflows.
  • Vector DB: Vertex AI Vector Search, AlloyDB pgvector, or external.

For full deployment patterns see 13-model-deployment.

Pipeline design questions to expect

  • "How would you ingest and re-index a regulatory document store nightly?"
  • "How would you detect drift in the retrieval corpus?"
  • "How do you handle a downstream system being down when an agent wants to write to it?"
  • "How do you back-fill historical data into your eval pipeline?"
  • "How do you ensure agent outputs are persisted reliably even when there are downstream failures?"

For each, walk through:

  1. What you'd ingest from where.
  2. What you'd transform / validate.
  3. Where you'd land it.
  4. What runs on a schedule vs trigger.
  5. What the failure mode looks like and how you recover.
  6. How lineage / audit is preserved.

Common pitfalls — flag unprompted

A senior signal: you mention these without being asked.

  • Schema drift: source documents change shape; pipeline silently fails or produces garbage.
  • Backfilling: a bug means re-processing historical data. Idempotent steps + replay-from-source design make this tractable.
  • Late / out-of-order data: streaming systems get events out of order or hours late. Watermarking, windowing, late-arrival policies.
  • Backpressure: when downstream slows, what happens upstream? Buffering, dropping, blocking.
  • Cost: full re-embeds on every refresh are expensive. Detect document-level changes; re-embed only what changed.
  • PII at every layer: tokenize/encrypt at ingest, not at query time.
  • Test with real data shape: synthetic test data hides real-world weirdness.

Talking-point: "describe a data pipeline you'd build for X"

Template — customize for the scenario

"I'd structure it in four layers. Ingestion: source [Y] flows in via [Kafka/Kinesis/scheduled job]; raw lands in [object storage] with retention. Validation: schema check, data-quality checks, dead-letter queue for bad records. Transformation: [whatever the task needs] — for compliance I'd preserve provenance: original record + version + ingestion time on every transformed record. Serving: writes to [vector index / warehouse / retrieval store], with a metadata catalog and lineage tracking. The whole thing's orchestrated by [Airflow / Step Functions / Cloud Composer]. Failure modes I'd plan for: source format drift, downstream outage, late-arriving data, backfill needs, and a controlled way to re-run a window without duplicating downstream side effects."

That answer signals: you've thought past the happy path, you know the layers, you flag the operational realities.