LLM & AI-SaaS Domain Context
The vocabulary you need to sound credible at an AI-native company — tokens, fine-tuning, evals, latency budgets, the economics of inference — and how product analytics for AI products differs from classical SaaS.
Why this chapter
Both analytics-leadership and founding-DS ask for "genuine excitement about AI" or place the role on an "Agentic Platform." You don't need to be an LLM researcher. You need to sound credible discussing AI products — the failure modes, the economics, the metrics that actually matter. This chapter is the vocabulary stack.
The economics of inference
The single most underappreciated fact about AI products is that they have marginal cost per request — unlike traditional SaaS where serving the millionth user costs essentially nothing. This changes how you think about:
- Gross margins. If each call to GPT-4 costs $0.03, and you charge $0.10, your margin per call is $0.07 before infrastructure and people. At scale, that becomes 30–60% gross margin — closer to a hosting company than a SaaS company.
- Free tiers. Generous free tiers can torch margins. Every "free user runs a workflow" is a marginal cost.
- Pricing tiers. Usage-based pricing is more honest for AI products than seat-based. The strong DS argument: "seat pricing is fictional for our cost structure."
- Whale effects. A small number of users drive most of the cost. Cost-per-user distributions are heavy-tailed, and the top 1% can be 50%+ of cost.
The token math
Most LLM providers price by token (input + output, often at different rates). One token ≈ 0.75 words. For Claude / GPT-4-class models, expect ~$3–15 per million input tokens and 3–5× that for output tokens. A 1000-token prompt with a 500-token response costs $0.01–$0.05 depending on model.
Model vocabulary
Terms you'll hear thrown around and should track:
| Term | What it means |
|---|---|
| Foundation model | A large pre-trained model (GPT-4, Claude, Command, Llama) you build on top of, rather than train from scratch. |
| Token | The unit of text the model processes. ~0.75 words on average. |
| Context window | Max tokens (input + output) a model can handle in one call. 4k → 8k → 32k → 200k+ is the recent progression. |
| Embedding | A dense vector representation of text used for similarity search. (e.g., embedding products from various providers). |
| Reranker | A model that re-orders search results by relevance. (e.g., reranker products from various providers). |
| RAG (Retrieval-Augmented Generation) | Pull relevant context from a knowledge store, inject it into the prompt, generate the answer. |
| Fine-tuning | Continue training the model on your data for a specialized task. |
| Prompt engineering | Structuring the prompt to get better outputs — system prompts, n-shot examples, chain-of-thought, tool descriptions. |
| Hallucination | The model produces a fluent but factually wrong output. |
| Eval | A test suite for model behavior. Replaces "we ran it and looked at the output" with reproducible scoring. |
| Agent | An LLM in a loop with tools — it can call APIs, search, modify state, then continue reasoning. |
| Inference | The act of generating from a model (vs training). |
Latency budgets
LLM responses are slow. A typical chat-style completion is 1–5 seconds; agentic workflows with multiple tool calls can be 10–60 seconds. This shapes product design and metric thinking:
- Streaming tokens makes long outputs feel fast — the perceived latency is "time to first token," not "time to last token." Almost every chat UI does this.
- p95 vs p50 matters more than usual. Most outputs are fast; a long tail spoils the experience. Track p95 explicitly as a guardrail.
- Caching via prompt caching (Anthropic, OpenAI) can cut latency on repeated prompts by 80%+. Senior DS should be aware that "did we hit the cache?" is a metric, not just an engineering detail.
Metrics for AI products
AI product metrics differ from classical SaaS in three ways:
1. Quality is now a per-request metric
Every model output has a quality score (implicit: did the user accept it? Explicit: did the user thumbs-up?). Aggregate quality metrics — task-completion rate, "good response" rate — are first-class.
2. Cost per active user is a real number
Track $/MAU. For a healthy AI SaaS, $/MAU should be a small fraction of $/revenue-per-user. When it isn't, you're losing money on each active user.
3. Adoption metrics shift toward use-frequency
Classical SaaS: "did the user log in this week?" AI SaaS: "did the user invoke the AI feature this week, and how often?" Frequency of invocation is the leading indicator of stickiness.
Specific to AI video products
- Render success rate (did the video actually produce?)
- Render time p95 (latency budget for "video done in < 5 minutes")
- Edits-per-published-video (low = the AI nailed it on first try; high = users are correcting it)
- Publish rate among rendered videos
Specific to enterprise LLM API products
- API calls per active account per week
- Tokens per call distribution (heavy users are 100× light users)
- Latency p95 per endpoint, per model
- Retention defined as "calls in week N" — not seat-based but usage-based
- Net revenue retention per cohort (the SaaS classic, still relevant)
Fine-tuning vs prompting
When to fine-tune:
- You have ≥1k high-quality examples of the task.
- Prompting alone has hit a quality ceiling.
- You need a specialized format, style, or domain that the base model doesn't reliably produce.
- The cost of always-on prompt overhead exceeds the cost of fine-tuning.
When not to:
- You're still iterating on the task definition. Fine-tuning calcifies behavior; you'll regret it if requirements shift.
- You have < 1k examples — start with prompting and gather more.
- You need behavior that emerges from chain-of-thought reasoning in the base model. Fine-tuning can degrade emergent capabilities.
Evals in product context
Evals are the test suite for LLM behavior — they let you change a prompt or model and know whether quality went up or down. The product-DS framing:
- Eval is the new A/B for some changes. If swapping a prompt clearly improves the eval set, you can ship without running a costly user A/B test.
- Evals require labels. Either human-labeled examples or LLM-as-judge with calibration to humans on a sample.
- Slice evals matter. Average quality is fine; degraded quality on important slices (enterprise customers, regulated workflows) is a halt signal even if average improved.
Two contexts compared
Enterprise-AI company quick facts
- Founded 2019, Toronto. Co-founders include Aidan Gomez (one of the "Attention Is All You Need" authors).
- Products: Command (chat/generation LLMs), Embed (embeddings), Rerank, Aya (multilingual research models).
- Positioning: enterprise-first, data-private, deployable on the customer's cloud. Differentiates from OpenAI/Anthropic on the data-sovereignty axis.
- Customers: enterprises building RAG, agents, search, support automation. Banks, telcos, governments.
AI-video SaaS quick facts
- Founded 2020 (originally Surreal). AI video creation — avatars, voice cloning, language dubbing.
- Positioning: enterprise + creator-friendly. Marketing teams, training, sales videos.
- Competitors: Synthesia, D-ID, Veed, Captions.
- Key product economics: render cost per video. Talking-head avatar videos run a few cents per minute at scale.
Interview probes
Show probe 1: "How is product analytics different for AI products?"
Three differences. (1) Quality is a per-request metric, not just an aggregate — every output has an implicit accept/reject signal. (2) Marginal cost per request is real, so cost-per-user becomes a first-class metric and pricing leans usage-based. (3) Adoption shifts from "logged in" to "invoked the AI feature how often" — frequency of invocation predicts retention better than session count. The senior signal is naming the cost economics — most candidates skip that.
Show probe 2: "What's RAG, in one paragraph?"
Retrieval-Augmented Generation: when the user asks a question, retrieve relevant documents from a knowledge store, inject them into the prompt, then have the LLM generate an answer grounded in those documents. Used when the model needs information it wasn't trained on (private docs, current events) and you want to cite sources. The hard parts: chunking the documents well, retrieving the right ones (often hybrid search + rerank), and grounding the model's output so it doesn't hallucinate outside the provided context.
Show probe 3: "Why is p95 latency more important than p50 for AI products?"
Because user experience is dominated by the worst calls they remember, not the average. p50 of 2 seconds and p95 of 30 seconds means one in twenty calls takes half a minute — and those are the ones users complain about. Many companies report median latency and miss this; the senior call is to track p95 (and sometimes p99) explicitly as a guardrail.
Show probe 4: "When would you fine-tune over prompting?"
When prompting has hit a quality ceiling for a stable, well-defined task, and I have ≥1k high-quality examples. Or when the always-on prompt overhead (instructions, n-shot examples) is eating context budget I need for actual input. I wouldn't fine-tune while the task definition is still moving — fine-tuning calcifies behavior, and a requirements shift after fine-tuning costs much more than after prompting.
Show probe 5: "What metric would you track to know an AI feature is actually valuable?"
Layered. Adoption: weekly users who invoked the feature. Engagement: invocations per active user (frequency). Quality: implicit acceptance rate (the user kept the output) and explicit thumbs-up/down. Business impact: lift on the downstream metric the feature was supposed to improve — fewer support tickets, more successful publishes, more API revenue. The guardrail layer is latency p95 and cost per invocation — both can quietly trash unit economics if ignored.