Section B · Core DS

Prompt Engineering, Applied

the company's role centers on prompt engineering for Newton's lenses. This chapter covers the practical craft — components, n-shot patterns, templates, and the iteration loops that turn customer questions into shipped POCs.

Why companies in this space hire for this

The product space is a multimodal AI platform for physical-world data — a multimodal-LLM platform with "lenses" (configurable analytical operations) over sensor + video data. The role is essentially: take a customer's raw assets, design prompts and configure lenses to produce a working analysis, and hand off a customer-ready result. Skill at prompt engineering is the rate-limiter on POC velocity.

JD-relevant phrases: "Design, test, and refine prompts including n-shot examples for Newton's lenses," "Configure lens parameters for proof-of-concept runs," "Maintain reusable prompt templates and configuration presets."

The components of a prompt

A production-grade prompt has up to six distinct layers. Build them in order:

System / role. "You are a vision analyst examining warehouse video. Your job is X."
Task definition. Precise, with examples of edge cases the task does or doesn't include.
Input description. What kind of asset is being passed, what format, what the user wants the model to attend to.
Few-shot examples (n-shot). Several worked examples. More on this below.
Output format. JSON schema, structured fields, or natural-language template. Be unambiguous about delimiters and required fields.
Reasoning scaffolding. Chain-of-thought prompts, "think step by step," or explicit decomposition that the model fills in.

The "boring" lift

The biggest prompt-quality win is usually #5 — output format. Forcing structured output cuts hallucination, makes results parseable, and exposes errors immediately. Worth investing time in even when other layers feel more interesting.

N-shot examples

Few-shot examples teach the model the task by demonstration. The craft is in picking them.

How many?

For most tasks, 3–8 examples is the sweet spot. Fewer and the model under-specifies the task; more and you waste context and risk over-anchoring on idiosyncratic examples.

How to pick

Coverage: show the model the range of valid inputs and outputs, not just one shape.
Edge cases: include the ones that confused early prompts. "Show what to do when the sensor is offline" — explicit handling beats hoping for it.
Negative examples: where it makes sense, show "this is what NOT to do, and why."
Order matters: most recent example often weights heaviest. Put the most representative example last.

Dynamic vs static n-shot

Static: same examples every call. Dynamic: retrieve the most-similar examples from a library for each input (RAG-style). Dynamic is better for diverse inputs but adds latency and complexity.

Reusable templates & lenses

the role's JD calls out "maintain reusable prompt templates and configuration presets." This is the operational discipline that turns one-off POCs into a library:

What lives in the template

The static system message.
The task definition (parameterized by customer scenario).
The output schema.
Slots for n-shot examples (filled per scenario from a curated library).
Lens parameters (frame rate, attention regions, confidence thresholds).

What gets customized per POC

The specific n-shot examples drawn from the customer's domain.
The input asset description (warehouse vs retail vs factory).
Lens hyperparameters tuned to the customer's data quality and use case.

Versioning

Treat prompts like code. Version them. Tag the version with each result so you can reproduce. The "the prompt I shipped to that customer last quarter" problem is real and solvable only with discipline.

Iterating against an eval

The shift from hobbyist prompting to applied prompt engineering is iterating against a held-out eval set, not vibes.

Build the eval first

Before tuning the prompt, label 20–50 representative examples with expected outputs. This is your scoreboard. Yes, it slows the first hour. Yes, you save 10× that time over the project.

Score the eval

Methods, in order of preference:

Programmatic check when the output is structured. Did the model return the right JSON keys? Are the numeric outputs within tolerance?
LLM-as-judge for open-ended output, with a prompt that grades against criteria. Calibrate the judge against human scoring on a sample.
Human review for the gold standard, especially on the first dozen examples.

The loop

Run current prompt against eval. Record score per example.
Look at the failing examples — patterns?
Change one thing in the prompt. (Just one.)
Re-run. Compare deltas per example.
Keep the change if aggregate improved AND no example regressed.

The "I added one sentence and it helped overall but broke 2 cases" problem

Common. Decide policy in advance: are regressions acceptable for aggregate lift, or do you require strict monotone improvement? Usually you accept small regressions, but only if you've documented them and the consumer accepts the tradeoff.

Failure modes

Hallucination. Model invents facts not in the input. Mitigate with grounding ("answer only from the provided data; if absent, say 'unknown'"), structured output, retrieval augmentation.
Format drift. Model occasionally returns prose when JSON was requested. Mitigate with explicit format examples, strict-JSON modes (if the provider supports), and a parse-or-retry wrapper at the application layer.
Refusal. Model refuses tasks that should be benign. Tune the system message; reframe the task; for sensitive domains, use a model variant tuned for that domain.
Confidence calibration. Model says it's certain when it isn't. Mitigate with chain-of-thought, multi-sample agreement (self-consistency), or an explicit "rate your confidence on a 1–5 scale" field.
Order sensitivity. Same examples in different order produce different outputs. Test with shuffled n-shot before shipping.

Multimodal prompts (Solutions-Engineering-specific)

Newton is multimodal — sensor streams, video, time-series. Prompting these adds dimensions:

Temporal scope: which time window of the asset is the model meant to attend to? Frame the prompt around the window.
Spatial scope: for video, attention regions. Crop or annotate to direct attention.
Modality fusion: when video + sensor are both present, the prompt has to specify which is primary. Otherwise the model picks (often badly).
Sample rate & preprocessing: prompts often interact with lens preprocessing. If the lens downsamples to 1 Hz, prompts assuming finer granularity will produce hallucinated detail.

Interview probes

Show probe 1: "Walk me through how you'd build a prompt for [customer scenario]."

Sequence: (1) understand the decision the output supports, (2) define the task and edge cases in plain language, (3) build the structured output schema first, (4) curate 3–8 n-shot examples covering valid range and edge cases, (5) score against a 20–50 example eval, (6) iterate one change at a time. The senior signal is building the eval before tuning the prompt.

Show probe 2: "What's the most common failure mode you see in prompts?"

Format drift — model returns prose when JSON was specified — and hallucination on edge cases the prompt didn't anticipate. Format drift is fixed with explicit examples and strict-JSON modes; hallucination is harder, requires grounding to source data and explicit "say 'unknown' if absent" instructions, plus calibrated retrieval for RAG-style flows.

Show probe 3: "How do you decide how many few-shot examples to include?"

3–8 is the sweet spot for most tasks. More wastes context and over-anchors on idiosyncratic examples; fewer under-specifies the task. Pick by coverage (range of valid inputs), include edge cases that confused earlier prompts, and put the most representative example last because recency tends to weight heaviest.

Show probe 4: "When would you fine-tune over prompting?"

When prompting has hit a quality ceiling for a stable, well-defined task, you have ≥1k high-quality examples, and the always-on prompt overhead is eating context budget. Avoid fine-tuning while the task definition is still moving — it calcifies behavior, and a requirements shift costs more after fine-tuning than after prompting.

Show probe 5: "How do you evaluate a prompt's output rigorously?"

Build an eval set (20–50 labeled examples) before iterating. Score with programmatic checks when output is structured; LLM-as-judge for open-ended outputs, calibrated against human scoring on a sample. Track per-example deltas across prompt changes, not just aggregate scores — to catch the 'helped overall but broke two cases' pattern that easy-to-miss in aggregate metrics.