Pricing & Commercial
Together's pricing follows the per-token model for inference and dedicated-rental for higher-throughput / training workloads. The per-token rates undercut closed-source frontier APIs significantly and are competitive with other open-source-hosting platforms.
API per-token pricing
Together publishes per-token prices for each model in its catalog. Indicative pricing for representative models (per million tokens, varies by model size and version):
| Model | Input | Output |
|---|---|---|
| Llama 70B-class | $0.50-0.90 | $0.80-1.20 |
| Llama 405B-class | $3-5 | $3-5 |
| Mixtral 8x7B | $0.60-0.90 | $0.60-0.90 |
| DeepSeek-V3-class | $1-3 | $1-3 |
| Smaller / 7B-class | $0.10-0.30 | $0.10-0.30 |
The prices update as model releases happen and as Together's serving efficiency improves. Together generally tracks the broader market rather than leading on price.
Dedicated endpoint pricing
For customers whose workload exceeds the economical threshold of per-token pricing, dedicated endpoints offer flat-rate hourly pricing on reserved capacity. The break-even between per-token and dedicated depends on traffic patterns; high-utilization customers benefit from dedicated.
Dedicated endpoint pricing isn't always public; negotiated based on the customer's specific needs.
Fine-tuning pricing
Fine-tuning is priced by the compute consumed and the model size. Indicative ranges:
- Small fine-tunes (LoRA on 7B model): tens-to-hundreds of dollars.
- Medium (LoRA on 70B model): hundreds to low-thousands of dollars.
- Full fine-tunes of larger models: thousands to tens of thousands of dollars depending on duration and scale.
The resulting fine-tuned model can be served via the platform at standard per-token rates, or as a dedicated endpoint.
Training cluster pricing
For dedicated training clusters, pricing approximates the broader enterprise neocloud market. Per-GPU-hour rates competitive with CoreWeave and similar; reserved discounts for longer-term commitments.
Together's positioning: bundling training with the inference and fine-tuning lifecycle. Customers who'll deploy their trained model on Together's inference API get strategic continuity that pure GPU clouds don't offer.
vs OpenAI / Anthropic
For comparable-quality output on many use cases, Together's pricing on open-source models is 50-80% lower than closed-source frontier APIs:
- GPT-4 / Claude-class output: $5-15+ per million input tokens, $15-50+ per million output tokens.
- Together hosting Llama 70B equivalent: $0.5-1.2 input, $0.8-1.2 output.
The savings are real and meaningful at scale. Customers running millions of API calls per day save substantially.
The quality gap matters case-by-case. For reasoning-heavy tasks, closed-source frontier still leads. For simpler chat, RAG, structured-output, and many enterprise tasks, open-source quality is sufficient.
vs Fireworks / Anyscale
Among the direct managed-inference competitors, pricing is usually within 10-20% of each other for the same model. The competition has compressed margins across the category. Customers choose based on:
- Specific model availability and quality.
- Serving latency.
- Reliability.
- Customer support and platform polish.
- Lifecycle product breadth (fine-tuning, training, etc.).
Unit economics
Together's per-token revenue translates to:
- Cost of inference per token (GPU time, electricity, platform overhead).
- Gross margin on the inference revenue.
- Platform operating costs that the gross margin covers.
Together's research-driven optimization (FlashAttention, speculative decoding, etc.) compresses the cost-per-token, giving Together more margin headroom at competitive prices. This is a real unit-economic advantage.
Takeaway
Together's pricing structure is the "managed open-source inference" model — significant savings vs closed-source for sufficient-quality workloads, competitive with peers in the managed-OSS-inference category. The next chapter looks at the infrastructure that supports the platform.