Model-by-model fit check
| Model | Context | Fits? | % used | Tokens left |
|---|---|---|---|---|
| GPT-4o | 128,000 | โ Yes | 26.0% | 94,750 |
| GPT-4 Turbo | 128,000 | โ Yes | 26.0% | 94,750 |
| Claude Sonnet 4 (1M) | 1,000,000 | โ Yes | 3.3% | 966,750 |
| Claude Opus 4 (200k) | 200,000 | โ Yes | 16.6% | 166,750 |
| Gemini 2.5 Pro (2M) | 2,000,000 | โ Yes | 1.7% | 1,966,750 |
| Gemini 2.5 Flash (1M) | 1,000,000 | โ Yes | 3.3% | 966,750 |
| GPT-3.5 Turbo | 16,000 | โ No | 207.8% | 0 |
What the context window actually limits
The context window is the total number of tokens the model can attend to in a single request โ your system prompt, user message, uploaded files, AND the model's response all share this budget. A "200k context" doesn't mean you can dump 200k tokens of input and still get a long reply.
Practical budgeting
- Reserve 10โ25% of the window for output tokens.
- Long outputs (2,000+ tokens) can actually exceed some models' per-response cap even when context allows.
- Quality degrades past ~50% of the window on most models โ "lost in the middle" problem.
- You pay per input token. A 2M-context call isn't free just because the model accepts it.