Question 1

Why does quality drop in long contexts?

Accepted Answer

Research ("Lost in the Middle", 2023) shows LLMs attend most strongly to the beginning and end of context, with accuracy dropping in the middle. At 80%+ of a model's window, retrieval accuracy on middle content can fall 30–50%. For critical lookups, summarize or chunk rather than dumping everything.

Question 2

What fits in 128k tokens?

Accepted Answer

Roughly 300 pages of a novel, a 250-page PDF, a 90-minute podcast transcript, or a mid-size codebase (~20k lines of well-commented code). Claude Opus' 200k gets you ~500 pages, and Gemini Pro's 2M handles an entire book series.

Question 3

Does my system prompt count against the context?

Accepted Answer

Yes. System prompts, user messages, previous turns in a conversation, and tool-call results ALL consume context tokens. Long chat histories silently eat your budget — trim aggressively or summarize.

Question 4

What about Claude's 1M beta?

Accepted Answer

Claude Sonnet 4's 1M context is available via API with specific pricing tiers (input above 200k costs 2x). Opus 4 is capped at 200k. Pricing and availability may have changed — check Anthropic's current docs before relying on it.

Model	Context	Fits?	% used	Tokens left
GPT-4o	128,000	✅ Yes	26.0%	94,750
GPT-4 Turbo	128,000	✅ Yes	26.0%	94,750
Claude Sonnet 4 (1M)	1,000,000	✅ Yes	3.3%	966,750
Claude Opus 4 (200k)	200,000	✅ Yes	16.6%	166,750
Gemini 2.5 Pro (2M)	2,000,000	✅ Yes	1.7%	1,966,750
Gemini 2.5 Flash (1M)	1,000,000	✅ Yes	3.3%	966,750
GPT-3.5 Turbo	16,000	❌ No	207.8%	0

Context Window — Will It Fit?

Model-by-model fit check

What the context window actually limits

Practical budgeting

Frequently Asked Questions