Open Benchmarks

Helium Benchmarks

Two public datasets that test frontier LLMs on real market math and real character. Same twelve models. Frozen prompts. Full data on Hugging Face.

Market Resolution ↓ Model Worldview ↓
12
Frontier models
604
Total prompts
2
Benchmarks
MIT
Open license
Two benchmarks, one question each

Market Resolution asks whether models can do options math. Model Worldview asks whether they mean what they say. Each links to its own Hugging Face dataset with charts, methodology, and raw prompts.

Finance

Market Resolution

Can a frontier LLM read a real option chain? We paste live chains from Helium into twelve models and grade the math. No “NVDA goes up” trivia.

Current leader
grok-4.20-reasoning at 48% — no model has cracked 50%
Top 5 · overall score
grok-4.20
48%
gemini-3.1
47%
grok-4.3
47%
gpt-5
43%
opus-4.7
42%
Full bar = 100%. Dashed line = 50%. Nobody is past it yet.
  • 300 questions across NVDA, SPY, TSLA, AAPL, QQQ, AMZN
  • Guess implied vol, delta, arbitrage checks, rich-vs-average strike
  • Ground truth from the market chain itself, not model opinions
  • Honesty checks: does showing the option price help? NVDA vs SPY gap?
Character

Model Worldview

Do models walk the talk? Change one detail (a name, a label, a topic) and see if the answer flips. A coherent model stays consistent. A model on vibes does not.

Standout finding
gpt-5 writes stereotype essays 100% of the time; qwen3 refuses every one
Top 5 · consistency under cue swaps
sonnet-4.5
90%
grok-4.3
90%
opus-4.6
85%
opus-4.7
80%
gemini-3.1
80%
Same question, different label. 100% = the verdict never flips.
  • 304 prompts across safety, values, bias, and politics
  • Safety: edgy essays vs honest evidence questions
  • Values: stated priorities vs forced tradeoffs
  • Politics: 50 balanced Likert items plus classic compass axes
How we test

Both benchmarks share the same rigor. We designed them so you can trust the numbers, not just the headlines.

1
Frozen prompts

Every question is fixed at snapshot time. Models cannot game a moving target.

2
Public data

Full datasets on Hugging Face. Download prompts, rerun models, verify our scores.

3
Private holdout

Labels for a blind holdout stay private so models cannot train to the test set.

Frequently asked questions
What is the difference between the two benchmarks?
Market Resolution tests whether models can read option chains and do the math (implied vol, delta, arbitrage). Model Worldview tests character: safety boundaries, value consistency, name-swap fairness, and political lean. They are independent datasets on Hugging Face.
Which models were tested?
Twelve frontier models: GPT-5, GPT-5.4, Claude Opus 4.6/4.7, Claude Sonnet 4.5/4.6, Gemini 3.1 Pro, Grok 4.20/4.3, Llama 3.3 70B, Qwen3 235B, and DeepSeek V4. The same roster runs on both benchmarks so you can compare chain math vs worldview in one place.
Can I rerun the benchmarks myself?
Yes. Both datasets are MIT-licensed on Hugging Face with all prompts and scoring code. Load the dataset, run your model, and compare to our published leaderboards.
Why is the top Market Resolution score only 48%?
Because we grade against real market data with partial credit on vol guesses, not multiple-choice stock trivia. Under 50% means even the best model earns less than half the possible points. That is the point: chain math is genuinely hard.
How does this relate to Helium Trades?
Helium Trades builds market intelligence tools (forecasts, options pricing, news bias). These benchmarks come from the same live chain data and research mindset. They are our public contribution to open LLM evaluation.

Explore the full results

Charts, scoring guides, and downloadable data live on Hugging Face.

Market Resolution → Model Worldview →