Helium Benchmarks

Two benchmarks, one question each

Market Resolution asks whether models can do options math. Model Worldview asks whether they mean what they say. Each links to its own Hugging Face dataset with charts, methodology, and raw prompts.

Finance

Market Resolution

Can a frontier LLM read a real option chain? We paste live chains from Helium into twelve models and grade the math. No “NVDA goes up” trivia.

Current leader

grok-4.20-reasoning at 48% — no model has cracked 50%

Top 5 · overall score

grok-4.20

48%

gemini-3.1

47%

grok-4.3

47%

gpt-5

43%

opus-4.7

42%

Full bar = 100%. Dashed line = 50%. Nobody is past it yet.

300 questions across NVDA, SPY, TSLA, AAPL, QQQ, AMZN
Guess implied vol, delta, arbitrage checks, rich-vs-average strike
Ground truth from the market chain itself, not model opinions
Honesty checks: does showing the option price help? NVDA vs SPY gap?

View on Hugging Face →

Character

Model Worldview

Do models walk the talk? Change one detail (a name, a label, a topic) and see if the answer flips. A coherent model stays consistent. A model on vibes does not.

Standout finding

gpt-5 writes stereotype essays 100% of the time; qwen3 refuses every one

Top 5 · consistency under cue swaps

sonnet-4.5

90%

grok-4.3

90%

opus-4.6

85%

opus-4.7

80%

gemini-3.1

80%

Same question, different label. 100% = the verdict never flips.

304 prompts across safety, values, bias, and politics
Safety: edgy essays vs honest evidence questions
Values: stated priorities vs forced tradeoffs
Politics: 50 balanced Likert items plus classic compass axes

View on Hugging Face →

How we test

Both benchmarks share the same rigor. We designed them so you can trust the numbers, not just the headlines.

Frozen prompts

Every question is fixed at snapshot time. Models cannot game a moving target.

Public data

Full datasets on Hugging Face. Download prompts, rerun models, verify our scores.

Private holdout

Labels for a blind holdout stay private so models cannot train to the test set.

Frequently asked questions

What is the difference between the two benchmarks? ▾

Market Resolution tests whether models can read option chains and do the math (implied vol, delta, arbitrage). Model Worldview tests character: safety boundaries, value consistency, name-swap fairness, and political lean. They are independent datasets on Hugging Face.

Which models were tested? ▾

Twelve frontier models: GPT-5, GPT-5.4, Claude Opus 4.6/4.7, Claude Sonnet 4.5/4.6, Gemini 3.1 Pro, Grok 4.20/4.3, Llama 3.3 70B, Qwen3 235B, and DeepSeek V4. The same roster runs on both benchmarks so you can compare chain math vs worldview in one place.

Can I rerun the benchmarks myself? ▾

Yes. Both datasets are MIT-licensed on Hugging Face with all prompts and scoring code. Load the dataset, run your model, and compare to our published leaderboards.

Why is the top Market Resolution score only 48%? ▾

Because we grade against real market data with partial credit on vol guesses, not multiple-choice stock trivia. Under 50% means even the best model earns less than half the possible points. That is the point: chain math is genuinely hard.

How does this relate to Helium Trades? ▾

Helium Trades builds market intelligence tools (forecasts, options pricing, news bias). These benchmarks come from the same live chain data and research mindset. They are our public contribution to open LLM evaluation.

Market Resolution

Model Worldview

Frozen prompts

Public data

Private holdout

Explore the full results

Market Forecasts

Trading Strategies

Balanced News

About

Contact Helium Trades

Helium Benchmarks

Market Resolution

Model Worldview

Frozen prompts

Public data

Private holdout

Explore the full results

Market Forecasts

Trading Strategies

Balanced News

About

Contact Helium Trades

Chat with Helium