Two public datasets that test frontier LLMs on real market math and real character. Same twelve models. Frozen prompts. Full data on Hugging Face.
Market Resolution asks whether models can do options math. Model Worldview asks whether they mean what they say. Each links to its own Hugging Face dataset with charts, methodology, and raw prompts.
Can a frontier LLM read a real option chain? We paste live chains from Helium into twelve models and grade the math. No “NVDA goes up” trivia.
Do models walk the talk? Change one detail (a name, a label, a topic) and see if the answer flips. A coherent model stays consistent. A model on vibes does not.
Both benchmarks share the same rigor. We designed them so you can trust the numbers, not just the headlines.
Every question is fixed at snapshot time. Models cannot game a moving target.
Full datasets on Hugging Face. Download prompts, rerun models, verify our scores.
Labels for a blind holdout stay private so models cannot train to the test set.
Charts, scoring guides, and downloadable data live on Hugging Face.