Live Evaluation Results

♟ LLM Chess Leaderboard

How well do large language models play chess against Stockfish at various ELO ratings?

Total Games
Models Tested
LLM Wins
Stockfish Wins
Draws

📊 Performance Matrix

Win rates mapped by Stockfish ELO (rows) vs. LLM Models & reasoning effort (columns). Click any active cell to inspect details.

Win Rate Legend:
≥60%
30-59%
>0% and <30%
0%
No games
Generating matrix…

🏆 Model Performance

Click a row to see individual games & replay
Loading PGN data…