Skip to content
#

benchmark

Here are 4,434 public repositories matching this topic...

MixEval, a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU), with its queries being stably updated every month to avoid contamination.

  • Updated Jun 1, 2024
  • Python

Improve this page

Add a description, image, and links to the benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the benchmark topic, visit your repo's landing page and select "manage topics."

Learn more