EvalScope

EvalScope is the official model evaluation and benchmarking framework for ModelScope, designed to meet diverse evaluation needs. It supports various model types including large language models (LLMs), multimodal models, embedding models, reranker models, and CLIP models.

The framework accommodates multiple evaluation scenarios such as end-to-end RAG evaluation, arena mode, and inference performance testing. It comes pre-loaded with benchmarks and metrics including MMLU, CMMLU, C-Eval, and GSM8K. Seamlessly integrated with the ms-swift training framework, EvalScope enables one-click evaluation, providing comprehensive support for model training and evaluation 🚀.

Now you can use EvalScope to evaluate LLM performance while leveraging SwanLab for convenient tracking, comparison, and visualization.

Demo

1. Preparation

Install the required environment:

bash

pip install evalscope  
pip install swanlab

For extended EvalScope functionality, install optional dependencies as needed:

bash

pip install -e '.[opencompass]'   # Install OpenCompass backend  
pip install -e '.[vlmeval]'       # Install VLMEvalKit backend  
pip install -e '.[rag]'           # Install RAGEval backend  
pip install -e '.[perf]'          # Install performance dependencies  
pip install -e '.[app]'           # Install visualization dependencies  
pip install -e '.[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)

2. Evaluating Qwen Model Performance

If you want to evaluate Qwen2.5-0.5B-Instruct on the default data in openqa format while monitoring results via SwanLab, run the following command:

bash

export CUDA_VISIBLE_DEVICES=0
evalscope perf \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --dataset openqa \
 --number 20 \
 --parallel 2 \
 --limit 5 \
 --swanlab-api-key 'your API Key' \
 --name 'qwen2.5-openqa' \
 --temperature 0.9 \
 --api local

Where:
• swanlab-api-key is your SwanLab API key
• name specifies the experiment name

To customize the project name, navigate to the statistic_benchmark_metric_worker function in evalscope/perf/benchmark.py and modify the project parameter in the SwanLab configuration section.

Visualization Effect Example:

Upload to Self-Hosted Version

If you wish to upload the evaluation results to a self-hosted version, you can first log in to the self-hosted version via the command line. For example, if your deployment address is http://localhost:8000, you can run:

bash

swanlab login --host http://localhost:8000

After completing the login, run the evalscope command, and the evaluation results will be uploaded to the self-hosted version.

EvalScope ​

1. Preparation ​

2. Evaluating Qwen Model Performance ​

Upload to Self-Hosted Version ​

EvalScope

1. Preparation

2. Evaluating Qwen Model Performance

Upload to Self-Hosted Version