Name	Name	Last commit message	Last commit date
parent directory ..
Qwen2.5-Math @ a45202b	Qwen2.5-Math @ a45202b
manual_eval	manual_eval
README.md	README.md

Name

Last commit message

Last commit date

📊 Model Evaluation Setup

This guide walks you through the setup process for evaluating math reasoning performance using the Qwen2.5-Math repository.

🔁 Step 1: Clone the Repository

git clone https://github.com/QwenLM/Qwen2.5-Math.git

📦 Step 2: Install Dependencies

Standard Installation

cd Qwen2.5-Math/evaluation/

# Install latex2sympy (symbolic math parser)
cd latex2sympy
pip install -e .
cd ..

# Install core dependencies
pip install -r requirements.txt
pip install vllm==0.5.1 --no-build-isolation
pip install pebble word2number multiprocess timeout_decorator datasets
pip install pyarrow==10.0.1

⚠️ PyTorch 2.8.0 Special Case

If you're using PyTorch 2.8.0, follow these additional steps:

Delete the line containing flash_attn in requirements.txt.
Re-run installation:

cd latex2sympy
pip install -e .
cd ..

pip install -r requirements.txt
pip install vllm==0.5.1 --no-build-isolation
pip install pebble word2number multiprocess timeout_decorator datasets
pip install pyarrow==10.0.1

# Reinstall triton to avoid compatibility issues
pip install -U --force-reinstall --no-cache-dir triton

⚙️ Step 3: Configure Evaluation Script

set -ex

PROMPT_TYPE=$1                   # e.g., "qwen25-math-cot"
MODEL_NAME_OR_PATH=$2            # Path to your model directory
OUTPUT_DIR=${MODEL_NAME_OR_PATH}/math_eval

SPLIT="test"
NUM_TEST_SAMPLE=-1               # -1 means evaluate all samples

DATA_NAME="gsm8k"

TOKENIZERS_PARALLELISM=false \
python3 -u math_eval.py \
  --model_name_or_path ${MODEL_NAME_OR_PATH} \
  --data_name ${DATA_NAME} \
  --output_dir ${OUTPUT_DIR} \
  --split ${SPLIT} \
  --prompt_type ${PROMPT_TYPE} \
  --num_test_sample ${NUM_TEST_SAMPLE} \
  --seed 0 \
  --temperature 0 \
  --n_sampling 1 \
  --top_p 1 \
  --start 0 \
  --end -1 \
  --use_vllm \
  --save_outputs \
  --overwrite

🚀 Step 4: Run the Evaluation

export CUDA_VISIBLE_DEVICES="0"

PROMPT_TYPE="qwen25-math-cot"
MODEL_NAME_OR_PATH="/workspace/LLM_MCTS/ft_1.5_gsm8k"
bash sh/eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH

✅ Notes

Evaluation output will be saved in ${MODEL_NAME_OR_PATH}/math_eval.
Set NUM_TEST_SAMPLE to any positive integer to evaluate a subset.
Ensure vllm is properly installed and GPU memory is sufficient.
For non-GSM8K datasets, change DATA_NAME accordingly (e.g., math, minerva, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

📊 Model Evaluation Setup

🔁 Step 1: Clone the Repository

📦 Step 2: Install Dependencies

Standard Installation

⚠️ PyTorch 2.8.0 Special Case

⚙️ Step 3: Configure Evaluation Script

🚀 Step 4: Run the Evaluation

✅ Notes

FilesExpand file tree

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

README.md

📊 Model Evaluation Setup

🔁 Step 1: Clone the Repository

📦 Step 2: Install Dependencies

Standard Installation

⚠️ PyTorch 2.8.0 Special Case

⚙️ Step 3: Configure Evaluation Script

🚀 Step 4: Run the Evaluation

✅ Notes