This guide walks you through the setup process for evaluating math reasoning performance using the Qwen2.5-Math repository.
git clone https://github.com/QwenLM/Qwen2.5-Math.gitcd Qwen2.5-Math/evaluation/
# Install latex2sympy (symbolic math parser)
cd latex2sympy
pip install -e .
cd ..
# Install core dependencies
pip install -r requirements.txt
pip install vllm==0.5.1 --no-build-isolation
pip install pebble word2number multiprocess timeout_decorator datasets
pip install pyarrow==10.0.1If you're using PyTorch 2.8.0, follow these additional steps:
-
Delete the line containing
flash_attninrequirements.txt. -
Re-run installation:
cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt
pip install vllm==0.5.1 --no-build-isolation
pip install pebble word2number multiprocess timeout_decorator datasets
pip install pyarrow==10.0.1
# Reinstall triton to avoid compatibility issues
pip install -U --force-reinstall --no-cache-dir tritonset -ex
PROMPT_TYPE=$1 # e.g., "qwen25-math-cot"
MODEL_NAME_OR_PATH=$2 # Path to your model directory
OUTPUT_DIR=${MODEL_NAME_OR_PATH}/math_eval
SPLIT="test"
NUM_TEST_SAMPLE=-1 # -1 means evaluate all samples
DATA_NAME="gsm8k"
TOKENIZERS_PARALLELISM=false \
python3 -u math_eval.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--data_name ${DATA_NAME} \
--output_dir ${OUTPUT_DIR} \
--split ${SPLIT} \
--prompt_type ${PROMPT_TYPE} \
--num_test_sample ${NUM_TEST_SAMPLE} \
--seed 0 \
--temperature 0 \
--n_sampling 1 \
--top_p 1 \
--start 0 \
--end -1 \
--use_vllm \
--save_outputs \
--overwriteexport CUDA_VISIBLE_DEVICES="0"
PROMPT_TYPE="qwen25-math-cot"
MODEL_NAME_OR_PATH="/workspace/LLM_MCTS/ft_1.5_gsm8k"
bash sh/eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH- Evaluation output will be saved in
${MODEL_NAME_OR_PATH}/math_eval. - Set
NUM_TEST_SAMPLEto any positive integer to evaluate a subset. - Ensure
vllmis properly installed and GPU memory is sufficient. - For non-GSM8K datasets, change
DATA_NAMEaccordingly (e.g.,math,minerva, etc.).