ALOPE-RL

ALOPE-RL is a policy-based reinforcement learning framework for Machine Translation Quality Estimation (QE). It leverages the GRPO (Group Relative Policy Optimization) algorithm to train efficient adapters for Large Language Models (LLMs), enabling them to generate precise quality scores, error categorizations, and detailed Translation Quality Remarks (TQR).

The framework is specifically designed to address gaps in low-resource language evaluation (e.g., English -> Malayalam, English -> Hindi) by utilizing rewards derived from Direct Assessment (DA) scores and contextual annotator comments.

🚀 Key Features

Policy-Based Reinforcement Learning: Implements the GRPO algorithm via the trl library, enabling high-performance policy optimization without the overhead of a separate critic model.
TQR-Augmented Training: Leverages Translation Quality Remarks (TQR) -- contextual annotator comments to drive better judgment and explainability in QE outputs.
Multi-Component Reward System: Models are optimized using a weighted reward aggregation system:
- DA Score Accuracy: Proximity to ground-truth Direct Assessment (DA) scores (Exact Score & Score Bin).
- Error Categorization: Accuracy in identifying specific error types (e.g., Mistranslation, Addition, Untranslated).
- Description Quality: Semantic similarity of generated TQR via BERTScore.
- Formatting & Length: Ensures output adheres to a strict XML structure and maintains optimal verbosity.
Efficient Fine-Tuning: Built on Unsloth, utilizing 4-bit quantization and LoRA adapters to achieve state-of-the-art results with compact LLMs (≤4B parameters).

📂 Repository Structure

File	Description
`rlqe_word_tags.py`	Training script utilizing word-level quality tags (OK/BAD) for English-Hindi evaluation.
`rlqe_weak_annotations.py`	Training script utilizing Translation Quality Remarks (TQR) for English-Malayalam evaluation.
`evaluate_word_tags.py`	Evaluation script for word-tag models, calculating MSE, MAE, and correlation metrics.
`evaluate_weak_annotations.py`	Evaluation script for TQR-based models.
`rlqe_yaml.yml`	Full Conda environment specification.
`requirements.txt`	Standard pip dependency list.
`LICENSE`	MIT License.

🛠️ Installation

1. Clone the Repository

git clone https://github.com/surrey-nlp/ALOPE-RL.git
cd ALOPE-RL

2. Set Up Environment

We recommend using Conda:

conda env create -f rlqe_yaml.yml
conda activate rlqe

Alternatively:

pip install -r requirements.txt

🚄 Usage

Training with Translation Quality Remarks (TQR)

To start training a model using GRPO and contextual annotations:

python rlqe_weak_annotations.py --data your_data.xlsx --max_steps 100 --batch_size 64

Training with Word-Level Tags

To train using word-level OK/BAD tags as additional context:

python rlqe_word_tags.py --data your_data.xlsx --max_steps 100 --batch_size 64

Evaluation

Evaluate model performance against a test set to get MSE, MAE, and correlation coefficients:

For TQR-based models:

python evaluate_weak_annotations.py

For Word-Tag models:

python evaluate_word_tags.py

🧠 Model Architecture

The ALOPE-RL architecture consists of:

QE Model: A frozen Large Language Model (e.g., Gemma-3-4b-it) augmented with trainable LoRA adapters.
GRPO Trainer: Manages terminal rewards and weight updates across a group of K completions per prompt.
Reward Aggregation: A weighted sum of policy rewards derived from both scalar scores and linguistic analysis.

📊 Reward Function Details

The GRPO trainer optimizes the model based on the following weighted reward components. The weights and the rewards can be customized based on the requirements.

Exact Score Reward (30%): Numerical proximity to the gold DA score.
Error Type Reward (25%): Jaccard similarity between predicted and gold error categories.
Score Bin Reward (15%): Correctly identifying the 0-100 score bucket.
Description Reward (10%): BERTScore F1 between generated TQR and gold remarks.
Length Reward (10%): Ensuring the description length is within the expected range.
Format Reward (10%): Strict adherence to the required XML schema (<reasoning>, <answer>, etc.).

📦 Dataset

The dataset used for training and evaluation in ALOPE-RL is publicly available on Hugging Face:

👉 https://huggingface.co/datasets/surrey-nlp/ALOPE-RL-Dataset

Models

The trained models are publicly available on Hugging Face:

👉 Models

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ALOPE-RL

🚀 Key Features

📂 Repository Structure

🛠️ Installation

1. Clone the Repository

2. Set Up Environment

🚄 Usage

Training with Translation Quality Remarks (TQR)

Training with Word-Level Tags

Evaluation

🧠 Model Architecture

📊 Reward Function Details

📦 Dataset

The dataset used for training and evaluation in ALOPE-RL is publicly available on Hugging Face:

👉 https://huggingface.co/datasets/surrey-nlp/ALOPE-RL-Dataset

Models

The trained models are publicly available on Hugging Face:

👉 Models

📜 License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
evaluate_weak_annotations.py		evaluate_weak_annotations.py
evaluate_word_tags.py		evaluate_word_tags.py
gitignore		gitignore
requirements.txt		requirements.txt
rlqe_weak_annotations.py		rlqe_weak_annotations.py
rlqe_word_tags.py		rlqe_word_tags.py
rlqe_yaml.yml		rlqe_yaml.yml

Folders and files

Latest commit

History

Repository files navigation

ALOPE-RL

🚀 Key Features

📂 Repository Structure

🛠️ Installation

1. Clone the Repository

2. Set Up Environment

🚄 Usage

Training with Translation Quality Remarks (TQR)

Training with Word-Level Tags

Evaluation

🧠 Model Architecture

📊 Reward Function Details

📦 Dataset

The dataset used for training and evaluation in ALOPE-RL is publicly available on Hugging Face:

👉 https://huggingface.co/datasets/surrey-nlp/ALOPE-RL-Dataset

Models

The trained models are publicly available on Hugging Face:

👉 Models

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages