A Python-based automated instruction dataset generator for fine-tuning Large Language Models. This bot generates high-quality instruction-answer pairs using an LLM API and outputs them to a CSV file.
- Automated Dataset Generation: Generates instruction-answer pairs for LLM fine-tuning
- Configurable: Customize language, style, context, and generation parameters
- Memory System: Tracks generated topics, intents, and patterns to avoid duplicates
- CSV Output: Outputs data in a standard instruction tuning format
- API Integration: Works with any OpenAI-compatible LLM API
instruction-bot/
├── main.py # Main entry point
├── config.json # Configuration file
├── requirements.txt # Python dependencies
├── .env # Environment variables (API keys, etc.)
├── config/
│ ├── __init__.py
│ ├── config.py # Configuration loading
│ └── memory.py # Memory system and prompts
├── handler/
│ ├── __init__.py
│ ├── api_handler.py # API communication
│ └── csv_handler.py # CSV file operations
├── util/
│ └── json_cleaner.py # JSON response cleaning
└── output/
└── output.csv # Generated dataset
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Configure environment variables in
.env:BASE_URL=https://your-api-endpoint.com/v1 API_KEY=your-api-key MODEL_NAME=your-model-name
Edit config.json to customize the generation:
| Parameter | Description | Default |
|---|---|---|
language |
Language for generated content | "turksih" |
style |
Writing style guidelines | "" |
context |
Context for instructions | "" |
customInstruction |
Custom instructions | "" |
loop |
Number of generation loops | 3 |
dataCountPerRequest |
Instructions per request | 5 |
{
"language": "english",
"style": "Formal and technical",
"context": "Software development documentation",
"customInstruction": "Focus on practical coding examples",
"loop": 5,
"dataCountPerRequest": 10
}| Variable | Description |
|---|---|
BASE_URL |
API endpoint URL |
API_KEY |
API authentication key |
MODEL_NAME |
Name of the LLM model to use |
Run the main script:
python main.pyThe bot will:
- Load configuration from
config.json - Connect to the configured LLM API
- Generate instruction-answer pairs in loops
- Track generated content in memory to avoid duplicates
- Save results to
output/output.csv
Generated data is saved to output/output.csv with the following columns:
| Column | Description |
|---|---|
instruction |
The instruction/prompt |
input |
Additional input context (usually empty) |
output |
The expected response/answer |
- Initialization: Loads configuration and initializes memory
- Prompt Generation: Creates system and user prompts based on config
- API Call: Sends prompts to the LLM API
- Response Parsing: Cleans and parses JSON response
- Memory Update: Updates memory with new topics, intents, patterns
- CSV Export: Appends generated data to CSV file
- Loop: Repeats for configured number of iterations
Main entry point orchestrating the entire generation process.
Configclass: Configuration data structureload_config(): Loads configuration from JSON file
Memoryclass: Tracks used topics, intents, patternsconfigurated_system_prompt_as_message(): Creates system promptconfigurated_user_prompt_as_message(): Creates user prompt
Messageclass: Represents a chat messageMessagesclass: Container for multiple messagesget_ai_result(): Sends messages to API and returns responsecheck_config(): Validates environment variables
ensure_csv_header(): Creates CSV with header if not existsappend_to_csv(): Appends generated data to CSV
clean_json_response(): Cleans LLM JSON response, removes markdown formatting
certifi- SSL certificatescharset-normalizer- Character encodingdotenv- Environment variable loadingidna- Internationalized domain namesload-dotenv- Alternative env loadingpython-dotenv- Python .env supportrequests- HTTP libraryurllib3- HTTP client
The bot maintains a memory system to ensure dataset diversity:
- used_topics: Track generated topics to avoid repetition
- used_intents: Track question intents/types
- used_patterns: Track structural patterns
- notes: Additional notes for future generations
This ensures each generated instruction is unique and diverse.
The bot works with any OpenAI-compatible API. Ensure your .env file points to a valid endpoint.
MIT License