Skip to content

tahsinkoc/instruction-bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instruction Dataset Generator Bot

A Python-based automated instruction dataset generator for fine-tuning Large Language Models. This bot generates high-quality instruction-answer pairs using an LLM API and outputs them to a CSV file.

Features

  • Automated Dataset Generation: Generates instruction-answer pairs for LLM fine-tuning
  • Configurable: Customize language, style, context, and generation parameters
  • Memory System: Tracks generated topics, intents, and patterns to avoid duplicates
  • CSV Output: Outputs data in a standard instruction tuning format
  • API Integration: Works with any OpenAI-compatible LLM API

Project Structure

instruction-bot/
├── main.py              # Main entry point
├── config.json          # Configuration file
├── requirements.txt     # Python dependencies
├── .env                 # Environment variables (API keys, etc.)
├── config/
│   ├── __init__.py
│   ├── config.py        # Configuration loading
│   └── memory.py        # Memory system and prompts
├── handler/
│   ├── __init__.py
│   ├── api_handler.py   # API communication
│   └── csv_handler.py   # CSV file operations
├── util/
│   └── json_cleaner.py  # JSON response cleaning
└── output/
    └── output.csv       # Generated dataset

Installation

  1. Clone the repository
  2. Install dependencies:
    pip install -r requirements.txt
  3. Configure environment variables in .env:
    BASE_URL=https://your-api-endpoint.com/v1
    API_KEY=your-api-key
    MODEL_NAME=your-model-name
    

Configuration

Edit config.json to customize the generation:

Parameter Description Default
language Language for generated content "turksih"
style Writing style guidelines ""
context Context for instructions ""
customInstruction Custom instructions ""
loop Number of generation loops 3
dataCountPerRequest Instructions per request 5

Example Configuration

{
    "language": "english",
    "style": "Formal and technical",
    "context": "Software development documentation",
    "customInstruction": "Focus on practical coding examples",
    "loop": 5,
    "dataCountPerRequest": 10
}

Environment Variables

Variable Description
BASE_URL API endpoint URL
API_KEY API authentication key
MODEL_NAME Name of the LLM model to use

Usage

Run the main script:

python main.py

The bot will:

  1. Load configuration from config.json
  2. Connect to the configured LLM API
  3. Generate instruction-answer pairs in loops
  4. Track generated content in memory to avoid duplicates
  5. Save results to output/output.csv

Output Format

Generated data is saved to output/output.csv with the following columns:

Column Description
instruction The instruction/prompt
input Additional input context (usually empty)
output The expected response/answer

How It Works

  1. Initialization: Loads configuration and initializes memory
  2. Prompt Generation: Creates system and user prompts based on config
  3. API Call: Sends prompts to the LLM API
  4. Response Parsing: Cleans and parses JSON response
  5. Memory Update: Updates memory with new topics, intents, patterns
  6. CSV Export: Appends generated data to CSV file
  7. Loop: Repeats for configured number of iterations

Key Components

Main entry point orchestrating the entire generation process.

  • Config class: Configuration data structure
  • load_config(): Loads configuration from JSON file
  • Memory class: Tracks used topics, intents, patterns
  • configurated_system_prompt_as_message(): Creates system prompt
  • configurated_user_prompt_as_message(): Creates user prompt
  • Message class: Represents a chat message
  • Messages class: Container for multiple messages
  • get_ai_result(): Sends messages to API and returns response
  • check_config(): Validates environment variables
  • ensure_csv_header(): Creates CSV with header if not exists
  • append_to_csv(): Appends generated data to CSV
  • clean_json_response(): Cleans LLM JSON response, removes markdown formatting

Dependencies

  • certifi - SSL certificates
  • charset-normalizer - Character encoding
  • dotenv - Environment variable loading
  • idna - Internationalized domain names
  • load-dotenv - Alternative env loading
  • python-dotenv - Python .env support
  • requests - HTTP library
  • urllib3 - HTTP client

Memory System

The bot maintains a memory system to ensure dataset diversity:

  • used_topics: Track generated topics to avoid repetition
  • used_intents: Track question intents/types
  • used_patterns: Track structural patterns
  • notes: Additional notes for future generations

This ensures each generated instruction is unique and diverse.

API Compatibility

The bot works with any OpenAI-compatible API. Ensure your .env file points to a valid endpoint.

License

MIT License

About

An automated Python bot that generates instruction-answer datasets for LLM fine-tuning using any OpenAI-compatible API. Features configurable language/style, memory system to prevent duplicates, and CSV export

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages