Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions MCP_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Synthetic Data Kit MCP Server

This directory contains an MCP (Model Context Protocol) server implementation for the Synthetic Data Kit. This allows MCP-compatible clients to interact with the Synthetic Data Kit through a standardized protocol.

## Features

The MCP server provides access to all major functionalities of the Synthetic Data Kit:

1. **Document Ingestion** - Parse documents (PDF, HTML, YouTube, DOCX, PPT, TXT) into clean text
2. **Content Creation** - Generate QA pairs, summaries, and Chain of Thought examples
3. **Content Curation** - Clean and filter content based on quality
4. **Format Conversion** - Convert to different formats for fine-tuning
5. **System Check** - Verify LLM provider connectivity

## Installation

The MCP server is automatically installed when you install the Synthetic Data Kit in development mode:

```bash
cd /path/to/synthetic-data-kit
pip install -e .
```

## Usage

### Starting the MCP Server

To start the MCP server directly:

```bash
synthetic-data-kit-mcp
```

This will start the server and listen for MCP connections over stdio.

### Using with an MCP Client

The server can be used with any MCP-compatible client. For example, if you're using Claude Desktop or another MCP client, you can configure it to connect to this server.

### Tools Available

The server exposes the following tools:

1. `sdk_ingest` - Parse documents into clean text
2. `sdk_create` - Generate content from text
3. `sdk_curate` - Clean and filter content based on quality
4. `sdk_save_as` - Convert to different formats for fine-tuning
5. `sdk_system_check` - Check if the selected LLM provider's server is running

Each tool maps directly to the corresponding CLI command in the Synthetic Data Kit.

### Prompts Available

The server also provides prompts:

1. `sdk-workflow` - A complete workflow for generating synthetic data

## Example Usage

Here's an example of how an MCP client might interact with the server:

1. Client requests tool list → Server responds with available tools
2. Client calls `sdk_ingest` with a document path → Server runs `synthetic-data-kit ingest`
3. Client calls `sdk_create` with parameters → Server runs `synthetic-data-kit create`
4. Client calls `sdk_curate` to filter results → Server runs `synthetic-data-kit curate`
5. Client calls `sdk_save_as` to convert formats → Server runs `synthetic-data-kit save-as`

## Development

To test the MCP server:

```bash
cd /path/to/synthetic-data-kit
python test_mcp_server.py
```

To run a complete example:

```bash
cd /path/to/synthetic-data-kit
python example_mcp_usage.py
```

## Architecture

The MCP server acts as a bridge between MCP clients and the Synthetic Data Kit CLI:

```
MCP Client ↔ MCP Server ↔ Synthetic Data Kit CLI
```

All commands are executed as subprocess calls to the CLI, ensuring full compatibility with existing functionality.

## Configuration

The MCP server uses the same configuration as the Synthetic Data Kit CLI. Make sure your `config.yaml` is properly set up before using the server.

## Troubleshooting

If you encounter issues:

1. Ensure the Synthetic Data Kit is properly installed: `pip install -e .`
2. Verify the CLI works: `synthetic-data-kit --help`
3. Check that required dependencies are installed
4. Ensure your LLM provider (vLLM or API endpoint) is properly configured and running

Note: For API endpoints, you'll need to set the appropriate API keys in your environment or configuration file.
125 changes: 125 additions & 0 deletions example_mcp_usage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
#!/usr/bin/env python3
"""
Example script demonstrating how to use the Synthetic Data Kit MCP server.
This script shows how to process a document using the MCP server.
"""

import asyncio
import json
import sys
import os
from typing import Any, Dict, List

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from mcp.types import TextContent


async def process_document_example():
"""Example of processing a document using the MCP server."""
# Get the current working directory
cwd = os.getcwd()

# Start the MCP server as a subprocess
server_params = StdioServerParameters(
command=sys.executable,
args=["-m", "synthetic_data_kit.mcp_server"],
cwd=cwd
)

async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
# Initialize the session
await session.initialize()

# Create a sample text file to process
sample_text = """
Artificial Intelligence (AI) is a branch of computer science that aims to create software or machines that exhibit human-like intelligence.
This can include learning from experience, understanding natural language, solving problems, and recognizing patterns.

Machine Learning (ML) is a subset of AI that focuses on algorithms and statistical models that enable computers to improve at tasks
with experience. Deep Learning is a further subset of ML that uses neural networks with multiple layers.

Natural Language Processing (NLP) is another important area of AI that deals with the interaction between computers and humans
using natural language. It involves tasks like language translation, sentiment analysis, and text summarization.

Computer Vision is yet another field that enables computers to interpret and understand visual information from the world,
including image and video recognition.
"""

# Write the sample text to a file
with open("sample_document.txt", "w") as f:
f.write(sample_text)

print("=== Processing Sample Document ===")
print("Sample document created: sample_document.txt")

# 1. Use the ingest tool to process the document
print("\n1. Ingesting document...")
try:
result = await session.call_tool(
"sdk_ingest",
{
"input": "sample_document.txt",
"output_dir": "data/parsed"
}
)
print(f"Ingest result: {result}")
except Exception as e:
print(f"Error during ingestion: {e}")

# 2. Use the create tool to generate QA pairs
print("\n2. Creating QA pairs...")
try:
result = await session.call_tool(
"sdk_create",
{
"input": "data/parsed/sample_document.lance",
"content_type": "qa",
"num_pairs": 5
}
)
print(f"Create result: {result}")
except Exception as e:
print(f"Error during creation: {e}")

# 3. Use the curate tool to filter content
print("\n3. Curating content...")
try:
result = await session.call_tool(
"sdk_curate",
{
"input": "data/generated/sample_document_qa_pairs.json",
"threshold": 7.0
}
)
print(f"Curate result: {result}")
except Exception as e:
print(f"Error during curation: {e}")

# 4. Use the save-as tool to convert format
print("\n4. Saving in final format...")
try:
result = await session.call_tool(
"sdk_save_as",
{
"input": "data/curated/sample_document_cleaned.json",
"format": "alpaca"
}
)
print(f"Save-as result: {result}")
except Exception as e:
print(f"Error during format conversion: {e}")

print("\n=== Document Processing Complete ===")

# Clean up sample files
try:
os.remove("sample_document.txt")
print("Cleaned up sample_document.txt")
except:
pass


if __name__ == "__main__":
asyncio.run(process_document_example())
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ classifiers = [

[project.scripts]
synthetic-data-kit = "synthetic_data_kit.cli:app"
synthetic-data-kit-mcp = "synthetic_data_kit.mcp_server:main"

[tool.hatch.build.targets.wheel]
packages = ["synthetic_data_kit"]
Expand Down
Loading