meta-llama · groxaxo · Sep 4, 2025
diff --git a/MCP_README.md b/MCP_README.md
@@ -0,0 +1,107 @@
+# Synthetic Data Kit MCP Server
+
+This directory contains an MCP (Model Context Protocol) server implementation for the Synthetic Data Kit. This allows MCP-compatible clients to interact with the Synthetic Data Kit through a standardized protocol.
+
+## Features
+
+The MCP server provides access to all major functionalities of the Synthetic Data Kit:
+
+1. **Document Ingestion** - Parse documents (PDF, HTML, YouTube, DOCX, PPT, TXT) into clean text
+2. **Content Creation** - Generate QA pairs, summaries, and Chain of Thought examples
+3. **Content Curation** - Clean and filter content based on quality
+4. **Format Conversion** - Convert to different formats for fine-tuning
+5. **System Check** - Verify LLM provider connectivity
+
+## Installation
+
+The MCP server is automatically installed when you install the Synthetic Data Kit in development mode:
+
+```bash
+cd /path/to/synthetic-data-kit
+pip install -e .
+```
+
+## Usage
+
+### Starting the MCP Server
+
+To start the MCP server directly:
+
+```bash
+synthetic-data-kit-mcp
+```
+
+This will start the server and listen for MCP connections over stdio.
+
+### Using with an MCP Client
+
+The server can be used with any MCP-compatible client. For example, if you're using Claude Desktop or another MCP client, you can configure it to connect to this server.
+
+### Tools Available
+
+The server exposes the following tools:
+
+1. `sdk_ingest` - Parse documents into clean text
+2. `sdk_create` - Generate content from text 
+3. `sdk_curate` - Clean and filter content based on quality
+4. `sdk_save_as` - Convert to different formats for fine-tuning
+5. `sdk_system_check` - Check if the selected LLM provider's server is running
+
+Each tool maps directly to the corresponding CLI command in the Synthetic Data Kit.
+
+### Prompts Available
+
+The server also provides prompts:
+
+1. `sdk-workflow` - A complete workflow for generating synthetic data
+
+## Example Usage
+
+Here's an example of how an MCP client might interact with the server:
+
+1. Client requests tool list → Server responds with available tools
+2. Client calls `sdk_ingest` with a document path → Server runs `synthetic-data-kit ingest`
+3. Client calls `sdk_create` with parameters → Server runs `synthetic-data-kit create`
+4. Client calls `sdk_curate` to filter results → Server runs `synthetic-data-kit curate`
+5. Client calls `sdk_save_as` to convert formats → Server runs `synthetic-data-kit save-as`
+
+## Development
+
+To test the MCP server:
+
+```bash
+cd /path/to/synthetic-data-kit
+python test_mcp_server.py
+```
+
+To run a complete example:
+
+```bash
+cd /path/to/synthetic-data-kit
+python example_mcp_usage.py
+```
+
+## Architecture
+
+The MCP server acts as a bridge between MCP clients and the Synthetic Data Kit CLI:
+
+```
+MCP Client ↔ MCP Server ↔ Synthetic Data Kit CLI
+```
+
+All commands are executed as subprocess calls to the CLI, ensuring full compatibility with existing functionality.
+
+## Configuration
+
+The MCP server uses the same configuration as the Synthetic Data Kit CLI. Make sure your `config.yaml` is properly set up before using the server.
+
+## Troubleshooting
+
+If you encounter issues:
+
+1. Ensure the Synthetic Data Kit is properly installed: `pip install -e .`
+2. Verify the CLI works: `synthetic-data-kit --help`
+3. Check that required dependencies are installed
+4. Ensure your LLM provider (vLLM or API endpoint) is properly configured and running
+
+Note: For API endpoints, you'll need to set the appropriate API keys in your environment or configuration file.
diff --git a/example_mcp_usage.py b/example_mcp_usage.py
@@ -0,0 +1,125 @@
+#!/usr/bin/env python3
+"""
+Example script demonstrating how to use the Synthetic Data Kit MCP server.
+This script shows how to process a document using the MCP server.
+"""
+
+import asyncio
+import json
+import sys
+import os
+from typing import Any, Dict, List
+
+from mcp import ClientSession, StdioServerParameters
+from mcp.client.stdio import stdio_client
+from mcp.types import TextContent
+
+
+async def process_document_example():
+    """Example of processing a document using the MCP server."""
+    # Get the current working directory
+    cwd = os.getcwd()
+
+    # Start the MCP server as a subprocess
+    server_params = StdioServerParameters(
+        command=sys.executable,
+        args=["-m", "synthetic_data_kit.mcp_server"],
+        cwd=cwd
+    )
+
+    async with stdio_client(server_params) as (read, write):
+        async with ClientSession(read, write) as session:
+            # Initialize the session
+            await session.initialize()
+
+            # Create a sample text file to process
+            sample_text = """
+            Artificial Intelligence (AI) is a branch of computer science that aims to create software or machines that exhibit human-like intelligence. 
+            This can include learning from experience, understanding natural language, solving problems, and recognizing patterns.
+
+            Machine Learning (ML) is a subset of AI that focuses on algorithms and statistical models that enable computers to improve at tasks 
+            with experience. Deep Learning is a further subset of ML that uses neural networks with multiple layers.
+
+            Natural Language Processing (NLP) is another important area of AI that deals with the interaction between computers and humans 
+            using natural language. It involves tasks like language translation, sentiment analysis, and text summarization.
+
+            Computer Vision is yet another field that enables computers to interpret and understand visual information from the world, 
+            including image and video recognition.
+            """
+
+            # Write the sample text to a file
+            with open("sample_document.txt", "w") as f:
+                f.write(sample_text)
+
+            print("=== Processing Sample Document ===")
+            print("Sample document created: sample_document.txt")
+
+            # 1. Use the ingest tool to process the document
+            print("\n1. Ingesting document...")
+            try:
+                result = await session.call_tool(
+                    "sdk_ingest",
+                    {
+                        "input": "sample_document.txt",
+                        "output_dir": "data/parsed"
+                    }
+                )
+                print(f"Ingest result: {result}")
+            except Exception as e:
+                print(f"Error during ingestion: {e}")
+
+            # 2. Use the create tool to generate QA pairs
+            print("\n2. Creating QA pairs...")
+            try:
+                result = await session.call_tool(
+                    "sdk_create",
+                    {
+                        "input": "data/parsed/sample_document.lance",
+                        "content_type": "qa",
+                        "num_pairs": 5
+                    }
+                )
+                print(f"Create result: {result}")
+            except Exception as e:
+                print(f"Error during creation: {e}")
+
+            # 3. Use the curate tool to filter content
+            print("\n3. Curating content...")
+            try:
+                result = await session.call_tool(
+                    "sdk_curate",
+                    {
+                        "input": "data/generated/sample_document_qa_pairs.json",
+                        "threshold": 7.0
+                    }
+                )
+                print(f"Curate result: {result}")
+            except Exception as e:
+                print(f"Error during curation: {e}")
+
+            # 4. Use the save-as tool to convert format
+            print("\n4. Saving in final format...")
+            try:
+                result = await session.call_tool(
+                    "sdk_save_as",
+                    {
+                        "input": "data/curated/sample_document_cleaned.json",
+                        "format": "alpaca"
+                    }
+                )
+                print(f"Save-as result: {result}")
+            except Exception as e:
+                print(f"Error during format conversion: {e}")
+
+            print("\n=== Document Processing Complete ===")
+
+            # Clean up sample files
+            try:
+                os.remove("sample_document.txt")
+                print("Cleaned up sample_document.txt")
+            except:
+                pass
+
+
+if __name__ == "__main__":
+    asyncio.run(process_document_example())
diff --git a/pyproject.toml b/pyproject.toml
@@ -67,6 +67,7 @@ classifiers = [
 
 [project.scripts]
 synthetic-data-kit = "synthetic_data_kit.cli:app"
+synthetic-data-kit-mcp = "synthetic_data_kit.mcp_server:main"
 
 [tool.hatch.build.targets.wheel]
 packages = ["synthetic_data_kit"]