For generating tool2vec embeddings, you need to provide the path to the dataset, output path, output filename and model. The available models are: text-embedding-3-small, e5-small, e5-base, e5-large, and mxbai-large.
To use OpenAI model, you need to set the AZURE_ENDPOINT, AZURE_OPENAI_API_KEY, and AZURE_OPENAI_API_VERSION environment variables.
For other models, you need to provide the path to the model checkpoint.
For example, to generate embeddings using the text-embedding-3-small model with the dataset located at ...json and save the embeddings to ...pkl, you can run the following command:
AZURE_ENDPOINT="..."
AZURE_OPENAI_API_KEY="..."
AZURE_OPENAI_API_VERSION="..."
DATA_PATH="...json"
OUTPUT_PATH="..."
OUTPUT_FILE_NAME="....pkl"
python embedding_generator.py \
--data_path ${DATA_PATH} \
--output_path "..." \
--output_file_name "..." \
--model "azure"
To generate embeddings using the e5-small model with the dataset located at ...json and save the embeddings to ...pkl, you can run the following command:
DATA_PATH="...json"
OUTPUT_PATH="..."
OUTPUT_FILE_NAME="....pkl"
python embedding_generator.py \
--data_path ${DATA_PATH} \
--output_path "..." \
--output_file_name "..." \
--model "e5-small" \
--checkpoint_path "..." \
--use_checkpoint
VAL_DATA_PATH="...json"
T2V_EMBEDDING_DATA_PATH="...pkl"
OUTPUT_FILE_NAME="....txt"
python evaluate_t2v_embedding.py \
--valid_data_path ${VAL_DATA_PATH} \
--tool_embedding_dir ${T2V_EMBEDDING_DATA_PATH} \
--output_file_name ${OUTPUT_FILE_NAME}