This Fine Tuning Dataset Creation Toolkit will help you in creation of JSONL dataset files that are needed for fine tuning Completions models like babbage-002 or davinci-002 and Chat Completions model like gpt-35-turbo-0613 from XLSX and CSV files. These JSONL dataset files can be used to fine tune models in OpenAI or Azure OpenAI.
Python 3.10+installedvirtualenvpackage installed in Python
- Create and activate a virtual environment
- Install dependencies
- Create a dataset in XLSX or CSV file
-
To create a virtual environment, open terminal in your working directory and execute this command :
python -m venv .venv -
To activate virtual environment, execute this command in the terminal :
./.venv/Scripts/activate
-
To install the dependencies needed to run this kit, execute this command in terminal :
pip install -r requirements.txt
- To create a dataset, create a XLSX or CSV file. You can take reference from XLSX or CSV files inside
Samplefolder. - This XLSX or CSV file needs to be in your working directory.
[IMPORTANT]
- Convert XLSX/CSV dataset file to JSONL dataset file
- Validate JSONL dataset file
- Analyze JSONL dataset file
[FOR COMPLETIONS JSONL DATASET ONLY] - Convert JSONL dataset files to XLSX and CSV files
[EXTRA]
-
Based on the model your JSONL dataset file will be targetting for fine tuning, there are different scripts that you can use
-
If you are creating dataset for
Completionsmodels likebabbage-002ordavinci-002, the script you should be using isCompletions Dataset Formatter.py. This is how you should execute the script :python 'Completions Dataset Formatter.py' [XLSX/CSV Filename] -
If you are creating dataset for
Chat Completionsmodel likegpt-35-turbo-0613, the script you should be using isChat Completions Dataset Formatter.py. This is how you should execute the script :python 'Chat Completions Dataset Formatter.py' [XLSX/CSV Filename] -
Both of them will create a JSONL dataset file in your working directory with name same as input XLSX/CSV file.
-
To validate JSONL files, you can make use of
JSONL Validator.pyscript -
This script will return output as
ValidorInvalidbased on the input file. -
This is how you should execute the script :
python 'JSON Validator.py' [JSONL Filename] -
This script can validate JSONL dataset files created for both
CompletionsandChat Completionsmodels.
-
This is only applicable to the datasets created for
Completionsmodels likebabbage-002ordavinci-002. -
To analyze the JSONL dataset files, execute this command in your terminal :
openai tools fine_tunes.prepare_data -f [JSON Filename] -
More details can be found [here]
-
Based on the model your JSONL dataset file will be targetting for fine tuning, there are different scripts that you can use
-
To create XLSX and CSV files from JSONL file that was created for fine tuning of
Completionsmodels likebabbage-002ordavinci-002, the script you should be using isCompletions - JSONL to CSV and XLSX.py. This is how you should execute the script :python 'Completions - JSONL to CSV and XLSX.py' [JSONL Filename] -
To create XLSX and CSV files from JSONL file that was created for fine tuning of
Chat Completionsmodel likegpt-35-turbo-0613, the script you should be using isChat Completions - JSONL to CSV and XLSX.py. This is how you should execute the script :python 'Chat Completions - JSONL to CSV and XLSX.py' [JSONL Filename] -
Both of them will create a XLSX and CSV file in your working directory with name same as input JSON file.
This toolkit saved me a lot of time in creating dataset files for fine tuning jobs. If it also helps you to save your time, then please share this with your friends and colleagues. Please don't forget to give it a 🌟. Feel free to raise an issue or send a PR for improvements.