The solidity contract data are preprocessed here for training a solidity code generator and completion.
With the main dataset being ~43 GB large, this provide a solid base for training an LLM. However, to avoid vulnerabilities and further bias, the processing strategy involves first compiling each project or solidity files.
Run
$ python build_dataset.py
for slithering all the sol files. The slithering process is applied on all single Solidity files in every contract directory. A project wide slither process is not supported yet.
- First extract the contract directories
- Filter the directories and only keep those with solidity source files
- hash processed files thereby avoiding multiple processings and duplicates
You can follow the instructions in the docs and contact Kaan Uzdogan for the credentials.
Slither is used for detecting vulnerabilities in the solidity source code. See slither. Update the detectors' list with respect to up-to-date versions in the detectors.json
file.
Consider install all solc
versions as the sources file might need different versions for compilation. The slithering process makes use of any possible solidity version. Latest by now 0.8.20.
$ pip install solc_select
$ solc-select install all
The herein provided source code supports finetuning Causal LLM.
$ accelerate launch --num_cpu_threads_per_process 8 fine_tune.py
Important: The token must have write access
export HF_TOKEN="your write access token"