Sourcify Data Processing

The solidity contract data are preprocessed here for training a solidity code generator and completion.

Data Processing Approach

With the main dataset being ~43 GB large, this provide a solid base for training an LLM. However, to avoid vulnerabilities and further bias, the processing strategy involves first compiling each project or solidity files.

Run

$ python build_dataset.py

for slithering all the sol files. The slithering process is applied on all single Solidity files in every contract directory. A project wide slither process is not supported yet.

First extract the contract directories
Filter the directories and only keep those with solidity source files
hash processed files thereby avoiding multiple processings and duplicates

Download Raw Data

You can follow the instructions in the docs and contact Kaan Uzdogan for the credentials.

Slither

Slither is used for detecting vulnerabilities in the solidity source code. See slither. Update the detectors' list with respect to up-to-date versions in the detectors.json file.

Environment

Consider install all solc versions as the sources file might need different versions for compilation. The slithering process makes use of any possible solidity version. Latest by now 0.8.20.

$ pip install solc_select
$ solc-select install all

Fine Tuning

The herein provided source code supports finetuning Causal LLM.

$ accelerate launch --num_cpu_threads_per_process 8 fine_tune.py

Setup the hugginface token

Important: The token must have write access

export HF_TOKEN="your write access token"

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
build_dataset.ipynb		build_dataset.ipynb
build_dataset.py		build_dataset.py
config.yml		config.yml
describe_dataset.ipynb		describe_dataset.ipynb
detectors.json		detectors.json
fine_tune.py		fine_tune.py
fine_tuneDDP.py		fine_tuneDDP.py
requirements.txt		requirements.txt
silther_warnings_analysis.ipynb		silther_warnings_analysis.ipynb
slither_sol_helpers.py		slither_sol_helpers.py
test_make_dataset.ipynb		test_make_dataset.ipynb
test_slither.py		test_slither.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sourcify Data Processing

Data Processing Approach

Download Raw Data

Slither

Environment

Fine Tuning

Setup the hugginface token

About

Uh oh!

Releases

Packages

Uh oh!

Languages

STetsing/slither-solidity

Folders and files

Latest commit

History

Repository files navigation

Sourcify Data Processing

Data Processing Approach

Download Raw Data

Slither

Environment

Fine Tuning

Setup the hugginface token

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages