MiniCheck-Fork/synthetic_data_gen at main · HotLoverGirl69/MiniCheck-Fork

Name	Name	Last commit message	Last commit date
parent directory ..
C2D_gen.py	C2D_gen.py
D2C-doc-example.txt	D2C-doc-example.txt
D2C_gen.py	D2C_gen.py
README.md	README.md
prompt_utils.py	prompt_utils.py

Name

Last commit message

Last commit date

Synthetic Data Generation

This directory contains code for generating synthetic training data for the C2D and D2C method. The goal of this data generation is to create data that requires multi-sentence, multi-fact reasoning to check claims against documents.

Please add your OPENAI_API_KEY in the synthetic_data_gen/prompt_utils.py file to use the OpenAI API for generating synthetic data.

C2D Data Generation

We start with any human-written claim statement. The goal is to generate synthetic documents that require models be able to check multiple facts in the claim against multiple sentences each.

We provide the C2D code for generating (doc, claim, label) triples for a given claim. We use the claim mentioned in our work "By this date, over 5,000 members of the caravan were staying at the Tijuana Stadium — a structure with a capacity of 3,000." as an example:

cd synthetic_data_gen
python C2D_gen.py --claim "By this date, over 5,000 members of the caravan were staying at the Tijuana Stadium — a structure with a capacity of 3,000."

D2C Data Generation

We start with any human-written documents to start with. The goal is to generate claims and pair them with portions of the human written document, which, once again, require multi-sentence, multi-fact reasoning to check the claims. We assume the document length is around 500 words. Longer documents would make the data generation more costly in general due to the "leave-one-sent-out" document augmentation method.

We provide the D2C code for generating (doc, claim, label) triples for a given document. We use the document mentioned in our work as an example. The original website for the document can be found here.

cd synthetic_data_gen
python D2C_gen.py --doc_path D2C-doc-example.txt

Citation

If you found our work useful, please consider citing our work.

@misc{tang2024minicheck,
      title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents}, 
      author={Liyan Tang and Philippe Laban and Greg Durrett},
      year={2024},
      eprint={2404.10774},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Synthetic Data Generation

C2D Data Generation

D2C Data Generation

Citation

FilesExpand file tree

synthetic_data_gen

Directory actions

More options

Directory actions

More options

Latest commit

History

synthetic_data_gen

Folders and files

parent directory

README.md

Synthetic Data Generation

C2D Data Generation

D2C Data Generation

Citation