Contributing a Dataset to EuroEval

This guide will walk you through the process of contributing a new dataset to EuroEval.

For general contribution guidelines, please refer to our Contributing Guide.

If you have any questions during this process, please open an issue on the EuroEval GitHub repository.

Step 0: Prerequisites

Before beginning:

Check if your dataset matches one of the supported tasks. If your dataset doesn't match any supported task, you have two options:
1. Try to adapt it to fit an existing task (e.g., by reformatting it or adding multiple choice options)
2. Open an issue on the EuroEval repository requesting to add a new task type
If it does, fork the EuroEval repository and create a new branch to work on your dataset contribution

Step 1: Create the Dataset Processing Script

Create a script in the src/scripts directory that processes your dataset into the EuroEval format.

The dataset creation script roughly follows this pattern:

# Load your dataset.
raw_dataset = load_dataset("path_to_your_dataset")

# Process the dataset to fit the EuroEval format.
dataset = process_raw_dataset(raw_dataset=raw_dataset)

# Push the dataset to the Hugging Face Hub.
dataset.push_to_hub("EuroEval/your_dataset_name", private=True)

Tips for Dataset Processing

Examine existing scripts for datasets with the same task for a reference on how to process your dataset.
Take a look at existing datasets in your language to see how these are usually set up. Study these examples to understand the expected format and structure for your own dataset's entries.
Split your dataset into train / val / test sets, ideally with 1,024 / 256 / 2,048 samples, respectively
If your dataset already has splits, maintain consistency (e.g., the EuroEval train split should be a subset of the original train split)

Step 2: Add Dataset Configuration

Dataset configurations in EuroEval are organised by language, with each language having its own file at src/euroeval/dataset_configs/{language}.py. A configuration is made with the DatasetConfig class. Here is an example for the fictive English Knowledge dataset Rizzler.

RIZZLER_KNOWLEDGE_CONFIG = DatasetConfig(
    name="rizzler_knowledge", # The name of the dataset
    pretty_name="the truncated version of the English knowledge dataset Rizzler", # The pretty name of the dataset used in logs.
    huggingface_id="EuroEval/rizzler_knowledge", # The same id as used in the dataset creation script
    task=KNOW, # The task of the dataset
    languages=[EN], # The language of the dataset
    unofficial=True, # Whether the dataset is unofficial
)

Every src/euroeval/dataset_configs/{language}.py file has two sections:

### Official datasets ###
### Unofficial datasets ###

An unofficial dataset means that the resulting evaluation will not be included in the official leaderboard.

As a starting point, make your dataset unofficial. This can always be changed later.

Step 3: Document Your Dataset

Dataset documentation in EuroEval is organised by language, with each language having its own file at docs/datasets/{language}.md. Within each language file, documentation is further organised by task.

Navigate to the documentation file for your dataset's language and add your dataset's documentation in the appropriate task section.

The documentation should include the following information:

General description: Explain the dataset's origin and purpose
Split details: Describe how splits were created and their sizes
Example samples: Provide 3 representative examples from the training split
Evaluation setup: Explain how models are evaluated on this dataset
Evaluation command: Show how to evaluate a model on your dataset

To do this, you can follow these steps:

Find an existing dataset of the same task in docs/datasets/{language}.md
Copy the entire documentation section for that dataset
Use this as a template and modify all details to match your new dataset
Ensure you update all dataset-specific information (description, split sizes, example samples, etc.)

Step 4: Modify the Change Log

After completing the previous steps, add an entry to the project's changelog to document your contribution. The entry should be added under the [Unreleased] section with a short description of the dataset you have added. Here is an example of a new entry.

## [Unreleased]
### Added
- Added the English knowledge dataset [rizzler_knowledge](https://huggingface.co/datasets/Example-User/rizzler_knowledge). The split is given by 1,024 / 256 / 2,048 samples for train / val / test, respectively. It is marked as `unofficial` for now. This was contributed by [@your_name](https://github.com/your_name) ✨

Step 5: Make a Pull Request

When you have completed all the previous steps, create a pull request to the EuroEval repository.

Thank you

This concludes the process of contributing a dataset to EuroEval. Your contribution helps expand the multilingual evaluation capabilities of the benchmark and is greatly appreciated by the research community!

Thank you for your valuable contribution! 🎉

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing a Dataset to EuroEval

Step 0: Prerequisites

Step 1: Create the Dataset Processing Script

Tips for Dataset Processing

Step 2: Add Dataset Configuration

Step 3: Document Your Dataset

Step 4: Modify the Change Log

Step 5: Make a Pull Request

Thank you

FilesExpand file tree

NEW_DATASET_GUIDE.md

Latest commit

History

NEW_DATASET_GUIDE.md

File metadata and controls

Contributing a Dataset to EuroEval

Step 0: Prerequisites

Step 1: Create the Dataset Processing Script

Tips for Dataset Processing

Step 2: Add Dataset Configuration

Step 3: Document Your Dataset

Step 4: Modify the Change Log

Step 5: Make a Pull Request

Thank you