A curated collection of 361 example annotation task configurations for Potato, the lightweight annotation tool for NLP research. Covers all 22 Potato annotation types, 90 SemEval shared tasks (2013-2025), and benchmarks from ACL, EMNLP, NeurIPS, ICML, ICLR, CVPR, and more.
| Category | Description | Tasks |
|---|---|---|
| text/ | Text-based NLP tasks (emotion, NER, IE, QA, parsing, etc.) | 111 |
| image/ | Image annotation (classification, VQA, grounding, medical) | 33 |
| video/ | Video annotation (action recognition, QA, summarization) | 37 |
| audio/ | Audio annotation (transcription, commands, captioning) | 17 |
| evaluation/ | AI output evaluation (LLM judging, code, benchmarks) | 23 |
| preference-learning/ | RLHF, DPO, and preference annotation tasks | 18 |
| multimodal/ | Cross-modal tasks (robotics, chart analysis, science QA) | 9 |
| agentic/ | Agent evaluation tasks (web agents, code agents) | 3 |
| semeval/ | SemEval shared tasks (2013-2025, 90 tasks) | 90 |
| templates/ | Generic reusable annotation templates | 20 |
| Subcategory | Tasks | Examples |
|---|---|---|
| Emotion & Sentiment | 8 | GoEmotions, SemEval Sentiment, Multirate Sentiment |
| Hate Speech & Moderation | 6 | HateXplain, Implicit Hate, Social Bias Frames, Toxic Spans |
| Named Entity Recognition | 5 | CoNLL-2003, WNUT-2017, Biomedical NER, Complex NER |
| Information Extraction | 7 | KG-BERT, Event Arguments, Dialogue Relations |
| Argumentation & Stance | 5 | Argument Quality, Stance Detection, Rumor Stance |
| Fact Verification | 8 | FActScore, FAVA, Scientific Claims, Propaganda |
| Commonsense & Ethics | 5 | Social Chemistry, Moral Stories, Commonsense Inference |
| Explainability | 2 | Rationale Annotation, NLI Explanation |
| Dialogue | 2 | SWBD-DAMSL Dialogue Acts, Conversation Quality |
| Political & Media | 1 | Political Discourse |
| Discourse | 3 | PDTB Discourse Trees, DISRPT, Timeline Relations |
| Coreference | 4 | OntoNotes, CorefUD, MAVEN-ERE, Legal Coreference |
| Cross-lingual | 5 | XNLI, Belebele, FLORES MT Quality, IndicNLP |
| Domain-specific | 8 | BioNLP, ChemProt, Clinical NER, Legal, Medical |
| Computational Social Science | 7 | OffensEval, Moral Foundations, Politeness, Media Frames |
| Relation Extraction | 6 | MultiTACRED, CrossRE, RadGraph, SciER |
| Entity Linking | 2 | AIDA-CoNLL, MedMentions |
| Code Annotation | 1 | CodeXGLUE Defect Detection |
| Tabular | 1 | Tabular Data Annotation |
| Reading Comprehension | 1 | SQuAD Extractive QA |
| Natural Language Inference | 2 | SNLI, MultiNLI |
| Question Answering | 2 | Natural Questions, TriviaQA |
| Information Retrieval | 2 | MS MARCO, TREC-DL |
| Semantic Similarity | 1 | STS Benchmark |
| Word Sense | 1 | SemEval-2007 WSD |
| Parsing | 1 | Universal Dependencies |
| Education | 3 | Essay Scoring, MathDial Tutoring, Student Essay Discourse |
| Financial | 3 | FinBERT, FLARE NER, Financial PhraseBank |
| Bias & Toxicity | — | See subcategories |
| Subcategory | Tasks | Examples |
|---|---|---|
| Classification | 6 | MS-COCO, ImageNet, Places365, CUB-200 |
| Segmentation | 3 | Cityscapes, ADE20K, LIP Human Parsing |
| Visual QA | 2 | VQAv2, TextVQA |
| Visual Grounding | 1 | RefCOCO |
| Medical Imaging | 3 | CheXpert, MIMIC-CXR, Camelyon Pathology |
| Human Pose | 1 | ViTPose Keypoint Annotation |
| Generation Evaluation | 1 | T2I-CompBench |
| Autonomous Driving | 2 | KITTI, BDD100K |
| Aerial & Remote Sensing | 3 | BigEarthNet, xView, DOTA |
| Specialized Domains | 6 | MVTec-AD, DeepFashion, CelebA, iWildCam |
| Document Analysis | 3+ | DocLayNet, OmniDocBench, SA-1B |
| Subcategory | Tasks | Examples |
|---|---|---|
| Action Recognition | 8 | AVA, Charades, THUMOS14, EPIC-KITCHENS |
| Temporal Grounding | 3 | ActivityNet Captions, DiDeMo, Charades-STA |
| Video Summarization | 4 | TVSum, SumMe, YouTube Highlights, LSMDC |
| Boundary Detection | 3 | Scene/Shot Boundary, MovieScenes |
| Video QA | 2 | NExT-QA, MVBench |
| Scene Understanding | 1 | MovieNet Scene Classification |
| Instructional Video | 2 | HowTo100M, YouCook2 |
| Other Video Tasks | 14 | Video-ChatGPT, Sign Language, Child Language, etc. |
| Task | Description |
|---|---|
| librispeech-transcription | Audio quality + transcription (slider, audio_annotation) |
| speech-commands-recognition | Speech command labeling (audio_annotation) |
| covost-speech-translation | Speech translation evaluation |
| clotho-audio-captioning | Audio event captioning |
| audio-transcription | Speech transcription review |
| speaker-diarization | Speaker identification |
| emotion-recognition | Speech emotion classification |
| music-genre-classification | Music genre tagging |
| + 9 more | DiSPLACE, DoReCo, EmoBox, VoiceMOS, etc. |
| Task | Paper | Types |
|---|---|---|
| wildbench-llm-eval | WildBench (COLM 2024) | pairwise, likert, text |
| mt-bench-judge-consistency | MT-Bench (NeurIPS 2023) | pairwise, likert, radio |
| arena-hard-auto | Arena Hard (2024) | pairwise (scale), likert |
| rewardbench-reward-eval | RewardBench (ICML 2024) | pairwise, radio, multirate |
| mmlu-knowledge-eval | MMLU (ICLR 2021) | radio, text |
| humaneval-code-correctness | HumanEval (2021) | radio, text, number |
| gpqa-expert-qa | GPQA (ICLR 2024) | number, radio, text |
| big-bench-task-eval | BIG-Bench (TMLR 2023) | radio, text, number |
| helm-model-card-display | HELM (TMLR 2023) | pure_display, likert |
| chatbot-arena-pairwise-bws | Chatbot Arena (ICML 2024) | bws, pairwise |
| + 13 more | AlpacaEval, DoNotAnswer, ESA-MT, IFEval, etc. |
| Task | Paper | Types |
|---|---|---|
| dpo-preference-data | DPO (NeurIPS 2023) | pairwise, radio, text |
| ultrafeedback-multiaspect | UltraFeedback (ICML 2024) | multirate, likert, text |
| spin-self-play | SPIN (ICML 2024) | pairwise, radio |
| constitutional-ai-harmlessness | Constitutional AI (2022) | radio, likert, text |
| mmlu-pro-tiered-eval | MMLU-Pro (NeurIPS 2024) | tiered_annotation, radio |
| + 13 more | HH-RLHF, SafeRLHF, BeaverTails, WebGPT, etc. |
Comprehensive coverage of SemEval shared tasks from 2013-2025. See SEMEVAL.md for details.
| Year | Tasks | Highlights |
|---|---|---|
| 2025 | 10 | Multimodal idiomaticity, entity-aware MT, emotion detection |
| 2024 | 9 | Semantic relatedness, persuasion in memes, BRAINTEASER |
| 2023 | 10 | Visual WSD, clickbait spoiling, AfriSenti |
| 2022 | 10 | Patronizing language, idiomaticity, news similarity |
| 2021 | 9 | Lexical complexity, humor detection, MeasEval |
| 2020 | 9 | Commonsense validation, counterfactuals, code-mixed |
| 2019 | 7 | HatEval, hyperpartisan news, suggestion mining |
| 2018 | 10 | Emoji prediction, irony, cybersecurity NER |
| 2017 | 5 | Financial sentiment, humor, pun detection |
| 2016 | 7 | Stance detection, aspect sentiment, clinical TempEval |
| 2013-2015 | 4 | Drug interactions, ABSA, timeline ordering, clinical |
All 22 Potato annotation types are represented:
| Type | Count | Example Tasks |
|---|---|---|
| radio | 483 | GoEmotions, SNLI, MMLU, most classification tasks |
| text | 160 | SQuAD, Natural Questions, code review, translations |
| likert | 128 | STS-B, essay scoring, MT quality, humor ratings |
| multiselect | 126 | GoEmotions, moral foundations, persuasion techniques |
| span | 110 | NER tasks, PICO extraction, SQuAD answer spans |
| video_annotation | 46 | Action recognition, temporal grounding, MVBench |
| pairwise | 16 | DPO, Arena Hard, WildBench, MT-Bench |
| slider | 8 | STS-B similarity, essay scoring, word similarity |
| image_annotation | 6 | ViTPose, RefCOCO, Camelyon pathology |
| select | 6 | MS MARCO, WSD, Financial PhraseBank |
| number | 5 | GPQA confidence, HumanEval, NumEval, event counting |
| multirate | 3 | UltraFeedback, RewardBench, SemEval sentiment |
| audio_annotation | 3 | LibriSpeech, Speech Commands, CoVoST |
| tree_annotation | 3 | PDTB, UD parsing, RumourEval thread structure |
| video | 2 | Video-ChatGPT display |
| triage | 2 | CoNLL-2003 triage, triage template |
| tiered_annotation | 1 | MMLU-Pro tiered evaluation |
| bws | 1 | Chatbot Arena best-worst scaling |
| pure_display | 1 | HELM model card display |
| event_annotation | 1 | BioNLP gene regulation events |
| coreference | 1 | OntoNotes coreference resolution |
| span_link | 9 | Chemical-disease relations, structured sentiment |
Each task folder contains:
metadata.json- Task metadata (title, description, tags, paper reference, citation)config.yaml- Potato configuration filesample-data.json- Example data for testing (8-12 items)
# Clone this repository
git clone https://github.com/davidjurgens/potato-showcase.git
# Navigate to a task
cd potato-showcase/text/emotion-sentiment/goemotions
# Run with Potato
potato start config.yaml- Clone this repository
- Browse categories to find a relevant task
- Copy the task folder to your project
- Customize the
config.yamlfor your needs - Run with:
potato start config.yaml
We welcome contributions! To add a new task:
- Create a folder in the appropriate category
- Add required files (
metadata.json,config.yaml,sample-data.json) - Include paper reference and BibTeX citation if based on published work
- Submit a pull request
MIT License - feel free to use these configurations in your projects.