diff --git a/docs/3recipes/models.md b/docs/3recipes/models.md index 2c034baf9..905890785 100644 --- a/docs/3recipes/models.md +++ b/docs/3recipes/models.md @@ -50,26 +50,26 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`lstm-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 914/1000 + - True Positive/Positive: 914/1000 - Accuracy: 91.4% - IMDB (`lstm-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 883/1000 + - True Positive/Positive: 883/1000 - Accuracy: 88.30% - Movie Reviews [Rotten Tomatoes] (`lstm-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 807/1000 + - True Positive/Positive: 807/1000 - Accuracy: 80.70% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 781/1000 + - True Positive/Positive: 781/1000 - Accuracy: 78.10% - SST-2 (`lstm-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 737/872 + - True Positive/Positive: 737/872 - Accuracy: 84.52% - Yelp Polarity (`lstm-yelp`) - `datasets` dataset `yelp_polarity`, split `test` - - Successes: 922/1000 + - True Positive/Positive: 922/1000 - Accuracy: 92.20% @@ -81,26 +81,26 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`cnn-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 910/1000 + - True Positive/Positive: 910/1000 - Accuracy: 91.00% - IMDB (`cnn-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 863/1000 + - True Positive/Positive: 863/1000 - Accuracy: 86.30% - Movie Reviews [Rotten Tomatoes] (`cnn-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 794/1000 + - True Positive/Positive: 794/1000 - Accuracy: 79.40% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 768/1000 + - True Positive/Positive: 768/1000 - Accuracy: 76.80% - SST-2 (`cnn-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 721/872 + - True Positive/Positive: 721/872 - Accuracy: 82.68% - Yelp Polarity (`cnn-yelp`) - `datasets` dataset `yelp_polarity`, split `test` - - Successes: 913/1000 + - True Positive/Positive: 913/1000 - Accuracy: 91.30% @@ -112,38 +112,38 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`albert-base-v2-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 943/1000 + - True Positive/Positive: 943/1000 - Accuracy: 94.30% - CoLA (`albert-base-v2-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 829/1000 + - True Positive/Positive: 829/1000 - Accuracy: 82.90% - IMDB (`albert-base-v2-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 913/1000 + - True Positive/Positive: 913/1000 - Accuracy: 91.30% - Movie Reviews [Rotten Tomatoes] (`albert-base-v2-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 882/1000 + - True Positive/Positive: 882/1000 - Accuracy: 88.20% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 851/1000 + - True Positive/Positive: 851/1000 - Accuracy: 85.10% - Quora Question Pairs (`albert-base-v2-qqp`) - `datasets` dataset `glue`, subset `qqp`, split `validation` - - Successes: 914/1000 + - True Positive/Positive: 914/1000 - Accuracy: 91.40% - Recognizing Textual Entailment (`albert-base-v2-rte`) - `datasets` dataset `glue`, subset `rte`, split `validation` - - Successes: 211/277 + - True Positive/Positive: 211/277 - Accuracy: 76.17% - SNLI (`albert-base-v2-snli`) - `datasets` dataset `snli`, split `test` - - Successes: 883/1000 + - True Positive/Positive: 883/1000 - Accuracy: 88.30% - SST-2 (`albert-base-v2-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 807/872 + - True Positive/Positive: 807/872 - Accuracy: 92.55%) - STS-b (`albert-base-v2-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -151,11 +151,11 @@ All evaluations shown are on the full validation or test set up to 1000 examples - Spearman correlation: 0.8995912861209745 - WNLI (`albert-base-v2-wnli`) - `datasets` dataset `glue`, subset `wnli`, split `validation` - - Successes: 42/71 + - True Positive/Positive: 42/71 - Accuracy: 59.15% - Yelp Polarity (`albert-base-v2-yelp`) - `datasets` dataset `yelp_polarity`, split `test` - - Successes: 963/1000 + - True Positive/Positive: 963/1000 - Accuracy: 96.30% @@ -166,50 +166,50 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`bert-base-uncased-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 942/1000 + - True Positive/Positive: 942/1000 - Accuracy: 94.20% - CoLA (`bert-base-uncased-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 812/1000 + - True Positive/Positive: 812/1000 - Accuracy: 81.20% - IMDB (`bert-base-uncased-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 919/1000 + - True Positive/Positive: 919/1000 - Accuracy: 91.90% - MNLI matched (`bert-base-uncased-mnli`) - `datasets` dataset `glue`, subset `mnli`, split `validation_matched` - - Successes: 840/1000 + - True Positive/Positive: 840/1000 - Accuracy: 84.00% - Movie Reviews [Rotten Tomatoes] (`bert-base-uncased-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 876/1000 + - True Positive/Positive: 876/1000 - Accuracy: 87.60% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 838/1000 + - True Positive/Positive: 838/1000 - Accuracy: 83.80% - MRPC (`bert-base-uncased-mrpc`) - `datasets` dataset `glue`, subset `mrpc`, split `validation` - - Successes: 358/408 + - True Positive/Positive: 358/408 - Accuracy: 87.75% - QNLI (`bert-base-uncased-qnli`) - `datasets` dataset `glue`, subset `qnli`, split `validation` - - Successes: 904/1000 + - True Positive/Positive: 904/1000 - Accuracy: 90.40% - Quora Question Pairs (`bert-base-uncased-qqp`) - `datasets` dataset `glue`, subset `qqp`, split `validation` - - Successes: 924/1000 + - True Positive/Positive: 924/1000 - Accuracy: 92.40% - Recognizing Textual Entailment (`bert-base-uncased-rte`) - `datasets` dataset `glue`, subset `rte`, split `validation` - - Successes: 201/277 + - True Positive/Positive: 201/277 - Accuracy: 72.56% - SNLI (`bert-base-uncased-snli`) - `datasets` dataset `snli`, split `test` - - Successes: 894/1000 + - True Positive/Positive: 894/1000 - Accuracy: 89.40% - SST-2 (`bert-base-uncased-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 806/872 + - True Positive/Positive: 806/872 - Accuracy: 92.43%) - STS-b (`bert-base-uncased-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -217,11 +217,11 @@ All evaluations shown are on the full validation or test set up to 1000 examples - Spearman correlation: 0.8773251339980935 - WNLI (`bert-base-uncased-wnli`) - `datasets` dataset `glue`, subset `wnli`, split `validation` - - Successes: 40/71 + - True Positive/Positive: 40/71 - Accuracy: 56.34% - Yelp Polarity (`bert-base-uncased-yelp`) - `datasets` dataset `yelp_polarity`, split `test` - - Successes: 963/1000 + - True Positive/Positive: 963/1000 - Accuracy: 96.30% @@ -233,23 +233,23 @@ All evaluations shown are on the full validation or test set up to 1000 examples - CoLA (`distilbert-base-cased-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 786/1000 + - True Positive/Positive: 786/1000 - Accuracy: 78.60% - MRPC (`distilbert-base-cased-mrpc`) - `datasets` dataset `glue`, subset `mrpc`, split `validation` - - Successes: 320/408 + - True Positive/Positive: 320/408 - Accuracy: 78.43% - Quora Question Pairs (`distilbert-base-cased-qqp`) - `datasets` dataset `glue`, subset `qqp`, split `validation` - - Successes: 908/1000 + - True Positive/Positive: 908/1000 - Accuracy: 90.80% - SNLI (`distilbert-base-cased-snli`) - `datasets` dataset `snli`, split `test` - - Successes: 861/1000 + - True Positive/Positive: 861/1000 - Accuracy: 86.10% - SST-2 (`distilbert-base-cased-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 785/872 + - True Positive/Positive: 785/872 - Accuracy: 90.02%) - STS-b (`distilbert-base-cased-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -264,31 +264,31 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`distilbert-base-uncased-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 944/1000 + - True Positive/Positive: 944/1000 - Accuracy: 94.40% - CoLA (`distilbert-base-uncased-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 786/1000 + - True Positive/Positive: 786/1000 - Accuracy: 78.60% - IMDB (`distilbert-base-uncased-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 903/1000 + - True Positive/Positive: 903/1000 - Accuracy: 90.30% - MNLI matched (`distilbert-base-uncased-mnli`) - `datasets` dataset `glue`, subset `mnli`, split `validation_matched` - - Successes: 817/1000 + - True Positive/Positive: 817/1000 - Accuracy: 81.70% - MRPC (`distilbert-base-uncased-mrpc`) - `datasets` dataset `glue`, subset `mrpc`, split `validation` - - Successes: 350/408 + - True Positive/Positive: 350/408 - Accuracy: 85.78% - QNLI (`distilbert-base-uncased-qnli`) - `datasets` dataset `glue`, subset `qnli`, split `validation` - - Successes: 860/1000 + - True Positive/Positive: 860/1000 - Accuracy: 86.00% - Recognizing Textual Entailment (`distilbert-base-uncased-rte`) - `datasets` dataset `glue`, subset `rte`, split `validation` - - Successes: 180/277 + - True Positive/Positive: 180/277 - Accuracy: 64.98% - STS-b (`distilbert-base-uncased-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -296,7 +296,7 @@ All evaluations shown are on the full validation or test set up to 1000 examples - Spearman correlation: 0.8407155030382939 - WNLI (`distilbert-base-uncased-wnli`) - `datasets` dataset `glue`, subset `wnli`, split `validation` - - Successes: 40/71 + - True Positive/Positive: 40/71 - Accuracy: 56.34% @@ -307,38 +307,38 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`roberta-base-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 947/1000 + - True Positive/Positive: 947/1000 - Accuracy: 94.70% - CoLA (`roberta-base-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 857/1000 + - True Positive/Positive: 857/1000 - Accuracy: 85.70% - IMDB (`roberta-base-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 941/1000 + - True Positive/Positive: 941/1000 - Accuracy: 94.10% - Movie Reviews [Rotten Tomatoes] (`roberta-base-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 899/1000 + - True Positive/Positive: 899/1000 - Accuracy: 89.90% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 883/1000 + - True Positive/Positive: 883/1000 - Accuracy: 88.30% - MRPC (`roberta-base-mrpc`) - `datasets` dataset `glue`, subset `mrpc`, split `validation` - - Successes: 371/408 + - True Positive/Positive: 371/408 - Accuracy: 91.18% - QNLI (`roberta-base-qnli`) - `datasets` dataset `glue`, subset `qnli`, split `validation` - - Successes: 917/1000 + - True Positive/Positive: 917/1000 - Accuracy: 91.70% - Recognizing Textual Entailment (`roberta-base-rte`) - `datasets` dataset `glue`, subset `rte`, split `validation` - - Successes: 217/277 + - True Positive/Positive: 217/277 - Accuracy: 78.34% - SST-2 (`roberta-base-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 820/872 + - True Positive/Positive: 820/872 - Accuracy: 94.04%) - STS-b (`roberta-base-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -346,7 +346,7 @@ All evaluations shown are on the full validation or test set up to 1000 examples - Spearman correlation: 0.9025045272903051 - WNLI (`roberta-base-wnli`) - `datasets` dataset `glue`, subset `wnli`, split `validation` - - Successes: 40/71 + - True Positive/Positive: 40/71 - Accuracy: 56.34% @@ -357,26 +357,26 @@ All evaluations shown are on the full validation or test set up to 1000 examples - CoLA (`xlnet-base-cased-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 800/1000 + - True Positive/Positive: 800/1000 - Accuracy: 80.00% - IMDB (`xlnet-base-cased-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 957/1000 + - True Positive/Positive: 957/1000 - Accuracy: 95.70% - Movie Reviews [Rotten Tomatoes] (`xlnet-base-cased-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 908/1000 + - True Positive/Positive: 908/1000 - Accuracy: 90.80% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 876/1000 + - True Positive/Positive: 876/1000 - Accuracy: 87.60% - MRPC (`xlnet-base-cased-mrpc`) - `datasets` dataset `glue`, subset `mrpc`, split `validation` - - Successes: 363/408 + - True Positive/Positive: 363/408 - Accuracy: 88.97% - Recognizing Textual Entailment (`xlnet-base-cased-rte`) - `datasets` dataset `glue`, subset `rte`, split `validation` - - Successes: 196/277 + - True Positive/Positive: 196/277 - Accuracy: 70.76% - STS-b (`xlnet-base-cased-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -384,7 +384,7 @@ All evaluations shown are on the full validation or test set up to 1000 examples - Spearman correlation: 0.8773439961182335 - WNLI (`xlnet-base-cased-wnli`) - `datasets` dataset `glue`, subset `wnli`, split `validation` - - Successes: 41/71 + - True Positive/Positive: 41/71 - Accuracy: 57.75% diff --git a/sample.txt b/sample.txt new file mode 100644 index 000000000..21f7c708c --- /dev/null +++ b/sample.txt @@ -0,0 +1,3 @@ +0 Hi there +1 Nope +2 Whatever diff --git a/textattack/commands/eval_model/eval_model_command.py b/textattack/commands/eval_model/eval_model_command.py index 727d83ecc..883f92760 100644 --- a/textattack/commands/eval_model/eval_model_command.py +++ b/textattack/commands/eval_model/eval_model_command.py @@ -52,6 +52,7 @@ def test_model_on_dataset(self, args): i = 0 while i < args.num_examples: + dataset_batch = dataset[ i : min(args.num_examples, i + args.model_batch_size) ] @@ -86,7 +87,9 @@ def test_model_on_dataset(self, args): successes = (guess_labels == ground_truth_outputs).sum().item() perc_accuracy = successes / len(preds) * 100.0 perc_accuracy = "{:.2f}%".format(perc_accuracy) - logger.info(f"Successes {successes}/{len(preds)} ({_cb(perc_accuracy)})") + logger.info( + f"True Positive/Positive {successes}/{len(preds)} ({_cb(perc_accuracy)})" + ) def run(self, args): textattack.shared.utils.set_seed(args.random_seed) diff --git a/textattack/datasets/__init__.py b/textattack/datasets/__init__.py index f27746985..dcb3c2e26 100644 --- a/textattack/datasets/__init__.py +++ b/textattack/datasets/__init__.py @@ -10,5 +10,4 @@ from .dataset import TextAttackDataset from .huggingface_dataset import HuggingFaceDataset - from . import translation diff --git a/textattack/datasets/dataset.py b/textattack/datasets/dataset.py index 1e1425d5d..cd3089276 100644 --- a/textattack/datasets/dataset.py +++ b/textattack/datasets/dataset.py @@ -1,13 +1,15 @@ -""" +""" + + dataset: TextAttack dataset + ============================= + """ -dataset: TextAttack dataset -============================= -""" -from abc import ABC import pickle import random +import pandas as pd + from textattack.shared import utils @@ -80,3 +82,40 @@ def _clean_example(self, ex): Only necessary for some datasets. """ return ex + + def _load_from_df(self, df, offset=0, shuffle=False, xcol=None, ycol=None): + + """Loads from pandas dataframe + df : dataframe to load from + xcol : column to use as x + ycol : column to use as label + """ + + # if xcol to be used is specified + new_df = df + if xcol is not None: + new_df = pd.DataFrame({xcol: df[xcol]}) + if ycol is not None: + new_df[ycol] = df[ycol] + df = new_df + self.examples = list(df.to_records(index=False)) + self.examples = self.examples[offset:] + self._i = 0 + if shuffle: + random.shuffle(self.examples) + + def _load_from_csv(self, path, header=None, sep=",", **kwargs): + """Loads from csv file""" + df = pd.read_csv(path, header=header, sep=sep) + self.examples = self._from_df(df, **kwargs) + + def _load_from_lists(self, list_x_y, offset=0, shuffle=False): + """Loads from a list + list_x_y : iterable [(text, label)] + offset : start + """ + self.examples = list_x_y + self.examples = self.examples[offset:] + self._i = 0 + if shuffle: + random.shuffle(self.examples) diff --git a/textattack/models/README.md b/textattack/models/README.md index 297a8b2f5..83d64ae27 100644 --- a/textattack/models/README.md +++ b/textattack/models/README.md @@ -22,26 +22,26 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`lstm-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 914/1000 + - True Positive/Positive: 914/1000 - Accuracy: 91.4% - IMDB (`lstm-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 883/1000 + - True Positive/Positive: 883/1000 - Accuracy: 88.30% - Movie Reviews [Rotten Tomatoes] (`lstm-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 807/1000 + - True Positive/Positive: 807/1000 - Accuracy: 80.70% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 781/1000 + - True Positive/Positive: 781/1000 - Accuracy: 78.10% - SST-2 (`lstm-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 737/872 + - True Positive/Positive: 737/872 - Accuracy: 84.52% - Yelp Polarity (`lstm-yelp`) - `datasets` dataset `yelp_polarity`, split `test` - - Successes: 922/1000 + - True Positive/Positive: 922/1000 - Accuracy: 92.20% @@ -53,26 +53,26 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`cnn-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 910/1000 + - True Positive/Positive: 910/1000 - Accuracy: 91.00% - IMDB (`cnn-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 863/1000 + - True Positive/Positive: 863/1000 - Accuracy: 86.30% - Movie Reviews [Rotten Tomatoes] (`cnn-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 794/1000 + - True Positive/Positive: 794/1000 - Accuracy: 79.40% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 768/1000 + - True Positive/Positive: 768/1000 - Accuracy: 76.80% - SST-2 (`cnn-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 721/872 + - True Positive/Positive: 721/872 - Accuracy: 82.68% - Yelp Polarity (`cnn-yelp`) - `datasets` dataset `yelp_polarity`, split `test` - - Successes: 913/1000 + - True Positive/Positive: 913/1000 - Accuracy: 91.30% @@ -84,38 +84,38 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`albert-base-v2-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 943/1000 + - True Positive/Positive: 943/1000 - Accuracy: 94.30% - CoLA (`albert-base-v2-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 829/1000 + - True Positive/Positive: 829/1000 - Accuracy: 82.90% - IMDB (`albert-base-v2-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 913/1000 + - True Positive/Positive: 913/1000 - Accuracy: 91.30% - Movie Reviews [Rotten Tomatoes] (`albert-base-v2-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 882/1000 + - True Positive/Positive: 882/1000 - Accuracy: 88.20% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 851/1000 + - True Positive/Positive: 851/1000 - Accuracy: 85.10% - Quora Question Pairs (`albert-base-v2-qqp`) - `datasets` dataset `glue`, subset `qqp`, split `validation` - - Successes: 914/1000 + - True Positive/Positive: 914/1000 - Accuracy: 91.40% - Recognizing Textual Entailment (`albert-base-v2-rte`) - `datasets` dataset `glue`, subset `rte`, split `validation` - - Successes: 211/277 + - True Positive/Positive: 211/277 - Accuracy: 76.17% - SNLI (`albert-base-v2-snli`) - `datasets` dataset `snli`, split `test` - - Successes: 883/1000 + - True Positive/Positive: 883/1000 - Accuracy: 88.30% - SST-2 (`albert-base-v2-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 807/872 + - True Positive/Positive: 807/872 - Accuracy: 92.55%) - STS-b (`albert-base-v2-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -123,11 +123,11 @@ All evaluations shown are on the full validation or test set up to 1000 examples - Spearman correlation: 0.8995912861209745 - WNLI (`albert-base-v2-wnli`) - `datasets` dataset `glue`, subset `wnli`, split `validation` - - Successes: 42/71 + - True Positive/Positive: 42/71 - Accuracy: 59.15% - Yelp Polarity (`albert-base-v2-yelp`) - `datasets` dataset `yelp_polarity`, split `test` - - Successes: 963/1000 + - True Positive/Positive: 963/1000 - Accuracy: 96.30% @@ -138,50 +138,50 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`bert-base-uncased-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 942/1000 + - True Positive/Positive: 942/1000 - Accuracy: 94.20% - CoLA (`bert-base-uncased-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 812/1000 + - True Positive/Positive: 812/1000 - Accuracy: 81.20% - IMDB (`bert-base-uncased-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 919/1000 + - True Positive/Positive: 919/1000 - Accuracy: 91.90% - MNLI matched (`bert-base-uncased-mnli`) - `datasets` dataset `glue`, subset `mnli`, split `validation_matched` - - Successes: 840/1000 + - True Positive/Positive: 840/1000 - Accuracy: 84.00% - Movie Reviews [Rotten Tomatoes] (`bert-base-uncased-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 876/1000 + - True Positive/Positive: 876/1000 - Accuracy: 87.60% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 838/1000 + - True Positive/Positive: 838/1000 - Accuracy: 83.80% - MRPC (`bert-base-uncased-mrpc`) - `datasets` dataset `glue`, subset `mrpc`, split `validation` - - Successes: 358/408 + - True Positive/Positive: 358/408 - Accuracy: 87.75% - QNLI (`bert-base-uncased-qnli`) - `datasets` dataset `glue`, subset `qnli`, split `validation` - - Successes: 904/1000 + - True Positive/Positive: 904/1000 - Accuracy: 90.40% - Quora Question Pairs (`bert-base-uncased-qqp`) - `datasets` dataset `glue`, subset `qqp`, split `validation` - - Successes: 924/1000 + - True Positive/Positive: 924/1000 - Accuracy: 92.40% - Recognizing Textual Entailment (`bert-base-uncased-rte`) - `datasets` dataset `glue`, subset `rte`, split `validation` - - Successes: 201/277 + - True Positive/Positive: 201/277 - Accuracy: 72.56% - SNLI (`bert-base-uncased-snli`) - `datasets` dataset `snli`, split `test` - - Successes: 894/1000 + - True Positive/Positive: 894/1000 - Accuracy: 89.40% - SST-2 (`bert-base-uncased-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 806/872 + - True Positive/Positive: 806/872 - Accuracy: 92.43%) - STS-b (`bert-base-uncased-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -189,11 +189,11 @@ All evaluations shown are on the full validation or test set up to 1000 examples - Spearman correlation: 0.8773251339980935 - WNLI (`bert-base-uncased-wnli`) - `datasets` dataset `glue`, subset `wnli`, split `validation` - - Successes: 40/71 + - True Positive/Positive: 40/71 - Accuracy: 56.34% - Yelp Polarity (`bert-base-uncased-yelp`) - `datasets` dataset `yelp_polarity`, split `test` - - Successes: 963/1000 + - True Positive/Positive: 963/1000 - Accuracy: 96.30% @@ -205,23 +205,23 @@ All evaluations shown are on the full validation or test set up to 1000 examples - CoLA (`distilbert-base-cased-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 786/1000 + - True Positive/Positive: 786/1000 - Accuracy: 78.60% - MRPC (`distilbert-base-cased-mrpc`) - `datasets` dataset `glue`, subset `mrpc`, split `validation` - - Successes: 320/408 + - True Positive/Positive: 320/408 - Accuracy: 78.43% - Quora Question Pairs (`distilbert-base-cased-qqp`) - `datasets` dataset `glue`, subset `qqp`, split `validation` - - Successes: 908/1000 + - True Positive/Positive: 908/1000 - Accuracy: 90.80% - SNLI (`distilbert-base-cased-snli`) - `datasets` dataset `snli`, split `test` - - Successes: 861/1000 + - True Positive/Positive: 861/1000 - Accuracy: 86.10% - SST-2 (`distilbert-base-cased-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 785/872 + - True Positive/Positive: 785/872 - Accuracy: 90.02%) - STS-b (`distilbert-base-cased-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -236,31 +236,31 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`distilbert-base-uncased-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 944/1000 + - True Positive/Positive: 944/1000 - Accuracy: 94.40% - CoLA (`distilbert-base-uncased-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 786/1000 + - True Positive/Positive: 786/1000 - Accuracy: 78.60% - IMDB (`distilbert-base-uncased-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 903/1000 + - True Positive/Positive: 903/1000 - Accuracy: 90.30% - MNLI matched (`distilbert-base-uncased-mnli`) - `datasets` dataset `glue`, subset `mnli`, split `validation_matched` - - Successes: 817/1000 + - True Positive/Positive: 817/1000 - Accuracy: 81.70% - MRPC (`distilbert-base-uncased-mrpc`) - `datasets` dataset `glue`, subset `mrpc`, split `validation` - - Successes: 350/408 + - True Positive/Positive: 350/408 - Accuracy: 85.78% - QNLI (`distilbert-base-uncased-qnli`) - `datasets` dataset `glue`, subset `qnli`, split `validation` - - Successes: 860/1000 + - True Positive/Positive: 860/1000 - Accuracy: 86.00% - Recognizing Textual Entailment (`distilbert-base-uncased-rte`) - `datasets` dataset `glue`, subset `rte`, split `validation` - - Successes: 180/277 + - True Positive/Positive: 180/277 - Accuracy: 64.98% - STS-b (`distilbert-base-uncased-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -268,7 +268,7 @@ All evaluations shown are on the full validation or test set up to 1000 examples - Spearman correlation: 0.8407155030382939 - WNLI (`distilbert-base-uncased-wnli`) - `datasets` dataset `glue`, subset `wnli`, split `validation` - - Successes: 40/71 + - True Positive/Positive: 40/71 - Accuracy: 56.34% @@ -279,38 +279,38 @@ All evaluations shown are on the full validation or test set up to 1000 examples - AG News (`roberta-base-ag-news`) - `datasets` dataset `ag_news`, split `test` - - Successes: 947/1000 + - True Positive/Positive: 947/1000 - Accuracy: 94.70% - CoLA (`roberta-base-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 857/1000 + - True Positive/Positive: 857/1000 - Accuracy: 85.70% - IMDB (`roberta-base-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 941/1000 + - True Positive/Positive: 941/1000 - Accuracy: 94.10% - Movie Reviews [Rotten Tomatoes] (`roberta-base-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 899/1000 + - True Positive/Positive: 899/1000 - Accuracy: 89.90% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 883/1000 + - True Positive/Positive: 883/1000 - Accuracy: 88.30% - MRPC (`roberta-base-mrpc`) - `datasets` dataset `glue`, subset `mrpc`, split `validation` - - Successes: 371/408 + - True Positive/Positive: 371/408 - Accuracy: 91.18% - QNLI (`roberta-base-qnli`) - `datasets` dataset `glue`, subset `qnli`, split `validation` - - Successes: 917/1000 + - True Positive/Positive: 917/1000 - Accuracy: 91.70% - Recognizing Textual Entailment (`roberta-base-rte`) - `datasets` dataset `glue`, subset `rte`, split `validation` - - Successes: 217/277 + - True Positive/Positive: 217/277 - Accuracy: 78.34% - SST-2 (`roberta-base-sst2`) - `datasets` dataset `glue`, subset `sst2`, split `validation` - - Successes: 820/872 + - True Positive/Positive: 820/872 - Accuracy: 94.04%) - STS-b (`roberta-base-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -318,7 +318,7 @@ All evaluations shown are on the full validation or test set up to 1000 examples - Spearman correlation: 0.9025045272903051 - WNLI (`roberta-base-wnli`) - `datasets` dataset `glue`, subset `wnli`, split `validation` - - Successes: 40/71 + - True Positive/Positive: 40/71 - Accuracy: 56.34% @@ -329,26 +329,26 @@ All evaluations shown are on the full validation or test set up to 1000 examples - CoLA (`xlnet-base-cased-cola`) - `datasets` dataset `glue`, subset `cola`, split `validation` - - Successes: 800/1000 + - True Positive/Positive: 800/1000 - Accuracy: 80.00% - IMDB (`xlnet-base-cased-imdb`) - `datasets` dataset `imdb`, split `test` - - Successes: 957/1000 + - True Positive/Positive: 957/1000 - Accuracy: 95.70% - Movie Reviews [Rotten Tomatoes] (`xlnet-base-cased-mr`) - `datasets` dataset `rotten_tomatoes`, split `validation` - - Successes: 908/1000 + - True Positive/Positive: 908/1000 - Accuracy: 90.80% - `datasets` dataset `rotten_tomatoes`, split `test` - - Successes: 876/1000 + - True Positive/Positive: 876/1000 - Accuracy: 87.60% - MRPC (`xlnet-base-cased-mrpc`) - `datasets` dataset `glue`, subset `mrpc`, split `validation` - - Successes: 363/408 + - True Positive/Positive: 363/408 - Accuracy: 88.97% - Recognizing Textual Entailment (`xlnet-base-cased-rte`) - `datasets` dataset `glue`, subset `rte`, split `validation` - - Successes: 196/277 + - True Positive/Positive: 196/277 - Accuracy: 70.76% - STS-b (`xlnet-base-cased-stsb`) - `datasets` dataset `glue`, subset `stsb`, split `validation` @@ -356,7 +356,7 @@ All evaluations shown are on the full validation or test set up to 1000 examples - Spearman correlation: 0.8773439961182335 - WNLI (`xlnet-base-cased-wnli`) - `datasets` dataset `glue`, subset `wnli`, split `validation` - - Successes: 41/71 + - True Positive/Positive: 41/71 - Accuracy: 57.75%