-
Notifications
You must be signed in to change notification settings - Fork 434
custom dataset loader #324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…face data loading
…using hugging face data loaders
…using huggingface dataloaders
…using hugging face data loaders
…using hugging face data loaders
|
@ArshdeepSekhon is this a continued PR from 310? If yes, I will close that one. @ArshdeepSekhon there are some issues with the flair library dependency, so we fix that in the #322 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know datasets offered an easy way to load datasets from file, so this CustomDataset class was cool to see! I made some minor comments about handling some edge cases.
The biggest comment that I have is that after reading the code for this class, I'm still not completely sure how to use it. I think part of it can be resolved with more detailed documentation. But also, I think part of it is that we're abstracting too much away from the users. We are coupling CustomDataset so tightly with Huggingface's dataset package. Part of me feels like it would easier for users if we can just require users to create the appropriate datasets.Dataset object and pass it directly instead of requiring them to provide the arguments that are eventually used to build the datasets.Dataset object. After all, Huggingface has better documentation and examples for guiding users how to build the dataset. Then, we can simply skip most of the complicated logic in __init__ and just focus on packing the examples as OrderedDict for TextAttack.
I think we should aim for two goals: (1) Custom class for Huggingface datasets.Dataset, (2) Custom dataset class that is more flexible and designed to work with Python lists, dict, and PyTorch Dataset class without being coupled with Huggingface's datasets. Supporting (2) would allow users who are familiar with PyTorch dataloading to skip the process of learning Huggingface's API. Also, for (2), I don't think we should worry about how to process CSV files, JSON files, etc. Opening and reading files should be done by the user who owns the files.
|
|
||
|
|
||
| class CustomDataset(TextAttackDataset): | ||
| """Loads a Custom Dataset from a file/list of files and prepares it as a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like users can pass int a list of files (or a dictionary of files where keys are the split name (e.g. "train", "test"). But the documentation for name argument suggests that it's just a string.
| - label_map: Mapping if output labels should be re-mapped. Useful | ||
| if model was trained with a different label arrangement than | ||
| provided in the ``datasets`` version of the dataset. | ||
| - output_scale_factor (float): Factor to divide ground-truth outputs by. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused what output_scale_factor is for. Can you maybe give an example of when this would be used?
| infile_format=None, | ||
| split=None, | ||
| label_map=None, | ||
| subset=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see subset argument being used anywhere in the code when loading the examples (other than in logging part). I think subset is used in HuggingFaceDataset because datasets like glue has subsets (e.g. sst2), but I don't see the point here. Do you think it is redundant?
| # if no split in custom data, default split is None | ||
| # if user hasn't specified a split to use, raise error if the dataset has splits | ||
| if split is None: | ||
| if set(self._dataset.keys()) <= set(["train", "validation", "test"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two cases pop into my mind:
- What happens if
set(self._dataset.keys())is empty? Then, the condition is true and will raise theValueError. But what if the CSV/JSON file we loaded is for train split and thusself._datasetdoesn't really have any concept of "splits"? - What happens if
set(self._dataset.keys())don't have split names that are explicitlytrain,validation, andtest? Maybe the user could have done passed intrain,val,test, which would cause the condition to beFalse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One solution could be that if it there are splits (e.g. set(self._dataset.keys()) is True), we require split to be not be None.
| def __init__( | ||
| self, | ||
| name, | ||
| infile_format=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think file_format is an easier term to remember.
| dataset_columns = [] | ||
| dataset_columns.append(self._dataset.column_names) | ||
|
|
||
| if not set(dataset_columns[0]) <= set(self._dataset.column_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm just not familiar with how datasets work, but does this mean first column of the data should always be input data and the second column is output?
My main concern is that this would be too restrictive for users. For example, NLI datasets have two input columns "premise" and "hypothesis".
| - name(Union[str, dict, pd.DataFrame]): the user specified dataset file names, dicts or pandas dataframe | ||
| - infile_format(str): Specifies type of file for loading HuggingFaceDataset : csv, json, pandas, text | ||
| from local_files will be loaded as ``datasets.load_dataset(filetype, data_files=name)``. | ||
| - label_map: Mapping if output labels should be re-mapped. Useful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you include an example of how an user should define this label_map (also is this a dict)? For example, should it look like {"Positive": 1, "Negative": 0}?
| if shuffle: | ||
| random.shuffle(self.examples) | ||
|
|
||
| def _format_raw_example(self, raw_example): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the last three methods are just copied from HuggingFaceDataset, would it be more appropriate to just inherit from HuggingFaceDataset?
| if dataset_columns is None: | ||
| # automatically infer from dataset | ||
| dataset_columns = [] | ||
| dataset_columns.append(self._dataset.column_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that self._dataset.column_names is a list. Why are we appending it to another list instead of setting dataset_columns = self._dataset.column_names?
| """Loads a Custom Dataset from a file/list of files and prepares it as a | ||
| TextAttack dataset. | ||
|
|
||
| - name(Union[str, dict, pd.DataFrame]): the user specified dataset file names, dicts or pandas dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you think name argument would be too confusing? Maybe file_name_or_data?
|
@ArshdeepSekhon It is a good idea to sync your copy of the code with the master repository regularly. This way you can quickly account for changes: $ git remote add upstream https://github.com/QData/TextAttack |
…face data loading
…using hugging face data loaders
…using huggingface dataloaders
…using hugging face data loaders
…using hugging face data loaders
|
@ArshdeepSekhon please run
|
|
@ArshdeepSekhon please fix simple formatting issues at least. |
|
@ArshdeepSekhon let me fix the build doc |
|
@ArshdeepSekhon @jinyongyoo I thought datasets/dataset.py should be @AbstractClass with some must @AbstractMethod .. |
|
@ArshdeepSekhon @jinyongyoo I recommend we triangle this PR for now.. Agreed? |
loads a user specified dataset from local file as text attack dataset using hugging face data loader utils: