-
Notifications
You must be signed in to change notification settings - Fork 434
New metric module to improve flexibility and intuitiveness - moved from #475 #514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jinyongyoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanchit97 Thank you for adding the perplexity module! I think PPL (and USE encoder similarity if you have time 😉) are two important metrics to add!
I still am having a hard time understanding why we need to instantiate a new Metric everytime we have a new list of results. I think one downside is clearly evident in the Perplexity class where whenever we instantiate a new Perplexity, we have to load a fresh copy of GPT2 model. Could you perhaps explain the design to me?
Also, I left some comments about calculating Perplexity. Since it's possible that what we add will be used by users for reporting performances, I think it's important that we not only get it correct but also explain clearly what we're calculating and how.
sanchit97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initandcalculateare now changed to favor one-time initialization- New USE metric has been added and checked
- [WIP] Tests and docs
|
@sanchit97 can you fix the errors? |
|
@jinyongyoo I think this is ready. Can you help review it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanchit97 This looks awesome! Thank you for the hard work! This will be a great addition to TextAttack.
One (just one I promise!) concern I have is the perplexity numbers from the test output. They looks a little bit too high (200-300, 1000-2000), though I cannot see any issues with the code. One reason might be that we're just using poor samples for testing, but maybe we can do a sanity check on a larger number of samples (e.g. 1000 IMDB, Yelp, etc) just to make sure.
Otherwise, most of my comments are minor ones about the docstring. I'm going to approve the PR since code and design looks good! 🤗
Lastly, these are some possible suggestions on how we can improve Metrics module later on in the future (not necessary for this PR).
- Batched passes with GPT2: I know doing batches with GPT2 is tricky b/c there are no padding tokens, but we can manually set them to be EOS token and then use attention masking to make sure we don't use the padding tokens. Also we need to make sure that we set the labels with EOS tokens to be -100 when calculating the loss. This will speed up the perplexity metric a lot.
- Global object sharing for GPT2 and USE: note than we also use GPT2 and USE as part of our constraints, so it's possible that we already loaded the model onto our GPUs when running the attacks. It would be nice if we could reuse them for metrics without instantiating new models (very easy to hit GPU OOM). One solution is to make a
GLOBAL_SHARED_OBJECTSdictionary withinshared/utils, make it global scope, and then insert GPT2 and USE models into the dictionary whenever we instantiate them the first time and reuse them later in other modules.
| | Average perturbed word %: | 5.56% | | ||
| | Average num. words per input: | 15.5 | | ||
| | Avg num queries: | 1.33 | | ||
| | Average Original Perplexity: | 291.47 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perplexity numbers here seem abnormally high (should be less than 100).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have doubts about this too.
|
@sanchit97 I also realized I never responded to your replies. I don't get any notifications from GitHub and I don't check the repo often, so I didn't see the response until now. In the future, if you need my response, please try tagging me with @ or let me know via slack. As for how we calculate perplexity (average loss exponentiated or average of individual perplexity), I'm actually not sure of this myself either. I also wondered the same thing recently when I had to calculate perplexity. Personally, I think the average loss exponentiated make sense if we're calculating model's perplexity on the corpus level (e.g. WikiText), but since we're calculating perplexity of individual sentences, I tend to think average individual perplexity gives clearer meaning. But it's possible others might have different interpretations. |
|
@jinyongyoo thanks for the excellent review comments. I will fix those Docstrings.. |
|
@ArshdeepSekhon can you review and try the code? |
What does this PR do?
This PR is in continuation from #475 due to excessive structure changes
Summary
This PR adds a new metric module to Textattack, which provides an easier way to add new metrics or use existing metrics on adv. examples. Currently, metrics are calculated in the
attack_log_manager.pyfile, which is very unintuitive and non-flexible. Slides detailing this change are at: https://docs.google.com/presentation/d/1O77447GGWQF-xDnIdTr8d2MMPsQ9o9CG3pZZoo2bPrs/edit?usp=sharingAdditions
metricsmodule astextattack.metrics.attack_metricsmodule astextattack.metrics.attack_metrics.quality_metricsmodule astextattack.metrics.quality_metrics.Changes
metricsis a new module, removing metrics fromattack_log_manager.pyDeletions
Checklist
.rstfile inTextAttack/docs/apidoc.'