-
Notifications
You must be signed in to change notification settings - Fork 310
[ViT] Vision Transformer (ViT) backbone, layers, and image classifier #1989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
741b889
vit base
sineeli 13dae08
Add vit backbone, classifier and preprocessor layers
sineeli b64b137
update args
sineeli 429d635
add default args
sineeli 6d69abc
correct build method
sineeli 2e87884
fix build issues
sineeli bd3cce0
fix bugs
sineeli 4232a06
Update backbone args and configs
sineeli 32b08c5
correct position ids dtype
sineeli cc938c6
build token layer
sineeli 78812de
token layer build
sineeli 8a20465
assign correct dtype to TokenLayer
sineeli de754cc
fix build shape of token layer
sineeli 84ba896
correct mlp dens var names
sineeli 7a70e16
use default norm mean and std as per hugging face config
sineeli 81e3021
correct position_ids
sineeli d3061d6
remove separate token layer
sineeli 618e163
correct position ids
sineeli 2338637
Checkpoint conversion script and minor changes
sineeli 95e5868
correct flag type
sineeli 9d2e5bd
correct key name
sineeli ac7d1d3
use flat list later we can extract in between layers if needed
sineeli 8065c01
Add test cases and correct dtype polciy for model
sineeli a8be824
add proper docstrings
sineeli 3f027a0
correct test cases
sineeli 05acb70
use numpy for test data
sineeli 521df6f
nit
sineeli ae2b800
nit
sineeli 26c2224
Merge branch 'master' into sineeli/ViT
sineeli 92149d5
add presets
sineeli 5374c70
load vit preset from hugging face directly
sineeli ebee9ef
nit
sineeli 93064bd
handle num classes case for ViT
sineeli e206e7b
replace toke with first
sineeli 7a39d5b
convert all vit checkpoints using tools
sineeli 0827954
Add custom ImageClassifier for ViT
sineeli ae9319a
remove token pooling and rename representation_size to intermediate_dim
sineeli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
fix bugs
- Loading branch information
commit bd3cce0a1e4d4d69d1f42b64b7f482a474144151
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"token" feels like a bit a weird name here, especially when compared to
"avg"or"max". Maybe"first"?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually wouldn't this also break for other classifier types? I think this "token" pooling would fail to actually pool over a 2d output from most backbone, and similarly global avg 2d pooling would fail to pool correctly for a vit backbone right (since it's a 1d sequence after patching)? Instead we should subclass here, and not let pooling be configurable for vit. See https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/vgg/vgg_image_classifier.py as an example of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I was thinking earlier to subclass and totally write a new one. Thanks for point out I will make the changes required.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdangerw
Also from hugging face I observed that there is one more dense layer if the model is not used for ImageClassification which they call
poolinglayer and it just has a dense layer(which just projects the same number of hidden dimension) and a tanh activation.Should we include this, if we are consider for ImageClassification this layer wouldn't be present.
ViTModel: https://github.com/huggingface/transformers/blob/91b8ab18b778ae9e2f8191866e018cd1dc7097be/src/transformers/models/vit/modeling_vit.py#L576
Image Classification: https://github.com/huggingface/transformers/blob/91b8ab18b778ae9e2f8191866e018cd1dc7097be/src/transformers/models/vit/modeling_vit.py#L823C37-L823C54
Any thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original Code from jax call it representation size: https://github.com/google-research/vision_transformer/blob/c6de1e5378c9831a8477feb30994971bdc409e46/vit_jax/models_vit.py#L296C13-L296C32