Is your feature request related to a problem? Please describe.
After reading the original paper on visual transformers (link below), they seem to excel when trained over large datasets.
That makes sense because they have to learn from scratch the structure of the image (what patches are neighbors of other patches, etc).
Describe the solution you'd like
I would like to find out if there is any pre-trained ViT for 3D images. And if yes, how can they be re-used in Monai.
Describe alternatives you've considered
I have explored the web with this same question, but without much luck.
This lucidrains/vit-pytorch#125 suggests that a pretrained 2D ViT could be adapted to 3D. But of course, I guess that implementation would differ from Monai? Any hint on how to do this for reuse in Monai?
Additional context
Original paper on ViT, for reference: https://arxiv.org/abs/2010.11929
EDIT: pinging @ahatamiz as the implementor of swin-unetr (thanks!)
Is your feature request related to a problem? Please describe.
After reading the original paper on visual transformers (link below), they seem to excel when trained over large datasets.
That makes sense because they have to learn from scratch the structure of the image (what patches are neighbors of other patches, etc).
Describe the solution you'd like
I would like to find out if there is any pre-trained ViT for 3D images. And if yes, how can they be re-used in Monai.
Describe alternatives you've considered
I have explored the web with this same question, but without much luck.
This lucidrains/vit-pytorch#125 suggests that a pretrained 2D ViT could be adapted to 3D. But of course, I guess that implementation would differ from Monai? Any hint on how to do this for reuse in Monai?
Additional context
Original paper on ViT, for reference: https://arxiv.org/abs/2010.11929
EDIT: pinging @ahatamiz as the implementor of swin-unetr (thanks!)