Generalized Visual Language Models-lilianweng
CLIP:
Learning Transferable Visual Models From Natural Language Supervision
ALBEF: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation