Abstract: Image captioning is a multimodal task combining computer vision (CV) and natural language processing (NLP). Contrastive language image pre-training has made significant progress by providing ...