Video captioning is essential for enhancing content accessibility and searchability by providing precise and searchable descriptions of video content. However, the task of generating accurate, descriptive, and detailed video captions remains challenging due to several factors: the limited availability of high-quality labeled data and the additional complexity involved in video captioning, such as temporal correlations