Executive Summary : | This project aims to develop methods for identity-aware video description that accurately attribute people in a video to their descriptions. The focus is on movie clips, where differentiating multiple characters by name or unique identifiers is crucial for better understanding and linking them across a longer story. The project also aims to improve performance by allowing the Transformer decoder to generate captions and fill in the blanks, leading to better relation between visual content and description. Another goal is to perform multilingual video description using recent advances in multilingual language modeling, specifically focusing on 3 Indian languages: Hindi, Marathi, and Telugu. To achieve this, audio descriptions will be collected and annotated for both English movies (dubbed) in Indian languages and Indian movies in Indian languages. This will help disentangle the impact of visual cues from language challenges and analyze the transfer ability of models trained on English movies with English descriptions to Indian movies. The project will share the dataset and create a benchmark for researchers worldwide to work on this task, leading to improved support for video descriptions in Indian languages. Additionally, the project will initiate discussions with not-for-profit organizations that enable visually impaired people to watch movies, aiming to generate reasonable descriptions with correct characters that can be edited lightly with a human-in-the-loop deployment, facilitating inclusivity for visually impaired people. |