Executive Summary : | Digitization has led to a significant increase in visual data, particularly from high-definition video cameras. This data is essential for various applications, such as language-guided human-robot interaction, education technology learning, interactive image editing, vision dialog navigation, video captioning, and instance-based person and object identification for computer vision applications. However, automatic data analytics is challenging due to the semantic gap between image/video and language. Text referrals can include descriptive words and positional relations, affecting the overall performance of content-based information selection and segmentation in a given video.
This project aims to develop a dynamic and scalable deep learning framework for a cross-modal progressive, comprehensive framework for instance-based content selection and segmentation in a video. The proposed framework could simultaneously ground language at both an anchor box and segmentation level without requiring dense anchor definitions, making it suitable for surveillance and security applications. The project aims to address these challenges and improve the performance of referring content-based information selection and segmentation in video. |