The goal of this work is segmenting on a video sequence the objects which are mentioned in a linguistic description of the scene. We have adapted an existing deep neural network that achieves state of the art performance in semi-supervised video object segmentation, to add a linguistic branch that would generate an attention map over the video frames, making the segmentation of the objects temporally consistent along the sequence.
Herrera-Palacio, A.; Ventura, C.; Giro, X. Video object linguistic grounding. A: International Workshop on Multimodal Understanding and Learning for Embodied Applications. "MULEA '19 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Nice, France: October 25-25, 2019". New York: Association for Computing Machinery (ACM), 2019, p. 49-51.