Multi-modal Procedure Learning from instructional videos

The goal is to summarize and localize the key steps of instructional videos from both visual and language data using unsupervised procedure learning. To address this problem, we make several major contributions as follows:

  1. Addressed the problem of unsupervised localization of key-steps and feature learning in instructional videos using both visual and language instructions.

  2. Proposed an ordered prototype learning module, which extracts visual and linguistic prototypes representing key-steps in an unsupervised manner.

  3. Proposed a differentiable weak sequence alignment loss that finds ordered one-to-one matching across modalities for feature learning.

  4. Outperformed state-of-the-art methods in unsupervised action segmentation on two benchmark datasets.

This work was published at CVPR 2021 as oral presentation.

Avatar
Yuhan Shen
沈宇寒

PhD student

My research interests include machine learning, computer vision and natural language processing.