Avatar

Yuhan Shen

沈宇寒

PhD student

Northeastern University

Biography

Yuhan Shen is a PhD candidate in Computer Science in Khoury College of Computer Sciences at Northeastern University, Boston, MA. He has broad interests in weakly-supervised and unsupervised machine learning, computer vision and multi-modal learning. He is currently working on egocentric video understanding, procedural learning, and action segmentation, under the supervision of Professor Ehsan Elhamifar. He was also co-advised by Professor Lu Wang.

During his PhD journey, he has enriched his expertise and broadened his perspective in video understanding through internships at Facebook AI Research (FAIR) with Dr. Effrosyni Mavroudi and Dr. Lorenzo Torresani, and at TikTok with Dr. Heng Wang.

Before joining Northeastern University, he received his bachelor’s degree from Department of Electronic Engineering at Tsinghua University in China in 2018. He also worked as a research assistant in Speech and Audio Technology Lab at Tsinghua University under the guidance of Professor Wei-Qiang Zhang.

Seeking full-time research scientist or applied scientist positions in 2025. Reach out to me if you have an opportunity!

Interests

  • Video Understanding
  • Multimodal Learning
  • Computer Vision
  • Audio and Speech

Education

  • PhD in Computer Science, 2019 - present

    Northeastern University

  • BEng in Electronic Engineering, 2014 - 2018

    Tsinghua University

Experience

 
 
 
 
 

Research Scientist Intern

Meta AI

May 2023 – Aug 2023 New York
Worked on narration-based video object segmentation with Dr. Effrosyni Mavroudi and Dr. Lorenzo Torresani at FAIR.
 
 
 
 
 

Research Intern

ByteDance/TikTok

May 2022 – Aug 2022 Remote
Worked on multi-modal video captioning with Dr. Heng Wang, Dr. Linjie Yang, Dr. Longyin Wen, and Dr. Haichao Yu at Intelligent Creation - Vision and Graphics team.
 
 
 
 
 

Graduate Research Assistant

Northeastern University

Sep 2019 – Present Boston, MA
Research projects include:

  • Unsupervised procedure learning via visual and language instructions
  • Semi-weakly supervised learning from instructional videos
  • Streaming video action segmentation
  • AI/AR task assistant for procedural guidance
 
 
 
 
 

Research Assistant

Tsinghua University

Jul 2018 – Jul 2019 Beijing, China
Research projects include:

  • Sound event detection and audio tagging
  • Keyword search from speech
  • Query-by-example spoken term detection

Publications

Quickly discover relevant content by filtering publications.

(* indicates equal contribution)

. Learning to Segment Referred Objects from Narrated Egocentric Videos. CVPR (oral), 2024.

PDF Poster Slides Supplementary

. Exploring the Role of Audio in Video Captioning. CVPR MULA Workshop, 2024.

PDF Slides Supplementary

. Learning How to Listen: A Temporal-Frequential Attention Model for Sound Event Detection. Interspeech (oral), 2019.

PDF DOI

. Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection. Interspeech, 2019.

PDF DOI

Projects

Semi-Weakly Supervised Learning of Complex Actions

Performed action segmentation using a small number of weakly-labeled videos and a large number of unlabeled videos.

Multi-modal Procedure Learning from instructional videos

Summarized and localized the key steps of instructional videos from both visual and language data.

Audio Tagging with Noisy Labels and Minimal Supervision

Classified multi-label audio clips using a small amount of manually-labeled data and a large quantity of noisy-labeled data.

Research on Sound Event Detection

Achieved state-of-the-art performance on rare sound event detection and weakly-labeled sound event detection.

Contact

  • 440 Huntington Ave, Boston, MA, 02115, United States
  • Enter West Village H and take the elevator to Office 472 on Floor 4