Avatar

Yuhan Shen

沈宇寒

Applied Scientist

Amazon

Biography

Yuhan Shen is an Applied Scientist at Amazon AGI in Boston, MA. He received his PhD in Computer Science from the Khoury College of Computer Sciences at Northeastern University, where his research spanned weakly-supervised and unsupervised machine learning, computer vision, and multimodal learning. His doctoral work focused on egocentric video understanding, procedural learning, and action segmentation under the supervision of Professor Ehsan Elhamifar, and he was co-advised by Professor Lu Wang.

During his PhD, he enriched his expertise in video understanding through internships at Facebook AI Research (FAIR) with Dr. Effrosyni Mavroudi and Dr. Lorenzo Torresani, and at TikTok with Dr. Heng Wang.

Before Northeastern, he earned his bachelor’s degree from the Department of Electronic Engineering at Tsinghua University in China in 2018 and worked as a research assistant in the Speech and Audio Technology Lab under the guidance of Professor Wei-Qiang Zhang.

Interests

  • Video Understanding
  • Multimodal Learning
  • Computer Vision
  • Audio and Speech

Education

  • PhD in Computer Science, 2019 - 2025

    Northeastern University

  • BEng in Electronic Engineering, 2014 - 2018

    Tsinghua University

Experience

 
 
 
 
 

Applied Scientist

Amazon

Jul 2025 – Present Boston, MA
 
 
 
 
 

Research Scientist Intern

Meta AI

May 2023 – Aug 2023 New York
Worked on narration-based video object segmentation with Dr. Effrosyni Mavroudi and Dr. Lorenzo Torresani at FAIR.
 
 
 
 
 

Research Intern

ByteDance/TikTok

May 2022 – Aug 2022 Remote
Worked on multi-modal video captioning with Dr. Heng Wang, Dr. Linjie Yang, Dr. Longyin Wen, and Dr. Haichao Yu at Intelligent Creation - Vision and Graphics team.
 
 
 
 
 

Graduate Research Assistant

Northeastern University

Sep 2019 – Jul 2025 Boston, MA
Research projects include:

  • Unsupervised procedure learning via visual and language instructions
  • Semi-weakly supervised learning from instructional videos
  • Streaming video action segmentation
  • AI/AR task assistant for procedural guidance
 
 
 
 
 

Research Assistant

Tsinghua University

Jul 2018 – Jul 2019 Beijing, China
Research projects include:

  • Sound event detection and audio tagging
  • Keyword search from speech
  • Query-by-example spoken term detection

Publications

Quickly discover relevant content by filtering publications.

(* indicates equal contribution)

. Learning to Segment Referred Objects from Narrated Egocentric Videos. CVPR (oral), 2024.

PDF Poster Slides Supplementary

. Exploring the Role of Audio in Video Captioning. CVPR MULA Workshop, 2024.

PDF Slides Supplementary

. Learning How to Listen: A Temporal-Frequential Attention Model for Sound Event Detection. Interspeech (oral), 2019.

PDF DOI

. Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection. Interspeech, 2019.

PDF DOI

Projects

Semi-Weakly Supervised Learning of Complex Actions

Performed action segmentation using a small number of weakly-labeled videos and a large number of unlabeled videos.

Multi-modal Procedure Learning from instructional videos

Summarized and localized the key steps of instructional videos from both visual and language data.

Audio Tagging with Noisy Labels and Minimal Supervision

Classified multi-label audio clips using a small amount of manually-labeled data and a large quantity of noisy-labeled data.

Research on Sound Event Detection

Achieved state-of-the-art performance on rare sound event detection and weakly-labeled sound event detection.

Contact

  • 55 Pier 4 Blvd., Boston, MA, 02110, United States