This talk is a featured presentation of the National Science Foundation Research Experience for Undergraduates (NSF REU) on Computational Methods for Understanding Music, Media, and Minds and is open to all faculty, staff, students and community members.
Lunch sponsored by Goergen Institute for Data Science.
Title: "When Computer Vision Meets Audition: From Cross-Modal Generation to Audio-Visual Scene Understanding"
Abstract: Understanding scenes around us is a fundamental capability in human intelligence. Similarly, designing computer algorithms that can understand scenes is a fundamental problem in artificial intelligence. Humans consciously or unconsciously use all five senses (vision, audition, taste, smell, and touch) to understand a scene, as different senses provide complimentary information. Existing machine scene understanding algorithms, however, are designed to rely on just a single modality. Take the two most commonly used senses, vision and audition, as an example, there are scene understanding algorithms designed to deal with each single modality. However, no systematic investigations have been conducted to integrate these two modalities towards more comprehensive audio-visual scene understanding. Here, I will talk about two recent works from my group. The first work addresses the cross-modal audio-visual generation problem leveraging the power of deep generative adversarial training. We show state-of-the-art performance in constrained domains such as music performance and lip reading. The second work addresses the general scene audio-visual event localization. We show using both modalities cohesively in a deep time series model outperform using visual-alone or audio-alone.
Bio: Chenliang Xu is an assistant professor in the Department of Computer Science at the University of Rochester. He received his Ph.D. degree from the University of Michigan in 2016, and the MS degree from SUNY Buffalo in 2012, both in Computer Science. He received his BS degree in Information and Computing Science from Nanjing University of Aeronautics and Astronautics in 2010. His research interests include computer vision and its relations to natural language, robotics, and data science. His work primarily focuses on problems in video understanding such as video segmentation, activity recognition, and multimodal vision-and-x modeling. He is the recipient of NSF BIGDATA award 2017, University of Rochester AR/VR Pilot Award, and the best paper award at SMC 2017. He is the author of refereed journals and conferences in IJCV, TMM, CVPR, ICCV, ECCV, ICSC, SMC, SPIE etc. He co-organized the CVPR 2017 Workshop on video understanding and has served as PC members for CVPR/ICCV/AAAI/NIPS/ICSC/BMVC/ACCV and regular reviewers for international journals such as IEEE TPAMI and IJCV. He is a member of the IEEE and the ACM.
Wednesday, June 13 at 12:00pm to 1:30pm
Wegmans Hall, Auditorium 1400
250 Hutchison Rd, Rochester, NY 14620