Keynote Speakers

Dr.Dima Damen is a Reader (Associate Professor) in Computer Vision at the University of Bristol, United Kingdom. Received her PhD from the University of Leeds, UK (2009). Dima is currently an EPSRC Fellow (2020-2025), focusing on her research interests in the automatic understanding of object interactions, actions and activities using static and wearable visual (and depth) sensors. Dima is a program chair for ICCV 2021 in Montreal, associate editor of IJCV (2020-), IEEE TPAMI (2019-) and Pattern Recognition (2017-). She was selected as a Nokia Research collaborator in 2016, and as an Outstanding Reviewer in CVPR2020, ICCV17, CVPR13 and CVPR12. She currently supervises 5 PhD students, and 4 postdoctoral researchers.

Title: Human-Centric Object Interactions - A Fine-Grained Perspective from Egocentric Videos

Keynote Slides: The download link is [keynote-slides]

Abstract: This talk aims to argue for a fine(r)-grained perspective onto human-object interactions, from video sequences, captured in an egocentric perspective (i.e. first-person footage). Using multi-modal footage (appearance, motion, audio, language), I will present approaches for determining skill or expertise from video sequences [CVPR 2019], assessing action ‘completion’ – i.e. when an interaction is attempted but not completed [BMVC 2018], dual-domain and dual-time learning [CVPR 2020, CVPR 2019, ICCVW 2019] as well as multi-modal fusion using vision, audio and language [CVPR 2020, ICCV 2019, BMVC 2019]. All project details at: Link. I will also introduce EPIC-KITCHENS-100, the largest egocentric dataset in people’s homes. The dataset now includes 20M frames of 90K action segments and 100 hours of recording fully annotated, based on unique annotations from the participants narrating their own videos, thus reflecting true intention.


Prof. Cees Snoek is a full professor in computer science at the University of Amsterdam, where he heads the Video & Image Sense lab. He is also a director of three public-private AI-research labs: QUVA lab with Qualcomm, Atlas lab with TomTom and AIM lab with the Inception Institute of Artificial Intelligence. At University spin-off Kepler Vision Technologies he acts as Chief Scientific Officer. He is also the director of the master program in Artificial Intelligence and co-founder of the Netherlands Innovation Center for Artificial Intelligence. He was previously visiting scientist at Carnegie Mellon University and UC Berkeley, head of R&D at University spin-off Euvision Technologies and managing principal engineer at Qualcomm Research Europe. His research interests focus on making sense of video and images. He has published over 200 refereed journal and conference papers, has been on the editorial board of several international journals and frequently serves as an area chair of the major conferences in computer vision and multimedia. Professor Snoek is the lead researcher of the award-winning MediaMill Semantic Video Search Engine, which was the most consistent top performer in the yearly NIST TRECVID evaluations for over a decade. He was general chair of ACM Multimedia 2016 in Amsterdam and initiator of the VideOlympics. Cees is recipient of an NWO Veni career award, a Fulbright Junior Scholarship, an NWO Vidi career award, and the Netherlands Prize for ICT Research. Together with his Ph.D. students and Post-docs he has won several best paper awards.

Title: Unseen Activity Recognition in Space and Time

Keynote Slides: The download link is [keynote-slides]

Abstract: Progress in video understanding has been astonishing in the past decade. Classifying, localizing, tracking and even segmenting actor instances at the pixel level is now common place, thanks to label-supervised machine learning. Yet, it is becoming increasingly clear that label-supervised knowledge transfer is expensive to obtain and scale, especially as the need for spatiotemporal detail and compositional semantic specification in long video sequences increases. In this talk we will discuss alternatives to label-supervision, using semantics, language, ontologies, similarity and time as the primary knowledge sources for various video understanding challenges. Despite being less example-dependent, the proposed algorithmic solutions are naturally embedded in modern (self-)-learned representations and lead to state-of-the-art unseen activity recognition in space and time.


Dr. Ziwei Liu is currently a Nanyang Assistant Professor at Nanyang Technological University (NTU). Previously, he was a senior research fellow at the Chinese University of Hong Kong. Before that, Ziwei was a postdoctoral researcher at University of California, Berkeley, working with Prof. Stella Yu. Ziwei received his PhD from the Chinese University of Hong Kong in 2017, under the supervision of Prof. Xiaoou Tang and Prof. Xiaogang Wang. During his PhD, Ziwei had the privilege of interning at Microsoft Research and Google Research, where he developed Microsoft Pix and Google Clips. His research revolves around computer vision/graphics, machine learning, and robotics. He has published over 50 papers (with more than 6,500 citations) on top-tier conferences and journals in relevant fields, including CVPR, ICCV, ECCV, AAAI, IROS, SIGGRAPH, T-PAMI, and TOG. He is the recipient of Microsoft Young Fellowship, Hong Kong PhD Fellowship, ICCV Young Researcher Award, and HKSTP best paper award. He has won the championship in major computer vision competitions, including DAVIS video segmentation challenge 2017, MSCOCO instance segmentation challenge 2018, and FAIR self-supervision challenge 2019. He is also the lead contributor of several renowned computer vision benchmarks and softwares, including CelebA, DeepFashion, mmdetection and mmfashion.

Title: Sensing, Understanding and Synthesizing Humans in an Open World

Abstract: Sensing, understanding and synthesizing humans in images and videos have been a long-pursuing goal of computer vision and graphics, with extensive real-life applications. It is at the core of embodied intelligence. In this talk, I will discuss our work in human-centric visual analysis (of faces, human bodies, scenes, videos and 3D scans), with an emphasis on learning structural deep representations under complex scenarios. I will also discuss the challenges related to naturally-distributed data (e.g. long-tailed and open-ended) emerged from real-world sensors, and how we can overcome these challenges by incorporating new neural computing mechanisms such as dynamic memory and routing. Our approach has shown its effectiveness on both discriminative and generative tasks.