🎥 Stanford and University of Washington Workshop on Video Analytics

📖 Description

The Workshop on Video Analytics (WoVA) brings together experts from industry and academia working in the area of video data management and analytics. The goal is to share cutting-edge research and novel challenges in video analytics. Our long-term goal is to build the community of researchers and industry practitioners. The broad topics of interest include storage, streaming, edge analytics, machine learning over videos, and applications that use videos. The workshop consists of talks and a panel presented by both academic and industry participants.

🚀 Organizers

📍 Where

Format: The workshop will be hybrid with the option to participate virtually via Zoom or attend in-person on the Stanford campus.
Location: Fujitsu Room (4th floor), Gates Computer Science Building, Stanford University
As you exit the elevators, go left, then immediately right past the kitchen at the end of the hall. There will be signs as you enter the building, and as you arrive on the 4th floor to direct you.
Zoom Info: Included in calendar invite registration.

🗓️ When

Monday, May 23rd, 2022 - 8:30AM - 5:00PM

🎟️ Registration

Registration, and lightning talk and poster submission is done at the following Registration Form. Decision notifications on lightning talk and poster submissions will be sent no later than April 15th. Lightning talk and poster submissions will be evaluated by the workshop organizers. If accepted, we look forward to you presenting both a lightning talk and a poster!

🧵 Agenda

The agenda is subject to change pending the final speaker topics.

TimeTopicSpeaker
8:30 am Breakfast
8:55 am Introduction Organizers
9:00 am First-Person Video Understanding Kristen Grauman, Professor, University of Texas at Austin
9:45 am EdgeServe: Supporting Sensor Fusion in Distributed Edge Networks Sanjay Krishnan, Assistant Professor, University of Chicago
10:05 am Translating video intelligence technology into practical use cases Gil Becker, CEO, AnyClip
10:25 am Lightning talks (Titles and Abstracts) Alexander Dietmüller, ETH Zürich (Slides)
Oscar Moll, MIT (Slides)
Jenya Pergament, Stanford University (Slides)
Wenjia He, University of Michigan (Slides)
Gaurav Tarlok Kakkar, Georgia Tech (Slides)
10:50 am Break
11:20 am The Video Coding Unit: Warehouse Scale Video Transcoding at Google Danner Stodolsky, VP of Engineering, Compute Infra and Performance, Google
11:40 am Data Centric AI with Video Russell Kaplan, Head of Nucleus, Scale AI
12:00 pm Visual Data Systems for Metadata to Metaverse Nilesh Jain, Principal Engineer, Intel
12:20 pm Lunch
1:30 pm EVA: An End-to-End Data System for Querying Videos At Scale Joy Arulraj, Assistant Professor, Georgia Tech
1:50 pm Edge Video Services on 5G Infrastructure Ganesh Ananthanarayanan, Principal Researcher, Microsoft Research
2:10 pm Adversarial Attacks on Production CV Systems Reza Zadeh, CEO, Matroid
2:30 pm Lightning talks (Titles and Abstracts) Vishakha Gupta, ApertureData (Slides)
Pulkit Tandon, Stanford University (Slides)
Favyen Bastani, AllenAI (Slides)
Yao Lu, Microsoft Research (Slides)
Francisco Romero, Stanford University (Slides)
2:55 pm Poster session (Titles and Abstracts)
4:00 pm Video scrambling: Fully discarding video contents and generating them on-the-fly Amrita Mazumdar, Research Scientist, NVIDIA Research
4:20 pm Video Analytics at AV Scale Volkmar Uhlig, CTO, Ghost Locomotion
4:40 pm Closing Remarks and Reception Organizers

📣 Abstracts

Keynote: First-Person Video Understanding

Kristen Grauman, Professor, University of Texas at Austin

First-person or “egocentric” perception requires understanding the video that streams to a person’s or robot’s wearable camera. The egocentric view offers a special window into the camera wearer’s attention, goals, and interactions with people and objects in the environment, making it an exciting avenue for perception in augmented reality and robot learning.
I will present our work on first-person video understanding, and show our progress using passive observations of human activity to inform active robot behaviors. First, we explore learning visual affordances to anticipate how objects and spaces can be used. We show how to transform egocentric video into a human-centric topological map of a physical space (such as a kitchen) that captures its primary zones of interaction and the activities they support. Moving down to the object level, we develop video anticipation models that localize interaction “hotspots” indicating how/where an object can be manipulated (e.g., pressable, toggleable, etc.). Towards translating these affordances into robot action, we prime reinforcement learning agents to prefer human-like interactions, thereby accelerating their task learning. Finally, I will overview Ego4D, a massive new egocentric video dataset and benchmark built by a multi-institution collaboration that aims to push the frontier of front-person perception.

Biography

Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin and a Research Director in Facebook AI Research (FAIR). Her research in computer vision and machine learning focuses on visual recognition, video, and embodied perception. Before joining UT-Austin in 2007, she received her Ph.D. at MIT. She is an IEEE Fellow, AAAI Fellow, Sloan Fellow, and recipient of the 2013 Computers and Thought Award. She and her collaborators have been recognized with several Best Paper awards in computer vision, including a 2011 Marr Prize and a 2017 Helmholtz Prize (test of time award). She has served as Associate Editor-in-Chief for PAMI and Program Chair of CVPR 2015 and NeurIPS 2018. http://www.cs.utexas.edu/~grauman/

Edge Video Services on 5G Infrastructure

Ganesh Ananthanarayanan, Principal Researcher, Microsoft Research

Creating a programmable software infrastructure for telecommunication operations promises to reduce both the capital expenditure and the operational expenses of the 5G telecommunications operators. The convergence of telecommunications, the cloud, and edge infrastructures will open up opportunities for new innovations and revenue for both the telecommunications industry and the cloud ecosystem. This talk will focus on video, the dominant traffic type on the Internet since the introduction of 4G networks. With 5G, not only will the volume of video traffic increase, but there will also be many new solutions for industries, from retail to manufacturing to healthcare and forest monitoring, infusing deep learning and AI for video analytics scenarios. The talk will touch up on various advances in edge video analytics systems including real-time inference over edge hierarchies, continuous learning of models, and privacy preserving video analytics.

Translating video intelligence technology into practical use cases

Gil Becker, CEO, AnyClip

How did we leverage Visual Intelligence to address essential markets needs:
- The ultimate search engine - In video. Test case AnyClip vs YouTube
- Brand safety - Protecting companies and consumers from offended content at scale
- AdTech - Contextual Targeting - With the severe implications of privacy breach and extinction of cookie matching, there's a massive need for new and complementary methods to target audience
- MediaTech - Content recommendation engine - Matching between videos and articles
- WorkTech - Companies produce endless amounts of videos, but they are wasted because they are smartless. A tremendous need to make videos organized, searchable and collaborative

The Video Coding Unit: Warehouse Scale Video Transcoding at Google

Danner Stodolsky, VP of Engineering, Compute Infra and Performance, Google

Google handles, organizes and serves an enormous quantity of both user generated as well as professionally created video in support of products such as YouTube, Google Photos, Drive, Meet. This talk will cover the role the video coding unit plays in supporting this massive scale, its design as part of a global, warehouse oriented video processing, storage, analysis and delivery system.

Video scrambling: Fully discarding video contents and generating them on-the-fly

Amrita Mazumdar, Research Scientist, NVIDIA Research

Modern video analytics systems use images and videos to understand the visual world, leveraging AI algorithms to extract features and semantic information about the captured video. For instance, an autonomous driving video system produces segment maps partitioning pedestrians from roadside features and the environment. Simultaneously, graphics and vision research has put forth new algorithms for full reconstruction of an image or video using semantic information as input; given some details about the objects in a driving video, an AI model could reconstruct the video contents with high fidelity. In this talk, I’ll present video scrambling, a new video processing technique for analytics platforms. Video scrambling combines advances in object recognition and generative models for improved performance and stronger privacy guarantees. The video scrambling system we are building highlights some open challenges, including analytics accuracy, perceptual fidelity, and temporal consistency.

Visual Data Systems for Metadata to Metaverse

Nilesh Jain, Principal Engineer, Intel

The tremendous growth in visual computing is fueled by the rapid increase in deployment of visual sensing (e.g., cameras) in many usages, that further multiplied due to global pandemic. While most of the focus has been on computer vision and AI acceleration, a key challenge in visual processing is the efficient data management and querying of large amounts of visual information. To address this challenge, we have developed an open-source Visual Data Management System (VDMS) that supports raw visual data, associated meta data, and key primitives to enable efficient visual/AI queries. In this presentation, we will describe our on-going research, open-source solution (VDMS), its application to real-world use-cases. We will also highlight the open challenges for creating a distributed edge-to-cloud multimodal data systems to support emerging analytics and immersive applications.

EVA: An End-to-End Data System for Querying Videos At Scale

Joy Arulraj, Assistant Professor, Georgia Tech

Over the last decade, advances in deep learning have led to a resurgence of interest in automated analysis of videos at scale. This approach poses many challenges, ranging from the high computational overhead associated with deep learning models to the types of queries that the user may ask. In this talk, I will present EVA, an end-to-end data system that we are developing at Georgia Tech, for tackling these challenges using novel query optimization and machine learning techniques. The driving goal of our research is to help domain experts, regardless of their programming ability, analyze and draw insights from video datasets.
To facilitate efficient exploratory analysis of videos, EVA automatically materializes and reuses the results of expensive user-defined functions that wrap around deep learning models. Unlike reuse algorithms in traditional data systems, EVA takes a symbolic approach to analyze predicates for guiding critical optimization decisions like predicate reordering and model selection. EVA supports a range of throughput-accuracy tradeoffs for users with different accuracy constraints. It adopts a fine-grained approach to optimization by processing various video chunks using different models in the given ensemble to meet the user's accuracy target. We broaden the range of queries that a user may ask by supporting action queries that often involve a complex interaction between objects and are spread across a sequence of frames. The optimizer trains a deep reinforcement learning agent to adaptively construct video chunks that are then sent to an action classifier.
EVA is open-sourced [https://georgia-tech-db.github.io/eva/] and is being used by domain experts in fields such as endocrinology and astrophysics. Our goal is to allow everyone to efficiently and effortlessly uncover insights hidden in their video datasets.

Data Centric AI with Video

Russell Kaplan, Head of Nucleus, Scale AI

Adversarial Attacks on Production CV Systems

Reza Zadeh, CEO, Matroid

EdgeServe: Supporting Sensor Fusion in Distributed Edge Networks

Sanjay Krishnan, Assistant Professor, University of Chicago

Due to latency and privacy concerns, we are witnessing the rise of edge computing, where computation is placed close to the point of data collection to facilitate low-latency decisions. Many applications of edge computing involve sensor fusion, in which data are generated on different nodes in the network and have to be combined to make a decision. We propose an edge-based model serving system, called EdgeServe, that not only manages a distributed machine learning inference service but also orchestrates data movement between nodes on an edge network. We find that sensor fusion tasks have very different properties to traditional distributed machine learning, and in particular, the strictness of temporal synchronization of the disparate data streams is important in this context but not really relevant to prior work. We present the architecture, a set of use-cases (primarily multi-camera video analytics), and initial experiments that show that EdgeServe has lower latency and/or more resource-efficient predictions than generic distributed machine learning frameworks and specialized streaming systems for robotics.

Video Analytics at AV Scale

Volkmar Uhlig, CTO, Ghost Locomotion

🔮 Logistics

🏨 Accommodations

Lodging: A list of lodging around Stanford can be found here.
Parking: There are multiple walking distance parking options from the workshop location: Via Ortega Garage, Roble Field Garage, Parking Circle, Roth Way Garage. Pay at the corresponding kiosk or via the Park Mobile App.