Vinay Bettadapura (Photo)
Vinay Bettadapura
I am a Senior Software Engineer at Google, working with the AI Perception group (under Research and Machine Intelligence) on activity and event understanding from video, audio, and other sensor data.

Ph.D., Computer Science
Advisor: Prof. Irfan Essa
Computational Perception Lab (CPL)
College of Computing (CoC), Georgia Tech

[CV] CV | Research Interests | Research Projects | Dissertation | Publications | Patents | Work | Awards | Contact |   Instagram LinkedIn Facebook
Research Interests
My research interests are in the areas of Computer Vision, Machine Learning and Ubiquitous Computing.


Research Projects
Video and Accelerometer-Based Motion Analysis for Automated Surgical Skills Assessment
Video and Accelerometer-Based Motion Analysis for Automated Surgical Skills Assessment
Basic surgical skills of suturing and knot tying are an essential part of medical training. Having an automated system for surgical skills assessment could help save experts time and improve training efficiency. There have been some recent attempts at automated surgical skills assessment using either video analysis or acceleration data. In this paper, we present a novel approach for automated assessment of OSATS like surgical skills and provide an analysis of different features on multi-modal data (video and accelerometer data). We conduct a large study for basic surgical skill assessment on a dataset that contained video and accelerometer data for suturing and knot-tying tasks. We introduce “entropy based” features – Approximate Entropy (ApEn) and Cross-Approximate Entropy (XApEn), which quantify the amount of predictability and regularity of fluctuations in time-series data. The proposed features are compared to existing methods of Sequential Motion Texture (SMT), Discrete Cosine Transform (DCT) and Discrete Fourier Transform (DFT), for surgical skills assessment. Results: We report average performance of different features across all applicable OSATS-like criteria for suturing and knot tying tasks. Our analysis shows that the proposed entropy-based features outperform previous state-of-the-art methods using video data, achieving average classification accuracies of 95.1% and 92.2% for suturing and knot tying, respectively. For accelerometer data, our method performs better for suturing achieving 86.8% average accuracy. We also show that fusion of video and acceleration features can improve overall performance for skill assessment.

Here is the IJCARS 2018 journal paper [IJCARS 18] and here is the IPCAI 2017 paper [IPCAI 2017]

Leveraging Contextual Cues for Generating Basketball Highlights
Leveraging Contextual Cues for Generating Basketball Highlights
The massive growth of sports videos has resulted in a need for automatic generation of sports highlights that are comparable in quality to the hand-edited highlights produced by broadcasters such as ESPN. Unlike previous works that mostly use audio-visual cues derived from the video, we propose an approach that additionally leverages contextual cues derived from the environment that the game is being played in. The contextual cues provide information about the excitement levels in the game, which can be ranked and selected to automatically produce high-quality basketball highlights. We introduce a new dataset of 25 NCAA games along with their play-by-play stats and the ground-truth excitement data for each basket. We explore the informativeness of five different cues derived from the video and from the environment through user studies. Our experiments show that for our study participants, the highlights produced by our system are comparable to the ones produced by ESPN for the same games.

Here is the ACM MM 2016 Project Webpage (PDF and Video Demo)

Accepted for oral presentation

Automated Video-Based Assessment of Surgical Skills for Training and Evaluation in Medical Schools
Automated Video-Based Assessment of Surgical Skills for Training and Evaluation in Medical Schools
Routine evaluation of basic surgical skills in medical schools requires considerable time and effort from supervising faculty. For each surgical trainee, a supervisor has to observe the trainees inperson. Alternatively, supervisors may use training videos, which reduces some of the logistical overhead. All these approaches however are still incredibly time consuming and involve human bias. In this paper, we present an automated system for surgical skills assessment by analyzing video data of surgical activities. Method: We compare different techniques for video-based surgical skill evaluation. We use techniques that capture the motion information at a coarser granularity using symbols or words, extract motion dynamics using textural patterns in a frame kernel matrix, and analyze fine-grained motion information using frequency analysis. We were successfully able to classify surgeons into different skill levels with high accuracy. Our results indicate that fine-grained analysis of motion dynamics via frequency analysis is most effective in capturing the skill relevant information in surgical videos. Conclusion: Our evaluations show that frequency features perform better than motion texture features, which in-turn perform better than symbol/word based features. Put succinctly, skill classification accuracy is positively correlated with motion granularity as demonstrated by our results on two challenging video datasets.

Here is the IJCARS 2016 journal paper [IJCARS 16]

Discovering Picturesque Highlights from Egocentric Vacation Videos
Discovering Picturesque Highlights from Egocentric Vacation Videos
We present an approach for identifying picturesque highlights from large amounts of egocentric video data. Given a set of egocentric videos captured over the course of a vacation, our method analyzes the videos and looks for images that have good picturesque and artistic properties. We introduce novel techniques to automatically determine aesthetic features such as composition, symmetry and color vibrancy in egocentric videos and rank the video frames based on their photographic qualities to generate highlights. Our approach also uses contextual information such as GPS, when available, to assess the relative importance of each geographic location where the vacation videos were shot. Furthermore, we specifically leverage the properties of egocentric videos to improve our highlight detection. We demonstrate results on a new egocentric vacation dataset which includes 26.5 hours of videos taken over a 14 day vacation that spans many famous tourist destinations and also provide results from a user-study to access our results.

Here is the WACV 2016 paper [WACV 16]

Automated Assessment of Surgical Skills Using Frequency Analysis
Automated Assessment of Surgical Skills Using Frequency Analysis
We present an automated framework for visual assessment of the expertise level of surgeons using the OSATS (Objective Structured Assessment of Technical Skills) criteria. Video analysis techniques for extracting motion quality via frequency coefficients are introduced. The framework is tested on videos of medical students with different expertise levels performing basic surgical tasks in a surgical training lab setting. We demonstrate that transforming the sequential time data into frequency components effectively extracts the useful information differentiating between different skill levels of the surgeons. The results show significant performance improvements using DFT and DCT coefficients over known state-of-the-art techniques.

Here is the MICCAI 2015 paper [MICCAI 15]

Predicting Daily Activities From Egocentric Images Using Deep Learning
Predicting Daily Activities From Egocentric Images Using Deep Learning
We present a method to analyze images taken from a passive egocentric wearable camera along with the contextual information, such as time and day of week, to learn and predict everyday activities of an individual. We collected a dataset of 40,103 egocentric images over a 6 month period with 19 activity classes and demonstrate the benefit of state-of-the-art deep learning techniques for learning and predicting daily activities. Classification is conducted using a Convolutional Neural Network (CNN) with a classification method we introduce called a late fusion ensemble. This late fusion ensemble incorporates relevant contextual information and increases our classification accuracy. Our technique achieves an overall accuracy of 83.07% in predicting a person's activity across the 19 activity classes. We also demonstrate some promising results from two additional users by fine-tuning the classifier with one day of training data.

Here is the ISWC 2015 Project Webpage (with PDF)

Egocentric Field-of-View Localization Using First-Person Point-of-View Devices
Egocentric Field-of-View Localization Using First-Person Point-of-View Devices
We present a technique that uses images, videos and sensor data taken from first-person point-of-view devices to perform egocentric field-of-view (FOV) localization. We define egocentric FOV localization as capturing the visual information from a person’s field-of-view in a given environment and transferring this information onto a reference corpus of images and videos of the same space, hence determining what a person is attending to. Our method matches images and video taken from the first-person perspective with the reference corpus and refines the results using the first-person’s head orientation information obtained using the device sensors. We demonstrate single and multi-user egocentric FOV localization in different indoor and outdoor environments with applications in augmented reality, event understanding and studying social interactions.

Here is the WACV 2015 Project Webpage (PDF, Poster and Video Demo)

We won the best paper award at WACV 2015

Leveraging Context to Support Automated Food Recognition in Restaurants
Leveraging Context to Support Automated Food Recognition in Restaurants
The pervasiveness of mobile cameras has resulted in a dramatic increase in food photos, which are pictures reflecting what people eat. In this paper, we study how taking pictures of what we eat in restaurants can be used for the purpose of automating food journaling. We propose to leverage the context of where the picture was taken, with additional information about the restaurant, available online, coupled with state-of-the-art computer vision techniques to recognize the food being consumed. To this end, we demonstrate image-based recognition of foods eaten in restaurants by training a classifier with images from restaurant’s online menu databases. We evaluate the performance of our system in unconstrained, real-world settings with food images taken in 10 restaurants across 5 different types of food (American, Indian, Italian, Mexican and Thai).

Here is the WACV 2015 Project Webpage (PDF and Poster).

Video Based Assessment of OSATS Using Sequential Motion Textures
Video Based Assessment of OSATS Using Sequential Motion Textures
We present a fully automated framework for video based surgical skill assessment that incorporates the sequential and qualitative aspects of surgical motion in a data-driven manner. We replicate Objective Structured Assessment of Technical Skills (OSATS) assessments, which provides both an overall and in-detail evaluation of basic suturing skills required for surgeons. Video analysis techniques are introduced that incorporate sequential motion aspects into motion textures. We also demonstrate significant performance improvements over standard bag-ofwords and motion analysis approaches. We evaluate our framework in a case study that involved medical students with varying levels of expertise performing basic surgical tasks in a surgical training lab setting.

Here is the M2CAI 2014 paper [M2CAI 14]

We received an honorable mention (2nd place) at M2CAI 2014

Activity Recognition From Videos Using Augmented Bag-of-Words
Activity Recognition From Videos Using Augmented Bag-of-Words
We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use of randomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.

Here is the CVPR 2013 Project Webpage (PDF and Code).

Detecting Insider Threats
Detecting Insider Threats in a Real Corporate Database of Computer Usage Activity
This paper reports on methods and results of an applied research project by a team consisting of SAIC and four universities to develop, integrate, and evaluate new approaches to detect the weak signals characteristic of insider threats on organizations’ information systems. Our system combines structural and semantic information from a real corporate database of monitored activity on their users’ computers to detect independently developed red team inserts of malicious insider activities. We have developed and applied multiple algorithms for anomaly detection based on suspected scenarios of malicious insider behavior, indicators of unusual activities, high-dimensional statistical patterns, temporal sequences, and normal graph evolution. Algorithms and representations for dynamic graph processing provide the ability to scale as needed for enterpriselevel deployments on real-time data streams. We have also developed a visual language for specifying combinations of features, baselines, peer groups, time periods, and algorithms to detect anomalies suggestive of instances of insider threat behavior. We defined over 100 data features in seven categories based on approximately 5.5 million actions per day from approximately 5,500 users. We have achieved area under the ROC curve values of up to 0.979 and lift values of 65 on the top 50 user-days identified on two months of real data.

Here is the KDD 2013 paper [KDD 13]

Activity Recognition Through IMS
Recognizing Water-Based Activities in the Home Through Infrastructure-Mediated Sensing
Activity recognition in the home has been long recognized as the foundation for many desirable applications in fields such as home automation, sustainability, and healthcare. However, building a practical home activity monitoring system remains a challenge. Striking a balance between cost, privacy, ease of installation and scalability continues to be an elusive goal. In this paper, we explore infrastructure-mediated sensing combined with a vector space model learning approach as the basis of an activity recognition system for the home. We examine the performance of our single-sensor water-based system in recognizing eleven high-level activities in the kitchen and bathroom, such as cooking and shaving. Results from two studies show that our system can estimate activities with overall accuracy of 82.69% for one individual and 70.11% for a group of 23 participants. As far as we know, our work is the first to employ infrastructuremediated sensing for inferring high-level human activities in a home setting.

Here is the UbiComp 2012 paper [UbiComp 12]

Activity Recognition
Activity Recognition from Wide Area Motion Imagery
This project aims at recognizing anomalous activities from aerial videos. My work is a part of the Persistent Stare Exploitation and Analysis System (PerSEAS) research program which aims to develop software systems that can automatically and interactively discover actionable intelligence from airborne, wide area motion imagery (WAMI) in complex urban environments.

A glimpse of this project can be seen here.

Electronics Field Guide
Leafsnap: An Electronics Field Guide

This project aims to simplify the process of plant species identification using visual recognition software on mobile devices such as the iPhone. This work is part of an ongoing collaboration with researchers at Columbia University, University of Maryland and the Smithsonian Institution. My major contribution to this project was the server's database integration and management. I also worked on stress-testing the backend server to improve its performance and scalability.

The free iPhone app can be downloaded from the app-store. Here is the project webpage and here is a video explaining the app's usage. Finally, Leafsnap in the news!

Face Verification
Visual Attributes for Face Verification

The project involves face verification in uncontrolled settings with non-cooperative subjects. The method is based on attribute (binary) classifiers that are trained to recognize the degrees of various visual attributes like gender, race, age, etc. Here is the project page.

I was a part of this research at Columbia University from December 2009 to May 2010. I mainly worked on Boosting to improve the classifiers' performance.

Face Rec
Face Recognition Using Gabor Wavelets

The choice of the object representation is crucial for an effective performance of cognitive tasks such as object recognition, fixation, etc. Face recognition is an example of advanced object recognition. In our project we demonstrate the use of Gabor wavelets for efficient face representation. Face recognition is influenced by several factors such as shape, reflectance, pose, occlusion and illumination which make it even more difficult. Today there exist many well known techniques to try to recognize a face. We want to introduce the Gabor wavelets for an efficient face recognition system simulating human perception of objects and faces. A face recognition system could greatly aid in the process of searching and classifying a face database and at a higher level help in identification of possible threats to security. The purpose of this study is to demonstrate that it is technically feasible to scan pictures of human faces and compare them with ID photos hosted in a centralized database using Gabor wavelets.

This was my undergraduate thesis supervised by Dr. C. N. S. Ganesh Murthy, Principal Scientist at Mercedes-Benz Research and Development, Bangalore, India. Here is the project report [FACE REC]


Ph.D. Dissertation
Leveraging Contextual Cues for Dynamic Scene Understanding
Environments with people are complex, with many activities and events that need to be represented and explained. The goal of scene understanding is to either determine what objects and people are doing in such complex and dynamic environments, or to know the overall happenings, such as the highlights of the scene. The context within which the activities and events unfold provides key insights that cannot be derived by studying the activities and events alone. In this thesis, we show that this rich contextual information can be successfully leveraged, along with the video data, to support dynamic scene understanding.

We categorize and study four different types of contextual cues: (1) spatiotemporal context, (2) egocentric context, (3) geographic context, and (4) environmental context, and show that they improve dynamic scene understanding tasks across several different application domains.

We start by presenting data-driven techniques to enrich spatio-temporal context by augmenting Bag-of-Words models with temporal, local and global causality information and show that this improves activity recognition, anomaly detection and scene assessment from videos. Next, we leverage the egocentric context derived from sensor data captured from first-person point-of-view devices to perform field-of-view localization in order to understand the user’s focus of attention. We demonstrate single and multi-user field-of-view localization in both indoor and outdoor environments with applications in augmented reality, event understanding and studying social interactions. Next, we look at how geographic context can be leveraged to make challenging “in-the-wild” object recognition tasks more tractable using the problem of food recognition in restaurants as a case-study. Finally, we study the environmental context obtained from dynamic scenes such as sporting events, which take place in responsive environments such as stadiums and gymnasiums, and show that it can be successfully used to address the challenging task of automatically generating basketball highlights. We perform comprehensive user-studies on 25 full-length NCAA games and demonstrate the effectiveness of environmental context in producing highlights that are comparable to the highlights produced by ESPN.

Here is a PDF of my dissertation [DISSERTATION]


Link to my Google Scholar page.
  1. A. Zia, Y. Sharma, V. Bettadapura, E. Sarin, I. Essa, "Video and Accelerometer-Based Motion Analysis for Automated Surgical Skills Assessment", International Journal of Computer Assisted Radiology and Surgery (IJCARS), January, 2018 [IJCARS 18]
  2. A. Zia, Y. Sharma, V. Bettadapura, E. Sarin, I. Essa, "Video and Accelerometer-Based Motion Analysis for Automated Surgical Skills Assessment", Proc. Information Processing in Computer-Assisted Interventions (IPCAI 2017), Barcelona, Spain, 2017 [IPCAI 17] [arXiv]
  3. V. Bettadapura, C. Pantofaru, I. Essa, "Leveraging Contextual Cues for Generating Basketball Highlights", ACM Multimedia Conference (ACM-MM 2016), Amsterdam, Netherlands, October 2016. [Oral] [Acceptance Rate: 20% (52/650)] [ACM-MM 16] [Project Webpage] [arXiv]
  4. A. Zia, Y. Sharma, V. Bettadapura, E. Sarin, T. Ploetz, M. Clements, I. Essa, "Automated Video-Based Assessment of Surgical Skills for Training and Evaluation in Medical Schools", International Journal of Computer Assisted Radiology and Surgery (IJCARS), 11(9), pp. 1623-1636, 2016 [IJCARS]
  5. V. Bettadapura, D. Castro, I. Essa, "Discovering Picturesque Highlights From Egocentric Vacation Videos", IEEE Winter Conference on Applications of Computer Vision (WACV 2016), Lake Placid, USA, March 2016. [Acceptance Rate: 34% (71/207)] [WACV 16] [Project Webpage] [arXiv]
  6. A. Zia, Y. Sharma, V. Bettadapura, E. Sarin, I. Essa, "Automated Assessment of Surgical Skills Using Frequency Analysis", 18th International Conference on Medical Image Computing and Computer Assisted Interventions (MICCAI 2015), Munich, Germany, October 2015. [Acceptance Rate < 30.0%] [MICCAI 15]
  7. D. Castro, S. Hickson, V. Bettadapura, E. Thomaz, G. Abowd, H. Christensen, I. Essa, "Predicting Daily Activities From Egocentric Images Using Deep Learning", 19th International Symposium on Wearable Computing (ISWC 2015), Osaka, Japan, September 2015. [Acceptance Rate (for full papers): 10.7% (13/121)] [ISWC 15] [Project Webpage] [arXiv]
  8. V. Bettadapura, I. Essa, C. Pantofaru, "Egocentric Field-of-View Localization Using First-Person Point-of-View Devices", IEEE Winter Conference on Applications of Computer Vision (WACV 2015), Hawaii, USA, January 2015. [Acceptance Rate: 36.7% (156/425)] [WACV 15] [Project Webpage] [arXiv] (Won the best paper award)
  9. V. Bettadapura, E. Thomaz, A. Parnami, G. Abowd, I. Essa, "Leveraging Context to Support Automated Food Recognition in Restaurants", IEEE Winter Conference on Applications of Computer Vision (WACV 2015), Hawaii, USA, January 2015. [Acceptance Rate: 36.7% (156/425)] [WACV 15] [Project Webpage] [arXiv]
  10. Y. Sharma, V. Bettadapura, et al., "Video Based Assessment of OSATS Using Sequential Motion Textures", 5th MICCAI Workshop on Modeling and Monitoring of Computer Assisted Interventions (M2CAI 2014), Boston, USA, September 2014 [KDD 2013] (Received an honorable mention - 2nd place)
  11. T. E. Senator, et al., "Detecting Insider Threats in a Real Corporate Database of Computer Usage Activity", 19th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD 2013), Chicago, USA, August 2013. [Acceptance Rate: 17.4% (126/726)] [KDD 2013]
  12. V. Bettadapura, G. Schindler, T. Ploetz, I. Essa, "Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition", 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, USA, June 2013. [Acceptance Rate: 25.2% (472/1870)] [CVPR 13] [Project Webpage] [arXiv]
  13. E. Thomaz, V. Bettadapura, G. Reyes, M. Sandesh, G. Schindler, T. Ploetz, G. Abowd, I. Essa, "Recognizing Water-Based Activities in the Home Through Infrastructure-Mediated Sensing", 14th ACM Conference on Ubiquitous Computing (UbiComp 2012), pp. 85-94, Pittsburgh, USA, September 2012. [Acceptance Rate: 19% (58/301)] [UbiComp 12]
  14. V. Bettadapura, "Face Expression Recognition and Analysis: The State of the Art", Tech Report, arXiv:1203.6722, April 2012 [FACE EXP REC] [arXiv]
  15. V. Bettadapura, D. R. Sai Sharan, "Pattern Recognition with Localized Gabor Wavelet Grids", IEEE Conference on Computational Intelligence and Multimedia Applications, vol. 2, pp. 517-521, Sivakasi, India, December 2007 [ICCIMA 07]
  16. V. Bettadapura, B. S. Shreyas, C. N. S Ganesh Murthy, "A Back Propagation Based Face Recognition Model Using 2D Symmetric Gabor Features", IEEE Conference on Signal Processing, Communications and Networking, pp. 433-437, Chennai, India, February 2007 [ICSCN 07]
  17. V. Bettadapura, B. S. Shreyas, "Face Recognition Using Gabor Wavelets", 40th IEEE Asilomar Conference on Signals, Systems and Computers, pp. 593-597, Pacific Groves (Monterey Bay), California, USA, October 2006 [ASILOMAR 06]


  1. C. Pantofaru, V. Bettadapura, K. Bharat, I. Essa, "Systems and methods for attention localization using a first-person point-of-view device", United States Patent 9600723


Work Experience
  1. Google: Senior Software Engineer (January 2016 - Present): Working on event and video understanding, and other related Computer Vision and Machine Learning technologies.
  2. Google: Software Engineering Intern (August 2013 - December 2015): Worked on event and video understanding using multi-modal data (videos, images and sensor data).
  3. Google Geo: Software Engineering Intern (May 2013 - August 2013): Worked with the Google Earth and Maps team on improving the quality of the satellite imagery.
  4. Google Research: Software Engineering Intern (May 2012 - August 2012): Worked with the Video Content Analysis team in developing algorithms and building systems for object detection and categorization in YouTube videos.
  5. Subex: Software Engineer (June 2006 - December 2008): Design and development of telecommunication fraud protection and anomaly detection systems. Worked on the mathematical modeling of user behaviors, data mining to detect anomalies in the signals and the design and development of the back-end server, database and web interfaces.




vinay [at]
Also on:
Instagram LinkedIn Facebook



Valid HTML 4.01 Transitional Valid CSS!