PhD Thesis (2014) by Yachna Sharma “Surgical Skill Assessment Using Motion Texture analysis”

May 2nd, 2014 Irfan Essa Posted in Medical, PhD, Yachna Sharma No Comments »

Thesis title: Surgical Skill Assessment Using Motion Texture analysis

Yachna Sharma, Ph. D. Candidate, ECE


Prof. Irfan Essa (advisor), College of Computing
Prof. Mark A. Clements (co-advisor), School of Electrical and Computer Engineering
Prof. David Anderson, School of Electrical and Computer Engineering
Prof. Anthony Yezzi, School of Electrical and Computer Engineering
Prof. Christopher F. Barnes, School of Electrical and Computer Engineering
Dr. Thomas Ploetz, Culture lab, School of Computing Science, Newcastle University, United Kingdom
Dr. Eric L. Sarin, Division of Cardiothoracic Surgery, Department of Surgery, Emory University School of Medicine


The objective of this Ph.D. research is to design and develop a framework for automated assessment of surgical skills.Automated assessment can help expedite the manual assessment process and provide unbiased evaluations with possible dexterity feedback.

Evaluation of surgical skills is an important aspect in training of medical students. Current practices rely on manual evaluations from faculty and residents and are time consuming. Proposed solutions in literature involve retrospective evaluations such as watching the offline videos. It requires precious time and attention of expert surgeons and may vary from one surgeon to another. With recent advancements in computer vision and machine learning techniques, the retrospective video evaluation can be best delegated to the computer algorithms.

Skill assessment is a challenging task requiring expert domain knowledge that may be difficult to translate into algorithms. To emulate this human observation process, an appropriate data collection mechanism is required to track motion of the surgeon’s hand in an unrestricted manner. In addition, it is essential to identify skill defining motion dynamics and skill relevant hand locations.

This Ph.D. research aims to address the limitations of manual skill assessment by developing an automated motion analysis framework. Specifically, we propose (1) to design and implement quantitative features to capture fine motion details from surgical video data, (2) to identify and test the efficacy of a core subset of features in classifying the surgical students into different expertise levels, (3) to derive absolute skill scores using regression methods and (4) to perform dexterity analysis using motion data from different hand locations.

AddThis Social Bookmark Button

PhD Thesis (2014) by S. Hussain Raza “Temporally Consistent Semantic Segmentation in Videos

May 2nd, 2014 Irfan Essa Posted in Computational Photography and Video, PhD, S. Hussain Raza No Comments »

Title : Temporally Consistent Semantic Segmentation in Videos

S. Hussain Raza, Ph. D. Candidate in ECE (


Prof. Irfan Essa (advisor), School of Interactive Computing
Prof. David Anderson (co-advisor), School of Electrical and Computer Engineering
Prof. Frank Dellaert, School of Interactive Computing
Prof. Anthony Yezzi, School of Electrical and Computer Engineering
Prof. Chris Barnes, School of Electrical and Computer Enginnering
Prof. Rahul Sukthanker, Department of Computer Science and Robotics, Carnegie Mellon University.

Abstract :

The objective of this Thesis research is to develop algorithms for temporally consistent semantic segmentation in videos. Though many different forms of semantic segmentations exist, this research is focused on the problem of temporally-consistent holistic scene understanding in outdoor videos. Holistic scene understanding requires an understanding of many individual aspects of the scene including 3D layout, objects present, occlusion boundaries, and depth. Such a description of a dynamic scene would be useful for many robotic applications including object reasoning, 3D perception, video analysis, video coding, segmentation, navigation and activity recognition.

Scene understanding has been studied with great success for still images. However, scene understanding in videos requires additional approaches to account for the temporal variation, dynamic information, and exploiting causality. As a first step, image-based scene understanding methods can be directly applied to individual video frames to generate a description of the scene. However, these methods do not exploit temporal information across neighboring frames. Further, lacking temporal consistency, image-based methods can result in temporally-inconsistent labels across frames. This inconsistency can impact performance, as scene labels suddenly change between frames.

The objective of our this study is to develop temporally consistent scene descriptive algorithms by processing videos efficiently, exploiting causality and data-redundancy, and cater for scene dynamics. Specifically, we achieve our research objects by (1) extracting geometric context from videos to give broad 3D structure of the scene with all objects present, (2) detecting occlusion boundaries in videos due to depth discontinuity, and (3) estimating depth in videos by combining monocular and motion features with semantic features and occlusion boundaries.

AddThis Social Bookmark Button

PhD Thesis by Zahoor Zafrulla “Automatic recognition of American Sign Language Classifiers

May 2nd, 2014 Irfan Essa Posted in Affective Computing, Behavioral Imaging, Face and Gesture, PhD, Thad Starner, Zahoor Zafrulla No Comments »

Title: Automatic recognition of American Sign Language Classifiers

Zahoor Zafrulla
School of Interactive Computing
College of Computing
Georgia Institute of Technology


Dr. Thad Starner (Advisor, School of Interactive Computing, Georgia Tech)
Dr. Irfan Essa (Co-Advisor, School of Interactive Computing, Georgia Tech)
Dr. Jim Rehg (School of Interactive Computing, Georgia Tech)
Dr. Harley Hamilton (School of Interactive Computing, Georgia Tech)
Dr. Vassilis Athitsos (Computer Science and Engineering Department, University of Texas at Arlington)


Automatically recognizing classifier-based grammatical structures of American Sign Language (ASL) is a challenging problem. Classifiers in ASL utilize surrogate hand shapes for people or “classes” of objects and provide information about their location, movement and appearance. In the past researchers have focused on recognition of finger spelling, isolated signs, facial expressions and interrogative words like WH-questions (e.g. Who, What, Where, and When). Challenging problems such as recognition of ASL sentences and classifier-based grammatical structures remain relatively unexplored in the field of ASL recognition.

One application of recognition of classifiers is toward creating educational games to help young deaf children acquire language skills. Previous work developed CopyCat, an educational ASL game that requires children to engage in a progressively more difficult expressive signing task as they advance through the game.

We have shown that by leveraging context we can use verification, in place of recognition, to boost machine performance for determining if the signed responses in an expressive signing task, like in the CopyCat game, are correct or incorrect. We have demonstrated that the quality of a machine verifier’s ability to identify the boundary of the signs can be improved by using a novel two-pass technique that combines signed input in both forward and reverse directions. Additionally, we have shown that we can reduce CopyCat’s dependency on custom manufactured hardware by using an off-the-shelf Microsoft Kinect depth camera to achieve similar verification performance. Finally, we show how we can extend our ability to recognize sign language by leveraging depth maps to develop a method using improved hand detection and hand shape classification to recognize selected classifier-based grammatical structures of ASL.

AddThis Social Bookmark Button

Two Ph. D. Defenses the same day. A first for me!

April 2nd, 2014 Irfan Essa Posted in Activity Recognition, Computational Photography and Video, Health Systems, PhD, S. Hussain Raza, Students, Yachna Sharma No Comments »

Today, two of my Ph. D. Students defended their Dissertations.  Back to back.  Congrats to both as they are both done.

Thesis title: Surgical Skill Assessment Using Motion Texture analysis
Student: Yachna Sharma, Ph. D. Candidate in ECE
Date/Time : 2nd April, 1:00 pm

Title : Temporally Consistent Semantic Segmentation in Videos
S. Hussain Raza, Ph. D. Candidate in ECE
Date/Time : 2nd April, 1:00 pm

Location : CSIP Library, Room 5186, CenterGy One Building


AddThis Social Bookmark Button

Matthias Grundmann’s PhD Thesis Defense (2013): “Title: Computational Video: Post-processing Methods for Stabilization, Retargeting and Segmentation”

February 4th, 2013 Irfan Essa Posted in Computational Photography and Video, Matthias Grundmann, PhD No Comments »

Title: Computational Video: Post-processing Methods for Stabilization, Retargeting and Segmentation

Matthias Grundmann
School of Interactive Computing
College of Computing
Georgia Institute of Technology

Date: February 04, 2013 (Monday)
Time: 3:00p – 6:00p EST
Location: Nano building, 116-118



In this thesis, we address a variety of challenges for analysis and enhancement of Computational Video. We present novel post-processing methods to bridge the difference between professional and casually shot videos mostly seen on online sites. Our research presents solutions to three well-defined problems: (1) Video stabilization and rolling shutter removal in casually-shot, uncalibrated videos; (2) Content-aware video retargeting; and (3) spatio-temporal video segmentation to enable efficient video annotation. We showcase several real-world applications building on these techniques.

We start by proposing a novel algorithm for video stabilization that generates stabilized videos by employing L1-optimal camera paths to remove undesirable motions. We compute camera paths that are optimally partitioned into constant, linear and parabolic segments mimicking the camera motions employed by professional cinematographers. To achieve this, we propose a linear programming framework to minimize the first, second, and third derivatives of the resulting camera path. Our method allows for video stabilization beyond conventional filtering, that only suppresses high frequency jitter. An additional challenge in videos shot from mobile phones are rolling shutter distortions. Modern CMOS cameras capture the frame one scanline at a time, which results in non-rigid image distortions such as shear and wobble. We propose a solution based on a novel mixture model of homographies parametrized by scanline blocks to correct these rolling shutter distortions. Our method does not rely on a-priori knowledge of the readout time nor requires prior camera calibration. Our novel video stabilization and calibration free rolling shutter removal have been deployed on YouTube where they have successfully stabilized millions of videos. We also discuss several extensions to the stabilization algorithm and present technical details behind the widely used YouTube Video Stabilizer.

We address the challenge of changing the aspect ratio of videos, by proposing algorithms that retarget videos to fit the form factor of a given device without stretching or letter-boxing. Our approaches use all of the screen’s pixels, while striving to deliver as much video-content of the original as possible. First, we introduce a new algorithm that uses discontinuous seam-carving in both space and time for resizing videos. Our algorithm relies on a novel appearance-based temporal coherence formulation that allows for frame-by-frame processing and results in temporally discontinuous seams, as opposed to geometrically smooth and continuous seams. Second, we present a technique, that builds on the above mentioned video stabilization approach. We effectively automate classical pan and scan techniques by smoothly guiding a virtual crop window via saliency constraints.

Finally, we introduce an efficient and scalable technique for spatio-temporal segmentation of long video sequences using a hierarchical graph-based algorithm. We begin by over-segmenting a volumetric video graph into space-time regions grouped by appearance. We then construct a “region graph” over the obtained  segmentation and iteratively repeat this process over multiple levels to create a tree of spatio-temporal segmentations. This hierarchical approach generates high quality segmentations, and allows subsequent applications to choose from varying levels of granularity. We demonstrate the use of spatio-temporal segmentation as users interact with the video, enabling efficient annotation of objects within the video.


  • Dr. Irfan Essa (Advisor, School of Interactive Computing, Georgia Tech)
  • Dr. Jim Rehg (School of Interactive Computing, Georgia Tech)
  • Dr. Frank Dellaert (School of Interactive Computing, Georgia Tech)
  • Dr. Michael Black (Perceiving Systems Department, Max Planck Institute for Intelligent Systems)
  • Dr. Sing Bing Kang (Adjunct Faculty, Georgia Tech; Microsoft Research, Microsoft Corp.)
  • Dr. Vivek Kwatra (Google Research, Google Inc.)
AddThis Social Bookmark Button

Kihwan Kim’s Thesis Defense (2011): “Spatio-temporal Data Interpolation for Dynamic Scene Analysis”

December 6th, 2011 Irfan Essa Posted in Computational Photography and Video, Kihwan Kim, Modeling and Animation, Multimedia, PhD, Security, Visual Surviellance, WWW No Comments »

Spatio-temporal Data Interpolation for Dynamic Scene Analysis

Kihwan Kim, PhD Candidate

School of Interactive Computing, College of Computing, Georgia Institute of Technology

Date: Tuesday, December 6, 2011

Time: 1:00 pm – 3:00 pm EST

Location: Technology Square Research Building (TSRB) Room 223


Analysis and visualization of dynamic scenes is often constrained by the amount of spatio-temporal information available from the environment. In most scenarios, we have to account for incomplete information and sparse motion data, requiring us to employ interpolation and approximation methods to fill for the missing information. Scattered data interpolation and approximation techniques have been widely used for solving the problem of completing surfaces and images with incomplete input data. We introduce approaches for such data interpolation and approximation from limited sensors, into the domain of analyzing and visualizing dynamic scenes. Data from dynamic scenes is subject to constraints due to the spatial layout of the scene and/or the configurations of video cameras in use. Such constraints include: (1) sparsely available cameras observing the scene, (2) limited field of view provided by the cameras in use, (3) incomplete motion at a specific moment, and (4) varying frame rates due to different exposures and resolutions.

In this thesis, we establish these forms of incompleteness in the scene, as spatio- temporal uncertainties, and propose solutions for resolving the uncertainties by applying scattered data approximation into a spatio-temporal domain.

The main contributions of this research are as follows: First, we provide an effi- cient framework to visualize large-scale dynamic scenes from distributed static videos. Second, we adopt Radial Basis Function (RBF) interpolation to the spatio-temporal domain to generate global motion tendency. The tendency, represented by a dense flow field, is used to optimally pan and tilt a video camera. Third, we propose a method to represent motion trajectories using stochastic vector fields. Gaussian Pro- cess Regression (GPR) is used to generate a dense vector field and the certainty of each vector in the field. The generated stochastic fields are used for recognizing motion patterns under varying frame-rate and incompleteness of the input videos. Fourth, we also show that the stochastic representation of vector field can also be used for modeling global tendency to detect the region of interests in dynamic scenes with camera motion. We evaluate and demonstrate our approaches in several applications for visualizing virtual cities, automating sports broadcasting, and recognizing traffic patterns in surveillance videos.


  • Prof. Irfan Essa (Advisor, School of Interactive Computing, Georgia Institute of Technology)
  • Prof. James M. Rehg (School of Interactive Computing, Georgia Institute of Technology)
  • Prof. Thad Starner (School of Interactive Computing, Georgia Institute of Technology)
  • Prof. Greg Turk (School of Interactive Computing, Georgia Institute of Technology)
  • Prof. Jessica K. Hodgins (Robotics Institute, Carnegie Mellon University, and Disney Research Pittsburgh)
AddThis Social Bookmark Button

N. Diakopoulos PhD Thesis (2009): Collaborative annotation, analysis, and presentation interfaces for digital video”

July 6th, 2009 Irfan Essa Posted in Computational Journalism, Computational Photography and Video, Multimedia, Nick Diakopoulos, PhD, Students No Comments »

Title: Collaborative annotation, analysis, and presentation interfaces for digital video

Author: Diakopoulos, Nicholas A.


Information quality corresponds to the degree of excellence in communicating knowledge or intelligence and encompasses aspects of validity, accuracy, reliability, bias, transparency, and comprehensiveness among others. Professional news, public relations, and user generated content alike all have their own subtly different information quality concerns. With so much recent growth in online video, it is also apparent that more and more consumers will be getting their information from online videos and that understanding the information quality of video becomes paramount for a consumer wanting to make decisions based on it.

This dissertation explores the design and evaluation of collaborative video annotation and presentation interfaces as motivated by the desire for better information quality in online video. We designed, built, and evaluated three systems: (1) Audio Puzzler, a puzzle game which as a by-product of play produces highly accurate time-stamped transcripts of video, (2) Videolyzer, a video annotation system designed to aid bloggers and journalists collect, aggregate, and share analyses of information quality of video, and (3) Videolyzer CE, a simplified video annotation presentation which syndicates the knowledge collected using Videolyzer to a wider range of users in order to modulate their perceptions of video information. We contribute to knowledge of different interface methods for collaborative video annotation and to mechanisms for enhancing accuracy of objective metadata such as transcripts as well as subjective notions of information quality of the video itself.

via Collaborative annotation, analysis, and presentation interfaces for digital video.

AddThis Social Bookmark Button

Thesis Raffay Hamid PhD (2008): “A Computational Framework For Unsupervised Analysis of Everyday Human Activities”

June 18th, 2008 Irfan Essa Posted in Aaron Bobick, Activity Recognition, Machine Learning, PhD, Raffay Hamid No Comments »

M. Raffay Hamid PhD (2008), “A Computational Framework For Unsupervised Analysis of Everyday Human Activities“, PhD Thesis, Georgia Institute of Techniology, College of Computing, Atlanta, GA. (Advisor: Aaron Bobick & Irfan Essa)


In order to make computers proactive and assistive, we must enable them to perceive, learn, and predict what is happening in their surroundings. This presents us with the challenge of formalizing computational models of everyday human activities. For a majority of environments, the structure of the in situ activities is generally not known a priori. This thesis therefore investigates knowledge representations and manipulation techniques that can facilitate learning of such everyday human activities in a minimally supervised manner. 

A key step towards this end is finding appropriate representations for human activities. We posit that if we chose to describe activities as finite sequences of an appropriate set of events, then the global structure of these activities can be uniquely encoded using their local event sub-sequences. With this perspective at hand, we particularly investigate representations that characterize activities in terms of their fixed and variable length event subsequences. We comparatively analyze these representations in terms of their representational scope, feature cardinality and noise sensitivity.

Exploiting such representations, we propose a computational framework to discover the various activity-classes taking place in an environment. We model these activity-classes as maximally similar activity-cliques in a completely connected graph of activities, and describe how to discover them efficiently. Moreover, we propose methods for finding concise characterizations of these discovered activity-classes, both from a holistic as well as a by-parts perspective. Using such characterizations, we present an incremental method to classify

a new activity instance to one of the discovered activity-classes, and to automatically detect if it is anomalous with respect to the general characteristics of its membership class. Our results show the efficacy of our framework in a variety of everyday environments

AddThis Social Bookmark Button

Thesis David Minnen PhD (2008): “Unsupervised Discovery of Activity Primitives from Multivariate Sensor Data”

June 18th, 2008 Irfan Essa Posted in Activity Recognition, David Minnen, PhD, Thad Starner No Comments »

Unsupervised Discovery of Activity Primitives from Multivariate Sensor Data



This research addresses the problem of temporal pattern discovery in real-valued, multivariate sensor data. Several algorithms were developed, and subsequent evaluation demonstrates that they can efficiently and accurately discover unknown recurring patterns in time series data taken from many different domains. Different data representations and motif models were investigated in order to design an algorithm with an improved balance between run-time and detection accuracy. The different data representations are used to quickly filter large data sets in order to detect potential patterns that form the basis of a more detailed analysis. The representations include global discretization, which can be efficiently analyzed using a suffix tree, local discretization with a corresponding random projection algorithm for locating similar pairs of subsequences, and a density-based detection method that operates on the original, real-valued data. In addition, a new variation of the multivariate motif discovery problem is proposed in which each pattern may span only a subset of the input features. An algorithm that can efficiently discover such “subdimensional” patterns was developed and evaluated. The discovery algorithms are evaluated by measuring the detection accuracy of discovered patterns relative to a set of expected patterns for each data set. The data sets used for evaluation are drawn from a variety of domains including speech, on-body inertial sensors, music, American Sign Language video, and GPS tracks.

AddThis Social Bookmark Button

Thesis: Mitch Parry PhD (2007), “Separation and Analysis of Multichannel Signals”

October 9th, 2007 Irfan Essa Posted in 0205507, Audio Analysis, Funding, Mitch Parry, PhD, Thesis No Comments »

Mitch Parry (2007), Separation and Analysis of Multichannel Signals PhD Thesis [PDF], Georgia Institute of Techniology, College of Computing, Atlanta, GA. (Advisor: Irfan Essa)


This thesis examines a large and growing class of digital signals that capture the combined effect of multiple underlying factors. In order to better understand these signals, we would like to separate and analyze the underlying factors independently. Although source separation applies to a wide variety of signals, this thesis focuses on separating individual instruments from a musical recording. In particular, we propose novel algorithms for separating instrument recordings given only their mixture. When the number of source signals does not exceed the number of mixture signals, we focus on a subclass of source separation algorithms based on joint diagonalization. Each approach leverages a different form of source structure. We introduce repetitive structure as an alternative that leverages unique repetition patterns in music and compare its performance against the other techniques.

When the number of source signals exceeds the number of mixtures (i.e., the underdetermined problem), we focus on spectrogram factorization techniques for source separation. We extend single-channel techniques to utilize the additional spatial information in multichannel recordings, and use phase information to improve the estimation of the underlying components.

via Separation and Analysis of Multichannel Signals.

AddThis Social Bookmark Button