MENU: Home Bio Affiliations Research Teaching Publications Videos Collaborators/Students Contact FAQ ©2007-14 RSS

Two Ph. D. Defenses the same day. A first for me!

April 2nd, 2014 Irfan Essa Posted in Activity Recognition, Computational Photography and Video, Health Systems, PhD, S. Hussain Raza, Students, Yachna Sharma No Comments »

Today, two of my Ph. D. Students defended their Dissertations.  Back to back.  Congrats to both as they are both done.

Thesis title: Surgical Skill Assessment Using Motion Texture analysis
Student: Yachna Sharma, Ph. D. Candidate in ECE
Date/Time : 2nd April, 1:00 pm

Title : Temporally Consistent Semantic Segmentation in Videos
S. Hussain Raza, Ph. D. Candidate in ECE
Date/Time : 2nd April, 1:00 pm

Location : CSIP Library, Room 5186, CenterGy One Building


AddThis Social Bookmark Button

Matthias Grundmann’s PhD Thesis Defense (2013): “Title: Computational Video: Post-processing Methods for Stabilization, Retargeting and Segmentation”

February 4th, 2013 Irfan Essa Posted in Computational Photography and Video, Matthias Grundmann, PhD No Comments »

Title: Computational Video: Post-processing Methods for Stabilization, Retargeting and Segmentation

Matthias Grundmann
School of Interactive Computing
College of Computing
Georgia Institute of Technology

Date: February 04, 2013 (Monday)
Time: 3:00p – 6:00p EST
Location: Nano building, 116-118



In this thesis, we address a variety of challenges for analysis and enhancement of Computational Video. We present novel post-processing methods to bridge the difference between professional and casually shot videos mostly seen on online sites. Our research presents solutions to three well-defined problems: (1) Video stabilization and rolling shutter removal in casually-shot, uncalibrated videos; (2) Content-aware video retargeting; and (3) spatio-temporal video segmentation to enable efficient video annotation. We showcase several real-world applications building on these techniques.

We start by proposing a novel algorithm for video stabilization that generates stabilized videos by employing L1-optimal camera paths to remove undesirable motions. We compute camera paths that are optimally partitioned into constant, linear and parabolic segments mimicking the camera motions employed by professional cinematographers. To achieve this, we propose a linear programming framework to minimize the first, second, and third derivatives of the resulting camera path. Our method allows for video stabilization beyond conventional filtering, that only suppresses high frequency jitter. An additional challenge in videos shot from mobile phones are rolling shutter distortions. Modern CMOS cameras capture the frame one scanline at a time, which results in non-rigid image distortions such as shear and wobble. We propose a solution based on a novel mixture model of homographies parametrized by scanline blocks to correct these rolling shutter distortions. Our method does not rely on a-priori knowledge of the readout time nor requires prior camera calibration. Our novel video stabilization and calibration free rolling shutter removal have been deployed on YouTube where they have successfully stabilized millions of videos. We also discuss several extensions to the stabilization algorithm and present technical details behind the widely used YouTube Video Stabilizer.

We address the challenge of changing the aspect ratio of videos, by proposing algorithms that retarget videos to fit the form factor of a given device without stretching or letter-boxing. Our approaches use all of the screen’s pixels, while striving to deliver as much video-content of the original as possible. First, we introduce a new algorithm that uses discontinuous seam-carving in both space and time for resizing videos. Our algorithm relies on a novel appearance-based temporal coherence formulation that allows for frame-by-frame processing and results in temporally discontinuous seams, as opposed to geometrically smooth and continuous seams. Second, we present a technique, that builds on the above mentioned video stabilization approach. We effectively automate classical pan and scan techniques by smoothly guiding a virtual crop window via saliency constraints.

Finally, we introduce an efficient and scalable technique for spatio-temporal segmentation of long video sequences using a hierarchical graph-based algorithm. We begin by over-segmenting a volumetric video graph into space-time regions grouped by appearance. We then construct a “region graph” over the obtained  segmentation and iteratively repeat this process over multiple levels to create a tree of spatio-temporal segmentations. This hierarchical approach generates high quality segmentations, and allows subsequent applications to choose from varying levels of granularity. We demonstrate the use of spatio-temporal segmentation as users interact with the video, enabling efficient annotation of objects within the video.


  • Dr. Irfan Essa (Advisor, School of Interactive Computing, Georgia Tech)
  • Dr. Jim Rehg (School of Interactive Computing, Georgia Tech)
  • Dr. Frank Dellaert (School of Interactive Computing, Georgia Tech)
  • Dr. Michael Black (Perceiving Systems Department, Max Planck Institute for Intelligent Systems)
  • Dr. Sing Bing Kang (Adjunct Faculty, Georgia Tech; Microsoft Research, Microsoft Corp.)
  • Dr. Vivek Kwatra (Google Research, Google Inc.)
AddThis Social Bookmark Button

Kihwan Kim’s Thesis Defense (2011): “Spatio-temporal Data Interpolation for Dynamic Scene Analysis”

December 6th, 2011 Irfan Essa Posted in Computational Photography and Video, Kihwan Kim, Modeling and Animation, Multimedia, PhD, Security, Visual Surviellance, WWW No Comments »

Spatio-temporal Data Interpolation for Dynamic Scene Analysis

Kihwan Kim, PhD Candidate

School of Interactive Computing, College of Computing, Georgia Institute of Technology

Date: Tuesday, December 6, 2011

Time: 1:00 pm – 3:00 pm EST

Location: Technology Square Research Building (TSRB) Room 223


Analysis and visualization of dynamic scenes is often constrained by the amount of spatio-temporal information available from the environment. In most scenarios, we have to account for incomplete information and sparse motion data, requiring us to employ interpolation and approximation methods to fill for the missing information. Scattered data interpolation and approximation techniques have been widely used for solving the problem of completing surfaces and images with incomplete input data. We introduce approaches for such data interpolation and approximation from limited sensors, into the domain of analyzing and visualizing dynamic scenes. Data from dynamic scenes is subject to constraints due to the spatial layout of the scene and/or the configurations of video cameras in use. Such constraints include: (1) sparsely available cameras observing the scene, (2) limited field of view provided by the cameras in use, (3) incomplete motion at a specific moment, and (4) varying frame rates due to different exposures and resolutions.

In this thesis, we establish these forms of incompleteness in the scene, as spatio- temporal uncertainties, and propose solutions for resolving the uncertainties by applying scattered data approximation into a spatio-temporal domain.

The main contributions of this research are as follows: First, we provide an effi- cient framework to visualize large-scale dynamic scenes from distributed static videos. Second, we adopt Radial Basis Function (RBF) interpolation to the spatio-temporal domain to generate global motion tendency. The tendency, represented by a dense flow field, is used to optimally pan and tilt a video camera. Third, we propose a method to represent motion trajectories using stochastic vector fields. Gaussian Pro- cess Regression (GPR) is used to generate a dense vector field and the certainty of each vector in the field. The generated stochastic fields are used for recognizing motion patterns under varying frame-rate and incompleteness of the input videos. Fourth, we also show that the stochastic representation of vector field can also be used for modeling global tendency to detect the region of interests in dynamic scenes with camera motion. We evaluate and demonstrate our approaches in several applications for visualizing virtual cities, automating sports broadcasting, and recognizing traffic patterns in surveillance videos.


  • Prof. Irfan Essa (Advisor, School of Interactive Computing, Georgia Institute of Technology)
  • Prof. James M. Rehg (School of Interactive Computing, Georgia Institute of Technology)
  • Prof. Thad Starner (School of Interactive Computing, Georgia Institute of Technology)
  • Prof. Greg Turk (School of Interactive Computing, Georgia Institute of Technology)
  • Prof. Jessica K. Hodgins (Robotics Institute, Carnegie Mellon University, and Disney Research Pittsburgh)
AddThis Social Bookmark Button

N. Diakopoulos PhD Thesis (2009): Collaborative annotation, analysis, and presentation interfaces for digital video”

July 6th, 2009 Irfan Essa Posted in Computational Journalism, Computational Photography and Video, Multimedia, Nick Diakopoulos, PhD, Students No Comments »

Title: Collaborative annotation, analysis, and presentation interfaces for digital video

Author: Diakopoulos, Nicholas A.


Information quality corresponds to the degree of excellence in communicating knowledge or intelligence and encompasses aspects of validity, accuracy, reliability, bias, transparency, and comprehensiveness among others. Professional news, public relations, and user generated content alike all have their own subtly different information quality concerns. With so much recent growth in online video, it is also apparent that more and more consumers will be getting their information from online videos and that understanding the information quality of video becomes paramount for a consumer wanting to make decisions based on it.

This dissertation explores the design and evaluation of collaborative video annotation and presentation interfaces as motivated by the desire for better information quality in online video. We designed, built, and evaluated three systems: (1) Audio Puzzler, a puzzle game which as a by-product of play produces highly accurate time-stamped transcripts of video, (2) Videolyzer, a video annotation system designed to aid bloggers and journalists collect, aggregate, and share analyses of information quality of video, and (3) Videolyzer CE, a simplified video annotation presentation which syndicates the knowledge collected using Videolyzer to a wider range of users in order to modulate their perceptions of video information. We contribute to knowledge of different interface methods for collaborative video annotation and to mechanisms for enhancing accuracy of objective metadata such as transcripts as well as subjective notions of information quality of the video itself.

via Collaborative annotation, analysis, and presentation interfaces for digital video.

AddThis Social Bookmark Button

Thesis Raffay Hamid PhD (2008): “A Computational Framework For Unsupervised Analysis of Everyday Human Activities”

June 18th, 2008 Irfan Essa Posted in Aaron Bobick, Activity Recognition, Numerical Machine Learning, PhD, Raffay Hamid No Comments »

M. Raffay Hamid PhD (2008), “A Computational Framework For Unsupervised Analysis of Everyday Human Activities“, PhD Thesis, Georgia Institute of Techniology, College of Computing, Atlanta, GA. (Advisor: Aaron Bobick & Irfan Essa)


In order to make computers proactive and assistive, we must enable them to perceive, learn, and predict what is happening in their surroundings. This presents us with the challenge of formalizing computational models of everyday human activities. For a majority of environments, the structure of the in situ activities is generally not known a priori. This thesis therefore investigates knowledge representations and manipulation techniques that can facilitate learning of such everyday human activities in a minimally supervised manner. 

A key step towards this end is finding appropriate representations for human activities. We posit that if we chose to describe activities as finite sequences of an appropriate set of events, then the global structure of these activities can be uniquely encoded using their local event sub-sequences. With this perspective at hand, we particularly investigate representations that characterize activities in terms of their fixed and variable length event subsequences. We comparatively analyze these representations in terms of their representational scope, feature cardinality and noise sensitivity.

Exploiting such representations, we propose a computational framework to discover the various activity-classes taking place in an environment. We model these activity-classes as maximally similar activity-cliques in a completely connected graph of activities, and describe how to discover them efficiently. Moreover, we propose methods for finding concise characterizations of these discovered activity-classes, both from a holistic as well as a by-parts perspective. Using such characterizations, we present an incremental method to classify

a new activity instance to one of the discovered activity-classes, and to automatically detect if it is anomalous with respect to the general characteristics of its membership class. Our results show the efficacy of our framework in a variety of everyday environments

AddThis Social Bookmark Button

Thesis David Minnen PhD (2008): “Unsupervised Discovery of Activity Primitives from Multivariate Sensor Data”

June 18th, 2008 Irfan Essa Posted in Activity Recognition, David Minnen, PhD, Thad Starner No Comments »

Unsupervised Discovery of Activity Primitives from Multivariate Sensor Data



This research addresses the problem of temporal pattern discovery in real-valued, multivariate sensor data. Several algorithms were developed, and subsequent evaluation demonstrates that they can efficiently and accurately discover unknown recurring patterns in time series data taken from many different domains. Different data representations and motif models were investigated in order to design an algorithm with an improved balance between run-time and detection accuracy. The different data representations are used to quickly filter large data sets in order to detect potential patterns that form the basis of a more detailed analysis. The representations include global discretization, which can be efficiently analyzed using a suffix tree, local discretization with a corresponding random projection algorithm for locating similar pairs of subsequences, and a density-based detection method that operates on the original, real-valued data. In addition, a new variation of the multivariate motif discovery problem is proposed in which each pattern may span only a subset of the input features. An algorithm that can efficiently discover such “subdimensional” patterns was developed and evaluated. The discovery algorithms are evaluated by measuring the detection accuracy of discovered patterns relative to a set of expected patterns for each data set. The data sets used for evaluation are drawn from a variety of domains including speech, on-body inertial sensors, music, American Sign Language video, and GPS tracks.

AddThis Social Bookmark Button

Thesis: Mitch Parry PhD (2007), “Separation and Analysis of Multichannel Signals”

October 9th, 2007 Irfan Essa Posted in 0205507, Audio Analysis, Funding, Mitch Parry, PhD, Thesis No Comments »

Mitch Parry (2007), Separation and Analysis of Multichannel Signals PhD Thesis [PDF], Georgia Institute of Techniology, College of Computing, Atlanta, GA. (Advisor: Irfan Essa)


This thesis examines a large and growing class of digital signals that capture the combined effect of multiple underlying factors. In order to better understand these signals, we would like to separate and analyze the underlying factors independently. Although source separation applies to a wide variety of signals, this thesis focuses on separating individual instruments from a musical recording. In particular, we propose novel algorithms for separating instrument recordings given only their mixture. When the number of source signals does not exceed the number of mixture signals, we focus on a subclass of source separation algorithms based on joint diagonalization. Each approach leverages a different form of source structure. We introduce repetitive structure as an alternative that leverages unique repetition patterns in music and compare its performance against the other techniques.

When the number of source signals exceeds the number of mixtures (i.e., the underdetermined problem), we focus on spectrogram factorization techniques for source separation. We extend single-channel techniques to utilize the additional spatial information in multichannel recordings, and use phase information to improve the estimation of the underlying components.

via Separation and Analysis of Multichannel Signals.

AddThis Social Bookmark Button

Thesis: Vivek Kwatra’s PhD Thesis (2005) “Example-based Rendering of Textural Phenomena”

July 19th, 2005 Irfan Essa Posted in Computational Photography and Video, PhD, Thesis, Vivek Kwatra No Comments »

Vivek Kwatra (2005), “Example-based Rendering of Textural Phenomena”PhD Thesis, Georgia Institute of Technology, College of Computing (Advisors: Aaron Bobick, Irfan Essa) [URI], 19-Jul-2005


This thesis explores synthesis by example as a paradigm for rendering real-world phenomena. In particular, phenomena that can be visually described as texture are considered. We exploit, for synthesis, the self-repeating nature of the visual elements constituting these texture exemplars. Techniques for unconstrained as well as constrained/controllable synthesis of both image and video textures are presented. For unconstrained synthesis, we present two robust techniques that can perform spatio-temporal extension, editing, and merging of image as well as video textures. In one of these techniques, large patches of input texture are automatically aligned and seamless stitched with each other to generate realistic looking images and videos. The second technique is based on iterative optimization of a global energy function that measures the quality of the synthesized texture with respect to the given input exemplar. We also present a technique for controllable texture synthesis. In particular, it allows for generation of motion-controlled texture animations that follow a specified flow field. Animations synthesized in this fashion maintain the structural properties like local shape, size, and orientation of the input texture even as they move according to the specified flow. We cast this problem into an optimization framework that tries to simultaneously satisfy the two (potentially competing) objectives of similarity to the input texture and consistency with the flow field. This optimization is a simple extension of the approach used for unconstrained texture synthesis. A general framework for example-based synthesis and rendering is also presented. This framework provides a design space for constructing example-based rendering algorithms. The goal of such algorithms would be to use texture exemplars to render animations for which certain behavioral characteristics need to be controlled. Our motion-controlled texture synthesis technique is an instantiation of this framework where the characteristic being controlled is motion represented as a flow field.

AddThis Social Bookmark Button

Thesis: Drew Steedly PhD (2004): “Rigid Partitioning Techniques for Efficiently Generating 3D Reconstructions from Images”

December 9th, 2004 Irfan Essa Posted in Computational Photography and Video, Drew Steedly, PhD, Thesis No Comments »

Drew Steedly (2004)“Rigid Partitioning Techniques for Efficiently Generating 3D Reconstructions from Images”PhD Thesis, Georgia Institute of Technology, College of Computing. (Advisor: Irfan Essa) [PDF] [URI]


This thesis explores efficient techniques for generating 3D reconstructions from imagery. Non-linear optimization is one of the core techniques used when computing a reconstruction and is a computational bottleneck for large sets of images. Since non-linear optimization requires a good initialization to avoid getting stuck in local minima, robust systems for generating reconstructions from images build up the reconstruction incrementally. A hierarchical approach is to split up the images into small subsets, reconstruct each subset independently and then hierarchically merge the subsets. Rigidly locking together portions of the reconstructions reduces the number of parameters needed to represent them when merging, thereby lowering the computational cost of the optimization. We present two techniques that involve optimizing with parts of the reconstruction rigidly locked together. In the first, we start by rigidly grouping the cameras and scene features from each of the reconstructions being merged into separate groups. Cameras and scene features are then incrementally unlocked and optimized until the reconstruction is close to the minimum energy. This technique is most effective when the influence of the new measurements is restricted to a small set of parameters. Measurements that stitch together weakly coupled portions of the reconstruction, though, tend to cause deformations in the low error modes of the reconstruction and cannot be efficiently incorporated with the previous technique. To address this, we present a spectral technique for clustering the tightly coupled portions of a reconstruction into rigid groups. Reconstructions partitioned in this manner can closely mimic the poorly conditioned, low error modes, and therefore efficiently incorporate measurements that stitch together weakly coupled portions of the reconstruction. We explain how this technique can be used to scalably and efficiently generate reconstructions from large sets of images.

AddThis Social Bookmark Button

Thesis: Gabriel Brostow’s PhD (2004): “Novel Skeletal Representation for Articulated Creatures”

April 9th, 2004 Irfan Essa Posted in Activity Recognition, Gabriel Brostow, Modeling and Animation, Research, Thesis No Comments »

Gabriel Brostow (2004), “Novel Skeletal Representation for Articulated Creatures” PhD Thesis, Georgia Institute of Technology, College of Computing. (Advisor: Irfan Essa) [PDF] [URI]


This research examines an approach for capturing 3D surface and structural data of moving articulated creatures. Given the task of non-invasively and automatically capturing such data, a methodologyand the associated experiments are presented, that apply to multiview videos of the subjects motion. Our thesis states: A functional structure and the timevarying surface of an articulated creature subject are contained in a sequence of its 3D data. A functional structure is one example of the possible arrangements of internal mechanisms (kinematic joints, springs, etc.) that is capable of performing the motions observed in the input data. Volumetric structures are frequently used as shape descriptors for 3D data. The capture of such data is being facilitated by developments in multi-view video and range scanning, extending to subjects that are alive and moving. In this research, we examine vision-based modeling and the related representation of moving articulated creatures using Spines. We define a Spine as a branching axial structure representing the shape and topology of a 3D objects limbs, and capturing the limbs correspondence and motion over time. The Spine concept builds on skeletal representations often used to describe the internal structure of an articulated object and the significant protrusions. Our representation of a Spine provides for enhancements over a 3D skeleton. These enhancements form temporally consistent limb hierarchies that contain correspondence information about real motion data. We present a practical implementation that approximates a Spines joint probability function to reconstruct Spines for synthetic and real subjects that move. In general, our approach combines the objectives of generalized cylinders, 3D scanning, and markerless motion capture to generate baseline models from real puppets, animals, and human subjects.

AddThis Social Bookmark Button