MENU: Home Bio Affiliations Research Teaching Publications Videos Collaborators/Students Contact FAQ ©2007-14 RSS

Paper in ECCV Workshop 2012: “Weakly Supervised Learning of Object Segmentations from Web-Scale Videos”

October 7th, 2012 Irfan Essa Posted in Activity Recognition, Awards, Google, Matthias Grundmann, Multimedia, PAMI/ICCV/CVPR/ECCV, Papers, Vivek Kwatra, WWW No Comments »

Weakly Supervised Learning of Object Segmentations from Web-Scale Videos

  • G. Hartmann, M. Grundmann, J. Hoffman, D. Tsai, V. Kwatra, O. Madani, S. Vijayanarasimhan, I. Essa, J. Rehg, and R. Sukthankar (2012), “Weakly Supervised Learning of Object Segmentations from Web-Scale Videos,” in Proceedings of ECCV 2012 Workshop on Web-scale Vision and Social Media, 2012. [PDF] [DOI] [BIBTEX]
    @inproceedings{2012-Hartmann-WSLOSFWV,
      Author = {Glenn Hartmann and Matthias Grundmann and Judy Hoffman and David Tsai and Vivek Kwatra and Omid Madani and Sudheendra Vijayanarasimhan and Irfan Essa and James Rehg and Rahul Sukthankar},
      Booktitle = {Proceedings of ECCV 2012 Workshop on Web-scale Vision and Social Media},
      Date-Added = {2012-10-23 15:03:18 +0000},
      Date-Modified = {2013-10-22 18:57:10 +0000},
      Doi = {10.1007/978-3-642-33863-2_20},
      Pdf = {http://www.cs.cmu.edu/~rahuls/pub/eccv2012wk-cp-rahuls.pdf},
      Title = {Weakly Supervised Learning of Object Segmentations from Web-Scale Videos},
      Year = {2012},
      Bdsk-Url-1 = {http://dx.doi.org/10.1007/978-3-642-33863-2_20}}

Abstract

We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Speci cally, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as dog”, without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classi ers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classi ers are further re ned using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we con rm that our proposed methods can learn good object masks just by watching YouTube.

Presented at: ECCV 2012 Workshop on Web-scale Vision and Social Media, 2012, October 7-12, 2012, in Florence, ITALY.

Awarded the BEST PAPER AWARD!

 

AddThis Social Bookmark Button

Kihwan Kim’s Thesis Defense (2011): “Spatio-temporal Data Interpolation for Dynamic Scene Analysis”

December 6th, 2011 Irfan Essa Posted in Computational Photography and Video, Kihwan Kim, Modeling and Animation, Multimedia, PhD, Security, Visual Surviellance, WWW No Comments »

Spatio-temporal Data Interpolation for Dynamic Scene Analysis

Kihwan Kim, PhD Candidate

School of Interactive Computing, College of Computing, Georgia Institute of Technology

Date: Tuesday, December 6, 2011

Time: 1:00 pm – 3:00 pm EST

Location: Technology Square Research Building (TSRB) Room 223

Abstract

Analysis and visualization of dynamic scenes is often constrained by the amount of spatio-temporal information available from the environment. In most scenarios, we have to account for incomplete information and sparse motion data, requiring us to employ interpolation and approximation methods to fill for the missing information. Scattered data interpolation and approximation techniques have been widely used for solving the problem of completing surfaces and images with incomplete input data. We introduce approaches for such data interpolation and approximation from limited sensors, into the domain of analyzing and visualizing dynamic scenes. Data from dynamic scenes is subject to constraints due to the spatial layout of the scene and/or the configurations of video cameras in use. Such constraints include: (1) sparsely available cameras observing the scene, (2) limited field of view provided by the cameras in use, (3) incomplete motion at a specific moment, and (4) varying frame rates due to different exposures and resolutions.

In this thesis, we establish these forms of incompleteness in the scene, as spatio- temporal uncertainties, and propose solutions for resolving the uncertainties by applying scattered data approximation into a spatio-temporal domain.

The main contributions of this research are as follows: First, we provide an effi- cient framework to visualize large-scale dynamic scenes from distributed static videos. Second, we adopt Radial Basis Function (RBF) interpolation to the spatio-temporal domain to generate global motion tendency. The tendency, represented by a dense flow field, is used to optimally pan and tilt a video camera. Third, we propose a method to represent motion trajectories using stochastic vector fields. Gaussian Pro- cess Regression (GPR) is used to generate a dense vector field and the certainty of each vector in the field. The generated stochastic fields are used for recognizing motion patterns under varying frame-rate and incompleteness of the input videos. Fourth, we also show that the stochastic representation of vector field can also be used for modeling global tendency to detect the region of interests in dynamic scenes with camera motion. We evaluate and demonstrate our approaches in several applications for visualizing virtual cities, automating sports broadcasting, and recognizing traffic patterns in surveillance videos.

Committee:

  • Prof. Irfan Essa (Advisor, School of Interactive Computing, Georgia Institute of Technology)
  • Prof. James M. Rehg (School of Interactive Computing, Georgia Institute of Technology)
  • Prof. Thad Starner (School of Interactive Computing, Georgia Institute of Technology)
  • Prof. Greg Turk (School of Interactive Computing, Georgia Institute of Technology)
  • Prof. Jessica K. Hodgins (Robotics Institute, Carnegie Mellon University, and Disney Research Pittsburgh)
AddThis Social Bookmark Button

Presentation (2011) at IBPRIA 2011: “Spatio-Temporal Video Analysis and Visual Activity Recognition”

June 8th, 2011 Irfan Essa Posted in Activity Recognition, Computational Photography and Video, Kihwan Kim, Matthias Grundmann, Multimedia, PAMI/ICCV/CVPR/ECCV, Presentations No Comments »

“Spatio-Temporal Video Analysis and Visual Activity Recognition” at the Iberian Conference on Pattern Recognition and Image Analysis  (IbPRIA) 2011 Conference in Las Palmas de Gran Canaria. Spain. June 8-10.

Abstract

My research group is focused on a variety of approaches for (a) low-level video analysis and synthesis and (b) recognizing activities in videos. In this talk, I will concentrate on two of our recent efforts. One effort aimed at robust spatio-temporal segmentation of video and another on using motion and flow to recognize and predict actions from video.

In the first part of the talk, I will present an efficient and scalable technique for spatio-temporal segmentation of long video sequences using a hierarchical graph-based algorithm. In this work, we begin by over segmenting a volumetric video graph into space-time regions grouped by appearance. We then construct a “region graph” over the obtained segmentation and iteratively repeat this process over multiple levels to create a tree of spatio-temporal segmentations. This hierarchical approach generates high quality segmentations, which are temporally coherent with stable region boundaries, and allows subsequent applications to choose from varying levels of granularity. We further improve segmentation quality by using dense optical flow to guide temporal connections in the initial graph. I will demonstrate a variety of examples of how this robust segmentation works, and will show additional examples of video-retargeting that use spatio-temporal saliency derived from this segmentation approach. (Matthias Grundmann, Vivek Kwatra, Mei Han, Irfan Essa, CVPR 2010, in collaboration with Google Research).

In the second part of this talk, I will show that constrained multi-agent events can be analyzed and even predicted from video. Such analysis requires estimating the global movements of all players in the scene at any time, and is needed for modeling and predicting how the multi-agent play evolves over time on the playing field. To this end, we propose a novel approach to detect the locations of where the play evolution will proceed, e.g. where interesting events will occur, by tracking player positions and movements over time. To achieve this, we extract the ground level sparse movement of players in each time-step, and then generate a dense motion field. Using this field we detect locations where the motion converges, implying positions towards which the play is evolving. I will show examples of how we have tested this approach for soccer, basketball and hockey. (Kihwan Kim, Matthias Grundmann, Ariel Shamir, Iain Matthews, Jessica Hodgins, Irfan Essa, CVPR 2010, in collaboration with Disney Research).

Time permitting, I will show some more videos of our recent work on video analysis and synthesis. For more information, papers, and videos, see my website.

AddThis Social Bookmark Button

Going Live on YouTube (2011): Lights, Camera… EDIT! New Features for the YouTube Video Editor

March 21st, 2011 Irfan Essa Posted in Computational Photography and Video, Google, In The News, Matthias Grundmann, Multimedia, Vivek Kwatra, WWW No Comments »

via YouTube Blog: Lights, Camera… EDIT! New Features for the YouTube Video Editor.

  • M. Grundmann, V. Kwatra, and I. Essa (2011), “Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. [PDF] [WEBSITE] [VIDEO] [DEMO] [DOI] [BLOG] [BIBTEX]
    @inproceedings{2011-Grundmann-AVSWROCP,
      Author = {M. Grundmann and V. Kwatra and I. Essa},
      Blog = {http://prof.irfanessa.com/2011/06/19/videostabilization/},
      Booktitle = {Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      Date-Modified = {2013-10-22 13:55:15 +0000},
      Demo = {http://www.youtube.com/watch?v=0MiY-PNy-GU},
      Doi = {10.1109/CVPR.2011.5995525},
      Month = {June},
      Pdf = {http://www.cc.gatech.edu/~irfan/p/2011-Grundmann-AVSWROCP.pdf},
      Publisher = {IEEE Computer Society},
      Title = {Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths},
      Url = {http://www.cc.gatech.edu/cpl/projects/videostabilization/},
      Video = {http://www.youtube.com/watch?v=i5keG1Y810U},
      Year = {2011},
      Bdsk-Url-1 = {http://www.cc.gatech.edu/cpl/projects/videostabilization/},
      Bdsk-Url-2 = {http://dx.doi.org/10.1109/CVPR.2011.5995525}}

Lights, Camera… EDIT! New Features for the YouTube Video Editor

Nine months ago we launched our cloud-based video editor. It was a simple product built to provide our users with simple editing tools. Although it didn’t have all the features available on paid desktop editing software, the idea was that the vast majority of people’s video editing needs are pretty basic and straight-forward and we could provide these features with a free editor available on the Web. Since launch, hundreds of thousands of videos have been published using the YouTube Video Editor and we’ve regularly pushed out new feature enhancements to the product, including:

  • Video transitions (crossfade, wipe, slide)
  • The ability to save projects across sessions
  • Increased clips allowed in the editor from 6 to 17
  • Video rotation (from portrait to landscape and vice versa – great for videos shot on mobile)
  • Shape transitions (heart, star, diamond, and Jack-O-Lantern for Halloween)
  • Audio mixing (AudioSwap track mixed with original audio)
  • Effects (brightness/contrast, black & white)

A new user interface and project menu for multiple saved projects

While many of these are familiar features also available on desktop software, today, we’re excited to unveil two new features that the team has been working on over the last couple of months that take unique advantage of the cloud:

Stabilizer

Ever shoot a shaky video that’s so jittery, it’s actually hard to watch? Professional cinematographers use stabilization equipment such as tripods or camera dollies to keep their shots smooth and steady. Our team mimicked these cinematographic principles by automatically determining the best camera path for you through a unified optimization technique. In plain English, you can smooth some of those unsteady videos with the click of a button. We also wanted you to be able to preview these results in real-time, before publishing the finished product to the Web. We can do this by harnessing the power of the cloud by splitting the computation required for stabilizing the video into chunks and distributed them across different servers. This allows us to use the power of many machines in parallel, computing and streaming the stabilized results quickly into the preview. You can check out the paper we’re publishing entitled “Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths.” Want to see stabilizer in action? You can test it out for yourself, or check out these two videos. The first is without stabilizer.

And now, with the stabilizer:

AddThis Social Bookmark Button

N. Diakopoulos PhD Thesis (2009): Collaborative annotation, analysis, and presentation interfaces for digital video”

July 6th, 2009 Irfan Essa Posted in Computational Journalism, Computational Photography and Video, Multimedia, Nick Diakopoulos, PhD, Students No Comments »

Title: Collaborative annotation, analysis, and presentation interfaces for digital video

Author: Diakopoulos, Nicholas A.

Abstract

Information quality corresponds to the degree of excellence in communicating knowledge or intelligence and encompasses aspects of validity, accuracy, reliability, bias, transparency, and comprehensiveness among others. Professional news, public relations, and user generated content alike all have their own subtly different information quality concerns. With so much recent growth in online video, it is also apparent that more and more consumers will be getting their information from online videos and that understanding the information quality of video becomes paramount for a consumer wanting to make decisions based on it.

This dissertation explores the design and evaluation of collaborative video annotation and presentation interfaces as motivated by the desire for better information quality in online video. We designed, built, and evaluated three systems: (1) Audio Puzzler, a puzzle game which as a by-product of play produces highly accurate time-stamped transcripts of video, (2) Videolyzer, a video annotation system designed to aid bloggers and journalists collect, aggregate, and share analyses of information quality of video, and (3) Videolyzer CE, a simplified video annotation presentation which syndicates the knowledge collected using Videolyzer to a wider range of users in order to modulate their perceptions of video information. We contribute to knowledge of different interface methods for collaborative video annotation and to mechanisms for enhancing accuracy of objective metadata such as transcripts as well as subjective notions of information quality of the video itself.

via Collaborative annotation, analysis, and presentation interfaces for digital video.

AddThis Social Bookmark Button

Paper: ACM Multimedia (2008) “Audio Puzzler: Piecing Together Time-Stamped Speech Transcripts with a Puzzle Game”

October 18th, 2008 Irfan Essa Posted in ACM MM, Computational Journalism, Multimedia, Nick Diakopoulos, Papers No Comments »

N. Diakopoulos, K. Luther, I. Essa (2008), “Audio Puzzler: Piecing Together Time-Stamped Speech Transcripts with a Puzzle Game.” In Proceedings of  ACM International Conference on Multimedia 2008. Vancouver, BC, CANANDA  [Project Link]

ABSTRACT

We have developed an audio-based casual puzzle game which produces a time-stamped transcription of spokenapaudio as a by-product of play. Our evaluation of the game indicates that it is both fun and challenging. The transcripts generated using the game are more accurate than those produced using a standard automatic transcription system and the time-stamps of words are within several hundred milliseconds of ground truth.

AddThis Social Bookmark Button

Paper: Pragmatic Web (2008) “An Annotation Model for Making Sense of Information Quality in Online Videos”

September 28th, 2008 Irfan Essa Posted in Computational Journalism, Multimedia, Nick Diakopoulos, Papers No Comments »

N. Diakopoulos, I. Essa. (2008) “An Annotation Model for Making Sense of Information Quality in Online Videos.” Proceedings of the International Conference on the Pragmatic Web. 28–30 Sept. 2008, Uppsala, Sweden (To Appear)

ABSTRACT

Making sense of the information quality of online media including things such as the accuracy and validity of claims and the reliability of sources is essential for people to be well-informed. We are developing Videolyzer to address the challenge of information quality sense-making by allowing motivated individuals to analyze, collect, share, and respond to criticisms of the information quality of online political videos and their transcripts. In this paper specifically we present a model of how the annotation ontology and collaborative dynamics embedded in Videolyzer can enhance information quality.

AddThis Social Bookmark Button

Paper in ACM Multimedia (2006): “Interactive mosaic generation for video navigation”

October 22nd, 2006 Irfan Essa Posted in ACM MM, Computational Photography and Video, Gregory Abowd, Kihwan Kim, Multimedia, Papers No Comments »

K. Kim, I. Essa, and G. Abowd (2006) “Interactive mosaic generation for video navigation.” in Proceedings of the 14th annual ACM international conference on Multimedia, pages 655-658, 2006. [Project Page | DOI | PDF]

Abstract

Navigation through large multimedia collections that include videos and images still remains cumbersome. In this paper, we introduce a novel method to visualize and navigate through the collection by creating a mosaic image that visually represents the compilation. This image is generated by a labeling-based layout algorithm using various sizes of sample tile images from the collection. Each tile represents both the photographs and video files representing scenes selected by matching algorithms. This generated mosaic image provides a new way for thematic video and visually summarizes the videos. Users can generate these mosaics with some predefined themes and layouts, or base it on the results of their queries. Our approach supports automatic generation of these layouts by using meta-information such as color, time-line and existence of faces or manually generated annotated information from existing systems (e.g., the Family Video Archive).

Interactive Video Mosaic

Interactive Video Mosaic

AddThis Social Bookmark Button