MENU: Home Bio Affiliations Research Teaching Publications Videos Collaborators/Students Contact FAQ ©2007-14 RSS

Paper in IEEE CVPR 2013 “Decoding Children’s Social Behavior”

June 27th, 2013 Irfan Essa Posted in Affective Computing, Behavioral Imaging, Denis Lantsman, Gregory Abowd, James Rehg, PAMI/ICCV/CVPR/ECCV, Papers, Thomas Ploetz No Comments »

  • J. M. Rehg, G. D. Abowd, A. Rozga, M. Romero, M. A. Clements, S. Sclaroff, I. Essa, O. Y. Ousley, Y. Li, C. Kim, H. Rao, J. C. Kim, L. L. Presti, J. Zhang, D. Lantsman, J. Bidwell, and Z. Ye (2013), “Decoding Children’s Social Behavior,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [PDF] [WEBSITE] [DOI] [BIBTEX]
    @inproceedings{2013-Rehg-DCSB,
      Author = {James M. Rehg and Gregory D. Abowd and Agata Rozga and Mario Romero and Mark A. Clements and Stan Sclaroff and Irfan Essa and Opal Y. Ousley and Yin Li and Chanho Kim and Hrishikesh Rao and Jonathan C. Kim and Liliana Lo Presti and Jianming Zhang and Denis Lantsman and Jonathan Bidwell and Zhefan Ye},
      Booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}},
      Date-Added = {2013-06-25 11:47:42 +0000},
      Date-Modified = {2014-04-28 17:08:51 +0000},
      Doi = {10.1109/CVPR.2013.438},
      Month = {June},
      Organization = {IEEE Computer Society},
      Pdf = {http://www.cc.gatech.edu/~rehg/Papers/Rehg_CVPR13.pdf},
      Title = {Decoding Children's Social Behavior},
      Url = {http://www.cbi.gatech.edu/mmdb/},
      Year = {2013},
      Bdsk-Url-1 = {http://www.cbi.gatech.edu/mmdb/},
      Bdsk-Url-2 = {http://dx.doi.org/10.1109/CVPR.2013.438}}

Abstract

We introduce a new problem domain for activity recognition: the analysis of children’s social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1-2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3-5 minute child-adult interaction. In each session, the adult examiner followed a semi-structured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.

Full database available from http://www.cbi.gatech.edu/mmdb/

via IEEE Xplore – Decoding Children’s Social Behavior.

AddThis Social Bookmark Button

Paper (2009) In ACM Symposium on Interactive 3D Graphics “Human Video Textures”

March 1st, 2009 Irfan Essa Posted in ACM SIGGRAPH, Computational Photography and Video, James Rehg, Matt Flagg, Modeling and Animation, Papers, Sing Bing Kang No Comments »

 

Matthew FlaggAtsushi Nakazawa, Qiushuang Zhang, Sing Bing Kang, Young Kee Ryu, Irfan EssaJames M. Rehg (2009), Human Video Textures In Proceedings of the ACM Symposium on Interactive 3D Graphics and Games 2009 (I3D ’09), Boston, MA, February 27-March 1 (Fri-Sun), 2009 [PDF (see Copyright) | Video in DiVx | Website ]

Abstract

This paper describes a data-driven approach for generating photorealistic animations of human motion. Each animation sequence follows a user-choreographed path and plays continuously by seamlessly transitioning between different segments of the captured data. To produce these animations, we capitalize on the complementary characteristics of motion capture data and video. We customize our capture system to record motion capture data that are synchronized with our video source. Candidate transition points in video clips are identified using a new similarity metric based on 3-D marker trajectories and their 2-D projections into video. Once the transitions have been identified, a video-based motion graph is constructed. We further exploit hybrid motion and video data to ensure that the transitions are seamless when generating animations. Motion capture marker projections serve as control points for segmentation of layers and nonrigid transformation of regions. This allows warping and blending to generate seamless in-between frames for animation. We show a series of choreographed animations of walks and martial arts scenes as validation of our approach.

Example Image from Project

Human Video Textures (Output Rendered as a Collage!)

AddThis Social Bookmark Button

Paper (2009): ICASSP “Learning Basic Units in American Sign Language using Discriminative Segmental Feature Selection”

February 4th, 2009 Irfan Essa Posted in 0205507, Face and Gesture, ICASSP, James Rehg, Numerical Machine Learning, Pei Yin, Thad Starner No Comments »

Pei Yin, Thad Starner, Harley Hamilton, Irfan Essa, James M. Rehg (2009), “Learning Basic Units in American Sign Language using Discriminative Segmental Feature Selection” in IEEE Conference on Acoustics, Speech, and Signal Processing 2009 (ICASSP 2009). Session: Spoken Language Understanding I, Tuesday, April 21, 11:00 – 13:00, Taipei, Taiwan.

ABSTRACT

The natural language for most deaf signers in the United States is American Sign Language (ASL). ASL has internal structure like spoken languages, and ASL linguists have introduced several phonemic models. The study of ASL phonemes is not only interesting to linguists, but also useful for scalability in recognition by machines. Since machine perception is different than human perception, this paper learns the basic units for ASL directly from data. Comparing with previous studies, our approach computes a set of data-driven units (fenemes) discriminatively from the results of segmental feature selection. The learning iterates the following two steps: first apply discriminative feature selection segmentally to the signs, and then tie the most similar temporal segments to re-train. Intuitively, the sign parts indistinguishable to machines are merged to form basic units, which we call ASL fenemes. Experiments on publicly available ASL recognition data show that the extracted data-driven fenemes are meaningful, and recognition using those fenemes achieves improved accuracy at reduced model complexity

AddThis Social Bookmark Button

Paper: ICASSP (2008) “Discriminative Feature Selection for Hidden Markov Models using Segmental Boosting”

April 3rd, 2008 Irfan Essa Posted in 0205507, Face and Gesture, Funding, James Rehg, Numerical Machine Learning, PAMI/ICCV/CVPR/ECCV, Papers, Pei Yin, Thad Starner No Comments »

Pei Yin, Irfan Essa, James Rehg, Thad Starner (2008) “Discriminative Feature Selection for Hidden Markov Models using Segmental Boosting”, ICASSP 2008 – March 30 – April 4, 2008 – Las Vegas, Nevada, U.S.A. (Paper: MLSP-P3.D8, Session: Pattern Recognition and Classification II, Time: Thursday, April 3, 15:30 – 17:30, Topic: Machine Learning for Signal Processing: Learning Theory and Modeling) (PDF|Project Site)

ABSTRACT

icassp08We address the feature selection problem for hidden Markov models (HMMs) in sequence classification. Temporal correlation in sequences often causes difficulty in applying feature selection techniques. Inspired by segmental k-means segmentation (SKS), we propose Segmentally Boosted HMMs (SBHMMs), where the state-optimized features are constructed in a segmental and discriminative manner. The contributions are twofold. First, we introduce a novel feature selection algorithm, where the temporal dynamics are decoupled from the static learning procedure by assuming that the sequential data are piecewise independent and identically distributed. Second, we show that the SBHMM consistently improves traditional HMM recognition in various domains. The reduction of error compared to traditional HMMs ranges from 17% to 70% in American Sign Language recognition, human gait identification, lip reading, and speech recognition.

AddThis Social Bookmark Button

Paper: J. Parallel Distrib. Computing (2005): “Experiences with optimizing two stream-based applications for cluster execution”

September 30th, 2006 Irfan Essa Posted in Computational Photography and Video, James Rehg, Kishore Ramachandran, Papers, Research No Comments »

Experiences with optimizing two stream-based applications for cluster execution Angelov, Y., Ramachandran, U., Mackenzie, K., Rehg, J. M., and Essa, I. 2005. “Experiences with optimizing two stream-based applications for cluster execution”. J. Parallel Distrib. Comput. 65, 6 (Jun. 2005), 678-691. [DOI]

Abstract

We explore optimization strategies and resulting performance of two stream-based video applications, video texture and color tracker, on a cluster of SMPs. The two applications are representative of a class of emerging applications, which we call “stream-based applications”, that are sensitive to both latency of individual results and overall throughput. Such applications require non-trivial parallelization techniques in order to improve both latency and throughput, given that the stream data emanates from a limited set of sources (exactly one in the two applications studied) and that the distribution of the data cannot be done a priori.We suggest techniques that address in a coordinated fashion the problems of data distribution and work partitioning. We believe the two problems are related and need to be addressed together. We have parallelized two applications using the Stampede cluster programming system that provides abstractions for implementing time-and throughput-sensitive applications elegantly and efficiently. For the Video Textures application we show that we can achieve a speedup of 24.26 on a 112 processor cluster. For the Color Tracker application, where latency is more crucial, we identify the extent of data parallelism that ensures that the slowest member of the pipeline is no longer the bottleneck for achieving a decent frame rate.

AddThis Social Bookmark Button

Paper: IEEE CVPR (2004) “Asymmetrically boosted HMM for speech reading”

June 2nd, 2004 Irfan Essa Posted in 0205507, Funding, James Rehg, Papers, Pei Yin No Comments »

Pei Yin Essa, I. Rehg, J.M. (2004) “Asymmetrically boosted HMM for speech reading,”, In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004 (CVPR 2004). Publication Date: 27 June-2 July 2004, Volume: 2, On page(s): II-755 – II-761 Vol.2 ISSN: 1063-6919, ISBN: 0-7695-2158-, INSPEC Accession Number:8161546, Digital Object Identifier: 10.1109/CVPR.2004.1315240

Abstract

Speech reading, also known as lip reading, is aimed at extracting visual cues of lip and facial movements to aid in recognition of speech. The main hurdle for speech reading is that visual measurements of lip and facial motion lack information-rich features like the Mel frequency cepstral coefficients (MFCC), widely used in acoustic speech recognition. These MFCC are used with hidden Markov models (HMM) in most speech recognition systems at present. Speech reading could greatly benefit from automatic selection and formation of informative features from measurements in the visual domain. These new features can then be used with HMM to capture the dynamics of lip movement and eventual recognition of lip shapes. Towards this end, we use AdaBoost methods for automatic visual feature formation. Specifically, we design an asymmetric variant of AdaBoost M2 algorithm to deal with the ill-posed multi-class sample distribution inherent in our problem. Our experiments show that the boosted HMM approach outperforms conventional AdaBoost and HMM classifiers. Our primary contributions are in the design of (a) boosted HMM and (b) asymmetric multi-class boosting.

AddThis Social Bookmark Button

Paper: Asilomar Conference (2003) “Boosted audio-visual HMM for speech reading”

November 9th, 2003 Irfan Essa Posted in 0205507, Face and Gesture, Funding, James Rehg, Numerical Machine Learning, Papers, Pei Yin No Comments »

Yin, P. Essa, I. Rehg, J.M. (2003) “Boosted audio-visual HMM for speech reading.” In Proceedings Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 2003. Date: 9-12 Nov. 2003, Volume: 2, On page(s): 2013 – 2018 Vol.2, , ISBN: 0-7803-8104-1, INSPEC Accession Number:8555396, Digital Object Identifier: 10.1109/ACSSC.2003.1292334

Abstract

We propose a new approach for combining acoustic and visual measurements to aid in recognizing lip shapes of a person speaking. Our method relies on computing the maximum likelihoods of (a) HMM used to model phonemes from the acoustic signal, and (b) HMM used to model visual features motions from video. One significant addition in this work is the dynamic analysis with features selected by AdaBoost, on the basis of their discriminant ability. This form of integration, leading to boosted HMM, permits AdaBoost to find the best features first, and then uses HMM to exploit dynamic information inherent in the signal.

AddThis Social Bookmark Button

Funding: NSF/ITR (2002) “Analysis of Complex Audio-Visual Events Using Spatially Distributed Sensors”

October 1st, 2002 Irfan Essa Posted in 0205507, Funding, James Rehg No Comments »

Award#0205507 – ITR: Analysis of Complex Audio-Visual Events Using Spatially Distributed Sensors

ABSTRACT

We propose to develop a comprehensive framework for the joint analysis of audio-visual signals obtained from spatially distributed microphones and cameras. We desire solutions to the audio-visual sensing problem that will scale to an arbitrary number of cameras and microphones and can address challenging environments in which there are multiple speech and nonspeech sound sources and multiple moving people and objects. Recently it has become relatively inexpensive to deploy tens or even hundreds of cameras and microphones in an environment. Many applications could benefit from ability to sense in both modalities. There are two levels at which joint audio-visual analysis can take place. At the signal level, the challenge is to develop representations that capture the rich dependency structure in the joint signal and deal success-fully issues such as variable sampling rates and varying temporal delays between cues. At the spatial level the challenge is to compensate for the distortions introduced by the sensor location and pool information across sensors to recover 3-D information about the spatial environment. For many applications, it is highly desirable if the solution method is self-calibrating, and does not require an extensive manual calibration process every time a new sensor is added or an old sensor is moved or replaced. Removing the burden of manual calibration also makes it possible to exploit ad hoc sensor networks which could arise, for example, from wearable microphones and cameras. We propose to address the following four research topics: 1. Representations and learning methods for signal level fusion. 2. Volumetric techniques for fusing spatially distributed audio-visual data. 3. Self-calibration of distributed microphone-camera systems 4. Applications of audio-visual sensing. For example, this proposal includes considerable work on lip and facial analysis to improve voice communications.

AddThis Social Bookmark Button