Paper: Asilomar Conference (2003) “Boosted audio-visual HMM for speech reading”

November 9th, 2003 Irfan Essa Posted in 0205507, Face and Gesture, Funding, James Rehg, Machine Learning, Papers, Pei Yin No Comments »

Yin, P. Essa, I. Rehg, J.M. (2003) “Boosted audio-visual HMM for speech reading.” In Proceedings Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 2003. Date: 9-12 Nov. 2003, Volume: 2, On page(s): 2013 – 2018 Vol.2, , ISBN: 0-7803-8104-1, INSPEC Accession Number:8555396, Digital Object Identifier: 10.1109/ACSSC.2003.1292334


We propose a new approach for combining acoustic and visual measurements to aid in recognizing lip shapes of a person speaking. Our method relies on computing the maximum likelihoods of (a) HMM used to model phonemes from the acoustic signal, and (b) HMM used to model visual features motions from video. One significant addition in this work is the dynamic analysis with features selected by AdaBoost, on the basis of their discriminant ability. This form of integration, leading to boosted HMM, permits AdaBoost to find the best features first, and then uses HMM to exploit dynamic information inherent in the signal.

AddThis Social Bookmark Button

Funding: NSF/ITR (2002) “Analysis of Complex Audio-Visual Events Using Spatially Distributed Sensors”

October 1st, 2002 Irfan Essa Posted in 0205507, Funding, James Rehg No Comments »

Award#0205507 – ITR: Analysis of Complex Audio-Visual Events Using Spatially Distributed Sensors


We propose to develop a comprehensive framework for the joint analysis of audio-visual signals obtained from spatially distributed microphones and cameras. We desire solutions to the audio-visual sensing problem that will scale to an arbitrary number of cameras and microphones and can address challenging environments in which there are multiple speech and nonspeech sound sources and multiple moving people and objects. Recently it has become relatively inexpensive to deploy tens or even hundreds of cameras and microphones in an environment. Many applications could benefit from ability to sense in both modalities. There are two levels at which joint audio-visual analysis can take place. At the signal level, the challenge is to develop representations that capture the rich dependency structure in the joint signal and deal success-fully issues such as variable sampling rates and varying temporal delays between cues. At the spatial level the challenge is to compensate for the distortions introduced by the sensor location and pool information across sensors to recover 3-D information about the spatial environment. For many applications, it is highly desirable if the solution method is self-calibrating, and does not require an extensive manual calibration process every time a new sensor is added or an old sensor is moved or replaced. Removing the burden of manual calibration also makes it possible to exploit ad hoc sensor networks which could arise, for example, from wearable microphones and cameras. We propose to address the following four research topics: 1. Representations and learning methods for signal level fusion. 2. Volumetric techniques for fusing spatially distributed audio-visual data. 3. Self-calibration of distributed microphone-camera systems 4. Applications of audio-visual sensing. For example, this proposal includes considerable work on lip and facial analysis to improve voice communications.

AddThis Social Bookmark Button

Funding: NSF (2001) ITR/SY “The Aware Home: Sustaining the Quality of Life for an Aging Population”

October 1st, 2001 Irfan Essa Posted in Aaron Bobick, Aware Home, Beth Mynatt, Funding, Gregory Abowd, Wendy Rogers No Comments »

Award# 0121661 – ITR/SY: The Aware Home: Sustaining the Quality of Life for an Aging Population


The focus of this project is on development of a domestic environment that is cognizant of the whereabouts and activities of its occupants and can support them in their everyday life. While the technology is applicable to a range of domestic situations, the emphasis in this work will be on support for aging in place; through collaboration with experts in assistive care and cognitive aging, the PI and his team will design, demonstrate, and evaluate a series of domestic services that aim to maintain the quality of life for an aging population, with the goal of increasing the likelihood of a “stay at home” alternative to assisted living that satisfies the needs of an aging individual and his/her distributed family. In particular, the PI will explore two areas that are key to sustaining quality of life for an independent senior adult: maintaining familial vigilance, and supporting daily routines. The intention is to serve as an active partner, aiding the senior occupant without taking control. This research will lead to advances in three research areas: human-computer interaction; computational perception; and software engineering. To achieve the desired goals, the PI will conduct the research and experimentation in an authentic domestic setting, a novel research facility called the Residential Laboratory recently completed next to the Georgia Tech campus. Together with experts in theoretical and practical aspects of aging, the PI will establish a pattern of research in which informed design of ubiquitous computing technology can be rapidly deployed, evaluated and evolved in an authentic setting. Special attention will be paid throughout to issues relating to privacy and trust implications. The PI will transition the products of this project to researchers and practitioners interested in performing more large-scale observations of the social and economic impact of Aware Home technologies.

AddThis Social Bookmark Button

Funding: NSF (1998) Experimental Software Systems “Automated Understanding of Captured Experience”

September 1st, 1998 Irfan Essa Posted in Activity Recognition, Audio Analysis, Aware Home, Funding, Gregory Abowd, Intelligent Environments No Comments »

Award#9806822 – Experimental Software Systems: Automated Understanding of Captured Experience

9806822 Essa, Irfan A. Abowd, Gregory D. Georgia Institute of Technology Experimental Software Systems: Automated Understanding of Captured Experience The objective of this research is to reduce substantially the human input necessary for creating and accessing large collections of multimedia, particularly multimedia created by capturing what is happening in an environment. The existing software system which is being used as the starting point for this investigation is Classroom 2000, a system designed to capture what happens in classrooms, meetings, and offices. Classroom 2000 integrates and synchronizes multiple streams of captured text, images, handwritten annotations, audio, and video. In a sense, it automates note-taking for a lecture or meeting. The research challenge is to make sense of this flood of captured data. The project explores how the output of Classroom 2000 can be automatically structured, segmented, indexed, and linked. Machine learning and statistical approaches to language are used to attempt to understand the captured data. Techniques from computational perception are used to try to find structure in the captured data. An important component of this research is the experimental analysis of the software system being built. The expectation is that this research will have a dramatic impact on how humans work and learn, as technology aids humans by capturing and making accessible what happens in an environment.

AddThis Social Bookmark Button