Paper in IROS 2012: “Linguistic Transfer of Human Assembly Tasks to Robots”

October 7th, 2012 Irfan Essa Posted in 0205507, Activity Recognition, IROS/ICRA, Mike Stilman, Robotics No Comments »

Linguistic Transfer of Human Assembly Tasks to Robots

  • N. Dantam, I. Essa, and M. Stilman (2012), “Linguistic Transfer of Human Assembly Tasks to Robots,” in Proceedings of Intelligent Robots and Systems (IROS), 2012. [PDF] [DOI] [BIBTEX]
    @InProceedings{    2012-Dantam-LTHATR,
      author  = {N. Dantam and I. Essa and M. Stilman},
      booktitle  = {Proceedings of Intelligent Robots and Systems
      doi    = {10.1109/IROS.2012.6385749},
      pdf    = {},
      title    = {Linguistic Transfer of Human Assembly Tasks to
      year    = {2012}


We demonstrate the automatic transfer of an assembly task from human to robot. This work extends efforts showing the utility of linguistic models in verifiable robot control policies by now performing real visual analysis of human demonstrations to automatically extract a policy for the task. This method tokenizes each human demonstration into a sequence of object connection symbols, then transforms the set of sequences from all demonstrations into an automaton, which represents the task-language for assembling a desired object. Finally, we combine this assembly automaton with a kinematic model of a robot arm to reproduce the demonstrated task.

Presented at: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2012), October 7-12, 2012 Vilamoura, Algarve, Portugal.


AddThis Social Bookmark Button

Funding (2011) NSF (1146352) “EAGER: Linguistic Task Transfer for Humans and Cyber Systems”

September 1st, 2011 Irfan Essa Posted in Activity Recognition, Mike Stilman, NSF, Robotics No Comments »

EAGER: Linguistic Task Transfer for Humans and Cyber Systems (Mike Stillman, Irfan Essa) NSF/RI

This project, investigating formal languages as a general methodology for task transfer between distinct cyber-physical systems such as humans and robots, aims to expand the science of cyber physical systems by developing Motion Grammars that will enable task transfer between distinct systems.

Formal languages are tools for encoding, describing and transferring structured knowledge. In natural language, the latter process is called communication. Similarly, we will develop a formal language through which arbitrary cyber-physical systems communicate tasks via structured actions. This investigation of Motion Grammars will contribute to the science of human cognition and the engineering of cyber-physical algorithms. By observing human activities during manipulation we will develop a novel class of hybrid control algorithms based on linguistic representations of task execution. These algorithms will broaden the capabilities of man-made systems and provide the infrastructure for motion transfer between humans, robots and broader systems in a generic context. Furthermore, the representation in a rigorous grammatical context will enable formal verification and validation in future work.
Broader Impacts: The proposed research has direct applications to new solutions for manufacturing, medical treatments such as surgery, logistics and food processing. In turn, each of these areas has a significant impact on the efficiency and convenience of our daily lives. The PIs serve as coordinators of graduate/undergraduate programs and mentors to community schools. In order to guarantee that women and minorities have a significant role in the research, the PIs will annually invite K-12 students from Atlanta schools with primarily African American populations to the laboratories. One-day robot classes will be conducted that engage students in the excitement of hands-on science by interactively using lab equipment to transfer their manipulation skills to a robot arm.

Via Award#1146352 – EAGER: Linguistic Task Transfer for Humans and Cyber Systems.

AddThis Social Bookmark Button

Funding (2011): NSF (1059362): “II-New: Motion Grammar Laboratory”

March 1st, 2011 Irfan Essa Posted in Henrik Christensen, Mike Stilman, NSF No Comments »

II-New: Motion Grammar Laboratory (Stillman, Essa, Egerstadt, Christensen, Ueda) Division of Computer and Network Systems Instrumentation Grant.

An anthropomorphic robot arm and a human capture system enable the autonomous performance of assembly tasks with significant uncertainty in problem specifications and environments. This line of work is investigated through sequences of manipulation actions where the guarantee of the completion of task-level objectives is rooted in the discovery of the semantic structure of human manipulation. New research directions in anthropomorphic robotics are explored including programming by demonstration, activity recognition, control and estimation and planning.

The motion grammar laboratory infrastructure allows a great opportunity for research and education. New classroom experiences for undergraduates and graduates provide practical experience in robot human interaction and activity process sharing. This opens possibilities for human training and rehabilitation, as well as assistive personal robotic, and opens the door to a host of technological innovations.

via Award#1059362 – II-New: Motion Grammar Laboratory.

AddThis Social Bookmark Button

Paper (2009): ICASSP “Learning Basic Units in American Sign Language using Discriminative Segmental Feature Selection”

February 4th, 2009 Irfan Essa Posted in 0205507, Face and Gesture, ICASSP, James Rehg, Machine Learning, Pei Yin, Thad Starner No Comments »

Pei Yin, Thad Starner, Harley Hamilton, Irfan Essa, James M. Rehg (2009), “Learning Basic Units in American Sign Language using Discriminative Segmental Feature Selection” in IEEE Conference on Acoustics, Speech, and Signal Processing 2009 (ICASSP 2009). Session: Spoken Language Understanding I, Tuesday, April 21, 11:00 – 13:00, Taipei, Taiwan.


The natural language for most deaf signers in the United States is American Sign Language (ASL). ASL has internal structure like spoken languages, and ASL linguists have introduced several phonemic models. The study of ASL phonemes is not only interesting to linguists, but also useful for scalability in recognition by machines. Since machine perception is different than human perception, this paper learns the basic units for ASL directly from data. Comparing with previous studies, our approach computes a set of data-driven units (fenemes) discriminatively from the results of segmental feature selection. The learning iterates the following two steps: first apply discriminative feature selection segmentally to the signs, and then tie the most similar temporal segments to re-train. Intuitively, the sign parts indistinguishable to machines are merged to form basic units, which we call ASL fenemes. Experiments on publicly available ASL recognition data show that the extracted data-driven fenemes are meaningful, and recognition using those fenemes achieves improved accuracy at reduced model complexity

AddThis Social Bookmark Button

Paper: ICASSP (2008) “Discriminative Feature Selection for Hidden Markov Models using Segmental Boosting”

April 3rd, 2008 Irfan Essa Posted in 0205507, Face and Gesture, Funding, James Rehg, Machine Learning, PAMI/ICCV/CVPR/ECCV, Papers, Pei Yin, Thad Starner No Comments »

Pei Yin, Irfan Essa, James Rehg, Thad Starner (2008) “Discriminative Feature Selection for Hidden Markov Models using Segmental Boosting”, ICASSP 2008 – March 30 – April 4, 2008 – Las Vegas, Nevada, U.S.A. (Paper: MLSP-P3.D8, Session: Pattern Recognition and Classification II, Time: Thursday, April 3, 15:30 – 17:30, Topic: Machine Learning for Signal Processing: Learning Theory and Modeling) (PDF|Project Site)


icassp08We address the feature selection problem for hidden Markov models (HMMs) in sequence classification. Temporal correlation in sequences often causes difficulty in applying feature selection techniques. Inspired by segmental k-means segmentation (SKS), we propose Segmentally Boosted HMMs (SBHMMs), where the state-optimized features are constructed in a segmental and discriminative manner. The contributions are twofold. First, we introduce a novel feature selection algorithm, where the temporal dynamics are decoupled from the static learning procedure by assuming that the sequential data are piecewise independent and identically distributed. Second, we show that the SBHMM consistently improves traditional HMM recognition in various domains. The reduction of error compared to traditional HMMs ranges from 17% to 70% in American Sign Language recognition, human gait identification, lip reading, and speech recognition.

AddThis Social Bookmark Button

Thesis: Mitch Parry PhD (2007), “Separation and Analysis of Multichannel Signals”

October 9th, 2007 Irfan Essa Posted in 0205507, Audio Analysis, Funding, Mitch Parry, PhD, Thesis No Comments »

Mitch Parry (2007), Separation and Analysis of Multichannel Signals PhD Thesis [PDF], Georgia Institute of Techniology, College of Computing, Atlanta, GA. (Advisor: Irfan Essa)


This thesis examines a large and growing class of digital signals that capture the combined effect of multiple underlying factors. In order to better understand these signals, we would like to separate and analyze the underlying factors independently. Although source separation applies to a wide variety of signals, this thesis focuses on separating individual instruments from a musical recording. In particular, we propose novel algorithms for separating instrument recordings given only their mixture. When the number of source signals does not exceed the number of mixture signals, we focus on a subclass of source separation algorithms based on joint diagonalization. Each approach leverages a different form of source structure. We introduce repetitive structure as an alternative that leverages unique repetition patterns in music and compare its performance against the other techniques.

When the number of source signals exceeds the number of mixtures (i.e., the underdetermined problem), we focus on spectrogram factorization techniques for source separation. We extend single-channel techniques to utilize the additional spatial information in multichannel recordings, and use phase information to improve the estimation of the underlying components.

via Separation and Analysis of Multichannel Signals.

AddThis Social Bookmark Button

Paper: IEEE CVPR (2007) “Tree-based Classifiers for Bilayer Video Segmentation”

June 17th, 2007 Irfan Essa Posted in 0205507, Antonio Crimisini, Computational Photography and Video, Funding, John Winn, Machine Learning, Papers, Pei Yin, Research No Comments »

Yin, Pei Criminisi, Antonio Winn, John Essa, Irfan (2007), Tree-based Classifiers for Bilayer Video Segmentation In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR ’07, 17-22 June 2007, page(s): 1 – 8, Location: Minneapolis, MN, USA, ISBN: 1-4244-1180-7, Digital Object Identifier: 10.1109/CVPR.2007.383008


This paper presents an algorithm for the automatic segmentation of monocular videos into foreground and background layers. Correct segmentations are produced even in the presence of large background motion with nearly stationary foreground. There are three key contributions. The first is the introduction of a novel motion representation, “motons”, inspired by research in object recognition. Second, we propose learning the segmentation likelihood from the spatial context of motion. The learning is efficiently performed by Random Forests. The third contribution is a general taxonomy of tree-based classifiers, which facilitates theoretical and experimental comparisons of several known classification algorithms, as well as spawning new ones. Diverse visual cues such as motion, motion context, colour, contrast and spatial priors are fused together by means of a Conditional Random Field (CRF) model. Segmentation is then achieved by binary min-cut. Our algorithm requires no initialization. Experiments on many video-chat type sequences demonstrate the effectiveness of our algorithm in a variety of scenes. The segmentation results are comparable to those obtained by stereo systems.

AddThis Social Bookmark Button

Paper: IEEE ICASSP (2007) “Incorporating Phase Information for Source Separation via Spectrogram Factorization”

April 15th, 2007 Irfan Essa Posted in 0205507, Audio Analysis, Funding, Mitch Parry, Papers, Research No Comments »

Parry, R.M. Essa, I. (2007) “Incorporating Phase Information for Source Separation via Spectrogram Factorization.” In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. 15-20 April 2007, Volume: 2, page(s): II-661 – II-66, Honolulu, HI, ISSN: 1520-6149, ISBN: 1-4244-0728-1, INSPEC Accession Number:9497202, Digital Object Identifier: 10.1109/ICASSP.2007.366322


Spectrogram factorization methods have been proposed for single channel source separation and audio analysis. Typically, the mixture signal is first converted into a time-frequency representation such as the short-time Fourier transform (STFT). The phase information is thrown away and this spectrogram matrix is then factored into the sum of rank-one source spectrograms. This approach incorrectly assumes the mixture spectrogram is the sum of the source spectrograms. In fact, the mixture spectrogram depends on the phase of the source STFTs. We investigate the consequences of this common assumption and introduce an approach that leverages a probabilistic representation of phase to improve the separation results

AddThis Social Bookmark Button

Paper: IEEE ICASSP (2006) “Source Detection Using Repetitive Structure”

May 14th, 2006 Irfan Essa Posted in 0205507, Audio Analysis, Funding, Mitch Parry, Papers, Research No Comments »

Parry, R.M. Essa, I. (2006) “Source Detection Using Repetitive Structure (IEEEXplore).” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006, Publication Date: 14-19 May 2006, Volume: 4, page(s): IV – IV, Location: Toulouse, ISSN: 1520-6149, ISBN: 1-4244-0469-X, INSPEC Accession Number:9154520, Digital Object Identifier: 10.1109/ICASSP.2006.1661163


Blind source separation algorithms typically require that the number of sources are known in advance. However, it is often the case that the number of sources change over time and that the total number is not known. Existing source separation techniques require source number estimation methods to determine how many sources are active within the mixture signals. These methods typically operate on the covariance matrix of mixture recordings and require fewer active sources than mixtures. When sources do not overlap in the time-frequency domain, more sources than mixtures may be detected and then separated. However, separating more sources than mixtures when sources overlap in time and frequency poses a particularly difficult problem. This paper addresses the issue of source detection when more sources than sensors overlap in time and frequency. We show that repetitive structure in the form of time-time correlation matrices can reveal when each source is active

AddThis Social Bookmark Button

Paper: IEEE CVPR (2004) “Asymmetrically boosted HMM for speech reading”

June 2nd, 2004 Irfan Essa Posted in 0205507, Funding, James Rehg, Papers, Pei Yin No Comments »

Pei Yin Essa, I. Rehg, J.M. (2004) “Asymmetrically boosted HMM for speech reading,”, In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004 (CVPR 2004). Publication Date: 27 June-2 July 2004, Volume: 2, On page(s): II-755 – II-761 Vol.2 ISSN: 1063-6919, ISBN: 0-7695-2158-, INSPEC Accession Number:8161546, Digital Object Identifier: 10.1109/CVPR.2004.1315240


Speech reading, also known as lip reading, is aimed at extracting visual cues of lip and facial movements to aid in recognition of speech. The main hurdle for speech reading is that visual measurements of lip and facial motion lack information-rich features like the Mel frequency cepstral coefficients (MFCC), widely used in acoustic speech recognition. These MFCC are used with hidden Markov models (HMM) in most speech recognition systems at present. Speech reading could greatly benefit from automatic selection and formation of informative features from measurements in the visual domain. These new features can then be used with HMM to capture the dynamics of lip movement and eventual recognition of lip shapes. Towards this end, we use AdaBoost methods for automatic visual feature formation. Specifically, we design an asymmetric variant of AdaBoost M2 algorithm to deal with the ill-posed multi-class sample distribution inherent in our problem. Our experiments show that the boosted HMM approach outperforms conventional AdaBoost and HMM classifiers. Our primary contributions are in the design of (a) boosted HMM and (b) asymmetric multi-class boosting.

AddThis Social Bookmark Button