PhD Thesis by Zahoor Zafrulla “Automatic recognition of American Sign Language Classifiers

May 2nd, 2014 Irfan Essa Posted in Affective Computing, Behavioral Imaging, Face and Gesture, PhD, Thad Starner, Zahoor Zafrulla No Comments »

Title: Automatic recognition of American Sign Language Classifiers

Zahoor Zafrulla
School of Interactive Computing
College of Computing
Georgia Institute of Technology


Dr. Thad Starner (Advisor, School of Interactive Computing, Georgia Tech)
Dr. Irfan Essa (Co-Advisor, School of Interactive Computing, Georgia Tech)
Dr. Jim Rehg (School of Interactive Computing, Georgia Tech)
Dr. Harley Hamilton (School of Interactive Computing, Georgia Tech)
Dr. Vassilis Athitsos (Computer Science and Engineering Department, University of Texas at Arlington)


Automatically recognizing classifier-based grammatical structures of American Sign Language (ASL) is a challenging problem. Classifiers in ASL utilize surrogate hand shapes for people or “classes” of objects and provide information about their location, movement and appearance. In the past researchers have focused on recognition of finger spelling, isolated signs, facial expressions and interrogative words like WH-questions (e.g. Who, What, Where, and When). Challenging problems such as recognition of ASL sentences and classifier-based grammatical structures remain relatively unexplored in the field of ASL recognition.

One application of recognition of classifiers is toward creating educational games to help young deaf children acquire language skills. Previous work developed CopyCat, an educational ASL game that requires children to engage in a progressively more difficult expressive signing task as they advance through the game.

We have shown that by leveraging context we can use verification, in place of recognition, to boost machine performance for determining if the signed responses in an expressive signing task, like in the CopyCat game, are correct or incorrect. We have demonstrated that the quality of a machine verifier’s ability to identify the boundary of the signs can be improved by using a novel two-pass technique that combines signed input in both forward and reverse directions. Additionally, we have shown that we can reduce CopyCat’s dependency on custom manufactured hardware by using an off-the-shelf Microsoft Kinect depth camera to achieve similar verification performance. Finally, we show how we can extend our ability to recognize sign language by leveraging depth maps to develop a method using improved hand detection and hand shape classification to recognize selected classifier-based grammatical structures of ASL.

AddThis Social Bookmark Button

Paper in IEEE CVPR 2013 “Decoding Children’s Social Behavior”

June 27th, 2013 Irfan Essa Posted in Affective Computing, Behavioral Imaging, Denis Lantsman, Gregory Abowd, James Rehg, PAMI/ICCV/CVPR/ECCV, Papers, Thomas Ploetz No Comments »

  • J. M. Rehg, G. D. Abowd, A. Rozga, M. Romero, M. A. Clements, S. Sclaroff, I. Essa, O. Y. Ousley, Y. Li, C. Kim, H. Rao, J. C. Kim, L. L. Presti, J. Zhang, D. Lantsman, J. Bidwell, and Z. Ye (2013), “Decoding Children’s Social Behavior,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [PDF] [WEBSITE] [DOI] [BIBTEX]
    @InProceedings{    2013-Rehg-DCSB,
      author  = {James M. Rehg and Gregory D. Abowd and Agata Rozga
          and Mario Romero and Mark A. Clements and Stan
          Sclaroff and Irfan Essa and Opal Y. Ousley and Yin
          Li and Chanho Kim and Hrishikesh Rao and Jonathan C.
          Kim and Liliana Lo Presti and Jianming Zhang and
          Denis Lantsman and Jonathan Bidwell and Zhefan Ye},
      booktitle  = {{Proceedings of IEEE Conference on Computer Vision
          and Pattern Recognition (CVPR)}},
      doi    = {10.1109/CVPR.2013.438},
      month    = {June},
      organization  = {IEEE Computer Society},
      pdf    = {},
      title    = {Decoding Children's Social Behavior},
      url    = {},
      year    = {2013}


We introduce a new problem domain for activity recognition: the analysis of children’s social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1-2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3-5 minute child-adult interaction. In each session, the adult examiner followed a semi-structured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.

Full database available from

via IEEE Xplore – Decoding Children’s Social Behavior.

AddThis Social Bookmark Button

Paper: PUI (1997) “Prosody Analysis for Speaker Affect Determination”

October 12th, 1997 Irfan Essa Posted in Affective Computing, Papers No Comments »

Andrew Gardner and Irfan Essa (1997) “Prosody Analysis for Speaker Affect Determination” In Proceedings of Perceptual User Interfaces Workshop (PUI 1997), Banff, Alberta, CANADA, Oct 1997 [PDF][Project Site]


Speech is a complex waveform containing verbal (e.g. phoneme, syllable, and word) and nonverbal (e.g. speaker identity, emotional state, and tone) information. Both the verbal and nonverbal aspects of speech are extremely important in interpersonal communication and human-machine interaction. However, work in machine perception of speech has focused primarily on the verbal, or content-oriented, goals of speech recognition, speech compression, and speech labeling. Usage of nonverbal information has been limited to speaker identification applications. While the success of research in these areas is well documented, this success is fundamentally limited by the effect of nonverbal information on the speech waveform. The extra-linguistic aspect of speech is considered a source of variability that theoretically can be minimized with an appropriate preprocessing technique; determination of such robust techniques is however, far from trivial.

It is widely believed in the speech processing community that the nonverbal component of speech contains higher-level information that provides cues for auditory scene analysis, speech understanding, and the determination of a speaker’s psychological state or conversational tone. We believe that the identification of such nonverbal cues can improve the performance of classic speech processing tasks and will be necessary for the realization of natural, robust human-computer speech interfaces. In this paper we seek to address the problem of how to systematically analyze the nonverbal aspect of the speech waveform to determine speaker affect, specifically by analyzing the pitch contour.

AddThis Social Bookmark Button

Paper: IEEE PAMI (1997) “Coding, analysis, interpretation, and recognition of facial expressions”

July 14th, 1997 Irfan Essa Posted in Affective Computing, Face and Gesture, PAMI/ICCV/CVPR/ECCV, Papers, Research, Sandy Pentland No Comments »

Coding, analysis, interpretation, and recognition of facial expressions

Essa, I.A. Pentland, A.P. In IEEE Transactions on Pattern Analysis and Machine Intelligence, July 1997, Volume: 19 , Issue: 7, pp 757 – 763, ISSN: 0162-8828, CODEN: ITPIDJ. INSPEC Accession Number:5661539
Digital Object Identifier: 10.1109/34.598232


We describe a computer vision system for observing facial motion by using an optimal estimation optical flow method coupled with geometric, physical and motion-based dynamic models describing the facial structure. Our method produces a reliable parametric representation of the face’s independent muscle action groups, as well as an accurate estimate of facial motion. Previous efforts at analysis of facial expression have been based on the facial action coding system (FACS), a representation developed in order to allow human psychologists to code expression from static pictures. To avoid use of this heuristic coding scheme, we have used our computer vision system to probabilistically characterize facial motion and muscle activation in an experimental population, thus deriving a new, more accurate, representation of human facial expressions that we call FACS . Finally, we show how this method can be used for coding, analysis, interpretation, and recognition of facial expressions

AddThis Social Bookmark Button

Scientific American Article (1996): “Smart Rooms; by Alex Pentland

April 9th, 1996 Irfan Essa Posted in Affective Computing, Face and Gesture, In The News, Intelligent Environments, Research No Comments »

Alex Pentland (1996), “Smart Rooms”Scientific American, April 1996

Quote from the Article: “Facial expression is almost as important as identity. A teaching program, for example, should know if its students look bored. So once our smart room has found and identified someone’s face, it analyzes the expression. Yet another computer compares the facial motion the camera records with maps depicting the facial motions involved in making various expressions. Each expression, in fact, involves a unique collection of muscle movements. When you smile, you curl the corners of your mouth and lift certain parts of your forehead; when you fake a smile, though, you move only your mouth. In experiments conducted by scientist Irfan A. Essa and me, our system has correctly judged expressions-among a small group of subjects-98 percent of the time.”

AddThis Social Bookmark Button