Simply put, multimodal interfaces allow users to interact with computers using multiple different modes or channels of communication (e.g. speaking vs. clicking a button vs. writing). Different modes are best suited for different kinds of input or outputs. For example, it is easier for a user to select among millions of names in a directory by saying the name they are interested in rather than searching in a huge menu, but if the user only has to choose among say three or four options, it is easier to click a button in a graphical user interface than to use speech. The most effective multimodal interfaces enable more natural and effective interaction by allowing users to interact using whichever mode or combination of modes are most appropriate given the situation and their preferences and abilities.
A critical motivator for the development of multimodal interfaces is the ongoing migration of computing and information access away from the ergonomics of desktop computing to more challenging settings including mobile devices such as PDAs and next generation phones, in vehicle (e.g. navigation systems), and in-home entertainment systems (set top box). These devices generally have no real keyboard or mouse and offer limited screen real estate, making traditional graphical user interfaces cumbersome and difficult to use. Furthermore, since mobile devices are used in situations involving different physical and social environments, tasks, and users, they need to allow users to adapt their mode of interaction to the surrounding environment.
Going beyond today's primarily telephony-based spoken language systems, researchers at AT&T Labs are addressing this challenge by developing technologies to support truly multimodal interaction. These systems combine spoken and graphical interaction. Users can issue requests using speech, gesture (e.g. pointing, drawing, touch) , or dynamic combinations of multiple modalities. System responses are dynamic presentations tailored to the user's needs and preferences and combine synthetic speech with dynamic animated graphical presentations. Building these systems involves significant advances in the areas of multimodal integration and understanding, multimodal dialog management, and multimodal generation. These multimodal interface technologies have been applied to a broad range of different application areas, including local search, corporate directory access and messaging, medical informatics, accessing and controlling presentations, and searching and browsing for IPTV content such as movies-on-demand. Several of these applications and the underlying finite-state multimodal understanding mechanism are described in more detail below.
1. Multimodal Interfaces for Local Search: MULTIMODAL ACCESS TO CITY HELP
In urban environments tourists and residents alike need access to a complex and constantly changing body of information regarding restaurants, cinema and theatre schedules, transportation topology and timetables. This information is most valuable if it can be delivered effectively while mobile, since places close and plans change. One of our pilot multimodal interface projects (Multimodal Access To City Help) is a working city guide and navigation system that enables users to access information about places of interest, restaurants, and the subway and metro for New York City and Washington, D.C. Users can search for restaurants based on their type of food, price, and location, and access information such as reviews, phone numbers, and addresses. They can also ask for subway directions to get from one location to another. The Multimodal Access To City Help project has resulted in two prototypes, the first is mobile and runs on a handheld PC, the second provides a multimodal customer service kiosk with a life-like talking head. Both prototypes integrate AT&T's WATSON speech recognition technology, Natural Voices text-to-speech, handwriting recognition, and gesture recognition with a browser-based graphical user interface incorporating a dynamic map display.
1. Mobile Multimodal Access To City Help 2. City Information Kiosk 3. Kiosk User Interface with Life-like talking head
These prototypes use advanced techniques for multimodal integration that utilize AT&T patented finite-state technology. These techniques enable users to interact freely using speech alone, pen alone, or dynamic synchronized combinations of speech and pen. For example, they can request to see restaurants using the spoken command 'show cheap italian restaurants in chelsea'. The system will then zoom to the appropriate map location and show the locations of restaurants on the map. Alternatively, they could give the same command multimodally by circling an area on the map and saying 'show cheap italian restaurants in this neighborhood'. If the immediate environment is too noisy or public, the same command can be given completely in pen as in by circling an area and writing 'cheap' and 'italian'. The Multimodal Access To City Help prototype also uses novel multimodal generation and synchronization techniques to generate dynamic presentations of subways directions and restaurant information that combine graphical presentations and synthetic speech.
View a Video Demonstration:
Mobile Multimodal Access To City Help Demonstration Video (MPEG 49 MB)
Read a paper about the system:
Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S., and Maloor, P. 2001. MATCH: an architecture for multimodal dialogue systems. In Proceedings of the 40th Annual Meeting on Association For Computational Linguistics. pp. 376-383.
2. Controlling your IPTV: MULTIMODAL ACCESS TO CONTENT IN THE HOME
As traditional entertainment channels and the internet converge through the advent of technologies such as broadband access, movies-on-demand, and streaming video, an increasingly large range of content is available to consumers in the home. However, to benefit from this new wealth of content, users need to be able to rapidly and easily find what they are actually interested in, and do so effortlessly while relaxing on the couch in their living room — a location where they typically do not have easy access to the keyboard, mouse, and close-up screen display typical of desktop web browsing.
Current interfaces to cable and satellite television services typically use direct manipulation of a graphical user interface using a remote control. In order to find content, users generally have to either navigate a complex, pre-defined, and often deeply embedded menu structure or type in titles or other key phrases using an onscreen keyboard or triple tap input on a remote control keypad. These interfaces are cumbersome and do not scale well as the range of content available increases. This project explores the application of multimodal interface technologies to the creation of more effective systems used to search and browse for entertainment content in the home.
1. Multimodal IPTV User Interface 2. Multimodal IPTV with remote control 3. Handwritten input modality on tablet PC
Read a paper about the system:
Johnston, M., L-F. D'Haro, M. Levine, B. Renger. 2007. A Multimodal Interface for Access to Content in the Home. Proceedings of the Association for Computational Linguistics Annual Conference. pp. 376-383.
3. Accessing and Controlling Presentation Content: MULTIMODAL PRESENTATION DASHBOARD
Anthropologists have long informed us that the way we work—whether reading, writing, or giving a presentation—is tightly bound to the tools we use. Web browsers and word processors changed the way we read and write from linear to nonlinear activities, though the linear approach to giving a presentation to a roomful of people has evolved little since the days of Mylar sheets and notecards, thanks to presentation software that reinforces—or even further entrenches—a linear bias in our notion of what “giving a presentation” means to us. While today’s presentations may be prettier and flashier, the spontaneity once afforded by holding a stack of easily re-arrangeable sheets has been lost.
The multimodal presentation dashboard allows users to control and browse presentation content such as slides and diagrams through a multimodal interface that supports speech and pen input. In addition to control commands (e.g. “take me to slide 10”), the system allows multimodal search over content collections. For example, if the user says “get me a slide about internet telephony,” the system will present a ranked series of candidate slides that they can then select among using voice, pen, or a wireless remote. As presentations are loaded, their content is analyzed and language and understanding models are built dynamically. This approach frees the user from the constraints of linear order allowing for a more dynamic and responsive presentation style.
1. Multimodal presentation dashboard on Tablet PC 2. Audience view and presenter view 3. Presenter User Interface
Read a paper about the system:
Johnston, M, P. Ehlen, D. Gibbon, Z. Liu. 2007. The Multimodal Presentation Dashboard. Proceedings of the NAACL-HLT 2007 Workshop: Bridging the Gap: Academic and Industrial Research in Dialog Technologies. pp. 17-24.
4. Finite-state Multimodal Integration and Understanding:
All three of the prototype applications described above utilize a common multimodal grammar mechanism for integrating and understanding input distributed over multiple modes, utilizing a combination of finite state language processing and machine learning techniques. Unlike previous approaches which separate integration and understanding, in this approach speech parsing and understanding, gesture interpretation, and multimodal parsing, integration, and understanding are all captured within a single model. This multimodal grammar model can be compiled into an efficient finite state mechanism enabling tightly integrated processing of lattice inputs from speech and gesture recognition. This finite state mechanism can be thought of as three tape finite state device, consuming input symbols from lattices representing speech and gesture inputs and writing out a lattice representing their combined meaning. This approach enables mutual disambiguation for errors; that is, errors in the individual modality recognition components can be corrected through the multimodal integration process. Additional robustness to errors is achieved through the incorporation of finite-state edit machines and machine translation techniques to the multimodal language understanding process.
1. Multimodal Integration Architecture 2. Three-tape Machine for Multimodal Integration
Read papers about finite-state multimodal integration and robust multimodal understanding:
Johnston, M. and S. Bangalore. 2005. Finite-state Multimodal Integration and Understanding. Journal of Natural Language Engineering 11.2, pp. 159-187, Cambridge University Press.
Johnston, M. and S. Bangalore. 2006. Learning Edit Machines for Robust Multimodal Understanding. IEEE International Conference on Acoustics, Speech, and Signal Processing.
Bangalore, S. and M. Johnston. 2004. Balancing Data-driven and Rule-based Approaches in the Context of a Multimodal Conversational System. In Proceedings of Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting (HLT-NAACL 2004).
For further information on multimodal interface technologies and their applications contact:
johnston NO SPAM AT research DOT att DOT com