Human interaction research has always been inventive in its use of the latest technology.Even 50 years ago, Bales (1951) adopted one-way mirrors to observe his groups and then had to design motorized paper scrolls so that his observers could keep up during live scoring. Since then, signal recording technologies have advanced significantly; video cameras are portable, microphones can be arranged to pick up individualsubjects even without the use of wires, and multiple signals can be synchronized using a mixing desk. Not only that, but now that every garage band makes music videos, these technologies are so cheap that researchers can focus less on cost and more on what would make for their ideal data capture. With these advances in signal recording come new ideas about what sort of data to collect and how to use them.One research area that can benefit greatly from better signal recording is the study of how people use language. When people communicate, gestures, postural shifts, facial expressions, backchannel continuers such as "mmhmm," and spoken turns from the subjects all work in concert to bring about mutual understanding (Goodwin, 1981). Apart from the scientific good of understanding how this process works, information about it is in demand for applications ranging from the documentation of endangered languages to animation for computer games. Observational analysis packages can help us determine some things about the timing, frequency, and sequencing of communicative behaviors, but that is not enough. In language data, behaviors are related less by their timing than by their structure: Pronouns have discourse referents, answers relate to questions, and deictic instances of the word "that" are resolved by pointing gestures that themselves relate to real-world objects, but with no guarantees about when the related behavior will occur. Linguistic analysis reveals this structure, but current tools only support specific codes and structures and only allow them to be imposed over the top of a textual transcription. This approach discards temporal information and makes it difficult to describe behaviors from different subjects that happen at the same time.We gratefully acknowledge support of the NITE project by the European Commission's Human Language Technologies Programme. The samples described in the paper use data kindly provided either to us personally or to the community at large by the Smartkom project (http:// smartkom.dfki.de/), by ISIP's Switchboard project (http://www.isip. msstate.edu/projects/switchboard/ ), and by the University of Edinburgh's Human Communication Research Centre (http://www.hcrc.ed. ac.uk/). The software described in this paper is available for download from http://www.ltg.ed.ac.uk/NITE. Correspondence concerning this article should be addressed to J. Carletta, Human Communication Research Centre and Language Technology Group, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, Scotland (e-mail: j.carletta@edinburgh.ac.uk). Multimodal corpora that show humans interacting ...