Link to this article: http://journals.cambridge.org/abstract_S0305000904006786How to cite this article: LOUANN GERKEN, RACHEL WILSON and WILLIAM LEWIS (2005). Infants can use distributional cues to form syntactic categories.
As the arm of NLP technologies extends beyond a small core of languages, techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as "solved" is language identification (language ID) of written text. However, we argue that language ID is far from solved when one considers input spanning not dozens of languages, but rather hundreds to thousands, a number that one approaches when harvesting language data found on the Web. We formulate language ID as a coreference resolution problem and apply it to a Web harvesting task for a specific linguistic data type and achieve a much higher accuracy than long accepted language ID approaches.
The GOLD Community of Practice is proposed as a model for managing on-line linguistic data. The key components of the model include the linguistic data resources themselves and those focused on the knowledge derived from data. Data resources include the ever-increasing amount of linguistic field data and other descriptive language resources being migrated to the Web. The knowledge resources capture generalizations about the data and are anchored in the General Ontology for Linguistic Description, or 'GOLD'. It is argued that such a model is in the spirit of the vision for a Semantic Web and, thus, provides a concrete methodology for rendering highly divergent resources interoperable. Furthermore, a methodology is given for creating specific communities of practice within the overall scientific domain of linguistics. A number of services around the model are proposed including knowledge acquisition and search facilities. Finally, as an example of the model's utility, an instantiation of a community of practice centered around interlinear glossed text is described.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.