In a 12-month project we have developed a new, register-diverse, 55-million-word bilingual corpus-the New Corpus for Ireland (NCI)-to support the creation of a new English-to-Irish dictionary. The paper describes the strategies we employed, and the solutions to problems encountered. We believe we have a good model for corpus creation for lexicography, and others may find it useful as a blueprint. The corpus has two parts, one Irish, the other Hiberno-English (English as spoken in Ireland). We describe its design, collection and encoding.The NCI was developed as part of the set-up phase of a project for a new English-to-Irish Dictionary (NEID). 1 The NEID is intended to be used by scholars, school and university students, translators, people working in the media, and the general public. It will replace the current main reference work, Tomas de Bhaldraithe's English-Irish Dictionary (1959), a highly-regarded dictionary but now almost 50-years-old.The island of Ireland includes both the Republic of Ireland and, in the North, six counties of the province of Ulster, which form part of the United Kingdom. The border was not critical to the project; collaborators and texts alike were sought both North and South of the border, and the language and dialects of Ulster were treated on a par with those of other regions. In this paper, ''Ireland'' means the whole island.About 62,000 speakers use Irish as their main everyday language, and almost 340,000 speakers use Irish on a daily basis. 2 It was the main language of Ireland until English displaced it (substantially as a result of language policies under the British Empire). It remains the chief language in a few parts of the island, collectively known as the Gaeltacht, which are mainly located along the western seaboard. There are three main dialects of Irish-Connacht, Munster, and Ulster-corresponding respectively to the most westerly, southerly, and northerly areas. The language has an important place in Irish culture and identity and is very widely taught in schools. 3 Irish is one of the two official languages of Ireland, the other being English. The Irish language belongs to the Celtic branch of the Indo-European family of languages, and within this branch, it forms part of the Goidelic branch along with Manx and Scots Gaelic, the other tradition being Brythonic, which comprises Welsh, Cornish, and Breton.The remainder of the paper describes the design, collection, and encoding of the NCI in Sects. 2, 3, and 4. A particular area of innovation was the use of the web as a source of some of the constituent texts, and the issues arising there are covered in some detail, as are the practical issues of data 'cleaning'. The morphological analyzer and part-of-speech tagger for Irish are described in Sect. 5. Section 6 describes the project team and resources, with a view to assisting others with comparable projects in mind to assess the resources they require. Section 7 outlines possible further developments, and Sect. 8 concludes.
DesignIn the first instance, a detailed cor...