Special Interest Tracks and Posters of the 14th International Conference on World Wide Web - WWW '05 2005
DOI: 10.1145/1062745.1062833
|View full text |Cite
|
Sign up to set email alerts
|

The language observatory project (LOP)

Abstract: The first part of the paper provides a brief description of the Language Observatory Project (LOP) and highlights the major technical difficulties to be challenged. The latter part gives how we responded to these difficulties by adopting UbiCrawler as a data collecting engine for the project. An interactive collaboration between the two groups is producing quite satisfactory results.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
11
0
1

Year Published

2006
2006
2024
2024

Publication Types

Select...
4
2
1

Relationship

4
3

Authors

Journals

citations
Cited by 13 publications
(12 citation statements)
references
References 2 publications
0
11
0
1
Order By: Relevance
“…This experiment was based on two technical processes: (1) crawling cyberspace initiated by seed URLs and thus fetching web pages (Boldi et al 2004) and (2) running a language identification engine to determine three aspects of the web pages, its LSE triplet (namely Language, Script and Encoding system). The huge data are stored in clusters of 40 servers at Nagaoka University of Technology under the management of the Language Observatory (LO) project (Mikami et al 2005). The system based on N-gram technology (Suzuki et al 2002) has been extensively used in the past four years in a project for LSE identification of huge domains.…”
Section: Snapshots Of Evolutionized Malay (Em) Occurrences In Web Formentioning
confidence: 99%
See 1 more Smart Citation
“…This experiment was based on two technical processes: (1) crawling cyberspace initiated by seed URLs and thus fetching web pages (Boldi et al 2004) and (2) running a language identification engine to determine three aspects of the web pages, its LSE triplet (namely Language, Script and Encoding system). The huge data are stored in clusters of 40 servers at Nagaoka University of Technology under the management of the Language Observatory (LO) project (Mikami et al 2005). The system based on N-gram technology (Suzuki et al 2002) has been extensively used in the past four years in a project for LSE identification of huge domains.…”
Section: Snapshots Of Evolutionized Malay (Em) Occurrences In Web Formentioning
confidence: 99%
“…This is followed by information from Gani's short article on the analysis of the SM Internet evolution, and our snapshots on EM occurrences for Malaysian cyberspace based on Language Observatory (LO) project data (Mikami et al 2005). Our main objective is to answer two research questions; 'How are Evolutionized Malay (EM) words able to represent the pronunciation of spoken Malay in a better way?'…”
mentioning
confidence: 99%
“…In response to this, the Language Observatory (LO) project was launched in 2003 under the sponsorship of the Japan Science and Technology Agency (JST) and has been implemented in collaboration with several international partners who have common interests with us [6]. After a few years of development work, the LO team has trained our own language identification engine to cover more than three hundred languages of the world, and has acquired the capability to collect terabyte size Web documents from the Internet.…”
Section: Introductionmentioning
confidence: 99%
“…3. This classifier is introduced for use in the Language Observatory Project (or LOP) [2], a census of Web-page population by language and character set. In this project, a crawler robot collects Web pages, and the category of each Web page is checked and characterized by the combination of attributes, i.e., the language, script, and character set.…”
Section: Introductionmentioning
confidence: 99%