Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions - ACL '07 2007
DOI: 10.3115/1557769.1557783
|View full text |Cite
|
Sign up to set email alerts
|

An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)

Abstract: Gorman and Curran (2006) argue that thesaurus generation for billion+-word corpora is problematic as the full computation takes many days. We present an algorithm with which the computation takes under two hours. We have created, and made publicly available, thesauruses based on large corpora for (at time of writing) seven major world languages. The development is implemented in the Sketch Engine (Kilgarriff et al., 2004).Another innovative development in the same tool is the presentation of the grammatical be… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0
2

Year Published

2009
2009
2019
2019

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 24 publications
(19 citation statements)
references
References 12 publications
(8 reference statements)
0
17
0
2
Order By: Relevance
“…Finally, we also can find some work measuring both the complexity and computational efficiency of the algorithm implemented to make pairwise comparisons [9,20]. As the accuracy of any extraction system does not depend on the chosen algorithm, we will not compare systems with regard to this specific parameter.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Finally, we also can find some work measuring both the complexity and computational efficiency of the algorithm implemented to make pairwise comparisons [9,20]. As the accuracy of any extraction system does not depend on the chosen algorithm, we will not compare systems with regard to this specific parameter.…”
Section: Related Workmentioning
confidence: 99%
“…So, there is no reason to check them. Following [9,20], we implemented an algorithm that only compares word pairs sharing at least one context. As the list of words sharing a context is small (in general, less than 1000), the quadratic complexity of the entire algorithm turns out to be manageable.…”
Section: Algorithmmentioning
confidence: 99%
See 1 more Smart Citation
“…The Google Book syntactic n-grams dataset provides dependency fragment counts by the years. However, instead of using the plain syntactic n-grams, we use a far richer representation of the data in the form of a distributional thesaurus (Lin, 1997;Rychlý and Kilgarriff, 2007). In specific, we prepare a distributional thesaurus (DT) for each of the time periods separately and subsequently construct the required networks.…”
Section: Datasets and Graph Constructionmentioning
confidence: 99%
“…In this case, the search terms which were used were potential candidates for discussing mock politeness, which had been identified by using terms discussed in the relevant literature and potential synonyms (as retrieved through the Sketch Engine distributional thesaurus (Rychly and Kilgarriff, 2007). 9 Using this method of compilation a corpus of approximately 61 million tokens was created.…”
Section: Compilation and Annotation Of The Corporamentioning
confidence: 99%