Improving Portuguese Term Extraction

Lopes, Lucelene; Vieira, Renata

doi:10.1007/978-3-642-28885-2_9

Cited by 4 publications

(13 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this purpose, the corpus undergoes a pre-processing step, which usually involves the identification of tokens 10 , removal of stopwords 11 , and the representation of the texts in tables. In these tables, each row represents a document (d i ) and each column represents an n-gram 12 of document (n j ), where cell d i n j may be filled with some measure, for instance, the absolute frequency of n-gram n j in document d i .…”

Section: The Statistical Approachmentioning

confidence: 99%

“…Among the definitions available in the literature, we highlight the definition of Witten et al [35] since it avoids that the tf − idf value drops to 0 if a candidate occurs in all documents of a corpus, as observed in Equation 10.…”

Section: (A) Log Ilkelihood Ratio (Ll)mentioning

confidence: 99%

“…idf part (10) where tf d x ,t j is the frequency of t j (jth candidate) in the d x (xth document) and df t j is the document frequency of the jth candidate. (g) Term contribution (tc)…”

Section: (A) Log Ilkelihood Ratio (Ll)mentioning

confidence: 99%

“…Another proposal analysed the use of linguistic knowledge only (morphological, in this case) [27]. Some contributions compared the term extraction according to the statistical and linguistic approaches [67,68] These contributions were classified according to their goals. The first group of contributions (subsection 'Research developed for Brazilian Portuguese term extraction') corresponds to investigations that primarily compared, adapted, or developed investigations for term extraction.…”

Section: State Of the Art Of Term Extraction In Brazilian Portuguesementioning

confidence: 99%

“…There are many ATE investigations available in the literature [5][6][7][8][9][10][11][12][13][14][15][16][17]. However, they perform ATE using different scenarios (e.g., variation of the test corpora 1 and measures and evaluation conditions), which make it difficult to choose the best ATE system.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A survey of automatic term extraction for Brazilian Portuguese

Conrado

Felippo

Pardo³

et al. 2014

J Braz Comput Soc

View full text Add to dashboard Cite

Background: Term extraction is highly relevant as it is the basis for several tasks, such as the building of dictionaries, taxonomies, and ontologies, as well as the translation and organization of text data. Methods and Results:In this paper, we present a survey of the state of the art in automatic term extraction (ATE) for the Brazilian Portuguese language. In this sense, the main contributions and projects related to such task have been classified according to the knowledge they use: statistical, linguistic, and hybrid (statistical and linguistic). We also present a study/review of the corpora used in the term extraction in Brazilian Portuguese, as well as a geographic mapping of Brazil regarding such contributions, projects, and corpora, considering their origins. Conclusions: In spite of the importance of the ATE, there are still several gaps to be filled, for instance, the lack of consensus regarding the formal definition of meaning of 'term'. Such gaps are larger for the Brazilian Portuguese when compared to other languages, such as English, Spanish, and French. Examples of gaps for Brazilian Portuguese include the lack of a baseline ATE system, as well as the use of more sophisticated linguistic information, such as the WordNet and Wikipedia knowledge bases. Nevertheless, there is an increase in the number of contributions related to ATE and an interesting tendency to use contrasting corpora and domain stoplists, even though most contributions only use frequency, noun phrases, and morphosyntactic patterns.

show abstract

Section: The Statistical Approachmentioning

confidence: 99%

Section: (A) Log Ilkelihood Ratio (Ll)mentioning

confidence: 99%

“…idf part (10) where tf d x ,t j is the frequency of t j (jth candidate) in the d x (xth document) and df t j is the document frequency of the jth candidate. (g) Term contribution (tc)…”

Section: (A) Log Ilkelihood Ratio (Ll)mentioning

confidence: 99%

Section: State Of the Art Of Term Extraction In Brazilian Portuguesementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A survey of automatic term extraction for Brazilian Portuguese

Conrado

Felippo

Pardo³

et al. 2014

J Braz Comput Soc

View full text Add to dashboard Cite

show abstract

Automatic Extraction of Domain Specific Non-taxonomic Relations from Portuguese Corpora

Ferreira

Lopes

Vieira

et al. 2013

2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)

View full text Add to dashboard Cite

Evaluation of cutoff policies for term extraction

Lopes

Vieira

2015

J Braz Comput Soc

Self Cite

View full text Add to dashboard Cite

Background: This paper presents a policy to choose cutoff points to identify potentially relevant terms in a given domain. Term extraction methods usually generate term lists ordered according to a relevance criteria, and the literature is abundant to offer different relevance indices. However, very few studies turn their attention to how many terms should be kept, i.e., to a cutoff policy. Methods: Our proposed policy provides an estimation of the portion of this list which preserves a good balance between recall and precision, adopting a refined term extraction and tf-dcf relevance index. Results: A practical study was conducted based on terms extracted from a Brazilian Portuguese corpus, and the results were quantitatively analyzed according to a previously defined reference list. Conclusions: Even thou different extraction procedures and different relevance indices could brought a different outcome, our policy seems to deliver a good balance for the method adopted in our experiments and it is likely to be able to be generalized to other methods. Background Automatic identification of relevant terms for a given domain is an extremely important task for a myriad of natural language processing applications. For instance, any ontology learning effort is doomed to fail if the concept identification step has a poor performance. In fact, any other steps to automatically build an ontology rely on the concept identification [1-4]. Also, text categorization applications can be much more effective if a good relevant term identification is available. An important part in the process of identifying relevant terms is the extraction of terms and the computation of their frequencies of use as term relevance index. For the term extraction itself, many software tools are available [5, 6]. These tools usually offer high-quality extraction, regardless being implemented on the basis of linguistic or statistical approaches. As for relevance indices, many theoretical formulations are available [7-11] and it is safe to assume that a reliable relevance-based rank of extracted terms is not difficult to obtain. Unfortunately, even assuming a nearly perfect ranked list of terms, it is still difficult to decide how many relevant

show abstract

Improving Portuguese Term Extraction

Cited by 4 publications

References 7 publications

A survey of automatic term extraction for Brazilian Portuguese

A survey of automatic term extraction for Brazilian Portuguese

Automatic Extraction of Domain Specific Non-taxonomic Relations from Portuguese Corpora

Evaluation of cutoff policies for term extraction

Contact Info

Product

Resources

About