Avaliação de métodos não-supervisionados de seleção de atributos para mineração de textos

Nogueira, Bruno

doi:10.11606/d.55.2009.tde-06052009-154832

Cited by 9 publications

(11 citation statements)

References 34 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Next, terms are extracted, and therefore, they are used to describe the text base, as detailed in Conrado [52]. To reduce the amount of terms to be worked with, a term selection is performed by using, e.g., the Luhn, Salton, and term variance, methods, which are detailed in the work of Nogueira [38].…”

Section: The Toptax Methodologymentioning

confidence: 99%

“…The zstf [38] measure, formally described in Equation 14, assumes that some parts of the document (such as the abstract and the conclusion) bring higher relevant information about the contents of the document than other parts. Based on this consideration, it attributes higher weights to the words that occur in parts of the document with higher impact or in which higher information related to the content of the document is concentrated.…”

Section: (A) Log Ilkelihood Ratio (Ll)mentioning

confidence: 99%

“…From 2009 on, it was established that the responsibility on it would be given only to the Oslo operational pole. The people in charge of this pole are four researchers (Diana Santos, Cristina Mota, Rosário Silva, and Fernando Ribeiro) of Instituto Superior Técnico, Universidade Técnica de Lisboa (IST-UTL) 38 and Universidade de Coimbra (UC) 39 …”

Section: The Linguateca Repositorymentioning

confidence: 99%

See 2 more Smart Citations

A survey of automatic term extraction for Brazilian Portuguese

Conrado

Felippo

Pardo³

et al. 2014

J Braz Comput Soc

View full text Add to dashboard Cite

Background: Term extraction is highly relevant as it is the basis for several tasks, such as the building of dictionaries, taxonomies, and ontologies, as well as the translation and organization of text data. Methods and Results:In this paper, we present a survey of the state of the art in automatic term extraction (ATE) for the Brazilian Portuguese language. In this sense, the main contributions and projects related to such task have been classified according to the knowledge they use: statistical, linguistic, and hybrid (statistical and linguistic). We also present a study/review of the corpora used in the term extraction in Brazilian Portuguese, as well as a geographic mapping of Brazil regarding such contributions, projects, and corpora, considering their origins. Conclusions: In spite of the importance of the ATE, there are still several gaps to be filled, for instance, the lack of consensus regarding the formal definition of meaning of 'term'. Such gaps are larger for the Brazilian Portuguese when compared to other languages, such as English, Spanish, and French. Examples of gaps for Brazilian Portuguese include the lack of a baseline ATE system, as well as the use of more sophisticated linguistic information, such as the WordNet and Wikipedia knowledge bases. Nevertheless, there is an increase in the number of contributions related to ATE and an interesting tendency to use contrasting corpora and domain stoplists, even though most contributions only use frequency, noun phrases, and morphosyntactic patterns.

show abstract

Section: The Toptax Methodologymentioning

confidence: 99%

Section: (A) Log Ilkelihood Ratio (Ll)mentioning

confidence: 99%

Section: The Linguateca Repositorymentioning

confidence: 99%

See 1 more Smart Citation

A survey of automatic term extraction for Brazilian Portuguese

Conrado

Felippo

Pardo³

et al. 2014

J Braz Comput Soc

View full text Add to dashboard Cite

show abstract

“…Luhn [8] and LuhnDF [9] are semi-automatic methods that plot histograms from candidate terms based on, respectively, candidate frequencies (tf ) and document frequencies (df ). These histograms facilitate the visualization of any possible pattern that candidates may follow and, then, the histograms help to determine a threshold.…”

Section: Related Workmentioning

confidence: 99%

The Main Challenge of Semi-Automatic Term Extraction Methods

Conrado¹,

Pardo²,

Rezende³

2015

Natural Language Processing and Cognitive Science

View full text Add to dashboard Cite

Term extraction is the basis for many tasks such as building of taxonomies, ontologies and dictionaries, for translation, organization and retrieval of textual data. This paper studies the main challenge of semi-automatic term extraction methods, which is the difficulty to analyze the rank of candidates created by these methods. With the experimental evaluation performed in this work, it is possible to fairly compare a wide set of semi-automatic term extraction methods, which allows other future investigations. Additionally, we discovered which level of knowledge and threshold should be adopted for these methods in order to obtain good precision or F-measure. The results show there is not a unique method that is the best one for the three used corpora.

show abstract

“…On the other hand, unsupervised feature selection algorithms may be employed in unlabeled datasets. Nogueira (2009) presents a comparison of some unsupervised feature selection algorithms for Text Mining. The most commonly used method is the Luhn's method (Luhn, 1958).…”

Section: Pre-processingmentioning

confidence: 99%

Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections

Nogueira¹

Self Cite

View full text Add to dashboard Cite

Avaliação de métodos não-supervisionados de seleção de atributos para mineração de textos

Cited by 9 publications

References 34 publications

A survey of automatic term extraction for Brazilian Portuguese

A survey of automatic term extraction for Brazilian Portuguese

The Main Challenge of Semi-Automatic Term Extraction Methods

Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections

Contact Info

Product

Resources

About