A Refined TF-IDF Algorithm Based on Channel Distribution Information for Web News Feature Extraction

Xu, Mingmin; He, Liang; Lin, Xin

doi:10.1109/etcs.2010.130

Cited by 10 publications

(8 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Yuan et al [15] utilized a variety of features including statistical features, location features, and part of speech features to evaluate the weight of candidate keywords. Some scholars have also improved the algorithm from the perspective of news text category [16], [17]. Xu et al [17] were convinced that the news from each of the categories have some proper nouns that appear frequently in the document, but are not meaningful.…”

Section: B Text Feature Selectionmentioning

confidence: 99%

Hot Topic Detection Based on a Refined TF-IDF Algorithm

Zhu¹,

Liang²,

Li³

et al. 2019

IEEE Access

View full text Add to dashboard Cite

In this paper, we propose a refined term frequency inversed document frequency (TF-IDF) algorithm called TA TF-IDF to find hot terms, based on time distribution information and user attention. We also put forward a method to generate new terms and combined terms, which are split by the Chinese word segmentation algorithm. Then, we extract hot news according to the hot terms, grouping them into K-means clusters so as to realize the detection of hot topics in news. The experimental results indicated that our method based on the refined TF-IDF algorithm can find hot topics effectively.

show abstract

Section: B Text Feature Selectionmentioning

confidence: 99%

Hot Topic Detection Based on a Refined TF-IDF Algorithm

Zhu¹,

Liang²,

Li³

et al. 2019

IEEE Access

View full text Add to dashboard Cite

show abstract

“…The figure below describes our step to convert gathered information to be document vector. There is two factors that are used in common information processing system [9]: TF, the frequency of the term in a text segment, and IDF, which is used to indicate the distinction of the term.…”

Section: B the Weighting Schemementioning

confidence: 99%

“…The Vector Space Model is commonly used structured form for text data in which individual text documents are represented as a set of vectors [9]. Later the matrix M would be converted into single vector V (word) so that those collections of words could be clustered using k-means algorithm.…”

Section: Vector Space Modelmentioning

confidence: 99%

Generating customized web search result through community driven search engine

Hantono

Putra

2013

2013 International Conference on Information Technology and Electrical Engineering (ICITEE)

View full text Add to dashboard Cite

These days, the growth of web has led it to a big source of information. Web search engine plays an important role of searching desired information from this enormous web. However, search engine provides the same result independently to the user while actually each user has different preference. In this paper, we present a novel method of customized web search result generation to provide a better result according to community's preference. We benefit from proxy servers, which are widely used in a community network to reduce bandwidth needs. Proxy servers are, actually, providing the user preference within its access log that contains accessed URLs. Instead of web crawler, we will use this logs, which is always updated as users browse the web through this proxy. This would be the base of our customized web search. As the proxy log only covers URL list, we still need to crawl the information contained in an URL. When the crawling method has completed, document vector is created to make those data to be more machine friendly. Eventually, searching process is carried out by utilizing the vector space model.

show abstract

“…run a Hidden Markov Model [11] speech recognition algorithm on media files for transcribing the speech to text; -compute the TF-IDF measure for all articles and transcriptions, as specified in the CDI IDF algorithm [12]; -compute the page rank of articles, as specified in the weighted pagerank algorithm [13]; -perform a topic extraction routine on articles and transcriptions, as specified in the Latent Dirichlet Allocation algorithm [14].…”

Section: Metadata Processesmentioning

confidence: 99%

PRIMEBALL: A Parallel Processing Framework Benchmark for Big Data Applications in the Cloud

Ferrarons¹,

Adhana²,

Colmenares³

et al. 2014

Performance Characterization and Benchmarking

View full text Add to dashboard Cite

Abstract. In this position paper, we draw the specifications for a novel benchmark for comparing parallel processing frameworks in the context of big data applications hosted in the cloud. We aim at filling several gaps in already existing cloud data processing benchmarks, which lack a reallife context for their processes, thus losing relevance when trying to assess performance for real applications. Hence, we propose a fictitious news site hosted in the cloud that is to be managed by the framework under analysis, together with several objective use case scenarios and measures for evaluating system performance. The main strengths of our benchmark definition are parallelization capabilities supporting cloud features and big data properties.

show abstract

A Refined TF-IDF Algorithm Based on Channel Distribution Information for Web News Feature Extraction

Cited by 10 publications

References 7 publications

Hot Topic Detection Based on a Refined TF-IDF Algorithm

Hot Topic Detection Based on a Refined TF-IDF Algorithm

Generating customized web search result through community driven search engine

PRIMEBALL: A Parallel Processing Framework Benchmark for Big Data Applications in the Cloud

Contact Info

Product

Resources

About