Modeling Latent Topics and Temporal Distance for Story Segmentation of Broadcast News

Chen, Hongjie; Xie, Lei; Leung, Cheung-Chi; Lu, Xiaoming; Ma, Bin; Li, Haizhou

doi:10.1109/taslp.2016.2626965

Cited by 15 publications

(4 citation statements)

References 43 publications

(60 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LSI has been widely discussed in study [8,11,29]. pLSI is an enhancement of LSI which able to model every word as a representation of some topics that could overcome the problem of synonyms and polysemy [4,14,18,31]. However, modeling which was done by pLSI only at document level.…”

Section: Introductionmentioning

confidence: 99%

Fuzzy-Gibbs latent Dirichlet allocation model for feature extraction on Indonesian documents

Prihatini¹,

Putra²,

Giriantari³

et al. 2017

ces

View full text Add to dashboard Cite

Latent Dirichlet Allocation is a topic-based feature extraction method that uses reasoning to find semantic relationship in corpus. Although Latent Dirichlet Allocation is very powerful in handling very large data sets, but it has a very high complexity along with increasing number of document to reach convergence. Latent Dirichlet Allocation generates probability for all topics in a document, which it contains uncertainty, so its relationship with number of iterations needs to be analyzed. In this paper, Latent Dirichlet Allocation modified by adding fuzzy logic in Gibbs sampling inference algorithm. Its purpose is to analyze the effect of fuzzy logic in handling uncertainty of the occurrence all topics in a document that affect number of iteration in reasoning. Fuzzy-Gibbs Latent Dirichlet Allocation algorithm is implemented on text data of Indonesian documents. Testing performed on three different sizes of data to determine the effect of the number of document to the number of iteration. The algorithm performance was also measured using Perplexity, Precision, Recall and F-Measure.

show abstract

Section: Introductionmentioning

confidence: 99%

Fuzzy-Gibbs latent Dirichlet allocation model for feature extraction on Indonesian documents

Prihatini¹,

Putra²,

Giriantari³

et al. 2017

ces

View full text Add to dashboard Cite

show abstract

“…This feature vector is then classified into its respective class label via a machine learning algorithm [10,[15][16][17][18]. Since BoW features are highly sparse and lack diversity [19], topic modeling approaches such as latent Dirichlet allocation (LDA) [20] have been developed. These admixture approaches were originally employed for document classification as they offer linguistic insights into language patterns by grouping associated words into topics and, thereafter, computing the probabilities of topics occurring in each document [21,22].…”

Section: Motivationmentioning

confidence: 99%

“…The scatter plot in Figure 6.3(c), on the other hand, exhibits a less skewed set of points based on the scaling in (6. 19) and is reflected via the histogram with a lighter-tailed distribution. The above implies that moderate emphasis is given to less significant topic probabilities which is subsequently shown to result in better feature representation for classification.…”

Section: Chapter Summarymentioning

confidence: 99%

Domain-agnostic document and question classification using natural language processing techniques

Supraja¹

View full text Add to dashboard Cite

I would like to dedicate this thesis to two special angels: God and my dear parents, without whom I could not have completed this challenging, yet fruitful PhD journey. I have no words to express my gratitude for every single thing my parents have done for my education and for my whole life. I cannot do anything to repay their love, affection, care, and concern for me.I would like to extend my sincere thanks to my husband, all my family members and friends who have supported me in one way or another. First and foremost, I would like to sincerely express my gratitude to my advisor A/Prof Andy W. H. Khong for his continuous support and guidance throughout my Ph.D. journey. His constant motivation and encouragement has transformed me tremendously.His advice has always been a great tonic to boost my morale and confidence level.

show abstract

“…SPIGA [83] represents documents by doing EDL and constructing a weighted bag-of-concepts with these entities. Also, graph based models like and-or-graphs (AOG) [120] or graph regularization methods [32] have recently been proving high accuracy results and are useful for multimodal topic modeling. For example, in [120] a novel representation using a Multimodal Topic And-Or Graph (MT-AOG) is presented.…”

Section: Topic Modelingmentioning

confidence: 99%

Knowledge graph population from news streams

Fernández Cañellas

View full text Add to dashboard Cite

(English) Media producers publish large amounts of multimedia content online - both text, audio, image and video. As the online media market grows, the management and delivery of contents becomes a challenge. Semantic and linking technologies can be used to organize and exploit these contents through the use of knowledge graphs. This industrial doctorate dissertation addresses the problem of constructing knowledge resources and integrating them into a system used by media producers to manage and explore their contents. For that purpose, knowledge graphs and their maintenance through Information Extraction (IE) from news streams is studied. This thesis presents solutions for multimedia understanding and knowledge extraction from online news, and their exploitation in real product applications, and it is structured in three parts. The first part consists on the construction of IE tools that will be used for knowledge graph population. For that, we built an holistic Entity Linking (EL) system capable of combining multimodal data inputs to extract a set of semantic entities that describe news content. The EL system is followed by a Relation Extraction (RE) model that predicts relations between pairs of entities with a novel method based on entity-type knowledge. The final system is capable of extracting triples describing the contents of a news article. The second part focuses on the automatic construction of a news event knowledge graph. We present an online multilingual system for event detection and comprehension from media feeds, called VLX-Stories. The system retrieves information from news sites, aggregates them into events (event detection), and summarizes them by extracting semantic labels of its most relevant entities (event representation) in order to answer four Ws from journalism: who, what, when and where. This part of the thesis deals with the problems of Topic Detection and Tracking (TDT), topic modeling and event representation. The third part of the thesis builds on top of the models developed in the two previous parts to populate a knowledge graph from aggregated news. The system is completed with an emerging entity detection module, which detects mentions of novel people appearing on the news and creates new knowledge graph entities from them. Finally, data validation and triple classification tools are added to increase the quality of the knowledge graph population. This dissertation addresses many general knowledge graph and information extraction problems, like knowledge dynamicity, self-learning, and quality assessment. Moreover, as an industrial work, we provide solutions that were deployed in production and verify our methods with real customers. (Català) Els productors de contingut multimèdia publiquen grans quantitats de contingut en línia, tant en forma de text, àudio, imatge com de vídeo. A mesura que el mercat dels mitjans de comunicació en línia creix, la gestió i distribució de continguts es converteixen en un repte. Les tecnologies semàntiques i d'enllaç es poden utilitzar per organitzar i explotar aquests continguts mitjançant l'ús de grafs de coneixement. Aquesta tesi de doctorat industrial aborda el problema de construir recursos de coneixement i integrar-los en un sistema utilitzat pels productors multimedia per gestionar i explorar els seus continguts. Amb aquest propòsit, s'estudien els grafs de coneixement i el seu manteniment mitjançant l'extracció d'informació a partir de fonts de notícies. Aquesta tesi presenta solucions per a la comprensió multimèdia i l'extracció de coneixement de les notícies en línia, així com la seva explotació en aplicacions de productes reals. Està estructurada en tres parts. La primera part consisteix en la construcció d'eines d'extracció d'informació que s'utilitzaran per a la població del graf de coneixement. Per això, hem desenvolupat un sistema holístic d'enllaç d'entitats (EL), capaç de combinar dades multimodals per extreure un conjunt d'entitats semàntiques que descriuen el contingut de les notícies. El sistema de EL es complementa amb un model d'extracció de relacions (RE) que prediu les relacions entre parells d'entitats mitjançant un mètode innovador basat en el coneixement del tipus d'entitat. El sistema final és capaç d'extreure tripletes de coneixement que descriuen el contingut d'un article de notícies. La segona part es centra en la construcció automàtica d'un graf de coneixement d'esdeveniments de notícies. Presentem un sistema en línia multilingüe per a la detecció i comprensió d'esdeveniments a partir de "feeds" de mitjans de comunicació, anomenat VLX-Stories. El sistema recupera informació de llocs web de notícies, les agrega en esdeveniments (detecció d'esdeveniments) i les resumeix extreient etiquetes semàntiques de les seves entitats més rellevants (representació d'esdeveniments) per respondre a les quatre preguntes bàsiques del periodisme: qui, què, quan i on. Aquesta part de la tesi aborda els problemes de detecció i seguiment de temes, modelització de temes i representació d'esdeveniments. La tercera part de la tesi es basa en els models desenvolupats en les dues parts anteriors per omplir un graf de coneixement a partir de notícies agregades. El sistema es completa amb un mòdul de detecció d'entitats emergents, que detecta mencions de persones noves que apareixen a les notícies i crea noves entitats al graf de coneixement a partir d'elles. Finalment, s'afegeixen eines de validació de dades i classificació de tripletes per augmentar la qualitat de la població del graf de coneixement. Aquesta tesi aborda molts problemes generals dels grafs de coneixement i de l'extracció d'informació, com el coneixement dinàmic, l'aprenentatge autònom i l'avaluació de la qualitat. A més, com a treball industrial, proporcionem solucions que s'han implementat en producció i verifiquem els nostres mètodes amb clients reals. (Español) Los productores audiovisuales publican grandes cantidades de contenido multimedia en línea, en forma de texto, audio, imagen o video. A medida que crece el mercado de medios en línea, la gestión y entrega de contenidos se convierte en un desafío. Las tecnologías semánticas y de enlace se pueden utilizar para organizar y explotar estos contenidos mediante el uso de grafos de conocimiento. Esta tesis de doctorado industrial aborda el problema de construir recursos de conocimiento e integrarlos en un sistema utilizado por los productores de medios para gestionar y explorar sus contenidos. Con ese propósito, se estudian los grafos de conocimiento y su mantenimiento a través de la extracción de información de flujos de noticias. Esta tesis presenta soluciones para la comprensión multimedia y la extracción de conocimiento de noticias en línea, y su explotación en aplicaciones de productos reales, y está estructurada en tres partes. La primera parte consiste en la construcción de herramientas de extracción de la información que se utilizarán para la población del grafo de conocimiento. Para eso, construimos un sistema holístico de enlace de entidades (EL) capaz de combinar datos multimodales para extraer un conjunto de entidades semánticas que describen el contenido de las noticias. El sistema de EL se complementa con un modelo de extracción de relaciones (RE) que predice las relaciones entre pares de entidades con un método novedoso basado en el conocimiento del tipo de entidad. El sistema final es capaz de extraer tripletas que describen el contenido de un artículo de noticias. La segunda parte se centra en la construcción automática de un grafo de conocimiento de eventos de noticias. Presentamos un sistema en línea multilingüe para la detección y comprensión de eventos a partir de "feeds" de medios de comunicación, llamado VLX-Stories. El sistema recopila información de sitios de noticias, las agrega en eventos (detección de eventos) y las resume extrayendo etiquetas semánticas de las entidades más relevantes (representación de eventos) para responder a las cuatro W del periodismo: quién, qué, cuándo y dónde. Esta parte de la tesis aborda los problemas de detección y seguimiento de temas (TDT), modelado de temas y representación de eventos. La tercera parte de la tesis se basa en los modelos desarrollados en las dos partes anteriores para poblar un grafo de conocimiento a partir de noticias agregadas. El sistema se completa con un módulo de detección de entidades emergentes, que detecta menciones de personas novedosas que aparecen en las noticias y crea nuevas entidades en el grafo de conocimiento a partir de ellas. Finalmente, se agregan herramientas de validación de datos y clasificación de tripletas para aumentar la calidad de la población del grafo de conocimiento. Esta disertación aborda muchos problemas generales de los grafos de conocimiento y extracción de información, como la dinamicidad del conocimiento, el autoaprendizaje y la evaluación de la calidad. Además, como trabajo industrial, proporcionamos soluciones que se implementaron en producción y verificamos nuestros métodos con clientes reales.

show abstract

Modeling Latent Topics and Temporal Distance for Story Segmentation of Broadcast News

Cited by 15 publications

References 43 publications

Fuzzy-Gibbs latent Dirichlet allocation model for feature extraction on Indonesian documents

Fuzzy-Gibbs latent Dirichlet allocation model for feature extraction on Indonesian documents

Domain-agnostic document and question classification using natural language processing techniques

Knowledge graph population from news streams

Contact Info

Product

Resources

About