Segmenting text into semantically coherent segments is an important task with applications in information retrieval and text summarization. Developing accurate topical segmentation requires the availability of training data with ground truth information at the segment level. However, generating such labeled datasets, especially for applications in which the meaning of the labels is user-defined, is expensive and timeconsuming. In this paper, we develop an approach that instead of using segment-level ground truth information, it instead uses the set of labels that are associated with a document and are easier to obtain as the training data essentially corresponds to a multilabel dataset. Our method, which can be thought of as an instance of distant supervision, improves upon the previous approaches by exploiting the fact that consecutive sentences in a document tend to talk about the same topic, and hence, probably belong to the same class. Experiments on the text segmentation task on a variety of datasets show that the segmentation produced by our method beats the competing approaches on four out of five datasets and performs at par on the fifth dataset. On the multilabel text classification task, our method performs at par with the competing approaches, while requiring significantly less time to estimate than the competing approaches.
In e-commerce, a user tends to search for the desired product by issuing a query to the search engine and examining the retrieved results. If the search engine was successful in correctly understanding the user's query, it will return results that correspond to the products whose attributes match the terms in the query that are representative of the query's product intent. However, the search engine may fail to retrieve results that satisfy the query's product intent and thus degrading user experience due to different issues in query processing: (i) when multiple terms are present in a query it may fail to determine the relevant terms that are representative of the query's product intent, and (ii) it may suffer from vocabulary gap between the terms in the query and the product's description, i.e., terms used in the query are semantically similar but different from the terms in the product description. Hence, identifying the terms that describe the query's product intent and predicting additional terms that describe the query's product intent better than the existing query terms to the search engine is an essential task in e-commerce search. In this paper, we leverage the historical query reformulation logs of a major e-commerce retailer to develop distant-supervised approaches to solve both these problems. Our approaches exploit the fact that the significance of a term is dependent upon the context (other terms in the neighborhood) in which it is used in order to learn the importance of the term towards the query's product intent. We show that identifying and emphasizing the terms that define the query's product intent leads to a 3% improvement in ranking. Moreover, for the tasks of identifying the important terms in a query and for predicting the additional terms that represent product intent, experiments illustrate that our approaches outperform the non-contextual baselines.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.