In this paper, we present two methodologies to extract particular information based on the full text returned from the search engine to facilitate the users. The approaches are based three tasks: name entity recognition (NER), text classification and text summarization. The first step is the building training data and data cleansing. We consider tourism domain such as restaurant, hotels, shopping and tourism data set crawling from the websites. First, the tourism data are gathered and the vocabularies are built. Several minor steps include sentence extraction, relation and name entity extraction for tagging purpose. These steps are needed for creating proper training data. Then, the recognition model of a given entity type can be built. From the experiments, given review texts, we demonstrate to build the model to extract the desired entity,i.e, name, location, facility as well as relation type, classify the reviews or summarize the reviews. Two tools, SpaCy and BERT, are used to compare the performance of these tasks.
Tourism information is scattered around nowadays. To search for the information, it is usually time consuming to browse through the results from search engine, select and view the details of each accommodation. In this paper, we present a methodology to extract particular information from full text returned from the search engine to facilitate the users. Then, the users can specifically look to the desired relevant information. The approach can be used for the same task in other domains. The main steps are 1) building training data and 2) building recognition model. First, the tourism data is gathered and the vocabularies are built. The raw corpus is used to train for creating vocabulary embedding. Also, it is used for creating annotated data. The process of creating named entity annotation is presented. Then, the recognition model of a given entity type can be built. From the experiments, given hotel description, the model can extract the desired entity,i.e, name, location, facility. The extracted data can further be stored as a structured information, e.g., in the ontology format, for future querying and inference. The model for automatic named entity identification, based on machine learning, yields the error ranging 8%-25% . I . I N T R O D U C T I O NTypical information search in the web requires the text or string matching. When the user searches the information, the search engine returns the relevant documents that contain the matched string. The users need to browse through the associated link to find whether the web site is in the scope of interest, which is very time consuming.To facilitate the user search, using ontology representation can enable the search to return precise results. The specified keyword may refer to the meaning in the specific domain. For example, consider the word, "clouds". The typical search matching such a keyword returns the documents referring to similar word such as "sky". However, when using as "cloud computing", the meaning is totally different. Also, with the capability of ontology, it can also infer to other relevant information. For example, "cloud computing" is a sub-field under "computer architecture" . The relevant documents may include the paper in the area such as "operating system", "distributed system" etc. The proper ontology construction and imported data can lead to the enhanced search features.It is known that for a given document, extraction data into the ontology usually required lots of human work. Several previous works have attempted to propose methods for building ontology based on data extraction [1]. Most of the work relied on the web structure documents [2], [3], [4].The ontology is extracted based on HTML web structure, and the corpus is based on WordNet. For these approaches, the time consuming process is the annotation which is to annotate the type of name entity. In this paper, we target at the tourism domain, and aim to extract particular information helping for ontology data acquisition.We present the framework for the given named entity extracti...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.