Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
In marine literature categorization, supervised machine learning method will take a lot of time for labelling the samples by hand. So we utilize Co-training method to decrease the quantities of labelled samples needed for training the classifier.In this paper, we only select features from the text details and add attribute labels to them. It can greatly boost the efficiency of text processing. For building up two views, we split features into two parts, each of which can form an independent view. One view is made up of the feature set of abstract, and the other is made up of the feature sets of title, keywords, creator and department. In experiments, the Fl value and error rate of the categorization system could reach about 0.863 and 14.26%.They are close to the performance of supervised classifier (0.902 and 9.13%), which is trained by more than 1500 labelled samples, however, the labelled samples used by Co-training categorization method to train the original classifier are only one positive sample and one negative sample. In addition we consider joining the idea of the active-learning in Co-training method.It is an important period for the rapid development of China's marine economy in the 21st century. Now people take more attention to the research on Marine disciplines. However, the quantity of marine literatures can't satisfy people's needs. With the development of the Internet and information technology, vast information resources can be obtained on the web. So it is an extremely arduous task to select marine literatures from millions of literatures each year. In addition, because of the lower categorization efficiency, the utilization of the current marine literatures is generally not high in the domestic institutions. According to statistics, more than 70% of the marine literatures in Chinese and 90% of foreign marine literatures collected by the libraries and various types of intelligence agencies have not been used for a long time. It will take at least 40% � 60% of the time for marine scientific researchers to filter and get information [1] in researches. Therefore, we need an efficient marine literature categorization method. The reality of marine literature categorization problem is that, on the one hand, there are very scarce labelled samples and it is difficult to get them, on the other hand, unlabelled 978-1-4244-6899-711 0/$26.00 ©20 1 0 IEEE samples are very rich, and it is easy to get them, but they are put aside with no use. The data in unlabelled samples can't be directly used for training traditional classifier, but we can analyze the structure of data and the distribution information from them. If we can make full use of the information by machine learning method, the performance of categorization algorithm will be improved effectively [2]. Traditional machine learning methods are divided into two kinds: supervision machine learning method and unsupervision machine learning method. Supervised machine learning requires all samples labelled in training set, however unsupervision machine lear...
In marine literature categorization, supervised machine learning method will take a lot of time for labelling the samples by hand. So we utilize Co-training method to decrease the quantities of labelled samples needed for training the classifier.In this paper, we only select features from the text details and add attribute labels to them. It can greatly boost the efficiency of text processing. For building up two views, we split features into two parts, each of which can form an independent view. One view is made up of the feature set of abstract, and the other is made up of the feature sets of title, keywords, creator and department. In experiments, the Fl value and error rate of the categorization system could reach about 0.863 and 14.26%.They are close to the performance of supervised classifier (0.902 and 9.13%), which is trained by more than 1500 labelled samples, however, the labelled samples used by Co-training categorization method to train the original classifier are only one positive sample and one negative sample. In addition we consider joining the idea of the active-learning in Co-training method.It is an important period for the rapid development of China's marine economy in the 21st century. Now people take more attention to the research on Marine disciplines. However, the quantity of marine literatures can't satisfy people's needs. With the development of the Internet and information technology, vast information resources can be obtained on the web. So it is an extremely arduous task to select marine literatures from millions of literatures each year. In addition, because of the lower categorization efficiency, the utilization of the current marine literatures is generally not high in the domestic institutions. According to statistics, more than 70% of the marine literatures in Chinese and 90% of foreign marine literatures collected by the libraries and various types of intelligence agencies have not been used for a long time. It will take at least 40% � 60% of the time for marine scientific researchers to filter and get information [1] in researches. Therefore, we need an efficient marine literature categorization method. The reality of marine literature categorization problem is that, on the one hand, there are very scarce labelled samples and it is difficult to get them, on the other hand, unlabelled 978-1-4244-6899-711 0/$26.00 ©20 1 0 IEEE samples are very rich, and it is easy to get them, but they are put aside with no use. The data in unlabelled samples can't be directly used for training traditional classifier, but we can analyze the structure of data and the distribution information from them. If we can make full use of the information by machine learning method, the performance of categorization algorithm will be improved effectively [2]. Traditional machine learning methods are divided into two kinds: supervision machine learning method and unsupervision machine learning method. Supervised machine learning requires all samples labelled in training set, however unsupervision machine lear...
The past decade has seen a dramatic increase in the amount of data captured and made available to scientists for research. This increase amplifies the difficulty scientists face in finding the data most relevant to their information needs. In prior work, we hypothesized that Information Retrieval-style ranked search can be applied to data sets to help a scientist discover the most relevant data amongst the thousands of data sets in many formats, much like text-based ranked search helps users make sense of the vast number of Internet documents. To test this hypothesis, we explored the use of ranked search for scientific data using an existing multi-terabyte observational archive as our test-bed. In this paper, we investigate whether the concept of varying relevance, and therefore ranked search, applies to numeric data-that is, are data sets are enough like documents for Information Retrieval techniques and evaluation measures to apply? We present a user study that demonstrates that data set similarity resonates with users as a basis for relevance and, therefore, for ranked search. We evaluate a prototype implementation of ranked search over data sets with a second user study and demonstrate that ranked search improves a scientist's ability to find needed data.
In the past decade, the amount of scientific data collected and generated by scientists has grown dramatically. This growth has intensified an existing problem: in large archives consisting of datasets stored in many files, formats and locations, how can scientists find data relevant to their research interests? We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and curated methods to extract metadata from large repositories of scientific data. We then perform searches over this metadata, returning results ranked by similarity to the search criteria. We present a model of this approach, and describe a specific implementation thereof performed at an ocean-observatory data archive and now running in production. Our prototype implements scanners that extract metadata from datasets that contain different kinds of environmental observations, and a search engine with a candidate similarity measure for comparing a set of search terms to the extracted metadata. We evaluate the utility of the prototype by performing two user studies; these studies show that the approach resonates with users, and that our proposed similarity measure performs well when analyzed using standard Information Retrieval evaluation methods. We performed performance tests to explore how continued archive growth will affect our goal of interactive response, developed and applied techniques that mitigate the effects of that growth, and show that the techniques are effective. Lastly, we describe some of the research needed to extend this initial work into a true "Google for data".
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.