Maxent

Markl, Volker; Kutsch, M.; Tran, T. M.; Haas, Peter J.; Megiddo, Nimrod

doi:10.1145/1142473.1142586

Cited by 3 publications

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

Marine literature categorization based on minimizing the labelled data

Zhang

Wang

Deng

et al. 2010

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)

View full text Add to dashboard Cite

In marine literature categorization, supervised machine learning method will take a lot of time for labelling the samples by hand. So we utilize Co-training method to decrease the quantities of labelled samples needed for training the classifier.In this paper, we only select features from the text details and add attribute labels to them. It can greatly boost the efficiency of text processing. For building up two views, we split features into two parts, each of which can form an independent view. One view is made up of the feature set of abstract, and the other is made up of the feature sets of title, keywords, creator and department. In experiments, the Fl value and error rate of the categorization system could reach about 0.863 and 14.26%.They are close to the performance of supervised classifier (0.902 and 9.13%), which is trained by more than 1500 labelled samples, however, the labelled samples used by Co-training categorization method to train the original classifier are only one positive sample and one negative sample. In addition we consider joining the idea of the active-learning in Co-training method.It is an important period for the rapid development of China's marine economy in the 21st century. Now people take more attention to the research on Marine disciplines. However, the quantity of marine literatures can't satisfy people's needs. With the development of the Internet and information technology, vast information resources can be obtained on the web. So it is an extremely arduous task to select marine literatures from millions of literatures each year. In addition, because of the lower categorization efficiency, the utilization of the current marine literatures is generally not high in the domestic institutions. According to statistics, more than 70% of the marine literatures in Chinese and 90% of foreign marine literatures collected by the libraries and various types of intelligence agencies have not been used for a long time. It will take at least 40% � 60% of the time for marine scientific researchers to filter and get information [1] in researches. Therefore, we need an efficient marine literature categorization method. The reality of marine literature categorization problem is that, on the one hand, there are very scarce labelled samples and it is difficult to get them, on the other hand, unlabelled 978-1-4244-6899-711 0/$26.00 ©20 1 0 IEEE samples are very rich, and it is easy to get them, but they are put aside with no use. The data in unlabelled samples can't be directly used for training traditional classifier, but we can analyze the structure of data and the distribution information from them. If we can make full use of the information by machine learning method, the performance of categorization algorithm will be improved effectively [2]. Traditional machine learning methods are divided into two kinds: supervision machine learning method and unsupervision machine learning method. Supervised machine learning requires all samples labelled in training set, however unsupervision machine lear...

show abstract

Marine literature categorization based on minimizing the labelled data

Zhang

Wang

Deng

et al. 2010

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)

View full text Add to dashboard Cite

show abstract

Are Data Sets Like Documents?: Evaluating Similarity-Based Ranked Search over Scientific Data

Megler

Maier

2015

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

The past decade has seen a dramatic increase in the amount of data captured and made available to scientists for research. This increase amplifies the difficulty scientists face in finding the data most relevant to their information needs. In prior work, we hypothesized that Information Retrieval-style ranked search can be applied to data sets to help a scientist discover the most relevant data amongst the thousands of data sets in many formats, much like text-based ranked search helps users make sense of the vast number of Internet documents. To test this hypothesis, we explored the use of ranked search for scientific data using an existing multi-terabyte observational archive as our test-bed. In this paper, we investigate whether the concept of varying relevance, and therefore ranked search, applies to numeric data-that is, are data sets are enough like documents for Information Retrieval techniques and evaluation measures to apply? We present a user study that demonstrates that data set similarity resonates with users as a basis for relevance and, therefore, for ranked search. We evaluate a prototype implementation of ranked search over data sets with a second user study and demonstrate that ranked search improves a scientist's ability to find needed data.

show abstract

Ranked Similarity Search of Scientific Datasets: An Information Retrieval Approach

Megler¹

2000

View full text Add to dashboard Cite

In the past decade, the amount of scientific data collected and generated by scientists has grown dramatically. This growth has intensified an existing problem: in large archives consisting of datasets stored in many files, formats and locations, how can scientists find data relevant to their research interests? We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and curated methods to extract metadata from large repositories of scientific data. We then perform searches over this metadata, returning results ranked by similarity to the search criteria. We present a model of this approach, and describe a specific implementation thereof performed at an ocean-observatory data archive and now running in production. Our prototype implements scanners that extract metadata from datasets that contain different kinds of environmental observations, and a search engine with a candidate similarity measure for comparing a set of search terms to the extracted metadata. We evaluate the utility of the prototype by performing two user studies; these studies show that the approach resonates with users, and that our proposed similarity measure performs well when analyzed using standard Information Retrieval evaluation methods. We performed performance tests to explore how continued archive growth will affect our goal of interactive response, developed and applied techniques that mitigate the effects of that growth, and show that the techniques are effective. Lastly, we describe some of the research needed to extend this initial work into a true "Google for data".

show abstract

Maxent

Cited by 3 publications

References 1 publication

Marine literature categorization based on minimizing the labelled data

Marine literature categorization based on minimizing the labelled data

Are Data Sets Like Documents?: Evaluating Similarity-Based Ranked Search over Scientific Data

Ranked Similarity Search of Scientific Datasets: An Information Retrieval Approach

Contact Info

Product

Resources

About