An experimental computer program has been developed to classify documents according to the 80 sections and five major section groupings of Chemical Abstracrs (CAI. The program uses pattern recognition techniques supplemented by heuristics. During the "training" phase, words from preclassified documents are selected, and the probability of occurrence of each word in each section of CA i s computed and stored in a reference dictionary. The "classification" phase matches each word of a document title against the dictionary and assigns a section number to the document using weights derived from the probabilities in the dictionary. Heuristic techniques are used to normalize word variants such as plurals, past tenses, and gerunds in both the training phase and the classification phase. The dictionary lookup technique is supplemented by the analysis of chemical nomenclature terms into their component word roots to influence the section to which the documents are assigned. Program performance and human consistency have been evaluated by comparing the program results against the published sections of CA and by conducting an experiment with people experienced in the assignment of documents to CA sections. The program assigned approximately 7896 of the documents to the correct major section groupings of CA and 67% of the correct sections or crossreferences a t a rate of 100 documents per second.
The Chemical Abstracts Service Chemical Registry System, operating since 1965, uniquely identifies chemical substances on the basis of molecular structure. Chemical Abstracts Service is now registering chemical substances cited in indexes to Chemical Abstracts prior to 1965. This effort will result in several hundred thousand additional chemical structures, along with their names, being available for online searching in the Registry File. Both the newly registered substances and those already on file are being linked to their pre-1965 citations in Chemical Abstracts in a new file called CAOLD. In this effort the printed Formula Index entries are converted to computer-readable form by using optical character recognition with the data subsequently processed with existing computer programs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.