The distribution of bibliographic records in on-line bibliographic databases is examined using 14 different search topics. These topics were searched using the DIALOG database host, and using as many suitable databases as possible. The presence of duplicate records in the searches was taken into consideration in the analysis, and the problem with lexical ambiguity in at least one search topic is discussed. The study answers questions such as how many databases are needed in a multifile search for particular topics, and what coverage will be achieved using a certain number of databases. The distribution of the percentages of records retrieved over a number of databases for 13 of the 14 search topics roughly fell into three groups: (1) high concentration of records in one database with about 80% coverage in five to eight databases; (2) moderate concentration in one database with about 80% coverage in seven to 10 databases; and (3) low concentration in one database with about 80% coverage in 16 to 19 databases. The study does conform with earlier results, but shows that the number of databases needed for searches with varying complexities of search strategies, is much more topic dependent than previous studies would indicate.
IntroductionDespite the proliferation of electronic databases, and the facilities to search multiple databases, not very much is known about how information on a particular topic is distributed throughout different databases. When searching bibliographic databases to obtain records pertaining to a particular topic, questions need to be asked such as "How many databases do I need to search to cover my topic to varying degrees of comprehensiveness?" and "Which databases should I search?" Additionally, despite the tools provided by some on-line hosts for searching over multiple databases or for discovering which databases are most productive for particular search statements (e.g., OneSearch or DIALINDEX on the DIALOG information system), more needs to be known about the distribution of records for a particular topic across many databases. The study discussed in this article is an attempt to better understand how such information is distributed in databases.
1The primary aim of this study is to explore the distribution of bibliographic records on particular search topics across various bibliographic databases. In doing this, the robustness of the distributions over different search topics will be examined; earlier studies in this area typically analyzed only one topic. This study will give a better understanding of how many databases are needed to achieve different levels of coverage over a range of different search topics and over varying complexities of search statements. The problem of lexical ambiguity in searching certain topics with seemingly precise keywords or key phrases is also discussed.
Bibliographic databases contain surrogates to a particular subset of the complete set of literature; some databases are very narrow in their scope, while others are multidisciplinary. These databases overlap in their coverage of the literature to a greater or lesser extent. The topic of Fuzzy Set Theory is examined to determine the overlap of coverage in the databases that index this topic. It was found that about 63% of records in the data set are unique to only one database, and the remaining 37% are duplicated in from two to 12 different databases. The overlap distribution is found to conform to a Lotka-type plot. The records with maximum overlap are identified; however, further work is needed to determine the significance of the high level of overlap in these records. The unique records are plotted using a Bradford-type form of data presentation and are found to conform (visually) to a hyperbolic distribution. The extent and causes of intra-database duplication (records duplicated in the one database) are also examined. Finally, the overlap in the top databases in the dataset were examined, and a high correlation was found between overlapping records, and overlapping DIALOG OneSearch categories.
Knowing how records on a particular topic are distributed over databases is useful for both practical and theoretical reasons; however little work in this area appears to have been done. This paper examines the distribution of records on the topic of "Fuzzy Set Theory" in over 100 bibliographic databases and determines whether the distribution of records over databases is similar to the traditional Bradford hyperbolic distribution of records over journals. Different methods for counting duplicate records between and within databases have been developed. A comparison of the various distributions based on these counting methods is presented; and the distributions are compared to results of earlier studies. The results also give an indication of the number of databases necessary to search for coverage of a literature to specified percentages using the different counting techniques developed in this study.
Papers in journals are indexed in bibliographic databases in varying degrees of overlap. The question has been raised as to whether papers that appear in multiple databases (highly overlapping) are in any way more significant (such as being more highly cited) than papers that are indexed in few databases. This paper uses a dataset from fuzzy set theory to compare low overlap papers with high overlap ones, and finds that more highly overlapping papers are in fact more highly cited.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.