Anthony Hartshorn scite author profile

Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. We outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%. And despite not being trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench. We believe these results demonstrate the potential for language models as a new interface for science. We open source the model for the benefit of the scientific community 1 .Computing has indeed revolutionized how research is conducted, but information overload remains an overwhelming problem (Bornmann and Mutz, 2014). In May 2022, an average of 516 papers per day were submitted to arXiv (arXiv, 2022). Beyond papers, scientific data is also growing much more quickly than our ability to process it (Marx, 2013). As of August 2022, the NCBI GenBank contained 1.49 × 10 12 nucleotide bases (GenBank, 2022). Given the volume of information, it is impossible for a single person to read all the papers in a given field; and it is likewise challenging to organize data on the underlying scientific phenomena.Search engines are the current interface for accessing scientific knowledge following the Licklider paradigm. But they do not organize knowledge directly, and instead point to secondary layers such as Wikipedia,

show abstract

Structure and Function of Peatland‐Forest Ecotones in Southeastern Alaska

Hartshorn

Southard

Bledsoe

2003

Soil Science Soc of Amer J

View full text Add to dashboard Cite

High‐latitude warming could cause northern peatlands to become C sources. Where peatlands border boreal forests, strong differences in ecosystem C balances reflect drainage differences. Because local drainage conditions could be influenced by alterations in temperature and precipitation regimes, peatland‐forest ecotones represent useful locations for monitoring potential impacts of global warming. We characterized the soils, hydrology, and forest structure along transects bracketing a peatland‐forest ecotone in southeastern Alaska. We expected to find soil properties and processes at the peatland‐forest edge that were intermediate between those from peatland and forest. Instead, we found that above‐ and belowground features of the ecotone did not coincide. Conifers grew on mineral soils, but also grew on Cryofibrists and Cryohemists, soils with high soil organic C (SOC) contents to 100 cm (57 kg m−2) that are significantly greater than the SOC contents of adjacent forested, non‐Histosol pedons. Soil respiration rates (SRR) at peatland‐forest edges (0.08 g CO2–C m−2 h−1), by contrast, were threefold lower than forest rates and did not differ significantly from peatland rates. Respiration rates were strongly influenced by water table height. Peatland and edge water tables were both significantly shallower than forest water tables. Our conceptual model suggests that if additional forest expansion and warmer summers enhance drainage of these edge soils and stimulate SRR to forest‐like levels, 23 kg C m−2 could ultimately be mineralized from these extensive peatland‐forest boundaries. Afforestation of peatland margins under this scenario could represent a transient positive feedback to rising atmospheric CO2 levels.

show abstract

Soils, geomorphology, landscape evolution, and land use in the Virginia Piedmont and Blue Ridge

Sherwood¹,

Hartshorn²,

Eaton³

2010

View full text Add to dashboard Cite

The object of this fi eld trip is to examine the geology, landforms, soils, and land use in the eastern Blue Ridge and western Piedmont geologic provinces in Orange County in central Virginia. A complex mix of igneous, sedimentary, and metamorphic bedrocks, ranging in age from Mesoproterozoic to Triassic (possibly some Jurassic) underlie the area. Soils are equally varied with a total of 62 series mapped in Orange County alone. The area being relatively stable tectonically, landforms generally refl ect the resistance to weathering of the bedrock. Area landforms range from a low ridge over Catoctin greenstone to a gently rolling Triassic basin. Soils examined on the trip represent three orders: Ultisols, Alfi sols, and Inceptisols. Residual soils clearly refl ect the compositions of the parent rocks and saprolites are common. Map patterns of forested versus nonforested lands bear a striking resemblance to the distribution patterns of the different soil and bedrock types. Our work has shown that the vast majority of the land in central Virginia, even that forested today, shows evidence of past clearing and cultivation. However, the harsh demands of growing tobacco wore out the less fertile and more erodible soils by the mid-nineteenth century resulting in their abandonment and the subsequent regeneration of the vast tracts of hardwood forests we see today. Only the most productive soils remain in agriculture.

show abstract

Assessing Robustness of Text Classification through Maximal Safe Radius Computation

Malfa¹,

Wu²,

Laurenti³

et al. 2020

View full text Add to dashboard Cite

Neural network NLP models are vulnerable to small modifications of the input that maintain the original meaning but result in a different prediction. In this paper, we focus on robustness of text classification against word substitutions, aiming to provide guarantees that the model prediction does not change if a word is replaced with a plausible alternative, such as a synonym. As a measure of robustness, we adopt the notion of the maximal safe radius for a given input text, which is the minimum distance in the embedding space to the decision boundary. Since computing the exact maximal safe radius is not feasible in practice, we instead approximate it by computing a lower and upper bound. For the upper bound computation, we employ Monte Carlo Tree Search in conjunction with syntactic filtering to analyse the effect of single and multiple word substitutions. The lower bound computation is achieved through an adaptation of the linear bounding techniques implemented in tools CNN-Cert and POPQORN, respectively for convolutional and recurrent network models. We evaluate the methods on sentiment analysis and news classification models for four datasets (IMDB, SST, AG News and NEWS) and a range of embeddings, and provide an analysis of robustness trends. We also apply our framework to interpretability analysis and compare it with LIME.

show abstract

Wind Tunnel Test of A R.A.F.28 Aerofoil with Thurston Rotors

Hartshorn¹,

Callen²

1934

J. R. Aeronaut. Soc.

View full text Add to dashboard Cite

The suggestion has been put forward by Dr. A. P. Thurston that the characteristics of a wing can be considerably improved by adding self-starting rotors at or near the wing tips. It is claimed that by placing these on each wing near the leading edge improvements analogous to those of a slotted wing tip can be obtained together with slow landing characteristics. Previous experiments (1) have shown that rotors can give about the same degree of lateral stability as tip slots. The primary object of these experiments was to test the second assertion, and the criterion by which this should be judged was that the gliding angle should be as steep as possible for a given rate of descent.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.