Characteristics of scite Citation Statements

Find out how long citation statements are.

Tue Jan 25 2022

This post describes some simple summary statistics on the length of scite citation statements and an overview of what we mean when we say citation statements. Notable findings are that the average citation length is 472 characters but they have quite a high variance (191 characters at the first standard deviation); supporting and contrasting citation statements tend to have a bit less variance (with 176 and 168 characters respectively for their first standard deviation) suggesting they are a bit more formulaic; and citations in the introduction sections tend to be longer (500 characters) and citations in the methods section tend to be shorter (420 characters) which is an intuitive finding of what we would expect in those sections. In order to help you picture what 472 characters looks like see Figure 1 and keep in mind that a maximum tweet length is 280 characters.

Much of what scite is built on is what we call citation statements. These have been called citances, in-text citations, citation contexts, and other things by scholars of scientific literature and typically indicate some span of text surrounding a citation made inside of a scholarly document. For example the sentence, "This supports the results presented in Gorman et. al. (2020).", is what is typically indicated by those terms. Extracting these citation statements allows us to classify them to present text spans that users can read in order to understand what is being said about cited papers. However, simply presenting one sentence that captures the citation isn't enough. This is clear from the example presented above. Which results are "the results in Gorman"? What is "this" in "this supports"? And more questions. Because single sentence citation statements are not clear enough on their own we typically capture the sentence before and after the sentence in which the citation appears in order to give users a fuller understanding of why the work is being cited and better reading experience. For more information on how we do this see our manuscript Nicholson et. al. (2021) outlining our methods.

Given those citation statements, some have been interested in how long they are. Are they long enough to give enough context about an author's motivation behind citing? How do they differ across citation types (are supporting citations longer than contrasting ones?) or citation section (are introduction citation statements longer than methods citation statements?). In the below analysis we will look at these questions to help you get a better understanding of scite Citation Statements. If you are interested in the characteristics of in-text citations in general Boyack et. al. (2017) is a good resource to look at.


At the time of writing (January 25th 2022), scite has over 968m citation statements which have been extracted from over 28m full-text articles. In order to look at the characteristics of those citation statements in an easy to characterize way, we simply count the number of characters in each citation statement. Over a corpus of almost 1bn citation statements this is much easier than tokenizing the statements and counting by number of words.. See Figure 1 below for what a typical citation statement looks like. After counting the citation statements characters we look at the average, standard deviation, minimum, maximum, and number of citations for the following groups of citations: all citations, citations by type (supporting, mentioning, contrasting), and citations by section (introduction, methods, results, discussion). See Table 1 below for a presentation of those summary statistics.

The incidence of hypotension in the ropivacaine only group in our study (53.3%) was higher than that reported by McNamee et al 17 with a similar dose of isobaric ropivacaine 0.75% (18.75 mg). This may be because the volume of drug that was administered in our study was 3.0 ml compared to 2.5 ml in the study by McNamee et al. However, the addition of dexemdetomidine did not alter the hemodynamic profile of ropivacaine.

Figure 1: Typical citation statement (472 characters, a contrasting citation from a discussion section)


Table 1: Summary statistics of character counts of citation statements by various groups. For context the maximum tweet length is 280 characters.

Given Table 1 above, it is interesting to note a few things. First the average citation length across groups is 472 characters as in Figure 1. In Figure 1, a contrasting citation from the discussion section, you can see three sentences with the first presenting the difference of results [incidence of hypotension in the ropivacaine only group… higher than… McNamee et al], the second qualifying the difference in method [3.0 ml compared to 2.5 ml in the study by McNamee et al] and the third sentence further justifying conclusions. The typical 472 character length citation statement then seems to provide evidence that the citation statements that we show in scite provide a clearer and more contextualized reading experience than simply the sentence in which the citation is made (In Figure 1, the citing sentence is only 197 characters). However, in Table 1, we also see that the variance on character lengths are naturally quite high. The first standard deviation is 191 characters meaning that 68% of citation statements could be 472 +/- 191 characters. This seems intuitive because the natural variation in sentence lengths is quite high. We should qualify this variance though, since citing sentences that are at the beginning of a paragraph will naturally not have a preceding sentence and citing sentences at the end will not have a proceeding one, part of the variance will simply be capturing the uninformative difference between where the target sentences occur in a paragraph.

It is also interesting to look at how citation lengths differ by group. Some groups are not notably different in average length but are slightly less variant. In citation type we see that contrasting citations only have 168 characters at the first standard deviation and supporting citations have only 176 characters at the first standard deviation. While this is a small difference when compared to the variance of 191 characters, this might indicate that supporting and contrasting citations tend to be a bit more formulaic in how they are written. Within citations by section we actually see something quite intuitive. Citations in the introduction section tend to be longer at 500 average characters per citation and method citations tend to be shorter at 420 average characters per citation. This is intuitive because we would expect introduction sections to have long discussions about background work and method citations to be concise descriptions of method due to saving elaboration and longer discussions for later sections.

This analysis has just been an initial peek into what scite citation statements tend to look like in aggregate. Further work could use scite citation statements as a vehicle for studying the language of citation statements including things such as lexical diversity (how diverse is the vocabulary being used by the citation statements and how that might differ for vocabularies in papers in general), disciplinary differences across subjects and fields, the degree of unresolved and ambiguous pronouns and references in citation statements, and general differences in the languages used in citation statements versus other language sets like the paper as a whole. If you are interested in doing research on citation statements themselves please let us know at as we would love to help you use the scite citation statement corpus for your studies.


