MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data

Chatr‐aryamontri, Andrew; Kerrien, Samuel; Khadake, Jyoti; Orchard, Sandra; Céol, Arnaud; Licata, Luana; Castagnoli, Luisa; Costa, Stefano; Derow, C.; Huntley, Rachael P.; Aranda, Bruno; Leroy, Catherine; Thorneycroft, Dave; Apweiler, Rolf; Cesareni, Gianni; Hermjakob, Henning

doi:10.1186/gb-2008-9-s2-s5

Cited by 28 publications

(32 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In papers in Trends in Biochemical Sciences 44,45,51 the authors argued over a distressing lack of reproducibility of curated interactions and contended that "protein interactions reported in the literature and curated in interaction databases might not occur as presented." Other reports have questioned the presumed perfection of curated PPIs 23,29,43,74 , even one report by several authors of Salwinski et al 71 : "a comparison of publications curated by both MINT and IntAct between 2003 and 2005 revealed that the two databases annotated exactly the same interaction pairs in only 6 out of 52 publications" 75 . BioGRID now grants that provisions are not made for quality assessment in curation: "We make no judgement calls on the methods or even, within reason, the quality of the data themselves" 76 .…”

Section: Addenda Corrigenda and Erratamentioning

confidence: 99%

Addendum: Literature-curated protein interaction datasets

Cusick¹,

Yu²,

Smolyar³

et al. 2009

Nat Methods

View full text Add to dashboard Cite

934 | VOL.6 NO.12 | DECEMBER | nature methods addenda, corrigenda and errataWe assessed literature-curated protein-protein interaction (PPI) datasets for the parameters of completeness, coverage and quality by several means, concluding that such datasets might be "possibly of lower quality than commonly assumed." A Correspondence 71 by members of the International Molecular Exchange Consortium (IMEx), while accepting many of our points, objected to our recuration exercise to assess quality, finding our criteria "subjective." We argue that the criteria were commonsensical and essentially capture how these databases are often described.A wide swath of the scientific community, from computer scientists and engineers to physicists, systems biologists and molecular biologists, use literature-curated datasets as 'gold-standard' positive controls with the tacit understanding that this information is nearly perfect. Whether user impressions were formed from statements made by database authors [18][19][20][21] or not, belief that database entries accurately correspond to high-quality, direct physical interactions is widespread 6,72 . The standards we used to assess quality are generally accepted by the IMEx members, but one that remains problematic is the definition of binary interactions. A meaningful fraction of database users is under the impression that 'binary interaction' means direct pairwise PPIs, and that is the definition we tried to apply. The definition that the IMEx databases apply is that of 'binary representation' , meaning any pairwise association between two entities, direct or indirect. Although technically correct from an informatics viewpoint, binary representation likely does not accurately reflect biophysical reality. To better match user expectations, one IMEx database has adjusted their website presentation to allow users to filter 'spoke expanded co-complexes' from binary interactions, although all reported interactions are initially classified as 'binary' .Another widespread perception is that curated databases contain predominantly low-throughput interactions, whereas the reality is that curated databases have a substantial portion of interactions derived from high-throughput experiments ( Fig. 2 in our Perspective). The point is not whether high-throughput interaction experiments are of worse or better quality than low-throughput experiments, but that greater transparency should be provided so that users can filter the data according to their needs.As a result of applying the criteria that we did, based on the observations above, the error rates we reported reflected not only errors in curation but also how well the underlying data meet the standards set forth. The details for the yeast, human and plant recurations are available in the Supplementary Note.Our efforts are aimed at alerting the scientific community that literature-curated interactions may need further scrutiny or classification to qualify as a 'gold standard' for users who are specifically interested in direct pairwise PPIs. Clos...

show abstract

Section: Addenda Corrigenda and Erratamentioning

confidence: 99%

Addendum: Literature-curated protein interaction datasets

Cusick¹,

Yu²,

Smolyar³

et al. 2009

Nat Methods

View full text Add to dashboard Cite

show abstract

“…Traditionally biological database curators have contributed to the various BioCreative challenges (Hirschman, Yeh et al 2005, Chatr-aryamontri, Kerrien et al 2008, Krallinger, Morgan et al 2008, Lu and Hirschman 2012 supporting the identification of stages in the curation workflow suitable for text mining applications and manually annotating the training and test corpora. Because the manual curation of the current exponentially growing body of biomedical literature is an impossible task, the insertion of robust text mining tools in the curation pipeline represent a feasible and sustainable solution to this problem (Hirschman, Burns et al 2012).…”

Section: Introductionmentioning

confidence: 99%

BioNLP 2017

Cohen¹,

Demner‐Fushman²,

Ananiadou³

et al. 2017

View full text Add to dashboard Cite

According to the Association for Computational Linguistics guidelines on special interest groups (SIGs), The function of a SIG is to encourage interest and activity in specific areas within the ACL's field [1]. Is the SIGBioMed special interest group "within the ACL's field"? The titles of this year's papers suggest that it is, in that the current interest in deep learning in its many and varied manifestations is mirrored in those titles. Do those papers cover a specific area? They do, and in doing so, they demonstrate one of the great satisfactions of working in biomedical natural language processing.One of the joys of involvement in the biomedical natural language processing community is seeing the development of research with clinical applications. As examples of such work being presented at BioNLP 2017, we would like to point out the two papers that discuss the application of natural language processing to the diagnosis of neurological disorders. Bhatia et al. [2] describe an approach to using speech processing in the assessment of patients with amyotrophic lateral sclerosis (also known as Lou Gehrig's disease), one of the more horrific motor neuron diseases. Good assessment of amyotrophic lateral sclerosis patients is important for a number of reasons, including the fact that accurate tracking of the inevitable deterioration that is a hallmark of this disease gives patients and their families the possibility of purposeful planning for the attendant disability and death. However, current methodologies for evaluating the status of amyotrophic lateral sclerosis patients necessarily involve expensive equipment and highly trained personnel; when further developed, this methodology could make such evaluation much more, and more frequently, available to ALS patients. The fact that the work reported here involves a speech modality is especially exciting, as speech-related indicators of future ALS can be present long before diagnosis. The paper uses measurements of phonological features of speech and their divergence from a baseline, and demonstrates correlation with physiological measures.Adams et al. [3] describe work on detecting and categorizing word production errors associated with anomia, a particular kind of inability to find words. Screening for anomia is important because anomia is a symptom of stroke, but it is difficult and time-consuming to do, and therefore is not done as often as it should be. Automatic detection of anomia could be a nice enabler of improved care for stroke victims, but it is made difficult due to the subtlety of the phonological and semantic judgments that have to be made when assessing the phenomenon. The paper uses a combination of language modeling and phonologically-based edit distance calculation to approach the task, applying these techniques to data from the AphasiaBank collection of transcribed aphasic and healthy speech.Although we have summarized only these two examples that address neurological disorders, there are several other papers on the use of natural language proc...

show abstract

“…Several databases are engaged in manual annotation of protein-protein interactions (PPIs) from the literature, including MINT [10], IntAct [11], and BioGRID [12]. The automatic detection of PPIs from the literature has been the focus of multiple text mining systems.…”

Section: Introductionmentioning

confidence: 99%

“…This task covered 1. the detection and ranking of abstracts according to the relevance for deriving PPI annotations, 2. the extraction of the normalized protein interaction pairs, 3. the retrieval of suitable protein interaction evidence passages from full-text articles as well as 4. the automatic detection of the interaction detection experimental methods mentioned in the papers. To ensure that the PPI annotations followed commonly used standards adopted by the biocuration community, the evaluation data was prepared by experienced curators from two different databases, MINT and IntAct [11].…”

Section: Introductionmentioning

confidence: 99%

An Overview of BioCreative II.5

Leitner

Mardis

Krallinger

et al. 2010

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

Abstract-We present the results of the BioCreative II.5 evaluation in association with the FEBS Letters experiment, where authors created Structured Digital Abstracts to capture information about protein-protein interactions. The BioCreative II.5 challenge evaluated automatic annotations from 15 text mining teams based on a gold standard created by reconciling annotations from curators, authors, and automated systems. The tasks were to rank articles for curation based on curatable protein-protein interactions; to identify the interacting proteins (using UniProt identifiers) in the positive articles (61); and to identify interacting protein pairs. There were 595 fulltext articles in the evaluation test set, including those both with and without curatable protein interactions. The principal evaluation metrics were the interpolated area under the precision/recall curve (AUC iP/R), and (balanced) F-measure. For article classification, the best AUC iP/R was 0.70; for interacting proteins, the best system achieved good macroaveraged recall (0.73) and interpolated area under the precision/recall curve (0.58), after filtering incorrect species and mapping homonymous orthologs; for interacting protein pairs, the top (filtered, mapped) recall was 0.42 and AUC iP/R was 0.29. Ensemble systems improved performance for the interacting protein task.

show abstract

MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data

Cited by 28 publications

References 11 publications

Addendum: Literature-curated protein interaction datasets

Addendum: Literature-curated protein interaction datasets

BioNLP 2017

An Overview of BioCreative II.5

Contact Info

Product

Resources

About