Enzymes are indispensable
in many biological processes, and with
biomedical literature growing exponentially, effective literature
review becomes increasingly challenging. Natural language processing
methods offer solutions to streamline this process. This study aims
to develop an annotated enzyme corpus for training and evaluating
enzyme named entity recognition (NER) models. A novel pipeline, combining
dictionary matching and rule-based keyword searching, automatically
annotated enzyme entities in >4800 full-text publications. Four
deep
learning NER models were created with different vocabularies (BioBERT/SciBERT)
and architectures (BiLSTM/transformer) and evaluated on 526 manually
annotated full-text publications. The annotation pipeline achieved
an F1-score of 0.86 (precision = 1.00, recall = 0.76),
surpassed by fine-tuned transformers for F1-score
(BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models
having higher precision (0.94) than transformers (0.92). The annotation
pipeline runs in seconds on standard laptops with almost perfect precision,
but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond
the training data. In comparison, SciBERT-based models exhibited higher
precision, and BioBERT-based models exhibited higher recall, highlighting
the importance of vocabulary and architecture. These models, representing
the first enzyme NER algorithms, enable more effective enzyme text
mining and information extraction. Codes for automated annotation
and model generation are available from and .