We present EDA: easy data augmentation techniques for boosting performance on text classification tasks. EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion. On five text classification tasks, we show that EDA improves performance for both convolutional and recurrent neural networks. EDA demonstrates particularly strong results for smaller datasets; on average, across five datasets, training with EDA while using only 50% of the available training set achieved the same accuracy as normal training with all available data. We also performed extensive ablation studies and suggest parameters for practical use.
Classification of histologic patterns in lung adenocarcinoma is critical for determining tumor grade and treatment for patients. However, this task is often challenging due to the heterogeneous nature of lung adenocarcinoma and the subjective criteria for evaluation. In this study, we propose a deep learning model that automatically classifies the histologic patterns of lung adenocarcinoma on surgical resection slides. Our model uses a convolutional neural network to identify regions of neoplastic cells, then aggregates those classifications to infer predominant and minor histologic patterns for any given whole-slide image. We evaluated our model on an independent set of 143 whole-slide images. It achieved a kappa score of 0.525 and an agreement of 66.6% with three pathologists for classifying the predominant patterns, slightly higher than the inter-pathologist kappa score of 0.485 and agreement of 62.7% on this test set. All evaluation metrics for our model and the three pathologists were within 95% confidence intervals of agreement. If confirmed in clinical practice, our model can assist pathologists in improving classification of lung adenocarcinoma patterns by automatically pre-screening and highlighting cancerous regions prior to review. Our approach can be generalized to any whole-slide image classification task, and code is made publicly available at https://github.com/BMIRDS/deepslide .
Background The Paris System for Urine Cytopathology (the Paris System) has succeeded in making the analysis of liquid‐based urine preparations more reproducible. Any algorithm seeking to automate this system must accurately estimate the nuclear‐to‐cytoplasmic (N:C) ratio and produce a qualitative “atypia score.” The authors propose a hybrid deep‐learning and morphometric model that reliably automates the Paris System. Methods Whole‐slide images (WSI) of liquid‐based urine cytology specimens were extracted from 51 negative, 60 atypical, 52 suspicious, and 54 positive cases. Morphometric algorithms were applied to decompose images to their component parts; and statistics, including the NC ratio, were tabulated using segmentation algorithms to create organized data structures, dubbed rich information matrices (RIMs). These RIM objects were enhanced using deep‐learning algorithms to include qualitative measures. The augmented RIM objects were then used to reconstruct WSIs with filtering criteria and to generate pancellular statistical information. Results The described system was used to calculate the N:C ratio for all cells, generate object classifications (atypical urothelial cell, squamous cell, crystal, etc), filter the original WSI to remove unwanted objects, rearrange the WSI to an efficient, condensed‐grid format, and generate pancellular statistics containing quantitative/qualitative data for every cell in a WSI. In addition to developing novel techniques for managing WSIs, a system capable of automatically tabulating the Paris System criteria also was generated. Conclusions A hybrid deep‐learning and morphometric algorithm was developed for the analysis of urine cytology specimens that could reliably automate the Paris System and provide many avenues for increasing the efficiency of digital screening for urine WSIs and other cytology preparations.
IMPORTANCE Deep learning–based methods, such as the sliding window approach for cropped-image classification and heuristic aggregation for whole-slide inference, for analyzing histological patterns in high-resolution microscopy images have shown promising results. These approaches, however, require a laborious annotation process and are fragmented. OBJECTIVE To evaluate a novel deep learning method that uses tissue-level annotations for high-resolution histological image analysis for Barrett esophagus (BE) and esophageal adenocarcinoma detection. DESIGN, SETTING, AND PARTICIPANTS This diagnostic study collected deidentified high-resolution histological images (N = 379) for training a new model composed of a convolutional neural network and a grid-based attention network. Histological images of patients who underwent endoscopic esophagus and gastroesophageal junction mucosal biopsy between January 1, 2016, and December 31, 2018, at Dartmouth-Hitchcock Medical Center (Lebanon, New Hampshire) were collected. MAIN OUTCOMES AND MEASURES The model was evaluated on an independent testing set of 123 histological images with 4 classes: normal, BE-no-dysplasia, BE-with-dysplasia, and adenocarcinoma. Performance of this model was measured and compared with that of the current state-of-the-art sliding window approach using the following standard machine learning metrics: accuracy, recall, precision, and F1 score. RESULTS Of the independent testing set of 123 histological images, 30 (24.4%) were in the BE-nodysplasia class, 14 (11.4%) in the BE-with-dysplasia class, 21 (17.1%) in the adenocarcinoma class, and 58 (47.2%) in the normal class. Classification accuracies of the proposed model were 0.85 (95% CI, 0.81–0.90) for the BE-no-dysplasia class, 0.89 (95% CI, 0.84–0.92) for the BE-with-dysplasia class, and 0.88 (95% CI, 0.84–0.92) for the adenocarcinoma class. The proposed model achieved a mean accuracy of 0.83 (95% CI, 0.80–0.86) and marginally outperformed the sliding window approach on the same testing set. The F1 scores of the attention-based model were at least 8% higher for each class compared with the sliding window approach: 0.68 (95% CI, 0.61–0.75) vs 0.61 (95% CI, 0.53–0.68) for the normal class, 0.72 (95% CI, 0.63–0.80) vs 0.58 (95% CI, 0.45–0.69) for the BE-nodysplasia class, 0.30 (95% CI, 0.11–0.48) vs 0.22 (95% CI, 0.11–0.33) for the BE-with-dysplasia class, and 0.67 (95% CI, 0.54–0.77) vs 0.58 (95% CI, 0.44–0.70) for the adenocarcinoma class. However, this outperformance was not statistically significant. CONCLUSIONS AND RELEVANCE Results of this study suggest that the proposed attention-based deep neural network framework for BE and esophageal adenocarcinoma detection is important because it is based solely on tissue-level annotations, unlike existing methods that are based on regions of interest. This new model is expected to open avenues for applying deep learning to digital pathology.
Are deep neural networks trained on data from a single institution for classification of colorectal polyps on digitized histopathology slides generalizable across multiple external institutions? Findings: A new deep neural network was developed based on 326 slide images from our institution to classify the four most common polyp types on digitized histopathology slides. In addition to evaluation on an internal test set of 157 slide images, we evaluated the model on an external test set of 238 slide images from 24 institutions across 13 states in the United States.This model achieved mean accuracies of 93.5% and 87.0% on the internal and external test sets, respectively, which were comparable with the performance of local pathologists on these test sets.Meaning: Deep neural networks could provide a generalizable approach for the classification of colorectal polyps on digitized histopathology slides and if confirmed in clinical trials, could potentially improve the efficiency, reproducibility, and accuracy of one of the most common cancer screening procedures.
Data augmentation has recently seen increased interest in NLP due to more work in lowresource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP.
With the rise of deep learning, there has been increased interest in using neural networks for histopathology image analysis, a field that investigates the properties of biopsy or resected specimens that are traditionally manually examined under a microscope by pathologists. In histopathology image analysis, however, challenges such as limited data, costly annotation, and processing high-resolution and variable-size images create a high barrier of entry and make it difficult to quickly iterate over model designs.Throughout scientific history, many significant research directions have leveraged small-scale experimental setups as petri dishes to efficiently evaluate exploratory ideas, which are then validated in large-scale applications. For instance, the Drosophila fruit fly in genetics and MNIST in computer vision are well-known petri dishes. In this paper, we introduce a minimalist histopathology image analysis dataset (MHIST), an analogous petri dish for histopathology image analysis. MHIST is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each with a gold-standard label determined by the majority vote of seven board-certified gastrointestinal pathologists. MHIST also includes each image's annotator agreement level. As a minimalist dataset, MHIST occupies less than 400 MB of disk space, and a ResNet-18 baseline can be trained to convergence on MHIST in just 6 minutes using approximately 3.5 GB of memory on a NVIDIA RTX 3090. As example use cases, we use MHIST to study natural questions that arise in histopathology image classification such as how dataset size, network depth, transfer learning, and high-disagreement examples affect model performance.By introducing MHIST, we hope to not only help facilitate the work of current histopathology imaging researchers, but also make histopathology image analysis more accessible to the general computer vision community. Our dataset is available at https://bmirds. github.io/MHIST/. Figure 1: Key features of our minimalist histopathology image analysis dataset (MHIST).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.