Pei-Yau Lung scite author profile

Elucidating the transcriptional regulatory networks that underlie growth and development requires robust ways to define the complete set of transcription factor (TF) binding sites. Although TF-binding sites are known to be generally located within accessible chromatin regions (ACRs), pinpointing these DNA regulatory elements globally remains challenging. Current approaches primarily identify binding sites for a single TF (e.g. ChIP-seq), or globally detect ACRs but lack the resolution to consistently define TF-binding sites (e.g. DNAse-seq, ATAC-seq). To address this challenge, we developed MNase-defined cistrome-Occupancy Analysis (MOA-seq), a high-resolution (< 30 bp), high-throughput, and genome-wide strategy to globally identify putative TF-binding sites within ACRs. We used MOA-seq on developing maize ears as a proof of concept, able to define a cistrome of 145,000 MOA footprints (MFs). While a substantial majority (76%) of the known ATAC-seq ACRs intersected with the MFs, only a minority of MFs overlapped with the ATAC peaks, indicating that the majority of MFs were novel and not detected by ATAC-seq. MFs were associated with promoters and significantly enriched for TF-binding and long-range chromatin interaction sites, including for the well-characterized FASCIATED EAR4, KNOTTED1, and TEOSINTE BRANCHED1. Importantly, the MOA-seq strategy improved the spatial resolution of TF-binding prediction and allowed us to identify 215 motif families collectively distributed over more than 100,000 non-overlapping, putatively-occupied binding sites across the genome. Our study presents a simple, efficient, and high-resolution approach to identify putative TF footprints and binding motifs genome-wide, to ultimately define a native cistrome atlas.

show abstract

Dysregulated gene expression predicts tumor aggressiveness in African-American prostate cancer patients

Ali

Lung

Sholl

et al. 2018

Sci Rep

View full text Add to dashboard Cite

Molecular mechanisms underlying the health disparity of prostate cancer (PCa) have not been fully determined. In this study, we applied bioinformatic approach to identify and validate dysregulated genes associated with tumor aggressiveness in African American (AA) compared to Caucasian American (CA) men with PCa. We retrieved and analyzed microarray data from 619 PCa patients, 412 AA and 207 CA, and we validated these genes in tumor tissues and cell lines by Real-Time PCR, Western blot, immunocytochemistry (ICC) and immunohistochemistry (IHC) analyses. We identified 362 differentially expressed genes in AA men and involved in regulating signaling pathways associated with tumor aggressiveness. In PCa tissues and cells, NKX3.1, APPL2, TPD52, LTC4S, ALDH1A3 and AMD1 transcripts were significantly upregulated (p < 0.05) compared to normal cells. IHC confirmed the overexpression of TPD52 (p = 0.0098) and LTC4S (p < 0.0005) in AA compared to CA men. ICC and Western blot analyses additionally corroborated this observation in PCa cells. These findings suggest that dysregulation of transcripts in PCa may drive the disparity of PCa outcomes and provide new insights into development of new therapeutic agents against aggressive tumors. More studies are warranted to investigate the clinical significance of these dysregulated genes in promoting the oncogenic pathways in AA men.

show abstract

Chromatin structure profile data from DNS-seq: Differential nuclease sensitivity mapping of four reference tissues of B73 maize (Zea mays L)

et al. 2018

View full text Add to dashboard Cite

Presented here are data from Next-Generation Sequencing of differential micrococcal nuclease digestions of formaldehyde-crosslinked chromatin in selected tissues of maize (Zea mays) inbred line B73. Supplemental materials include a wet-bench protocol for making DNS-seq libraries, the DNS-seq data processing pipeline for producing genome browser tracks. This report also includes the peak-calling pipeline using the iSeg algorithm to segment positive and negative peaks from the DNS-seq difference profiles. The data repository for the sequence data is the NCBI SRA, BioProject Accession PRJNA445708.

show abstract

Extracting chemical–protein interactions from literature using sentence structure analysis and feature engineering

Lung

Zhao

et al. 2019

View full text Add to dashboard Cite

Information about the interactions between chemical compounds and proteins is indispensable for understanding the regulation of biological processes and the development of therapeutic drugs. Manually extracting such information from biomedical literature is very time and resource consuming. In this study, we propose a computational method to automatically extract chemical–protein interactions (CPIs) from a given text. Our method extracts CPI pairs and CPI triplets from sentences, where a CPI pair consists of a chemical compound and a protein name, and a CPI triplet consists of a CPI pair along with an interaction word describing their relationship. We extracted a diverse set of features from sentences that were used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. For example, one set of features was extracted based on the shortest paths between the CPI pairs or among the CPI triplets in the dependency graphs obtained from sentence parsing. We designed a three-stage approach to predict the multiple categories of CPIs. Our method performed the best among systems that use non-deep learning methods and outperformed several deep-learning-based systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.

show abstract

Automatic extraction of protein-protein interactions using grammatical relationship graph

Lung

Zhao

et al. 2018

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

BackgroundRelationships between bio-entities (genes, proteins, diseases, etc.) constitute a significant part of our knowledge. Most of this information is documented as unstructured text in different forms, such as books, articles and on-line pages. Automatic extraction of such information and storing it in structured form could help researchers more easily access such information and also make it possible to incorporate it in advanced integrative analysis. In this study, we developed a novel approach to extract bio-entity relationships information using Nature Language Processing (NLP) and a graph-theoretic algorithm.MethodsOur method, called GRGT (Grammatical Relationship Graph for Triplets), not only extracts the pairs of terms that have certain relationships, but also extracts the type of relationship (the word describing the relationships). In addition, the directionality of the relationship can also be extracted. Our method is based on the assumption that a triplet exists for a pair of interactions. A triplet is defined as two terms (entities) and an interaction word describing the relationship of the two terms in a sentence. We first use a sentence parsing tool to obtain the sentence structure represented as a dependency graph where words are nodes and edges are typed dependencies. The shortest paths among the pairs of words in the triplet are then extracted, which form the basis for our information extraction method. Flexible pattern matching scheme was then used to match a triplet graph with unknown relationship to those triplet graphs with labels (True or False) in the database.ResultsWe applied the method on three benchmark datasets to extract the protein-protein-interactions (PPIs), and obtained better precision than the top performing methods in literature.ConclusionsWe have developed a method to extract the protein-protein interactions from biomedical literature. PPIs extracted by our method have higher precision among other methods, suggesting that our method can be used to effectively extract PPIs and deposit them into databases. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bio-entities.

show abstract

iSeg: an efficient algorithm for segmentation of genomic and epigenomic data

et al. 2018

View full text Add to dashboard Cite

BackgroundIdentification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems.ResultsWe designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences.ConclusionsWe have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions.Electronic supplementary materialThe online version of this article (10.1186/s12859-018-2140-3) contains supplementary material, which is available to authorized users.

show abstract

The regulatory landscape of early maize inflorescence development

Parvathaneni

Bertolini

Shamimuzzaman

et al. 2019

Preprint

View full text Add to dashboard Cite

The functional genome of agronomically important plant species remains largely unexplored, yet presents a virtually untapped resource for targeted crop improvement. Functional elements of regulatory DNA revealed through profiles of chromatin accessibility can be harnessed for fine-tuning gene expression to optimal phenotypes in specific environments. Here, we investigate the non-coding regulatory space in the maize (Zea mays) genome during early reproductive development of pollen-and grain-bearing inflorescences. Using an assay for differential sensitivity of chromatin to micrococcal nuclease (MNase) digestion, we profiled accessible chromatin and nucleosome occupancy in these largely undifferentiated tissues and classified approximately 1.6 percent of the genome as accessible, with the majority of MNase hypersensitive sites marking proximal promoters, but also 3' ends of maize genes. This approach mapped regulatory elements to footprint-level resolution. Integration of complementary transcriptome profiles and transcription factor (TF) occupancy data were used to annotate regulatory factors, such as combinatorial TF binding motifs and long non-coding RNAs, that potentially contribute to organogenesis, including tissue-specific regulation between male and female inflorescence structures. Finally, genome-wide association studies for inflorescence architecture traits based only on functional regions delineated by MNase hypersensitivity revealed new SNP-trait associations in known regulators of inflorescence development as well as new candidates. These analyses provide a comprehensive look into the cisregulatory landscape during inflorescence differentiation in a major cereal crop, which ultimately shapes architecture and influences yield potential.

show abstract

iSeg: an efficient algorithm for segmentation of genomic and epigenomic data

Girimurugan

Liu²,

Lung³

et al. 2017

Preprint

View full text Add to dashboard Cite

Background: Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. This problem is often called the segmentation problem in the field of genomics, and the change-point problem in other scientific disciplines. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems. Results: We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Pei-Yau Lung

The native cistrome and sequence motif families of the maize ear

Dysregulated gene expression predicts tumor aggressiveness in African-American prostate cancer patients

Chromatin structure profile data from DNS-seq: Differential nuclease sensitivity mapping of four reference tissues of B73 maize (Zea mays L)

Extracting chemical–protein interactions from literature using sentence structure analysis and feature engineering

Automatic extraction of protein-protein interactions using grammatical relationship graph

iSeg: an efficient algorithm for segmentation of genomic and epigenomic data

The regulatory landscape of early maize inflorescence development

iSeg: an efficient algorithm for segmentation of genomic and epigenomic data

Contact Info

Product

Resources

About