Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core linguistic resources e.g. corpora, WordNet, dictionaries, gazetteers and associated tools being developed for Western languages are customarily available. Most South Asian Languages are low resource languages e.g. Urdu is a South Asian Language, which is among the widely spoken languages of sub-continent. Due to resources scarcity not enough work has been conducted for Urdu. The core objective of this paper is to present a survey regarding different linguistic resources that exist for Urdu language processing, to highlight different tasks in Urdu language processing and to discuss different state of the art available techniques. Conclusively, this paper attempts to describe in detail the recent increase in interest and progress made in Urdu language processing research. Initially, the available datasets for Urdu language are discussed. Characteristic, resource sharing between Hindi and Urdu, orthography, and morphology of Urdu language are provided. The aspects of the pre-processing activities such as stop words removal, Diacritics removal, Normalization and Stemming are illustrated. A review of state of the art research for the tasks such as Tokenization, Sentence Boundary Detection, Part of Speech tagging, Named Entity Recognition, Parsing and development of WordNet tasks are discussed. In addition, impact of ULP on application areas, such as, Information Retrieval, Classification and plagiarism detection is investigated. Finally, open issues and future directions for this new and dynamic area of research are provided. The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future.
Bardet–Biedl syndrome (BBS) is a recessive disorder characterized by heterogeneous clinical manifestations, including truncal obesity, rod-cone dystrophy, renal anomalies, postaxial polydactyly, and variable developmental delays. At least 20 genes have been implicated in BBS, and all are involved in primary cilia function. We report a 1-year-old male child from Guyana with obesity, postaxial polydactyly on his right foot, hypotonia, ophthalmologic abnormalities, and developmental delay, which together indicated a clinical diagnosis of BBS. Clinical chromosomal microarray (CMA) testing and high-throughput BBS gene panel sequencing detected a homozygous 7p14.3 deletion of exons 1–4 of BBS9 that was encompassed by a 17.5 Mb region of homozygosity at chromosome 7p14.2–p21.1. The precise breakpoints of the deletion were delineated to a 72.8 kb region in the proband and carrier parents by third-generation long-read single molecule real-time (SMRT) sequencing (Pacific Biosciences), which suggested non-homologous end joining as a likely mechanism of formation. Long-read SMRT sequencing of the deletion breakpoints also determined that the aberration included the neighboring RP9 gene implicated in retinitis pigmentosa; however, the clinical significance of this was considered uncertain given the paucity of reported cases with unambiguous RP9 mutations. Taken together, our study characterized a BBS9 deletion, and the identification of this shared haplotype in the parents suggests that this pathogenic aberration may be a BBS founder mutation in the Guyanese population. Importantly, this informative case also highlights the utility of long-read SMRT sequencing to map nucleotide breakpoints of clinically relevant structural variants.
We present a prototype software system with sufficient capacity and speed to estimate radiation exposures in a mass casualty event by counting dicentric chromosomes (DCs) in metaphase cells from many individuals. Top-ranked metaphase cell images are segmented by classifying and defining chromosomes with an active contour gradient vector field (GVF) and by determining centromere locations along the centreline. The centreline is extracted by discrete curve evolution (DCE) skeleton branch pruning and curve interpolation. Centromere detection minimises the global width and DAPI-staining intensity profiles along the centreline. A second centromere is identified by reapplying this procedure after masking the first. Dicentrics can be identified from features that capture width and intensity profile characteristics as well as local shape features of the object contour at candidate pixel locations. The correct location of the centromere is also refined in chromosomes with sister chromatid separation. The overall algorithm has both high sensitivity (85 %) and specificity (94 %). Results are independent of the shape and structure of chromosomes in different cells, or the laboratory preparation protocol followed. The prototype software was recoded in C++/OpenCV; image processing was accelerated by data and task parallelisation with Message Passaging Interface and Intel Threading Building Blocks and an asynchronous non-blocking I/O strategy. Relative to a serial process, metaphase ranking, GVF and DCE are, respectively, 100 and 300-fold faster on an 8-core desktop and 64-core cluster computers. The software was then ported to a 1024-core supercomputer, which processed 200 metaphase images each from 1025 specimens in 1.4 h.
BackgroundCondensation differences along the lengths of homologous, mitotic metaphase chromosomes are well known. This study reports molecular cytogenetic data showing quantifiable localized differences in condensation between homologs that are related to differences in accessibility (DA) of associated DNA probe targets. Reproducible DA was observed for ~10% of locus-specific, short (1.5-5 kb) single copy DNA probes used in fluorescence in situ hybridization.ResultsFourteen probes (from chromosomes 1, 5, 9, 11, 15, 17, 22) targeting genic and intergenic regions were developed and hybridized to cells from 10 individuals with cytogenetically-distinguishable homologs. Differences in hybridization between homologs were non-random for 8 genomic regions (RGS7, CACNA1B, GABRA5, SNRPN, HERC2, PMP22:IVS3, ADORA2B:IVS1, ACR) and were not unique to known imprinted domains or specific chromosomes. DNA probes within CCNB1, C9orf66, ADORA2B:Promoter-Ex1, PMP22:IVS4-Ex 5, and intergenic region 1p36.3 showed no DA (equivalent accessibility), while OPCML showed unbiased DA. To pinpoint probe locations, we performed 3D-structured illumination microscopy (3D-SIM). This showed that genomic regions with DA had 3.3-fold greater volumetric, integrated probe intensities and broad distributions of probe depths along axial and lateral axes of the 2 homologs, compared to a low copy probe target (NOMO1) with equivalent accessibility. Genomic regions with equivalent accessibility were also enriched for epigenetic marks of open interphase chromatin (DNase I HS, H3K27Ac, H3K4me1) to a greater extent than regions with DA.ConclusionsThis study provides evidence that DA is non-random and reproducible; it is locus specific, but not unique to known imprinted regions or specific chromosomes. Non-random DA was also shown to be heritable within a 2 generation family. DNA probe volume and depth measurements of hybridized metaphase chromosomes further show locus-specific chromatin accessibility differences by super-resolution 3D-SIM. Based on these data and the analysis of interphase epigenetic marks of genomic intervals with DA, we conclude that there are localized differences in compaction of homologs during mitotic metaphase and that these differences may arise during or preceding metaphase chromosome compaction. Our results suggest new directions for locus-specific structural analysis of metaphase chromosomes, motivated by the potential relationship of these findings to underlying epigenetic changes established during interphase.Electronic supplementary materialThe online version of this article (doi:10.1186/s13039-014-0070-y) contains supplementary material, which is available to authorized users.
Named entity recognition (NER) continues to be an important task in natural language processing because it is featured as a subtask and/or subproblem in information extraction and machine translation. In Urdu language processing, it is a very difficult task. This paper proposes various deep recurrent neural network (DRNN) learning models with word embedding. Experimental results demonstrate that they improve upon current state‐of‐the‐art NER approaches for Urdu. The DRRN models evaluated include forward and bidirectional extensions of the long short‐term memory and back propagation through time approaches. The proposed models consider both language‐dependent features, such as part‐of‐speech tags, and language‐independent features, such as the “context windows” of words. The effectiveness of the DRNN models with word embedding for NER in Urdu is demonstrated using three datasets. The results reveal that the proposed approach significantly outperforms previous conditional random field and artificial neural network approaches. The best f‐measure values achieved on the three benchmark datasets using the proposed deep learning approaches are 81.1%, 79.94%, and 63.21%, respectively.
In Urdu, part of speech (POS) tagging is a challenging task as it is both inflectionally and derivationally rich morphological language. Verbs are generally conceived a highly inflected object in Urdu comparatively to nouns. POS tagging is used as a preliminary linguistic text analysis in diverse natural language processing domains such as speech processing, information extraction, machine translation, and others. It is a task that first identifies appropriate syntactic categories for each word in running text and second assigns the predicted syntactic tag to all concerned words. The current work is the extension of our previous work. Previously, we presented conditional random field (CRF)-based POS tagger with both language dependent and independent feature set. However, in the current study, we offer: 1) the implementation of both machine and deep learning models for Urdu POS tagging task with well-balanced language-independent feature set and 2) to highlight diverse challenges which cause Urdu POS task a challenging one. In this research, we demonstrated the effectiveness of machine learning and deep learning models for Urdu POS task. Empirically, we have evaluated the performance of all models on two benchmark datasets. The core models evaluated in this study are CRF, support vector machine (SVM), two variants of the deep recurrent neural network (DRNN), and a variant of n-gram Markov model the bigram hidden Markov model (HMM). The two variants of DRRN models evaluated include forward long short-term memory (LSTM)-RNN and LSTM-RNN with CRF output. INDEX TERMS Urdu, part of speech (POS), conditional random field (CRF), support vector machine (SVM), recurrent neural network (RNN), hidden Markov model (HMM).
Research efforts in the field of sentiment analysis have exponentially increased in the last few years due to its applicability in areas such as online product purchasing, marketing, and reputation management. Social media and online shopping sites have become a rich source of user-generated data. Manufacturing, sales, and marketing organizations are progressively turning their eyes to this source to get worldwide feedback on their activities and products. Millions of sentences in Urdu and Roman Urdu are posted daily on social sites, such as Facebook, Instagram, Snapchat, and Twitter. Disregarding people’s opinions in Urdu and Roman Urdu and considering only resource-rich English language leads to the vital loss of this vast amount of data. Our research focused on collecting research papers related to Urdu and Roman Urdu language and analyzing them in terms of preprocessing, feature extraction, and classification techniques. This paper contains a comprehensive study of research conducted on Roman Urdu and Urdu text for a product review. This study is divided into categories, such as collection of relevant corpora, data preprocessing, feature extraction, classification platforms and approaches, limitations, and future work. The comparison was made based on evaluating different research factors, such as corpus, lexicon, and opinions. Each reviewed paper was evaluated according to some provided benchmarks and categorized accordingly. Based on results obtained and the comparisons made, we suggested some helpful steps in a future study.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.