The success of a text simplification system heavily depends on the quality and quantity of complex-simple sentence pairs in the training corpus, which are extracted by aligning sentences between parallel articles. To evaluate and improve sentence alignment quality, we create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia. We propose a novel neural CRF alignment model which not only leverages the sequential nature of sentences in parallel documents but also utilizes a neural sentence pair model to capture semantic similarity. Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1. We apply our CRF aligner to construct two new text simplification datasets, NEWSELA-AUTO and WIKI-AUTO, which are much larger and of better quality compared to the existing datasets. A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation. 1
We design a novel speaker diarization system for the first DI-HARD challenge by integrating several important modules of speech denoising, speech activity detection (SAD), i-vector design, and scoring strategy. One main contribution is the proposed long short-term memory (LSTM) based speech denoising model. By fully utilizing the diversified simulated training data and advanced network architecture using progressive multitask learning with dense structure, the denoising model demonstrates the strong generalization capability to realistic noisy environments. The enhanced speech can boost the performance for the subsequent SAD, segmentation and clustering. To the best of our knowledge, this is the first time we show significant improvements of deep learning based single-channel speech enhancement over state-of-the-art diarization systems in highly mismatch conditions. For the design of i-vector extraction, we adopt a residual convolutional neural network trained on large dataset including more than 30,000 people. Finally, by score fusion of different i-vectors based on all these techniques, our systems yield diarization error rates (DERs) of 24.56% and 36.05% on the evaluation sets of Track1 and Track2, which are both in the second place among 14 and 11 participating teams, respectively.
In order to enhance the communication between sensor networks in the Internet of things (IoT), it is indispensable to establish the semantic connections between sensor ontologies in this field. For this purpose, this paper proposes an up-and-coming sensor ontology integrating technique, which uses debate mechanism (DM) to extract the sensor ontology alignment from various alignments determined by different matchers. In particular, we use the correctness factor of each matcher to determine a correspondence’s global factor, and utilize the support strength and disprove strength in the debating process to calculate its local factor. Through comprehensively considering these two factors, the judgment factor of an entity mapping can be obtained, which is further applied in extracting the final sensor ontology alignment. This work makes use of the bibliographic track provided by the Ontology Alignment Evaluation Initiative (OAEI) and five real sensor ontologies in the experiment to assess the performance of our method. The comparing results with the most advanced ontology matching techniques show the robustness and effectiveness of our approach.
BackgroundIn biomedical research, data sharing and information exchange are very important for improving quality of care, accelerating discovery, and promoting the meaningful secondary use of clinical data. A big concern in biomedical data sharing is the protection of patient privacy because inappropriate information leakage can put patient privacy at risk.MethodsIn this study, we deployed a grid logistic regression framework based on Secure Multi-party Computation (SMAC-GLORE). Unlike our previous work in GLORE, SMAC-GLORE protects not only patient-level data, but also all the intermediary information exchanged during the model-learning phase.ResultsThe experimental results demonstrate the feasibility of secure distributed logistic regression across multiple institutions without sharing patient-level data.ConclusionsIn this study, we developed a circuit-based SMAC-GLORE framework. The proposed framework provides a practical solution for secure distributed logistic regression model learning.
Motivation The generalized linear mixed model (GLMM) is an extension of the generalized linear model (GLM) in which the linear predictor takes random effects into account. Given its power of precisely modeling the mixed effects from multiple sources of random variations, the method has been widely used in biomedical computation, for instance in the genome-wide association studies (GWASs) that aim to detect genetic variance significantly associated with phenotypes such as human diseases. Collaborative GWAS on large cohorts of patients across multiple institutions is often impeded by the privacy concerns of sharing personal genomic and other health data. To address such concerns, we present in this paper a privacy-preserving Expectation–Maximization (EM) algorithm to build GLMM collaboratively when input data are distributed to multiple participating parties and cannot be transferred to a central server. We assume that the data are horizontally partitioned among participating parties: i.e. each party holds a subset of records (including observational values of fixed effect variables and their corresponding outcome), and for all records, the outcome is regulated by the same set of known fixed effects and random effects. Results Our collaborative EM algorithm is mathematically equivalent to the original EM algorithm commonly used in GLMM construction. The algorithm also runs efficiently when tested on simulated and real human genomic data, and thus can be practically used for privacy-preserving GLMM construction. We implemented the algorithm for collaborative GLMM (cGLMM) construction in R. The data communication was implemented using the rsocket package. Availability and implementation The software is released in open source at https://github.com/huthvincent/cGLMM. Supplementary information Supplementary data are available at Bioinformatics online.
The human genome can reveal sensitive information and is potentially re-identifiable, which raises privacy and security concerns about sharing such data on wide scales. In 2016, we organized the third Critical Assessment of Data Privacy and Protection competition as a community effort to bring together biomedical informaticists, computer privacy and security researchers, and scholars in ethical, legal, and social implications (ELSI) to assess the latest advances on privacy-preserving techniques for protecting human genomic data. Teams were asked to develop novel protection methods for emerging genome privacy challenges in three scenarios: Track (1) data sharing through the Beacon service of the Global Alliance for Genomics and Health. Track (2) collaborative discovery of similar genomes between two institutions; and Track (3) data outsourcing to public cloud services. The latter two tracks represent continuing themes from our 2015 competition, while the former was new and a response to a recently established vulnerability. The winning strategy for Track 1 mitigated the privacy risk by hiding approximately 11% of the variation in the database while permitting around 160,000 queries, a significant improvement over the baseline. The winning strategies in Tracks 2 and 3 showed significant progress over the previous competition by achieving multiple orders of magnitude performance improvement in terms of computational runtime and memory requirements. The outcomes suggest that applying highly optimized privacy-preserving and secure computation techniques to safeguard genomic data sharing and analysis is useful. However, the results also indicate that further efforts are needed to refine these techniques into practical solutions.
Data heterogeneity is the obstacle for the resource sharing on Semantic Web (SW), and ontology is regarded as a solution to this problem. However, since different ontologies are constructed and maintained independently, there also exists the heterogeneity problem between ontologies. Ontology matching is able to identify the semantic correspondences of entities in different ontologies, which is an effective method to address the ontology heterogeneity problem. Due to huge memory consumption and long runtime, the performance of the existing ontology matching techniques requires further improvement. In this work, an extended compact genetic algorithm-based ontology entity matching technique (ECGA-OEM) is proposed, which uses both the compact encoding mechanism and linkage learning approach to match the ontologies efficiently. Compact encoding mechanism does not need to store and maintain the whole population in the memory during the evolving process, and the utilization of linkage learning protects the chromosome’s building blocks, which is able to reduce the algorithm’s running time and ensure the alignment’s quality. In the experiment, ECGA-OEM is compared with the participants of ontology alignment evaluation initiative (OAEI) and the state-of-the-art ontology matching techniques, and the experimental results show that ECGA-OEM is both effective and efficient.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.