No abstract
Recent systems for converting natural language descriptions into regular expressions (regexes) have achieved some success, but typically deal with short, formulaic text and can only produce simple regexes. Real-world regexes are complex, hard to describe with brief sentences, and sometimes require examples to fully convey the user’s intent. We present a framework for regex synthesis in this setting where both natural language (NL) and examples are available. First, a semantic parser (either grammar-based or neural) maps the natural language description into an intermediate sketch, which is an incomplete regex containing holes to denote missing components. Then a program synthesizer searches over the regex space defined by the sketch and finds a regex that is consistent with the given string examples. Our semantic parser can be trained purely from weak supervision based on correctness of the synthesized regex, or it can leverage heuristically derived sketches. We evaluate on two prior datasets (Kushman and Barzilay 2013 ; Locascio et al. 2016 ) and a real-world dataset from Stack Overflow. Our system achieves state-of-the-art performance on the prior datasets and solves 57% of the real-world dataset, which existing neural systems completely fail on. 1
The implementation of quality control for multiomic data requires the widespread use of well-characterized reference materials, reference datasets, and related resources. The Quartet Data Portal was built to facilitate community access to such rich resources established in the Quartet Project. A convenient platform is provided for users to request the DNA, RNA, protein, and metabolite reference materials, as well as multi-level datasets generated across omics, platforms, labs, protocols, and batches. Interactive visualization tools are offered to assist users to gain a quick understanding of the reference datasets. Crucially, the Quartet Data Portal continuously collects, evaluates, and integrates the community-generated data of the distributed Quartet multiomic reference materials. In addition, the portal provides analysis pipelines to assess the quality of user-submitted multiomic data. Furthermore, the reference datasets, performance metrics, and analysis pipelines will be improved through periodic review and integration of multiomic data submitted by the community. Effective integration of the evolving technologies via active interactions with the community will help ensure the reliability of multiomics-based biological discoveries. The Quartet Data Portal is accessible at https://chinese-quartet.org.
Recent systems for converting natural language descriptions into regular expressions have achieved some success, but typically deal with short, formulaic text and can only produce simple regular expressions, limiting their applicability. Real-world regular expressions are complex, hard to describe with brief sentences, and sometimes require examples to fully convey the user's intent. We present a framework for regular expression synthesis in this setting where both natural language and examples are available. First, a semantic parser (either grammar-based or neural) maps the natural language description into an intermediate sketch, which is an incomplete regular expression containing holes to denote missing components. Then a program synthesizer enumerates the regular expression space defined by the sketch and finds a regular expression that is consistent with the given string examples. Our semantic parser can be trained from supervised or heuristically-derived sketches and additionally fine-tuned with weak supervision based on correctness of the synthesized regex. We conduct experiments on two public large-scale datasets (Kushman and Barzilay, 2013;Locascio et al., 2016) and a real-world dataset we collected from Stack-Overflow. Our system achieves state-of-theart performance on the public datasets and successfully solves 57% of the real-world dataset, which existing neural systems completely fail on.
Existing datasets for regular expression (regex) generation from natural language are limited in complexity; compared to regex tasks that users post on StackOverflow, the regexes in these datasets are simple, and the language used to describe them is not diverse. We introduce STRUCTUREDREGEX, a new regex synthesis dataset differing from prior ones in three aspects. First, to obtain structurally complex and realistic regexes, we generate the regexes using a probabilistic grammar with pre-defined macros observed from real-world StackOverflow posts. Second, to obtain linguistically diverse natural language descriptions, we show crowdworkers abstract depictions of the underlying regex and ask them to describe the pattern they see, rather than having them paraphrase synthetic language. Third, we augment each regex example with a collection of strings that are and are not matched by the ground truth regex, similar to how real users give examples. Our quantitative and qualitative analysis demonstrates the advantages of STRUCTUREDREGEX over prior datasets. Further experimental results using various multimodal synthesis techniques highlight the challenge presented by our dataset, including non-local constraints and multi-modal inputs. 1
Multiomics profiling is a powerful tool to characterize the same samples with complementary features orchestrating the genome, epigenome, transcriptome, proteome, and metabolome. However, the lack of ground truth hampers the objective assessment of and subsequent choice from a plethora of measurement and computational methods aiming to integrate diverse and often enigmatically incomparable omics datasets. Here we establish and characterize the first suites of publicly available multiomics reference materials of matched DNA, RNA, proteins, and metabolites derived from immortalized cell lines from a family quartet of parents and monozygotic twin daughters, providing built-in truth defined by family relationship and the central dogma. We demonstrate that the "ratio"-based omics profiling data, i.e., by scaling the absolute feature values of a study sample relative to those of a concurrently measured universal reference sample, were inherently much more reproducible and comparable across batches, labs, platforms, and omics types, thus empower the horizontal (within-omics) and vertical (cross-omics) data integration in multiomics studies. Our study identifies "absolute" feature quantitation as the root cause of irreproducibility in multiomics measurement and data integration, and urges a paradigm shift from "absolute" to "ratio"-based multiomics profiling with universal reference materials.
Batch effects are notorious technical variations that are common in multiomic data and may result in misleading outcomes. With the era of big data, tackling batch effects in multiomic integration is urgently needed. As part of the Quartet Project for quality control and data integration of multiomic profiling, we comprehensively assess the performances of seven batch-effect correction algorithms (BECAs) for mitigating the negative impact of batch effects in multiomic datasets, including transcriptomics, proteomics, and metabolomics. Performances are evaluated based on accuracy of identifying differentially expressed features, robustness of predictive models, and the ability of accurately clustering cross-batch samples into their biological sample groups. Ratio-based method is more effective and widely applicable than others, especially in cases when batch effects are highly confounded with biological factors of interests. We further provide practical guidelines for the implementation of ratio-based method using universal reference materials profiled with study samples. Our findings show the promise for eliminating batch effects and enhancing data integration in increasingly large-scale, cross-batch multiomic studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.