Searching for extraterrestrial, transient signals in astronomical data sets is an active area of current research. However, machine learning techniques are lacking in the literature concerning single-pulse detection. This paper presents a new, two-stage approach for identifying and classifying dispersed pulse groups (DPGs) in single-pulse search output. The first stage identified DPGs and extracted features to characterize them using a new peak identification algorithm which tracks sloping tendencies around local maxima in plots of signal-to-noise ratio vs. dispersion measure. The second stage used supervised machine learning to classify DPGs. We created four benchmark data sets: one unbalanced and three balanced versions using three different imbalance treatments. We empirically evaluated 48 classifiers by training and testing binary and multiclass versions of six machine learning algorithms on each of the four benchmark versions. While each classifier had advantages and disadvantages, all classifiers with imbalance treatments had higher recall values than those with unbalanced data, regardless of the machine learning algorithm used. Based on the benchmarking results, we selected a subset of classifiers to classify the full, unlabelled data set of over 1.5 million DPGs identified in 42,405 observations made by the Green Bank Telescope. Overall, the classifiers using a multiclass ensemble tree learner in combination with two oversampling imbalance treatments were the most efficient; they identified additional known pulsars not in the benchmark data set and provided six potential discoveries, with significantly less false positives than the other classifiers.
We present a novel two-stage approach which combines unsupervised and supervised machine learning to automatically identify and classify single pulses in radio pulsar search data. In the first stage, we identify astrophysical pulse candidates in the data, which were derived from the Pulsar Arecibo L-Band Feed Array (PALFA) survey and contain 47,042 independent beams, as trial single-pulse event groups (SPEGs) by clustering single-pulse events and merging clusters that fall within the expected DM and time span of astrophysical pulses. We also present a new peak scoring algorithm, to identify astrophysical peaks in S/N versus DM curves. Furthermore, we group SPEGs detected at a consistent DM for they were likely emitted by the same source. In the second stage, we create a fully labelled benchmark data set by selecting a subset of data with SPEGs identified (using stage 1 procedures), their features extracted and individual SPEGs manually labelled, and then train classifiers using supervised machine learning. Next, using the best trained classifier, we automatically classify unlabelled SPEGs identified in the full data set. To aid the examination of dim SPEGs, we develop an algorithm that searches for an underlying periodicity among grouped SPEGs. The results showed that RandomForest with SMOTE treatment was the best learner, with a recall of 95.6% and a false positive rate of 2.0%. In total, besides all 60 known pulsars from the benchmark data set, the model found 32 additional (i.e., not included in the benchmark data set) known pulsars, and several potential discoveries.
Abstract-There is a lack of published studies providing empirical support for the assumption at the heart of product line development, namely, that through structured reuse later products will be less fault-prone. This paper presents results from an empirical study of pre-release fault and change proneness from four products in an industrial software product line. The objectives of the study are (1) to determine the association between various software metrics, as well as their correlation with the number of faults at the component level; (2) to characterize the fault and change proneness at various degrees of reuse; and (3) to determine how existing products in the software product line affect the quality of subsequently developed products and our ability to make predictions. The research results confirm, in a software product line setting, the findings of others that faults are more highly correlated to change metrics than to static code metrics. Further, the results show that variation components unique to individual products have the highest fault density and are the most prone to change. The longitudinal aspect of our research indicates that new products in this software product line benefit from the development and testing of previous products. For this case study, the number of faults in variation components of new products is predicted accurately using a linear model built on data from the previous products.
The goals of cross-product reuse in a software product line (SPL) are to mitigate production costs and improve the quality. In addition to reuse across products, due to the evolutionary development process, a SPL also exhibits reuse across releases. In this paper, we empirically explore how the two types of reuse-reuse across products and reuse across releases-affect the quality of a SPL and our ability to accurately predict fault proneness. We measure the quality in terms of post-release faults and consider different levels of reuse across products (i.e., common, high-reuse variation, low-reuse variation, and single-use packages), over multiple releases. Assessment results showed that quality improved for common, low-reuse variation, and single-use packages as they evolved across releases. Surprisingly, within each release, among preexisting ('old') packages, the cross-product reuse did not affect the change and fault proneness. Cross-product predictions based on pre-release data accurately ranked the packages according to their post-release faults and predicted the 20 % most faulty packages. The predictions benefited from data available for other products in the product line, with models producing better results (1) when making predictions on smaller products (consisting mostly of common packages) rather than on larger products and (2) when trained on larger products rather than on smaller products.
Recycled pulsars are old ( 10 8 yr) neutron stars that are descendants from close, interacting stellar systems. In order to understand their evolution and population, we must find and study the largest number possible of recycled pulsars in a way that is as unbiased as possible. In this work, we present the discovery and timing solutions of five recycled pulsars in binary systems (PSRs J0509+0856, J0709+0458, J0732+2314, J0824+0028, J2204+2700) and one isolated millisecond pulsar (PSR J0154+1833). These were found in data from the Arecibo 327-MHz Drift-Scan Pulsar Survey (AO327). All these pulsars have a low dispersion measure (DM) ( 45 pc cm −3 ), and have a DM-determined distance of 3 kpc. Their timing solutions, have data spans ranging from 1 to ∼ 7 years, include precise estimates of their spin and astrometric parameters, and for the binaries, precise estimates of their Keplerian binary parameters. Their orbital periods range from about 4 to 815 days and the minimum companion masses (assuming a pulsar mass of 1.4 M ) range from ∼ 0.06-1.11 M . For two of the binaries we detect post-Keplerian parameters; in the case of PSR J0709+0458 we measure the component masses but with a low precision, in the not too distant future the measurement of the rate of advance of periastron and the Shapiro delay will allow very precise mass measurements for this system. Like several other systems found in the AO327 data, PSRs J0509+0854, J0709+0458 and J0732+2314 are now part of the NANOGrav timing array for gravitational wave detection.
Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta-and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation. Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% with less than a 2% average reduction in the algorithm's ability to correctly classify pulsars. The generalizability of these results is demonstrated by using two real-world radio astronomy data sets.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.