Currently, there are only a limited number of Japanese–Chinese bilingual corpora of a sufficient amount that can be used as training data for neural machine translation (NMT). In particular, there are few corpora that include spoken language such as daily conversation. In this research, we attempt to construct a Japanese–Chinese bilingual corpus of a certain scale by crawling the subtitle data of movies and TV series from the websites. We calculated the BLEU scores of the constructed WCC-JC (Web Crawled Corpus—Japanese and Chinese) and the other compared corpora. We also manually evaluated the translation results using the translation model trained on the WCC-JC to confirm the quality and effectiveness.
The mean-shift method is a convenient mode-seeking method. Using a principle of the sample mean over an analysis window, or kernel, in a data space where samples are distributed with bias toward the densest direction of sample from the kernel center, the mean-shift method is an attempt to seek the densest point of samples, or the sample mode, iteratively. A smaller kernel leads to convergence to a local mode that appears because of statistical fluctuation. A larger kernel leads to estimation of a biased mode affected by other clusters, abnormal values, or outliers if they exist other than in the major cluster. Therefore, optimal selection of the kernel size, which is designated as the bandwidth in many reports of the literature, represents an important problem. As described herein, assuming that the major cluster follows a Gaussian probability density distribution, and, assuming that the outliers do not affect the sample mode of the major cluster, and, by adopting a Gaussian kernel, we propose a new mean-shift by which both the mean vector and covariance matrix of the major cluster are estimated in each iteration. Subsequently, the kernel size and shape are updated adaptively. Numerical experiments indicate that the mean vector, covariance matrix, and the number of samples of the major cluster can be estimated stably. Because the kernel shape can be adjusted not only to an isotropic shape but also to an anisotropic shape according to the sample distribution, the proposed method has higher estimation precision than the general mean-shift.
Cytosine methylation plays an important role in many biological regulation processes. The current gold-standard method for analyzing cytosine methylation is based on sodium bisulfite treatment and high-throughput sequencing technologies. In this paper we introduce a new tool called TAMeBS for cytosine methylation analysis using bisulfite sequencing data. It aims to align long bisulfite-treated DNA reads onto a reference genome sequence with high mapping efficiency and estimate the methylation status of each cytosine very accurately. Our approach builds on recent advances in alignment techniques, including bi-directional FM-index, approximate seeds, and the likelihood-ratio scoring matrix which was designed particularly for aligning bisulfite-treated DNA reads. We compared TAMeBS with several popular bisulfite-treated read mapping tools on both simulation and real data. Experimental results showed that TAMeBS could detect many more uniquely best mapped reads than other tested tools while achieving a good balance between sensitivity and precision. The source code of TAMeBS is freely available at https://sourceforge.net/projects/tamebs/.
In order to solve the problems of long path planning time and large number of redundant points in the rapidly-exploring random trees algorithm, this paper proposed an improved algorithm based on the parent point priority determination strategy and the real-time optimization strategy to optimize the rapidly-exploring random trees algorithm. First, in order to shorten the path-planning time, the parent point is determined before generating a new point, which eliminates the complicated process of traversing the random tree to search the parent point when generating a new point. Second, a real-time optimization strategy is combined, whose core idea is to compare the distance of a new point, its parent point, and two ancestor points to the target point when a new point is generated, choosing the new point that is helpful for the growth of the random tree to reduce the number of redundant points. Simulation results of 3-dimensional path planning showed that the success rate of the proposed algorithm, which combines the strategy of parent point priority determination and the strategy of real-time optimization, was close to 100%. Compared with the rapidly-exploring random trees algorithm, the number of points was reduced by more than 93.25%, the path planning time was reduced by more than 91.49%, and the path length was reduced by more than 7.88%. The IRB1410 manipulator was used to build a test platform in a laboratory environment. The path obtained by the proposed algorithm enables the manipulator to safely avoid obstacles to reach the target point. The conclusion can be made that the proposed strategy has a better performance on optimizing the success rate, the number of points, the planning time, and the path length.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.