Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation. There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved.In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.978-1-4799-7964-6/15/$31.00
Abstract. This paper proposes a distributed data clustering technique based on deep neural network. First, each record in the distributed database is taken as an input vector, and its characteristics are extracted and input to the input layer of the depth neural network. The weight of the connection is trained by BP algorithm, and the training of depth neural network output is realized by adjusting the weight. Finally, the data clustering results are judged according to the similarity of the current vector corresponding to the output data. Experimental results based on small-scale distributed systems show that this method has better test set accuracy than traditional k-means clustering method, and is more suitable for large-scale data clustering in the distributed environments.
Abstract-the existing distributed database management system can realize the data storage with high access bandwidth through the cluster, which has the characteristics of reliable data replication and fault detection and fast automatic system recovery. However, with the existing network platform software and computing model, there is still a need to establish a virtual organization that can implement specific requirements based on the data manipulation mechanism and data access patterns in the network. This paper proposes a large-scale data storage and management scheme based on distributed database, which is more suitable for the use and control of large-scale data access and other types of application and data access modes as needed, compared with other technical solutions. Mode can cover multiple orders of magnitude of data exchange and data processing and input / output systems. The experimental results show that this scheme has the advantages of good reliability, easy operation and support of large amount of data processing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.