SummarySensors are widely used in the field of manufacturing, railways, aerospace, cars, medicines, robotics, and many other aspects of our everyday life. There is an increasing need to capture, store, and analyse the dynamic semi-structured data from those sensors. A similar growth of semi-structured data in the modern web has led to the creation of NoSQL data stores for scalability, availability, and performance, whereas large-scale data processing frameworks for parallel analysis. NoSQL data store such as MongoDB and data processing framework such as Apache Hadoop has been studied for scientific data analysis. However, there has been no study on MongoDB with Apache Spark, and there is a limited understanding of how sensor data management can benefit from these technologies, specifically for ingesting high-velocity sensor data and parallel retrieval of high volume data. In this paper, we evaluate the performance of MongoDB sharding and no-sharding databases with Apache Spark, to identify the right software environment for sensor data management.
KEYWORDSApache Spark, data ingestion, data retrieval, data storage, MongoDB, sensor data management
INTRODUCTIONSensors are becoming an integral part of many modern applications. The sensors ability to sense the environment such as temperature and humidity can provide valuable information to applications for real time or batch analytics. In some applications, a single sensor with multiple sensing capabilities is used to capture the environment data. For example, many fitness trackers employ accelerator, humidity sensor, and heart rate monitor in a single device. For some applications that require complex analytics, a single sensor with multiple sensing capabilities may not be sufficient. For example, in train safety monitoring systems, multiple groups of sensors with different capabilities may need to be deployed in different carriages to capture the movement of the train as a whole. While the data produced by the sensors are beneficial, managing collected sensor data when a large number of sensors are deployed is difficult.The volume of data collected, the velocity of ingestion of data, and the variety of data format generated by different sensor types are the three main challenges in managing data captured by large sensors systems. As the price of small sensors becomes more affordable, some sensing applications replace many specialised built and high sampling sensors system with a network of off-the-shelf sensors with lower sampling capabilities. 1,2Comparable accuracy can be achieved by the lower sampling rate by deploying more sensors. The volume of data captured by the specialised built sensor and the network of sensors may be similar. However, network of sensors usually does not have large temporary data storage unlike in a specialised built sensor where they usually include better processor and large storage capacity. In a network of sensors, data needs to be transferred to the main processing centre more often but in smaller size packets. This almost instantane...