In object tracking systems, often clients capture video, encode it and transmit it to a server that performs the actual machine task. In this paper we propose an alternative architecture, where we instead transmit features to the server. Specifically, we partition the Joint Detection and Embedding (JDE) person tracking network into client and server side sub-networks and code the intermediate tensors i.e. features. The features are compressed for transmission using a Deep Neural Network (DNN) we design and train specifically for carrying out the tracking task. The DNN uses trainable non-uniform quantizers, conditional probability estimators, hierarchical coding; concepts that have been used in the past for neural networks based image and video compression. Additionally, the DNN includes a novel parameterized dual-path layer that comprises of an autoencoder in one path and a convolution layer in the other. The tensor output by each path is added before being consumed by subsequent layers. The parameter value for this dual-path layer controls the output channel count and correspondingly the bitrate of transmitted bitstream. We demonstrate that our model improves coding efficiency by 43.67% over state-of-the-art Versatile Video Coding standard that codes the source video in pixel domain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.