In this paper, a simple yet efficient activity recognition method for first-person video is introduced. The proposed method is appropriate for representation of high-dimensional features such as those extracted from convolutional neural networks (CNNs). The per-frame (per-segment) extracted features are considered as a set of time series, and inter and intra-time series relations are employed to represent the video descriptors. To find the inter-time relations, the series are grouped and the linear correlation between each pair of groups is calculated. The relations between them can represent the scene dynamics and local motions. The introduced grouping strategy helps to considerably reduce the computational cost. Furthermore, we split the series in temporal direction in order to preserve long term motions and better focus on each local time window. In order to extract the cyclic motion patterns, which can be considered as primary components of various activities, intra-time series correlations are exploited. The representation method results in highly discriminative features which can be linearly classified. The experiments confirm that our method outperforms the state-of-the-art methods on recognizing first-person activities on the two challenging first-person datasets. Index Terms-Human activity recognition; first-person activity recognition; feature encoding; feature representation; convolutional neural network. I. INTRODUCTION Human action recognition have become an interesting research filed in the recent decade [1-6]. It is because of its numerous applications, such as visual surveillance, entertainment devices, elderly people assistance, human-computer interaction, and video indexing/retrieval. In spite of many efforts conducted on recognition of human activities, it still remains a difficult problem in real-world applications. Intrinsic similarities between different actions give small inter-class variations. On the other hand, there are large intra-class variations caused by camera motion, illumination changes, background clutter, viewpoint changes, irrelevant motions, and various styles/speeds.The videos taken from an actor's own viewpoint are called first-person videos. Although a lot of research have been conducted on third-person activity recognition, it is not appropriate to directly employ them for first-person videos. It is due to major differences between these two kinds of videos. The main difference is related to the fact that the person wearing the camera is involved in the activity. As a consequence, strong ego-motion is mostly occurred in this kind of videos. It should be noted that in most of the first-person video analysis, a real time response is required; therefore, the computational complexity should be considered more intensively [7].In recent years, the number of captured videos in first-person viewpoint has rapidly grown due to increasing wearable cameras [8]. A lot of applications have emerged such as life logging, elderly (or blind) people assistance, military applications, and ro...