Recently, deep learning approach has been used widely in order to enhance the recognition accuracy with different application areas. In this paper, both of deep convolutional neural networks (CNN) and support vector machines approach were employed in human action recognition task. Firstly, 3D CNN approach was used to extract spatial and temporal features from adjacent video frames. Then, support vector machines approach was used in order to classify each instance based on previously extracted features. Both of the number of CNN layers and the resolution of the input frames were reduced to meet the limited memory constraints. The proposed architecture was trained and evaluated on KTH action recognition dataset and achieved a good performance.

ABSTRACT

Introduction Human action recognition has been one of the important research areas of both computer vision and machine learning for more than ten years. Because it has a lot of potential applications such as surveillance systems, human-computer interaction and sports video annotation [1-5]. Initially, human action recognition approaches take a number of frames from videos in order to extract a set of features such as 3D-SIFT [6], extended SURF [7] and HOG3D [8], Space Time Interest Points (STIPs) [9], and optical dense trajectories [10]. Recently, deep learning architectures are used in order to replace the feature engineering step with an automated process. In this paper, we use 3D Convolutional Neural Networks (CNNs) as a feature extractor method based on spatial and temporal dimensions. Extracted features were classified by support vector machines algorithm. Our proposed system is trained and evaluated on KTH dataset (Fig. 1) which consist of 6 action classes (boxing, hand-waving, handclapping, jogging, running and walking) performed by 25 actors and includes a total of 599 videos

A. Single layered action recognition Authors in [14] have combined both motion history image (MHI) and appearance information for human actions recognition task. The first feature is the foreground image, obtained by background subtraction. The second is the histogram of oriented gradients feature (HOG), which characterizes the directions and magnitudes of edges and corners. SMILE-SVM (simulated annealing multiple instance learning support vector machines) has been used as a classifier. In [15] global features and local features collected to classify and recognize human activities. The global feature was based on binary motion energy image (MEI), and its contour coding of the motion energy image was used. Whereas for local features, an object's bounding box was used. The feature points were classified using multi-class SVM. In [16] Trajectory-based approached has been used by tracking of joint positions on human body to recognize actions. Wang et al. [17] used dense optical flow trajectories. HOG, HOF and MBH (motion boundary histogram) around the interest points were computed. Both of Harris3D detector [18] and the Dollar detector [19] are also examples of the optical flow-based approaches. In [20], space-time interest points are detected using the Harris3D detector, and assigned labels of a related class by Bayesian classifier. The collected features and labels are used by PCA-SVM classifier in order to recognize the action class. Authors in [21] employed optical flow and foreground flow to extract shape-based motion features for persons, objects and scenes. These feature channels were inputs to a multiple instance learning (MIL) framework in order to find the location of interest in a video. In [22] 3D optical flow from eight weighted 2D flow fields has been constructed to implement a view-independent action recognition. 3D Motion Context (3D-MC) and Harmonic Motion Context (HMC) were used to represent the 3D optical flow fields. By taking into account the different speed of each actor the (3D-MC) and (HMC) descriptors were classified into a set of human actions using normalized correlation. Authors in [23] represented the actions by a sequence of prototypes. The prototype is based on a shape-motion feature. K-means used in order to build a hierarchical tree of prototypes which is used in the generation of a sequence. The prototype is matched efficiently with the tree by using FastDTW algorithm. Standard hidden Markov models are also widely used for state model-based approaches in [24-26]. In [27], a n HMM is used to recognize human actions. In [28], a discriminative semi-Markov model approach is utilized with a Viterbi-like dynamic programming algorithm in order to solve the inference problem. B. Hierarchical action recognition In [29] a propagation network (P-net) based hierarchical approach has been used for concurrent and sequential sub-activities. In [30] a four-layered hierarchical probabilistic latent model is proposed. The spatial-temporal features are extracted and clustered using hierarchical Bayesian model to form basic actions. Then, LDA based hierarchical probabilistic latent model with local features is used to recognition the action. In [31] a four-level hierarchy is proposed where the actions are represented by a set of grammar rules of spatial and temporal information.

Human action recognition using support vector machines and 3D convolutional neural networks

Course Content

Watch free demo

Information

Customer Service

Extra

My Account

Help & Support

Connect Us