Improved Behavior Monitoring and Classification Using Cues Parameters Extraction from Camera Array Images

.


I. Introduction
I DENTIFICATION, monitoring, classification and recognition of human from behavior images is very necessary as it is very effective to convey subject's situation, identity, emotion, gait and gestures [1][2][3][4]. Still human identification and monitoring is not absolutely perfect in various conditions such as position changes, illumination, orientation, noise variations and dark-area places [5][6][7][8]. In spite of the research efforts and significant results in the past decade, recognition accuracy of human behavior still remains a challenge because of self-occlusion of human body parts, variation of body size and appearance, un-clear or hidden body parts behind objects and fast human movements during indoor scenes decade [9,10]. In addition, several researchers mainly focused on recognizing activities from videos captured by conventional cameras which are less effective due to complex backgrounds, light sensitivity and motion ambiguities (i.e. color and texture variability) [11][12][13]. Thus, to access the high quality imaging and 3D motions, the development of low-cost and easy-processing depth cameras such as Microsoft Kinect or bumblebee, have initiated new era for a variety of image recognition tasks including human behavior recognition (BR) [14][15][16]. Depth images provide several opportunities to enhance BR such as additional body joints information, spatial continuity, insensitivity to lighting conditions and controlling overlapping issues of different human body parts.
A large number of methods have been designed for efficient BR method and also a lot of comparative studies were evaluated by series of researchers over depth videos [16][17][18] to examine the best algorithms for recognition. These methods mainly interact with depth data using two different approaches: skeleton joints features and depth silhouette features. For example, Oreifej and Liu [19] proposed a new descriptor for behavior recognition using a histogram capturing the distribution of the surface normal orientation in the 4D space of time, depth, and spatial coordinates. To build the histogram, they created 4D projectors, which quantize the 4D space and represent the possible directions for the 4D normal. In [17], Yang et al described an effective method that project depth maps onto three orthogonal planes and accumulate global activities through entire video sequences to generate the Depth Motion Maps (DMM). Histograms of Oriented Gradients (HOG) are then computed to enhance the activity recognition results. In [20], authors proposed a behavior recognition system that deals with motion features as magnitude and directional angular features from body joints information between consecutive frames to recognize daily routine human activities. In [21], authors designed mid-level features from Kinect skeletons by considering the orientations of human body limbs connected by two skeleton joints and each orientation is encoded into different states. They employed frequent pattern mining to pick the most frequent feature values, relevant states of parts in continuous several frames and recognize different activity/actions. However, such methods show better performance and contributions, but different factors having negative impact surrounded each method. Those methods just relied on the skeleton data which became unreliable for postures with self-occlusion. Also, some methods were depended on depth silhouettes information which causes low recognition accuracy especially in case of hidden or missing body parts, fast moving human silhouettes and large distance of subject from the source (i.e. depth camera). Therefore, we elaborate some novel features along with advanced HMM to overcome the above mentioned problems and improve recognition accuracy.
In this paper, we propose a novel behavior recognition framework based on cues-parameters, which has an improved accuracy over existing algorithms. At the start of the BR framework, we handle the noisy input posture and unclear background data by designing a set of reliability measurement to extract true silhouettes and tracked joint values. These true data is examined to extract human silhouette by considering spatial/temporal continuity, constraints of human motion information and frame differentiation. These data are further processed to get feature representation by considering cues-parameters including angular direction, spatiotemporal velocity and invariant features which provide compact and sufficient feature values for better BR performance. While, all feature values are mapped into codewords and recognized each behavior via advanced Hidden Markov model (HMM). We evaluate our method according to the standard experimental protocols definition on three challenging depth behavior datasets: IM-DailyDepthActivity, MSRDailyActivity3D and MSRAction3D. Our experimental results show that the proposed method is able to achieve better recognition accuracy than the state-of-the-art methods. Since our system is well-organized, affordable and easily installable, therefore, it is the preferable solution for reliable body tracking, orientation, smart environments and behavior recognition systems.
The organization of this paper is as follows. Section II elaborates the system architecture of the proposed method starting from depth image preprocessing, evaluation of feature extraction by cues-parameters and behavior training/recognition via advanced HMMs. Experimental results and comparisons between proposed and state of the art methods are described in Section III. Finally, we conclude this work in Section IV.

II. System Architecture and Description
The proposed behavior recognition system consists of various steps as: 1) raw depth data captured by RGB-D video sensor, 2) noisy background removal, 3) human detection, tracking and identification from the time-sequential behavior video images. 4) feature extraction based on cues-parameters techniques, clustering using Linde, Buzo, and Gray (LBG)'s algorithm and training/recognition using advanced HMM. Fig. 1 shows the overall flow of our proposed BR system.

A. Depth Image Processing
For the human identification in depth sequential maps, background subtraction routine is applied which consists of least squares method for estimating the angle and center point of the floor in a real world coordinate system [22]. The depth value y in a spaced grid having least value is used to ignore floor from noisy background. Then, we have localized different objects in the scene and segmented them by computing the modified connected component labeling (CCL) method and examining degree of freedom [23,24]. It is used to label all candidates pixels separately. Finally, we extract the human moving silhouettes, temporal depth intensity differentiation [22] is applied as expressed in (1) to obtain the depth human silhouettes from consecutive frames. To properly track the entire human body, we performed disparity segmentation. Ignoring the 0's, we consider the average of the disparity values in the detected parts and compare neighboring pixels surrounded by the detected moving parts to add the pixels with closure disparity values that make a separate region (i.e. human silhouettes). Thus, the disparity segmentation is employed to find the target human silhouette candidates and subjects which are free to move more naturally.
Overall system's framework allows a smart environment to analysis what the user is doing from the noisy data obtained from the depth camera. We implement the depth motion database targeting at different behaviors. The database includes correctly and incorrectly performed skeleton models for different activity/action purposes, annotated posture information, as well as depth and color images obtained from the RGB-D camera. Some of the examples of the database are shown in Fig. 2.

B. Evaluation of Cues-parameters Techniques
In this section, we revive human motion and joint tracking information by considering unique cue evaluation techniques using OpenNI environment, and point out why these methods are applied efficiently and produce vital results.
For initialization, we utilized several body silhouettes and joints values cues for local body parts motions, temporal frames information, spatial/temporal depth silhouettes characteristics and speed measurements of human body shape using depth datasets. Following are the sub-sections that describe the features under our evaluation.

1) Angular Direction Feature
Each posture P in the database is represented by a vector of 3D joint points as The angular direction feature cos ϕ is defined as the difference of angular movements between similar joints of two different frames (i.e., consecutive frames) at time t 1 and t 2 . It captures the temporal movements of different body parts of human silhouette. It is defined as (3) where 1 t C and 2 t C are the joints information with respect to consecutive frames and k indicates the all three coordinates axis (i.e., x, y, z) of respective joints [22]. Fig. 3 shows an example of the directional angular features with different activities. However, the joints angular values are quite effective in order to improve the performance accuracy during feature discrimination and recognition.

2) Velocity Feature
The spatiotemporal velocity feature f v captures the velocity of each joint in the direction of the normal vector of the plane from starting frame till ending frame. It is defined as where t s and t e are the starting and ending frames of overall data sequence and j k deals with the coordinate axis of human body joint information. These features deal with the intensity differentiation and spatiotemporal motion values of body parts. Fig. 4 explains the detail description of spatiotemporal velocity features using depth dataset.

3) Invariant Feature
To observe the invariant characteristics of human body silhouette, we measure the integral of a function over a line as the Radon transform. Basically, it extracts some interested regions in the Euclidean plane as (5) where ) , ( θ ρ z R is 2D Radon map that is the line integral of depth data. During extraction of human behavior silhouettes, radon transform is used to represent the distance and local directional movements in human body parts motion. During the computation of radon transform, the line integrals of human silhouettes amplify low frequency components which are useful in behavior recognition. It is quite suitable for BR because there is a variation in the angles of different human body parts such as feet, hands, head, hips and shoulders. Thus, it represents the maximum energy of the human behavior that appears in specific coefficients which vary considerably through time. Fig. 5 shows some human behaviors and their corresponding Radon transforms.

C. Features Symbolization
Now, these cues-parameters vectors are further symbolized based on vector quantization technique known as Linde, Buzo, and Gray (LBG)'s clustering algorithm [15]. Initially, LBG initializes with a codebook size of 1 and recursively splits the centroids of feature vectors (i.e., datasets) to get an optimally sized codebook. We used the optimal codebook size of 64 after experimenting over different depth datasets. In addition, the codebook size and codevector are directly related with the precision value of a locally global feature values and parallelly intact with the size of the source image. This optimization of the centroids is done to reduce the distortion. While, these code-values are generated per each behavior sequence and stored by considering buffer strategy. Fig. 6 shows the procedure of code-values generation and symbolization of proposed features.

D. Advanced HMMs
To train and recognize different depth datasets, we modified conventional HMM into advanced HMM technique. In conventional HMM approach, each behavior makes its specific HMM based on finite states having transition and symbol observation probability [21]. Such method consists of redundant information in the form of whole body silhouette and less active moving body parts [25]. Also, HMMs are mainly dependent on the feedback of possible transitions and need the priori knowledge to manage the parameters which cause over-fitting in each class [26,27]. Such kind of unnecessary information causes reduction at overall performance of accuracy results. Thus, advanced HMM is developed which focused on active areas of human body parts such as hands, feet, head, hips and shoulders. For training phase, we need to train (N+1) distinct HMMs for N different human activities. During testing phase, maximum likelihood [28][29][30] value of specific sequential data is chosen to recognize distinct behavior as expressed in (6).
λ denotes the probability of likelihood of the h behavior HMM among a number of activities. Fig. 7 shows active features regions of overall human silhouettes to calculate specific likelihood of each behavior.

III. Experimental Results
In this section, we describe how our skeleton models and human postures are represented in the database, and details about what kind of behaviors are included to create the database. We evaluate the proposed cues-parameters approach on a newly collected depth-based behavior recognition datasets (i.e. IM-DailyDepthActivity) and two public datasets (i.e., MSRDailyActivity3D and MSRAction3D).
During experiment, we used the leave one subject out (LOSO) cross validation method [26].

A. Datasets Description
Following sub-sections are used to describe each dataset, its experimental setting and size of each dataset, respectively. We use the Microsoft Kinect to capture the skeleton data, its joints location and posture data for the database, as it is one of the known running and developed device of depth camera based motion sensors. Apart from that, we manually handled descriptions such as the natural movements of the human behavior, risky injury during captured sequence and slotting of each video sequence.

1) IM-DailyDepthActivity
We capture the RGB images, depth images, labeled data and skeleton joints information of 15 different activities as: sit down, both hands waving, phone conversation, kicking, reading an article, throwing, bending, clapping, right hand waving, take an object, exercise, eating, boxing, cleaning and stand up, respectively. The dataset is captured in indoor environments (i.e., labs, halls and classrooms) having multiple background scenes. However, the dataset includes 45 segmented videos of each activity for training and 30 unsegmented continuous videos for testing performed by 15 different subjects.
Both of the RGB videos and depth maps have the resolution of 640 x 480 pixels and the skeleton contains 15 joints per person. Fig. 8 gives some examples of IM-DailyDepthActivity dataset.

2) MSRDailyActivity3D
The dataset has sixteen activities including: drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lay down on sofa, walk, play guitar, stand up and sit down. The total dataset includes 320 video sequences which are performed by 10 subjects. Each subject performs each activity twice, one in standing position and the other in sitting position. Fig. 9 shows some depth images of MSRDailyActivity3D dataset. In addition, most activities involve human-object interactions which make this dataset more challenging.

3) MSRAction3D
The MSRAction3D dataset [27] is an actions dataset of depth map sequences captured by a depth camera. It contains 20 different actions performed by 10 subjects. These actions include: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing and pick up & throw. There are 567 depth videos sequences. Furthermore, the background of the dataset is clean and mostly actions involve specific body parts movements (i.e., head, arm and leg), which makes the dataset quite challenging. Fig. 10 shows different sequences of depth maps actions used in MSRAction3D dataset.  In the newly collected dataset, we evaluated the proposed cuesparameters method and compared the results with the state-of-theart methods. Table IV shows the comparison of recognition results. Here, the existing methods such as motion templates [31] proposed new methods for automatic classification and retrieval of motion capture data facilitating the identification of logically related motions in specific dataset. In [17], Yang et al. describes the depth motion maps as feature representation and HOG to characterize the local appearance for recognizing data. While, [32] deals with the positions of joints to locally define reference system and multi-part bag-of-poses approach is then defined, which permits the separate alignment of body parts through a nearest-neighbor for classification and recognition. In [33], multi-modality fusion scheme is developed based on spatio-temporal interest points and motion history images features to recognize different activities. From results in Table IV, it is clearly seen that the proposed method improves the recognition results as compared to state-of-theart methods.    [31] 38.7 Depth motion maps [17] 42.3 Naive-Bayes-Nearest-Neighbor [32] 47.6 Color-depth fusion features [33] 51.6 Proposed Cues-parameters 68.4

2) MSRDailyActivity3D
We evaluate our proposed cues-parameters method using MSRDailyActivity3D dataset [34,35]. We performed the experiment using the depth silhouettes and the joint information together in cuesparameters. We used existing methods [19,31,36] where [31] and [36] mainly deal with joint points information to monitor the human movements and recognize human activities/actions. While, in [10], a novel approach is proposed for human action recognition with histograms of 4D joint locations (HON4D) as a compact representation to recognize different activities. In [26], Wang et al developed the actionlet ensemble model that used 3D point cloud to model body shape and reported a recognition accuracy of 85.7%. In [38], Xia and Aggarwal proposed a novel depth cuboid similarity feature (DCSF) to describe the local 3D depth cuboid around the depth images to recognize different actions/activities. Besides, we compare the experimental results of our proposed method with the algorithms defined as state-of-the-arts methods and the results are shown in Table V. As the table shows, our proposed method performs much better than state-of-the-art methods. It is clearly shown that the proposed method achieved significantly better recognition accuracy as 91.2% than the state-of-the-art methods as 54.0%, 58.1%, 72.1%, 80.0%, 85.7% and 88.2%, respectively.

3) MSRAction3D
The confusion matrix of 20 different human actions that is obtained from the proposed cues parameters method on the MSRAction3D dataset is shown in Table III.
We compare the performance of our proposed method with the state of the art methods using MSRAction3D dataset. In the first approach [39], developed by Lv and Nevatia, learning-based algorithm for automatic recognition and segmentation of 3D human actions is defined. In [27] and [36], bag of 3D model and position differences of joints as eigenjoints are analyzed for action recognition. While, actionlet ensemble [26], pose set [42] and moving pose [43] methods deal with body joints information for activity classification and recognition.
To make a fair comparison, we performed cross subject test between proposed method and the state-of-the-art methods because the cross subject is more challenging due to dynamic intra-class differences in actions among different subjects. The proposed cues-parameters method achieves the recognition accuracy of 92.9% which significantly outperforms the existing methods, as listed in Table VI.  FP  HT  DX  DT  DC  HC  TH  SB  BD  FK  SK  JO  TS  TE  GS  PU  HW  91

IV. Conclusion
In this paper, we proposed a novel methodology having multijoints features along with advanced HMM for BR system using depth sequences. Our proposed BR system utilizes a combination of human silhouettes identification and tracking process using modified connected component labeling method. In-addition, it includes angular direction, spatiotemporal velocity and invariant features which are used to extract local body part information, intensity differentiation and temporal variation properties to reinforce the feature classification and accuracy. Finally, these features are modeled, trained and recognized using recognizer engine. The proposed method is implemented with Matlab (R2009b) using an Intel Core i3 processor with 4 GB RAM in Windows XP platform. We evaluated the proposed method using three different datasets as IM-DailyDepthActivity, MSRDailyActivity3D and MSRAction3D. Experimental results showed some promising performance of the proposed BR method over the state of the art methods.
In the future work, we will improve the effectiveness of our system by adding biometric identification techniques [44] such as face reorganization, iris detection, gesture recognition, etc., with the help of involving multi-cameras views or RGB contents in Kinect sensors. Besides, we will also use combined cues of biometric identification cues [44] along with proposed cues-parameters techniques to cover large interventions of worldly-scenarios and act as multi-mode biometric system.