A Recent Trend in Individual Counting Approach Using Deep Network

performance of detection and clustering based approaches are significantly marred on overpopulated scenes. In contrast to these approaches of the regression techniques, [19] is applicable to the overpopulated area and overlapping among the objects in crowded scenes. On the other hand, of all the regression techniques, the partial least square regression (PLSR) can resolve the collinearity difficulty with fewer factors, also requires fewer computations and converges speedily compared to other methods [20]. The major restriction of PLSR is at higher risk of overlooking ‘real’ correlations and sensitivity to the relative scaling of the descriptor variables [21]. Still, these approaches fall short of localization task due to their ineffectiveness to specify the location data of each individual in the scene. The density maps technique exhibits great success in tackling the issue of individual counting in comparison to regression-based methods, specifically where the scene consists of a few complications such as heavy inter-occlusion between individuals, unclear resolution in videos [22]-[25][7]. Additionally, these methods can specify spatial


I. Introduction
I ndividual counting scheme has an extensive application in various domains such as public stations, shopping mall, universities, etc. [1]. The analysis of crowd has been presented surprising assessment to get knowledge about the crowded scene, though comprising the behavior analysis of crowd [2] [3], tracking individuals [4] [5], and segmenting the crowd [6] [7].
The region of interest (ROI) and line of interest (LOI) are considered as two broad categories which are employed for individuals counting in the video. The ROI approximates the total value of individuals in some areas at a specific time whereas LOI obtains the total value of each individual in a certain period of time when they cross a detecting line [8]. To deal with these two categories, researchers employed either human detection, feature regression, or clustering techniques respectively [8]. Numerous schemes are proposed to count the individual on the basis of ROI and LOI categories. The general techniques are clustering [9], detection [10], and regression [11] [12].
The crowd analysis has contributed profusely towards understanding the scene with overpopulation via linking the behavior analysis [2] [3], tracking [4] [5], and segmenting the crowd [13] [6]. Based on the literature [14] [15], the different crowd schemes had similar principles which enable them to precisely describe by the attributes. Currently [16], multiple studies struggle on profiling attributes of the crowd, due to the limitation to the value of the attribute and the indifference to the dissimilarity in the scene.
The research of [17], suggested a deep multitask method to establish a good comprehension towards crowd by attaining information, mix procedure and features movement. The consequence of this letter demonstrates significant advances on the evaluation of recognizing the attribute in cross-scene by the suggested deep scheme.
The letter of [18] compared the most well-known counting approaches such as detection, clustering, and the latest technique, regression, which is gaining traction in handling the overpopulated area. The results of [18] showed that the regression scheme presented a higher performance in contrast to other approaches. Among the wellknown counting schemes, the performance of detection and clustering based approaches are significantly marred on overpopulated scenes. In contrast to these approaches of the regression techniques, [19] is applicable to the overpopulated area and overlapping among the objects in crowded scenes. On the other hand, of all the regression techniques, the partial least square regression (PLSR) can resolve the collinearity difficulty with fewer factors, also requires fewer computations and converges speedily compared to other methods [20]. The major restriction of PLSR is at higher risk of overlooking 'real' correlations and sensitivity to the relative scaling of the descriptor variables [21]. Still, these approaches fall short of localization task due to their ineffectiveness to specify the location data of each individual in the scene.
The density maps technique exhibits great success in tackling the issue of individual counting in comparison to regression-based methods, specifically where the scene consists of a few complications such as heavy inter-occlusion between individuals, unclear resolution in videos [22]- [25] [7]. Additionally, these methods can specify spatial data about the crowd.
Density estimation-based approach is normally utilized in public places like malls, the squares, and stations. This approach is highly effective in establishing a better understanding of individuals' behavior, which can elevate the individuals' safety in a public area. As the crowds may vary in distributions and shape patterns, the problem of recognizing pattern arises [26]. The crowd density is attested to be at greater advantage in comparison to crowd counting methods as it provides location data of the crowd. Counting individuals can be predicted with greater ease by utilizing a density map for some specific region. The total number of individuals in the video frame/image was achieved simply by determining the integral of density function over that frame/image [27].
Deep learning has shown a great performance in various area like speech and pattern recognition while among these techniques of deep learning, the CNN fares the best in the task of the image classification. The study of [28] proposed a novel approach based on deep techniques to classify a face. This work boosts the network performance through three various techniques of Deep belief Net (DBN), Stacked Auto-Encoder (SAE) besides of Back Propagation Neural Networks (BPNN) while utilizing sigmoid objective function. The high-level features are extracted via a deep algorithm which were utilized as inputs of the classifier to distinguish among the faces and non-faces. The outcome of this letter [28] attested to the robustness and proficiency of proposed technique for facial classification task on two datasets of BOSS and MIT.
The study of [29] suggested the estimation of density methods to assess the density maps on dissimilar crowd analysis tasks. Since the proficiency of density methods is firmly based on the features forms [22], this paper estimated a classical CNN to produce the original resolution of density map in contrast to current CNN approaches, which caused the reduction of density maps resolution through downsampling strides in convolution processes. The obtained result from this work proved that the performance of counting remained unaffected on the unclear light of density maps, it, however, did not work for the localization tasks, like detection and tracking.
By producing the information of high-density, the utmost of research concentrated on an approximation of a density map to obtain the total number of the individuals [30]. These approaches were totally depending on the forms of features while CNN was utilized to obtain the valuable info from input automatically and also very efficiently to deal with overlapping among objects, non-uniform lighting, and altering scales [30]. However, the current CNN approaches may err in resolving the two issues of non-stable density distributions and dissimilarity in scale. The letter of [30] suggested the multi-column multi-task convolutional neural network (MMCNN) to solve these problems. This paper suggested three innovative methods, namely to offer a new density map which could focus on location and full data, propose a multi-column CNN to get beneficial data from different scales and lastly, to estimate the density map, level of crowded environment, as well as background or foreground mask. The accuracy of the density map improved by additional tied objectives. The suggested scheme of this work displayed to be more proficient in contrast to conventional Gaussian density map and the recently utilized shaped of individual density map. The method of this work proved that current CNN-based individual counting methods were capable in dealing with the issue of non-uniform distributions and scale variations. The study of [13], proposed a framework to disentangle the individual counting difficulties in cross-scene by using CNN with no additional annotations for a new target scene. The CNN model of this letter was trained based on two objectives of the density map and individual counts where these objectives assisted each other to gain higher local optima. Also, this model could learn certain crowd features more efficiently compared to the handcraft features. To tackle the gap among various scenes, the pre-trained CNN was adjusted to all scenes. The texture of the crowd would be taken through the CNN technique which yielded an accurate counting outcome without considering the foreground segmentation results, as the method of this study was solely based on appearance information. The experiment result of this study showed the efficiency and consistency of the proposed method. As a consequence, in comparison to different counting approaches, density map methods were more proper for an overpopulated area, overlap scenes. Moreover, it seemed very potential in specifying spatial data to locate each individual in the overcrowded scene. This technique is accurate in defining the crowds through density distribution, as it provides the data on the location.
Nevertheless, it may collapse where only heads of individual are obvious. As the performance of this method is strictly dependent on the features types, the CNN can provide useful information automatically in contrast to shallow techniques. Moreover, the CNN techniques are highly beneficial to deal with difficulties that involve overlapping among objects, non-uniform lighting, and altering scales. Although the MMCNN is able to resolve these difficulties, they are restricted to the scales which are applied in overtraining and their ability to understand well-generalized schemes. In this letter, we aim to provide a review on recent individuals counting techniques which use deep networks in various applications of the crowded scene, and also to specify the effectiveness of deep CNN for attaining higher precision while utilizing popular individuals counting approaches. This paper is segregated into five parts, specifically, introduction, analysis of individuals counting, discussion, outcomes and conclusion which are specified in parts I, II, III, IV, and V correspondingly.

II. Individual Counting Analysis
Individuals counting approaches are classified into three techniques, namely, "Detection techniques", "Clustering techniques", "Regression techniques", and "Density Map techniques".

A. Detection Technique
The detection-based technique assigns the detector to recognize the individuals in the video scene for obtaining the total value of individuals with their location [31].
The well-known detection methods commonly include detection based on the body [32], shoulders [33] and heads [34]. In heavy occlusion condition, only the heads of individuals might be appearing so, head detection methods mostly show well performance in contrast to the other schemes of shoulder and body detection. The problem in detection technique is to identify an individual by itself, mostly in the existence of crowds and overlapping among objects [35]. Although this kind of method shows promising outcomes, the carefully handengineered designed indicator is not precise in the blurred scene where the quality of utmost videos is relatively low [1][35] [36]. Furthermore, this scheme is unable to deal with the overpopulated region or unstable climate as well as to count the total value of individuals with the different flow direction, and it limits its operation in real scenes [15]. These schemes are efficient to produce appropriate detections in sparse scenes, providing the total value of individuals and location, in addition, to posture of all individuals in a scene. However, these approaches are travail from sight clutter. A multi-camera adjustment with overlying views is not available in many circumstances, and also scene not being able to promise satisfactory cross-dataset generalization [10], while training of a specific scene indicator for counting is difficult [18]. The tracking and detecting of individuals are getting more complex while the crowded environment is getting denser and larger [37]. Nevertheless, these approaches require huge computing resource and are often limited by an overlapping and complex background in realistic situations, subsequent a low accuracy [38]. This owes to the fact that these methods would have to scan each frame via a trained detector, which normally takes more time to obtain the total value of individuals [39]. The individuals counting through detection-based approaches need more time whilst applied to the scene with occlusion and lighting changes [40]. These approaches are able to provide higher accuracy in the crowd scene with low density in comparison to the scene with high crowd density [41]. Additionally, it is essential to have scene with high resolution to get high precisions [42]. The pros and cons of prevalent detection techniques are shown in Table I.

B. Clustering Technique
The total number of individuals in the clustering-based techniques are obtained through tracking techniques algorithms [54]. In these approaches the useful information/features are followed out frame by frame, then by using spatial and temporal constancy heuristics, they are able to cluster the direction and also use extra elements to accomplish the spatial path for each person [9][54] [55]. The total number of individuals is indicated based on the number of clusters [56]. As these methods highly depend on tracking approaches, they are timeconsuming and need high computation resource. The performance of these techniques degrades excessively while they are applied under certain conditions of variations in illumination, low resolution, and motion imaging platforms, which reduces the stability of the tracking algorithm [30][54] [57]. These techniques are based on extracting and counting the individuals' blobs of a temporal slice of the video, the certain blobs which consist of many individuals are not able to provide precise individual counting due to severe occlusion [58] [59]. Furthermore, the accuracy of these techniques is influenced by the inaccuracy of coherently unbalanced info which is not matched with the exact object [60]. Mainly in a real scene, the handcraft information is hardly capable to adapt with different climate conditions, and lighting [58] [59]. However, a model propounds movement coherency, thus the probability of improper approximation increases when objects staying constant in a scene, signifying objects which allocating similar info at a certain period [61]. These methods are capable to handle the consecutive image frames while other counting approaches are not facing this restriction [18]. Also, these approaches are able to provide an accurate result when dependable trajectories can be extracted [39].

C. Regression Technique
The regression-based techniques, by gaining the low-level features map, are able to count the individuals. Typically, the regression algorithms encompass the appropriate features such as texture features [11][62] [63] and key points [11][62]- [64] which are extracted from the foreground through background subtraction approaches. These techniques obtain the correlation amongst features and count individuals through the training of extracted features without considering the individuals' identification [54]. These approaches provide superior performance under over-crowded scene circumstances [65] by escaping the issue of hard detection. However, these approaches lack of the ability to provide the information about the individual count, and also, disability to specify the position of each individual, which restricts these approaches for localization tasks [7] [66]. It is crucial for video surveillance systems to have information about the position of each person in the scene in order to acquire the spatial distribution of individuals [11][24] [67]. The computational cost of these approaches is very low as these methods do not need to detect and trace the individual [37]. Table II summarizes the advantages and disadvantages of six popular regression-based techniques.

D. Density Based Technique
In the Density-based techniques, the total value of the individuals is equivalent to the integral of the density map around sub-region. These approaches are employed for both counting and localization by retaining the spatial information which renders these methods very effective for describing the density distribution of crowds. Notwithstanding, these approaches might collapse where only heads of individuals can be observed. These methods are able to tackle the individual's counting issue, under high occlusion, and low resolution in the video scene [22]- [25] [7]. The performance of this technique is extremely based on the types of features [22]- [25] [7], where the convolutional neural network 2
• Unable to provide precise detection solely based on head region. 4

Multi-Sensor
• Accessible to multiple camera where the uncertainties caused by overlap among individuals can be dissolved via one camera.
---------------• Not applicable if multiple camera is not available. [45][52] [53] (CNN) is known as the best technique for extracting the features [30]. However, the CNN method cannot efficiently tackle three difficulties of distributions of non-uniform density, scale changes, and drastic scale changes [30] and also causes the resolution of the density maps to decrease which has no discernible effect on the development of the precise individual counting, even though it prevents from localizing the individual satisfactorily [36]. By considering the mentioned difficulties as a preventive factor from attaining higher precisions, the CNN schemes particularly handle these difficulties through multi-column or multi-resolution network [36]. Though these approaches exhibit their strength to non-uniform distribution density, scale changes, and drastic scale variation, they still pose some limitation to the size that is applied during training and thus their proficiency delimited for learning bettercommon approach [39]. In comparison to regression techniques, the density maps have been shown to be very efficient at solving the issue of individual counting, especially where the scene contains high interocclusion, with low-resolution surveillance videos [22]- [25][7] and also, these methods are providing spatial information about the crowd.

III. Experimental Results
This research aims to cater a review based on recent counting approaches and work is to specify the effectiveness of CNN used on popular individuals counting approaches for attaining higher precision results. In the field of computer vision, individual counting is considered as a fundamental task. There are several techniques such as counting by clustering, detection, and regression that exist to resolve the issue of counting individuals. However, these techniques fall short of overcoming the difficulties such as overlapping objects, the difference in illumination, unstable weather, overpopulated scene.
These techniques, albeit, have shown astonishing performance while incorporated with CNN to attain useful data from the image /video frames which play an important role in both ROI and LOI [8]. Through utilizing the advantage from CNN for representation of the image, the individuals counting has shown great achievements in contrast to shallow approaches to tackle the issue of the populated scene. Table  III shows some certain individual counting techniques mentioning the quantitative outcomes from the corresponding reference.

IV. Discussion
There are several techniques for individual counting namely, detection, clustering, and regression. These methods were employed to overcome the difficulties of counting, which have dissimilar reasons under different conditions in an overpopulated area. The detectionbased approaches specify the total number of individuals in the crowded scene by assigning a detector for recognizing each individual [73]. These approaches are only practical to a specific scene and will fail for some certain real-life applications [73] [74]. These approaches have difficulty to obtain precise outcome under various conditions such as unstable weather or individuals with opposite flow path [15]. Besides, they are time-consuming as the detector scans every frame of the scene [39]. The clustering methods obtain the total number of persons in a populated scene via employing tracking systems [73]. These approaches also take a long time to process and suffer from computation difficulty as they are closely dependent on the tracking algorithms. Moreover, their performance degrades significantly when employed to scene with variation in illumination, unclear resolution, and motion imaging platforms, which causes the tracking algorithm to be unsteady. The regression-based methods obtain the total value of  • Some objects are not beneficial for counting estimation. 6 Random Forest regression (RFR) • -Attain to nonlinear scalable regression.
• Less susceptible to the parameters. • Being to be scaled to big dataset.
• Disable to work with the points that are out of the range of target value. [71] individuals in the scene through mapping amongst low-level features [73]. Such techniques generally achieve higher performance for the overcrowded scene. However, they lack in providing the information of individuals location, hence not being used for object localization [66] [7]. As the goal of this study is to introduce the most suitable counting methods, the density map estimation has shown encouraging results in comparison to the other techniques, which can provide the location of the individuals in a very crowded scene. These approaches are very potential in tackling the individuals counting issue where heavy inter occlusion and unclear scene exist among the objects. However, they are unable to provide the precise individuals counting where only heads of individuals are obvious. The effectiveness of individual counting methods is indicated in Table IV.

V. Conclusion
This study endeavors to provide a revision on the existing individuals counting techniques namely, clustering, detection, regression, and density map-based methods. Among these techniques, the regression-based ones show great performance under overcrowded area. The regression-based approaches are capable to count the individuals in a crowded environment, although they are not applicable for localization task. The density map estimation approaches are very promising among other counting methods since they preserve the beneficial spatial data for two tasks of counting and localization. The density map technique is very efficient to resolve the issue of individuals counting under different circumstances such as the massive overlapping among the objects and unclear scene in consequence frames. This scheme is proper for defining individual's density distribution since it focuses on both details on location and spatial data but it might fail once the heads of individuals appear. While the functioning of this method relies on the type of features, the convolutional neural networks (CNN) are capable to extract the useful information. The CNN based counting approaches are not being able to tackle three problems, i.e., distributions of non-uniform density, a variation of scale. Furthermore, applying CNN-based approaches to density estimation techniques can reduce the resolution of these techniques rendering it ineffective for localization tasks. By taking these problems into account as a restrictive clause to accomplish superior precisions, particular CNN techniques accurately resolve the three mentioned difficulties via multi-column or multi-resolution net. Though these approaches showed robustness to distributions of non-uniform congestion and size variations, they are yet restricted to the size of the dataset used for training which makes their ability to be limited to achieved better methods. Based on the experimental results of previous work we found that the crowd density approach is more beneficial in comparison to other counting methods as it provides information about the location of the crowd. However, the performance of this technique is highly depending on the types of features in which the usage of best deep learning technique (CNN) can be very valuable for this method to extract the important feature from each video frame. The usage of (CNN) with a density map approach will help the future research in scheming more precise counting techniques to approximate density maps with high quality for both tasks of counting and localization.  [22]- [25], [7] [29] [29] [18] [18]