Image Classification Methods Applied in Immersive Environments for Fine Motor Skills Training in Early Education

Fine motor skills allow to carry out the execution of crucial tasks in people's daily lives, increasing their independence and self-esteem. Among the alternatives for working these skills, immersive environments are found providing a set of elements arranged to have a haptic experience through gestural control devices. However, generally, these environments do not have a mechanism for evaluation and feedback of the exercise performed, which does not easily identify the objective's fulfillment. For this reason, this study aims to carry out a comparison of image recognition methods such as Convolutional Neural Network (CNN), K-Nearest Neighbor (K-NN), Support Vector Machine (SVM) and Decision Tree (DT), for the purpose of performing an evaluation and feedback of exercises. The assessment of the techniques is carried out using images captured from an immersive environment, calculating metrics such as confusion matrix, cross validation and classification report. As a result of this process, it was obtained that the CNN model has a better supported performance in 82.5% accuracy, showing an increase of 23.5% compared to SVM, 30% compared to K-NN and 25% compared to DT. Finally, it is concluded that in order to implement a method of evaluation and feedback in an immersive environment for academic training in the first school years, a low margin of error must be taken in the percentage of successes of the image recognition technique implemented, to ensure the proper development of these skills considering their great importance in childhood.

KOS. Authors such as [8] propose a virtual environment based on 3D navigation for gestures with two hands, offering control on the speed of navigation, being easy and intuitive; when evaluating the technique with a set of participants, the study finds its feasibility making it easy to learn and use, in addition to reaching high rates of accuracy, performance and prediction speed. In research such as [9], the planning, assignment and use of digital learning objects with augmented reality is proposed to support the teaching-learning process of higher education students, taking into account the students' learning style, resulting in a better academic performance, in addition to making the learning process more dynamic and attractive to students. For their part, the authors of [10] develop an augmented reality platform to simulate a 3D book versus a traditional book for chemistry learning in high school students, given the low popularity of the subject, resulting in those students who used the platform had better results in their semiannual exams, compared to those who used the book traditionally. In [11] a method of learning in Augmented Reality was carried out through the use of mobile devices, using multidimensional concept maps, and having a standard group and an experimental group for testing; as a result, it is evident that the experimental group performed significantly better than the other group and a greater motivation is concluded during the development of the activity. In a study carried out by [12], a mobile platform is exposed through the use of augmented reality, for descriptive geometry learning; results show a great increase in positive feelings on the part of the students, through the proposed method that is characterized by the use of virtual models of the figures.
Likewise, there are studies related to AR in early education, as well as the influence of technology in increasing motivation and interest during the learning process in children. For their part in [13] they develop an interactive-educational game for learning concepts such as colors, geometric shapes and mathematics through augmented reality, for children between 4 and 7 years old, obtaining as a result a pleasant and fun teaching environment, as well as the increased interest and commitment on the part of the students, who were also evaluated by their parents. In [14] content on animals is implemented for children aged 4 to 5 years, based on augmented reality; as a result, the authors present an increase in the effectiveness of learning and an increase in student activity, as augmented reality is the promoter of active behavior and development of communication skills in children.
Images classification methods in artificial intelligence aim to improve the feature and pattern extraction learning, and their application can be seen in a number of contexts, including education [15]. Methods such as Support Vector Machine (SVM) that is based on the principle of margin maximization [16], K-Nearest Neighboor (K-NN) whose purpose is to find the nearest neighbor between nodes [17], Convolutional Neural Network (CNN) that it is a neural network characterized by using the convolution operation [18] and Decision Tree (DT) defined as a simple representation of a finite set of classes [19], are considered the most commonly used and known traditional approaches [20] [21]. Studies like [22] present the application of a convolutional neural network model for plants recognition as of an images dataset; as a result, 86.2% effectiveness was seen with the CNN model implemented. In the field of education, references such as [15] worked with a classification method and the use of augmented reality by presenting a CNN model for geometric figures classification captured from an AR-Sandbox, with the aim of supporting early education and fine-motor therapy in children. The authors found that the proposed method shows a decrease of 39.45% based on loss and an increase of 14.83% of correct answers, concluding that there is an increase in performance with the selected CNN model.
Taking into account the above, this study aims to apply the classification methods SVM, CNN, K-NN and DT with images captured from the immersive environment presented in [23], a study that addresses an environment based on augmented reality in support of training in children from 4 to 6 years old, through 3D scenarios where children reinforce the learning of vowels and their training. The application of the methods is done in order to provide feedback on the activities carried out by the children and these will be compared using a confusion matrix, classification report and cross validation as metrics to measure their performance, given that, traditionally, machine learning-based approaches are assessed in a cross validation scenario that validates the classification model by assessing how the result will generalize to an independent dataset [24]. Additionally, the representation of a confusion matrix helps in analyzing the belongingness of each class and makes a clear view of the classification accuracy of each class showing the correct classification and misclassification of each class against the belongingness of each target class [25], in addition, usually classification report is used to check the quality of classification algorithm predictions. The rest of this article is organized as follows: in Section II there are Materials and methods where the general scenario is presented, which is divided into the immersive environment and the development of activities feedback with the application of images classification methods; Section III presents results and discussion; finally, conclusions and future work are presented.

II. Materials and Methods
Fig . 1 shows the general scenario with which it will work, structured with two main components: Immersive environment and Activities feedback. The Immersive environment component is based on an Augmented Reality prototype where a series of 3D scenarios are presented with which it interacts through a gesture control device and whose objective is the practice of vowels for the support of fine-motor therapy in children from 4 to 6 years old. The Activities feedback component complements the first by determining if the child's exercise corresponds to the expected result; this is achieved with the application of the classification methods CNN, SVM, K-NN and DT, taking as a starting point the images obtained with the immersive environment. Finally, in order to measure the performance of the methods, a comparison is made between them by means of confusion matrix, classification report and cross validation as performance metrics, considering that they are techniques that have been addressed in studies such as [20] [21] [26], presenting a comparison between the methods when recognizing images in fields such as the identification of diseases in plants, numbers and elements of the daily life using data sets such as MNIST and CIFAR 10 and finally, identification of types of rice grains.

A. Immersive Environment
The immersive environment used is presented in [23], where the authors developed a set of scenarios based on Unity 3D. There are five modules, one for each vowel and each one consists of scenarios where multimedia elements, such as videos, related words and images, are presented for the learning of the vowel, and a training scenario for its practice. The interaction with the scenarios is done through the Leap Motion gesture control device which detects and tracks the position of the participant's hands and fingers in space through two monochromatic stereo cameras and three infrared LEDs [27], such as it is shown in Fig. 2. In this way, the user can play the videos, change the scenario and move through the different modules. The training scenario presents a blank board, indicating the vowel to be written, where the child should use their hands to write on the board, through the Leap Motion device (Fig. 3); this allows the practice of the writing of the vowels and at the same time the fine-motor skills are worked by involving the movement of the hands and the hand-eye coordination. From this platform the images of the vowels are obtained by taking a screenshot of the training board.  Fig. 4 presents the approach to carry out the development of the module that corresponds to Activities Feedback. This component has as its starting point the training of machine learning models such as Support Vector Machine (SVM), K-Nearest Neighbor (K-NN) and Decision Tree (DT), using the Grid Search method, in order to perform hyperparameter optimization and obtain the best model of each technique according to the vowel training dataset in accordance with the nature and operation of the immersive environment presented previously. Additionally, an already defined model of convolutional neural network (CNN) will be trained, which has been taken from studies such as [15], exposing as a result a base model of a CNN, with the objective of recognizing images acquired from an immersive environment called AR-Sandbox, presenting an accuracy in prediction phase of 0.87. Once the models have been trained and selected, the evaluation and comparison of the models is carried out, using metrics such as confusion matrices, a classification report made up of precision score, f1 score and recall score and, finally, performing cross-validation. In order to perform the training of the SVM, K-NN, DT, and CNN models, a training dataset made up of 5 classes corresponding to the vowels (A, E, I, O, U) was used. The dataset is determined by a total of 1600 images, where, each class has a total of 320 images, taken from the internet or made in some basic drawing tool. Fig. 5 presents the training set used.

CNN Model
As mentioned above, the selection of the convolutional neural network model was based on a study carried out by [15], where the definition of a base model of a CNN was carried out, performing hyperparameter optimization by Random Search, defining a hyperparameters dictionary. These parameters are taken randomly in the base model, with the purpose of training a total of 50 models, to make the selection of a final model, having as decision criteria the accuracy, the average square error and the loss function, metrics presented in this study. The selected model had the purpose of recognizing and classifying geometric figures acquired directly from an immersive environment such as an AR-Sandbox. The model is presented in Fig. 6. When the base model is defined, we proceed to train it with the training dataset described above, obtaining an average of accuracy of 0.9758 and a loss function of 0.0874. Table I, presents the defined compilation and training hyperparameters.

SVM Model
In order to obtain the SVM model by optimizing hyperparameters using Grid Search, a hyperparameters dictionary is defined, declaring variable parameters such as, the kernel that will use the algorithm and the penalty value for the error. Table II presents the defined hyperparameters dictionary. As a result, a total of 20 models were trained, taking the accuracy value as the decision criterion, the model with the best performance had as hyperparameters a 'poly' type kernel and a penalty value for the error of 1.0, obtaining an average accuracy of 0.595. Fig. 7

K-NN Model
On the other hand, in order to select the K-NN model using Grid Search, Table III presents the hyperparameters dictionary defined, declaring variable the number of neighbors and the weight function used in the prediction phase. When defining the hyperparameters dictionary, a total of 14 models were trained, calculating their average accuracy score, considering it as a decision criterion. The model with the highest precision value was obtained by using 3 neighbors and a distance type weight function, obtained 0.625 as the score value. Fig. 8

DT model
Finishing the training of machine learning models, the decision tree model is selected using Grid Search, which is why a hyperparameters dictionary is defined, leaving the criteria to measure the quality of a division and the strategy to choose the division in each of the nodes.  According to the hyperparameters dictionary, only a total of 4 models were trained, where the best combination occurs when the criterion for measuring the quality of a division is made using the Gini function and the best function is used as a division strategy in each node. This model had an average accuracy value of 0.61. Fig. 9, shows the model scheme with its hyperparameters.

Input Data
Output Data

III. Results and Discussions
In order to perform the evaluation and comparison of the models defined above, a total of 14 children were cited, who interacted with the immersive environment and made 40 images for each class, that is, 40 images for each of the vowels, forming a set of test data of a total of 200 images. Fig. 10 exposes the test dataset.
On the other hand, the models were evaluated by means of confusion matrices, a classification report, which is composed of metrics such as f1 score, recall score and precision score and, finally, by crossvalidation, the above with the purpose of obtaining approximately the accuracy value both specifically by vowel and generally from each of the models, and after that, making a comparison between each of the techniques implemented.

A. Confusion Matrix
In order to obtain the confusion matrices of each model with the test dataset, the sk-learn library was used, which was implemented in Python. Table V, presents the number of successes in the prediction phase with the CNN model with a total of 40 samples per class, where a success rate of 82.5% was obtained.

DT Model
Table VIII, presents the number of hits in the prediction phase with the DT model with a total of 40 samples for each class, where a 57.5% hit rate was obtained. Based on the confusion matrices presented above, it can be seen that the model that shows the best performance is CNN, since it has a total of 82.5% success rate in its predictions, exposing an increase of 23.5% compared to SVM, 30% against K-NN and 25% against DT. A specific analysis of the confusion matrices is presented in Table IX, where H is the percentage of correct answers and F the percentage of errors.
As shown in Table IX, it can be evidenced that when performing a specific analysis of each of the vowels, the CNN model significantly overcomes the rest of the models, except in the O vowel, given that the SVM gets over it by a 0.75% and K-NN by 1.5%.

B. Classification Report
With the purpose of obtaining metrics such as: precision score, recall score and f1 score, the sk-learn library was used, through its classification report module, which includes the metrics mentioned above. Table X presents the average of these metrics for each of the trained models.  Additionally, Fig. 11. Classification report values. Source: Authors.11 graphically represents the consolidated data in Table X. As of Fig. 11. Classification report values. Source: Authors.11, it can be seen that the CNN model presents a higher level of performance compared to the rest of the techniques, in addition, this means that for the SVM, K-NN and DT models there are a greater number of false positives, therefore, the values in these metrics tend to be lower 24.13% on average compared to the CNN model.

C. Cross Validation
In order to carry out the cross-validation process, use was made of both the training dataset and the test dataset, that is 1800 images. Additionally, a total of 10 randomly divisions were established, using functions such as KFold and cross_val_score of the model_selection module of the sk-learn library. Table XI, exposes the average of the cross-validation score and its standard deviation for each of the models and for the number of divisions defined. As of Fig. 12. Model comparison by cross validation. Source: Authors.12, it can be seen that when applying cross-validation and calculating its score, the CNN model shows an increase of 0.3345 against SVM, 0.3514 against K-NN and 0.3013 against DT, on average it increases by 32.9%, in addition, this model exposes a high score above 0.85. On the other hand, from Fig. 12. Model comparison by cross validation. Source: Authors.12, which presents the consolidated data in Table XI through a box diagram, it can be stated that, as a first instance, the median of the CNN model is located approximately 0.25 points above the median of the other models. In addition, it has a relatively close maximum value of 0.9 and a minimum value approximately close to 0.8, therefore its values do not vary significantly, that is, its precision value calculated when applying cross validation remains stable.

IV. Conclusions
Based on the results obtained in the evaluation of the confusion matrix metrics, precision score, recall score, F1 score and cross validation, it is possible to conclude that the CNN model implemented has a better performance compared to traditional methods such as K-NN, SVM and DT, even obtaining the best models of these techniques through hyperparameter optimization. Methods such as K-NN and SVM presented a greater number of successes in the prediction of the O vowel, compared to the CNN method, when analyzing the confusion matrix; nevertheless, this may imply that these methods created a predisposition to the vowel in its phase of extracting features and training.
Given the importance of the educational and motor training of children in the first school years, it is necessary to have a high level of percentage of successes in the model implemented to carry out the evaluation and feedback of the exercises performed in an immersive environment that supports the development of these skills. Although the recognition of images associated with vowels, geometric figures, among others, has been presented in a large number of previous studies, it is important to consider that the immersive environment used involves a variety of multimedia elements, which can generate noise. Despite the above, the CNN model presented a high level of precision in its prediction stage; however, this value must be refined and increased, perhaps as the authors of [15] do, using an image cleaning technique.
The immersive environment presented in this study was taken from the work carried out by [24], where the authors have as their main objective the reinforcement of fine motor skills through the writing of vowels; however, since there is no way to evaluate the exercise, it is not guaranteed that the children have fulfilled the objective. That is why when obtaining a method with a high level of success to carry out feedback on the exercises performed in immersive environments such as the one mentioned above, the design and implementation of a monitoring and recommendation system for students is considered as future work, being applied from its early stages of education which can be decisive for its formation. Additionally, it is possible to include new scenarios by implementing different exercises by broadening the model recognition spectrum.