Learning Models for Semantic Classification of Insufficient Plantar Pressure Images

Establishing a reliable and stable model to predict a target by using insufficient labeled samples is feasible and 
effective, particularly, for a sensor-generated data-set. This paper has been inspired with insufficient data-set 
learning algorithms, such as metric-based, prototype networks and meta-learning, and therefore we propose 
an insufficient data-set transfer model learning method. Firstly, two basic models for transfer learning are 
introduced. A classification system and calculation criteria are then subsequently introduced. Secondly, a dataset 
of plantar pressure for comfort shoe design is acquired and preprocessed through foot scan system; and by 
using a pre-trained convolution neural network employing AlexNet and convolution neural network (CNN)- 
based transfer modeling, the classification accuracy of the plantar pressure images is over 93.5%. Finally, 
the proposed method has been compared to the current classifiers VGG, ResNet, AlexNet and pre-trained 
CNN. Also, our work is compared with known-scaling and shifting (SS) and unknown-plain slot (PS) partition 
methods on the public test databases: SUN, CUB, AWA1, AWA2, and aPY with indices of precision (tr, ts, H) 
and time (training and evaluation). The proposed method for the plantar pressure classification task shows high 
performance in most indices when comparing with other methods. The transfer learning-based method can be 
applied to other insufficient data-sets of sensor imaging fields.

I. Introduction D ATA in our social society, scientific research, and other fields, are generating at an unprecedented rate and being widely collected and stored. To further exploring the information or knowledge hiding in the datasets, intelligent technologies-based data processing has been developed and becomes the consensus of the current academic and industrial circles. However, the lack of labeled data has become increasingly prominent with the accumulation of many new data-sets. While not every field will spend as much manual labeling to produce some data like ImageNet [1], supervised learning can solve many important problems. Transfer Learning (TL) plays a very important role for processing the unlabeled datasets. In traditional classification learning models, there are two basic assumptions to guarantee the accuracy and reliability of the classification model; the first is that how the training samples for learning to the new test samples satisfying the same distribution independently; the second is that there must be enough available training samples to learn a good classification model [2]- [3].
In practice however, these two conditions are often unsatisfactory. Over time, the previously available labeled sample data may become unavailable resulting in semantic and distribution barriers with new testing samples. Reliable labeled sample data are often scarce and hard to be obtained. This resulted in a very important problem in Machine Learning (ML), how to use insufficient labeled samples or source domain data to predict target domains with different data distribution of reliable models. Recently, TL has attracted extensive attention and research [4]. Transfer learning uses existing knowledge to solve problems in different but related fields; it aims to solve the learning problem in the target region by transferring the existing knowledge, thus relaxing the two basic assumptions in traditional machine learning, while only a small amount of labeled sample data is available in the target region. The more factors shared by two different fields then the easier it will be. Otherwise, it will be more difficult, and even "negative transfer" will occur, which has side effects in practice [5].
Scholars have carried out extensive research on TL recently; and many of them have focused on different technologies to study TL algorithms. TL arises from the mining technology of incomplete or insufficient data-sets; and as one of the Deep Learning (DL) technologies, it has been used in various fields to solve problems. The target accuracy of more than 90% was easily achieved by previous studies [6]- [7]. However, DL is a data-hungry technology that requires many annotated samples to work with. In practice, there are many problems due to the lack of labeled data, and the cost of obtaining labeled data is also very large, such as in the field of medical treatment and information security. Therefore, this inevitably leads to the well-known small sample problem, with new classes in the training process that have never been seen before. It can only use a few labeled samples of each class without changing the trained model [8]. Taking image classification data as an example, the traditional method is to obtain a model based on the training set, and then annotate the test set automatically. The small sample problem is that many classified datasets can be processed using a few labeled data-sets. When the amount of tagged data-set is relatively small, these rare categories need to be generalized without additional training. Few-shot, one-shot [9]- [11], and Zero-Shot Learning (ZSL) models are widely used to address this issue. TL is highly related to the few-shot and one-shot learning models [12]. Abderrahmane et al. [13] introduced a zero-shot haptic recognition algorithm for robots in interacting with their environment. Ji et al. [14]- [15] introduced a Manifold-regularized Cross Modal Embedding (MCME) approach for ZSL. Combining ontology and reinforcement learning for Zero-Shot Classification (ZSC) [16]- [17], Kernelized Linear Discriminant Analysis (KLDA), Central-Loss based Network (CLN) and Kernelized Ridge Regression (KRR) for ZSL [18] has also been developed subsequently. Yu et al. [19] introduced a regularized cross-modality ranking (ReCMR) to capture the semantic information from heterogeneous sources by using intra/intermodals.
In ZSL, intuitive decision-making is more reliable based on costs and benefits [20]. Some remarkable classification results deployed plenty of new class by using fast prototype-based learning algorithms [21]- [24]. From a TL aspect, the distribution of labeled source domain data is generally different from that of unlabeled target domain data, so the labeled sample data is not necessarily useful [25]. Therefore, the selection of the training samples which are most beneficial to target domain classification is of utmost importance. The TL algorithms can be divided into those applied to single-source domains and those applied to multiple source domains. TL has been widely applied in text classification and clustering, emotional classification, image classification and collaborative filtering. Soudani and Barhoumi [26] extracted features from the convolutional part to improve the segmentation performance using two pre-trained architectures (VGG16 and ResNet50).
Plantar pressure test and gait analysis is an internationally advanced technology based on the principle of biomechanics to detect the structure of the lower limbs of the human body, to assess and predict future foot diseases, and to provide scientific rehabilitation methods. The reaction force of the ground can be divided into static and dynamic foot pressure, which respectively represent the ground reaction force that the human body receives when standing statically and dynamically walking and running and jumping. Most of the foot diseases are related to normal pressure changes in the soles of the feet, and the two usually affect each other. Therefore, understanding the distribution of normal foot pressure is not only an effective tool for the diagnosis and evaluation of foot diseases but also is an important guider in restoring the normal or near-normal distribution of the foot pressure and restoring the function of the foot.
The pressure distribution of the human foot can directly reflect the pressure value of the various parts of the foot when standing or running and can also indirectly reflect the structure and function of the foot and the control of the knee, hip, spine and even the entire body posture. Testing and analyzing the plantar pressure can obtain the pressure information of the normal and abnormal soles of the human body in various postures. Through the dynamic collection of the pressure data of the sole, the biomechanical properties of the plantar pressure can be determined for the early prediction, diagnosis and treatment of deformed foot, various types of foot diseases and diabetic foot ulcers; quantitative assessment of the degree of joint disease and postoperative efficacy in orthopedics and orthopedic surgery; Walking training and designing intelligent prosthetics and other fields. Therefore, a comprehensive understanding of the changes in plantar pressure facilitates the understanding of the biomechanics and function of normal feet, clinical medical diagnosis, disease degree determination, postoperative efficacy evaluation, rehabilitation research and the design of various types of orthopedic shoes and sports shoes, which are of great significance.
Deschamps et al. [27] proposed a high-resolution pixel-level analysis and 12 of the standardized foot-pressure barometers in the original experimental data-set (including 97 diabetic patients and 33 non-diabetic patients). The new cohort of diabetic patients determines the classification recognition rate and its sensitivity and specificity to assess the classification effect. Ramirez-Bautista et al. [28] used plantar pressure data to detect disease progression and the use of different electronic measurement systems for corresponding analysis suggests a hybrid algorithm for classifying plantar images. Sommerset et al. [29] combined other physiological signals to study the effects of plantar pressure on compressible atherosclerosis. Xia et al. [30] analyzed the background plantar pressure image (PPI) of high temporal and spatial resolution and experimental results perform higher effectiveness in terms of mean square error (MSE), exclusive OR (XOR) and mutual information (MI). Plantar pressure imaging is different from the general image. Pressure images are formed by discrete values of pressure. Colors are used to reflect different pressure and pressure values. These pressure values will be applied to the image intelligent classification method in this paper. Therefore, we screened the current popular deep learning method to process the image, thus transforming the data set of discrete pressure values into image processing problems. The research motivation of our work is to find an effective transfer learning method applied for such insufficient datasets, including lack of labelled datasets. The finding of the research is that we used the images forms of the sensor generated dataset for training and classification by using an improved convolutional neural network, and the significance of the network models performs high efficiency by comparing with VGG, ResNet, AlexNet and pre-trained CNN, etc., in the classification indices.
The structure of the rest of the paper is: Section 2 introduces the basic models for plantar pressure image classification, and improved models are described in Section 3. Section 4 deploys the results and discussion, while Section 5 includes concluding remarks and future works.

A. Finetune-Based Modeling
DL requires a large amount of data and computing resources and takes a lot of time to train the model, but it is difficult to meet these requirements in practice. The use of TL can effectively reduce the amount of data, calculations, and calculation time, and can be customized in the new business needs of the scenario. Transfer learning is not an algorithm but a machine learning idea, and its application to deep learning is fine-tune. By modifying the structure of the pre-trained network model (such as modifying the number of sample category outputs), selectively load the weight of the pre-trained network model (usually loading all the previous layers except the last fully connected layer, also called the bottleneck layer), and then use it retraining the model on the dataset is the basic step of fine-tuning. Fine-tuning can quickly train a model with relatively small amount of data achieving good results.

Special Issue on Soft Computing
Finetune-based models have been widely used in the field of DL. By obtaining a certain amount of tagging data, a basic network can be established for fine-tuning. This basic network is acquired through large-scale data sets with rich tags, such as ImageNet, or e-commerce data, known as universal data domains. Then, training is performed on specific data domains. While the parameters of the basic network will be fixed in the process, and the domain-specific network parameters will be trained, including to set the fixed layer and learning rate, etc. This model method can be relatively fast and does not significantly depend on the amount of data. Fig. 1 shows the framework of a finetune-based model for the plantar pressure image specified data-set.

B. Siamese Neural Networks
Siamese Neural Networks (SNN) measure how similar the two inputs are. The twin neural network has two inputs (Input1 and Input2), and the two input feeds into two neural networks (network1 and network2). These two neural networks map the inputs to the new space respectively, forming the input in the new space. Trough the calculation of Loss, the similarity of the two inputs is evaluated. This method restricts the input structure and automatically discovers features that can be generalized from new samples. Supervised metric learning based on twin networks is trained, and then one/few-shot learning is performed by reusing the features extracted from that network [31] - [34]. It is a two-way neural network. It combines samples of different classes into pairs for the network training. At the top level, it calculates loss through a distance cross-entropy. In forecasting, taking 5 way-5 shot as an example, five samples are randomly selected from five classes, and the data with 25 mini-batch is input into the network. Finally, 25 values are obtained. The category with the highest score is taken as the forecasting result, as shown in Fig. 2.
The network structure is shown in Fig. 3, it is an eight-layer deep convolution twin network. The graph shows only one of the calculations. After the 4096-dimensional full-connection layers of the network, the component-wise L1 distance calculation is performed to generate a 4096-dimensional eigenvector, a probability of 0 to 1 is obtained by sigmoidal activation as the result of whether two input samples are similar.

C. Matching Networks
Without changing the network model, matching networks can generate labels for unknown categories. Its main innovation lies in the process of modeling and training. For the innovation of the modeling process, a matching network based on memory and attention is proposed, which makes it possible to learn quickly. For the innovation of the training process, this work is based on a principle of traditional ML, that is training and testing are to be carried out under the same conditions [35]- [36]. It is proposed that the network should constantly look at the insufficient samples of each type during training, which will be consistent with the testing process. Specifically, it attempts to obtain a mapping from support set S (composed of K samples and tags) to classifier , which is a network of . It gives the label for each unknown test sample based on the current S, and the label maximizes P. This model is described as follows, where, a is an attention (i.e. the distribution of classes on S) being a Fig. 1. Pre-trained convolutional neural network for shared weights finetune specified data-set using general image database-ImageNet. linear combination of class attentions on S; for x i which is farthest from , its attentions under a certain measure is 0, and its value is the weight fusion of the corresponding labels corresponding to x i which is similar to . The attentions mentioned above are embedding the training sample x i and the test sample separately; and then are inputed by software-max transforms, such as: where, c is the cosine distance. Two of the embedding models are shared, such as CNN. a is related to metric learning. For the sample x to be classified, it needs to be aligned with those labeled y, and other misalignments. Furthermore, the support set sample embedding model g can continue to be optimized, and the support set sample should be used to modify the embedding model f of the test sample. This can be solved in the following two aspects: i) embedding based on twoway LSTM learning training set, so that embedding of each training sample is the function of other training samples; ii) embedding of test samples based on attention-LSTM, so that embedding of each test sample is the function of embedding of training set. This work calls Full-Conditional Embedding (FCE). Although the above classification is done on the whole sample of support set, the embedding used for cosine distance calculation is independent of each other. Therefore, this work changes the embedding of support set samples to g(x i , S), which is useful when x j is very close to x i . This model uses Bidirectional LSTM (BiLSTM), forward LSTM and the backward LSTM are combined into a BiLSTM. By doing so, we can effectively use past features (through forward state) and future features (through backward state) within a specified time frame. We use back-propagation through time (BPTT) to train a bidirectional LSTM network. Over time, the forward and backward transfers on the expanded network are similar to the forward and backward transfers in conventional networks, except that we need to expand the hidden state for all time steps. We also need special processing at the beginning and end of the data points. Supposed that S is a random sequence, and then codes each x i : where, g'(x i ) is originally only dependent on its own embedding, x i is for information exchange by BiLSTM. On the optimization of f, support set samples can be used to modify the embedding model of the test samples. This can be solved by a fixed-step LSTM and attention model of support set as: where, f'(x) is only dependent on the characteristics of the test sample itself; as the input of LSTM (unchanged at each step), K is the step of LSTM, g(S) is the embedding of support set. As a result, the model ignores some samples in support set S. The embedding functions f and g optimize the feature space to improve the accuracy.

D. Insufficient Data-set Transfer Learning Modeling
For the convenience of introducing our zero-sample problem, here we first briefly introduce the recognition problem which is usually referred to in computer vision. We can summarize a simple pattern. First, we choose a suitable learning machine: convolutional neural network (CNN), prepare enough training data and corresponding labels for the learning machine, and use training data to train the learning machine. The main feature of the Convolutional neural network (CNN) is the use of convolutional layers. This is actually a simulation of the human visual neuron. A single neuron can only respond to certain specific image features, such as horizontal or vertical edges. It is very simple in itself, but these simple neurons form a layer. After the number of layers is enough, the system can obtain sufficient features. CNN is the most popular neural network model for image classification problems. An important idea behind CNN is the local understanding of the image. Its parameters will greatly reduce the time required for learning and reduce the amount of data required to train the model. The CNN has enough weights to see small patches of the image, rather than a fully connected network of weights from each pixel.
In practical use, the test data is input into the trained learner, and the learner can output the corresponding label of the test data. It is worth mentioning here that in this general pattern, the learner can only predict the existing categories in the training set when testing.

Basic Modeling
The concept of zero-shot learning and the initial solution, called attribute migration between classes, are given. We need to create a table to record the values of attributes of each category. We can map these pictures to the classification attribute table by looking up the table. These attribute descriptions can be summarized in advance because it is difficult to collect many pictures of unknown categories, but it is feasible to summarize the corresponding attributes of each category only. The core idea of attribute transfer between classes is that although objects have different categories, they have the same attributes, extract the attributes corresponding to each category and use several learners to learn. In the process of testing, the attributes of test data are predicted, and then the predicted attributes are combined with corresponding categories to realize the category prediction of test data. A simple model of zero sample learning can be summarized as shown in Fig. 4. The image space and label space are the initial image space and label space respectively. In zero-sample learning, images are usually mapped into feature space by some methods, which is called feature embedding. The same label is mapped into a label embedding, learning the linear or non-linear relationship between feature embedding and label embedding. Predictive transformation during testing replaces previous direct learning from images space to label space. Fig. 4. Zero-shot learning using word vector and attributes.

Discriminative Learning of Latent Feature
Compared with the inter-class attribute migration method, it mainly adds two ideas: designs a special network to find the main part of the target in the picture; and designs the attributes using manual annotation, and adds label embed to search potential attributes from the original picture. The model diagram is shown in Fig. 5.

Improved Framework for the Proposed Model
Universal Zero Sample Learning (ZSL) requires accurate prediction of categories that have been seen (WS) and have not been seen (WU) (ZSL only requires prediction of categories that have not been seen). The main idea of this method is to train a WS first by using training data and training categories [37]- [39]. Then WS trains a graph convolution network using word vectors as input, and uses the graph convolution network to input the word vectors of test categories to get the WU. WS and WU together form a learner W to predict all categories, as shown in Fig. 6.

A. Semantic Classification of Plantar Pressure Image Results
Training a good CNN model for image classification requires not only computational resources but also a long time, especially when the model is complex and the amount of data is large. Standard desktop computers typically need to be trained for a few days if they are immobile. In order to train our plantar pressure image classifier quickly, we can use the model parameters that others have trained, and train our model on this basis. This belongs to transfer learning. Images from the basic data-set obtained are shown in Fig. 7.

B. Pre-trained CNN using AlexNet for Classifiers
The data are divided into a training data set and a validation data set. 70% of the images were used for training, 15% for testing and 15% for verification. In the AlexNet system, SplitEachLabel divides image data stored into two new data storage areas, i.e. [imdsTrain, imdsValidation] = splitEachLabel (imds, 0.7,'randomized'); this small data set contains 55 training images and 20 validation images. AlexNet was trained on more than a million images, which can be grouped into 1,000 object categories. Therefore, the model has learned a rich feature representation based on many images. Let Net=alexnet and by using "analyzeNetwork", the network architecture and detailed information about the network layer can be presented visually and interactively. The steps are as follows: Step 1: The first layer (image input layer) requires an input image of 277×277×3 in size, where 3 is the number of color channels.
Step 2: Replace the final layer. The last three layers of the pretraining network net are configured for 1000 classes. These three layers must be fine-tuned for the new classification problem. All layers except the last three layers are extracted from the pre-training network. By replacing the last three layers with a full connection layer, soft Max layer and classification output layer, the layer is migrated to the new classification task. -The options for a new full connection layer are specified based on the new data. To make learning faster in the new layer than in the migrating layer, the full connection layer Weight Learn Rate Factor (WLRF) and Bias Learn Rate Factor (BLRF) values are increased.
Step 3 Training the network. The network requires the size of the input image to be 277×277×3, but the images stored in image data have different sizes. The size of training image can be automatically adjusted by using enhanced image data storage. The training image in both horizontal and vertical directions is up to 30 pixels. Data enhancement helps to prevent network over-fitting and memorizes the details of training images. Specify training options. For migration learning, the shallow features of the pre-training network (migration layer weight) need to be retained. This combination of learning rate settings only accelerates the learning speed in the new layer, but slows down the learning speed in other layers. When implementing transfer learning, the number of training rounds required is relatively small. A round of training is a complete training cycle for the whole training data set. Small batch sizes and validation data are specified. The software verifies the network every Validation Frequency iteration in the training process.
Step 4: Training the network consisting of migration layer and new layer.
Step 5: Calculate the classification accuracy for the verification sets. Fig. 8 and Fig. 9 present the accuracy and loss of 25 layers of the CNN in AlexNet.

C. CNN based Transfer Modeling
The CNN model has two parts, the convolution layer in front and the full connection layer in the back. The function of the convolution layer is to extract image features while the function of the full connection layer is to classify features. Our method is to modify the full connection layer and retain the convolution layer on the inception-v3 network model. The parameters of convolution layer are trained by others. The parameters of the full connection layer require to initialize and use our own data to train and learn, as shown in Fig. 10. The front part of the red arrow in the inception-v3 model in Fig.  10 is the convolution layer, and the back part is the full connection layer. We need to modify the full connection layer and change the final output of the model to 5. Because the TensorFlow framework is used here, we need to get the tensor BOTTLENECK_TENSOR_NAME (output value of the last convolution activation function, number 2048) and the tensor JPEG_DATA_TENSOR_NAME of the initial input data of the model. The purpose of acquiring these two tensors is to obtain the image trained data through JPEG_DATA_TENSOR_NAME tensor input model and BOTTLENECK_TENSOR_NAME tensor after the convolution layer. By inception-v3 model including pre-trained parameters, the 2 tensors above mentioned are acquired. For the classification issue, the full connected layers need to be adjusted and here only one layer is added; the input data is BOTTLENECK_ TENSOR_NAME; finally, the cross-entropy loss function is defined. Because the training data set is relatively small, all pictures are input into the model through JPEG_DATA_TENSOR_NAME tensor, and then the values of BOTTLENECK_TENSOR_NAME tensor are obtained and saved to the hard disk. During model training, the value of the saved BOTTLENECK_TENSOR_NAME tensor is read from the hard disk as the input data of the full connection layer, the value of tensor of BOTTLENECK _TENSOR_NAME is acquired by the inputting images. The final accuracy of the test was found to be 93.5% over 4180 steps.

D. Discussion
As a result, the pre-trained CNN to assess the required level achieves performance improvements in plantar pressure image dataset, and the classification was also incrementally fine-tuned. Detection and characterization of plantar pressure image data-set by using deep learning with indices of Training-Testing Data Splitting (TTDS), Class Type (CT), Area Under the Curve (AUC), Average Precision Score (APS), recall, precision, f1 score (an index used to measure the accuracy of binary classification model in statistics -it considers both the accuracy and recall of the classification model), N, UN, Ave are compared and summarized in Table I [ 37], [40]- [42]. Table II shows the details of the public test database of SUN, CUB, AWA1, WAW2, and aPY [37]. Scaling and shifting (SS) is the known partitioning method while plain slot (PS) is the unknown partitioning method, tr is precision on known classification, ts is precision on unknown classification. The training and evaluation time for SS and PS are shown in Table III. The performance of the proposed method for public data-set of SUN CUB AWA1, AWA2, aPY [25] are shown in Table IV. By using inception-v3, the proposed model was also compared by using a public test database which H is a comprehensive index of "ts" and "tr", as shown in Table V. Therefore, the proposed CNNTM has better performance on primary databases on known classification.

IV. Conclusion
Large-scale methods are currently used to process most of the labeled data, but existing problems include lack of knowledge and many training samples. In practice, it is difficult to accumulate data for most categories and large-scale methods are not fully applicable. Therefore, learning many data-sets for a certain category is not feasible. While in each situation, a new category can be learned quickly with only a small number of samples. There are two main solutions under consideration at present: the first is that an object has not been seen before which is named as zero sample learning; and the second is to learn knowledge from existing tasks and apply it to future model training, which can be considered as a problem of Transfer Learning (TL). The method of the transfer-based learning refers to the use of these auxiliary data sets for transfer learning when there are other data sources. The idea is to learn a generalized representation from the data which can be directly used in the target data and small sample category learning process. Sample-based transfer learning is an attempt to give new weight to each sample in the source data so that it can better serve the new learning task. Samples that are more like the target data are selected from the source data to participate in the training, while samples that are not like the target data are excluded. The insufficient data-set learning methods are introduced, and an improved framework of transfer learning model is proposed and compared by other typical methods. The plantar pressure image data-set semantical classification issue performs high effectiveness with indices that were defined by current researchers.   Current TL algorithms mainly want to improve the performance of the same probability y in the feature space at different times. In the plantar pressure image data-set, the morphological changes of the anterolateral prefrontal cortex and the medial prefrontal cortex, which belong to the visceral motor network system are found to be effective by plantar pressure structural imaging studies. The improved insufficient data-set learning method proposed in this work can solve the problem of insufficient training and accomplish excellent automatic recognition. While TL is mainly applied to small and volatile data sets (such as sensor data, text categorization, image categorization, etc.) in future research it is needed to consider how to use the TL method in a wider range of data scenarios.