Classification-based Deep Neural Network Architecture for Collaborative Filtering Recommender Systems

of data, different deep learning architectures are used: CNN [27] [28] , RNN [29] [30] , MLP [31] [4] , Autoencoder [8] [32] , etc. The previous examples implement specific content-based approaches or RS hybrid solutions. The CF RS core is based on the ratings to items casted by users. The deep learning solutions to this challenge can be classified as: a) Neural Collaborative Filtering (NCF) [33] , and b) Deep Factorization Machines (deepFM) [34] . NCF are usually based on dual neural networks to simultaneously process users and items information. The two NCF references are: 1) Neural Network Matrix Factorization (NNMF) [35] and Neural Collaborative Filtering (NCF) [33]. NCF reports better performance than NNMF, so we put the focus on it. Fig. 1 shows the NCF architecture; as explained, it takes the sparse user and item raw data vectors. Above the sparse layer is the fully connected embedding layer that projects the sparse representation to a dense vector. Then a Multilayer Perceptron (MLP) combines the two pathways features by concatenating them. A deep learning type of architecture used in CF combines wide & deep learning [36] [37] ; the wide component is a single-layer perceptron, whereas the deep component is a MLP. Combining the two learning techniques enables the RS to capture both generalization


I. Introduction
R ECOMMENDER Systems (RS) [1] play an important role to address the information overload in Internet. RS can use diverse sources of information: votes from users to items, purchased products or consumed services, social information, geographical coordinates, demographic information, etc. The most accurate RS are the Collaborative Filtering-based (CF) ones; they learn from all the existing implicit or explicit information about how users vote or consume items. It is usual to merge RS types (hybrid RS) [2] to improve accuracy results. CF RS can be reinforced by means of demographic [3], context-aware [4], content-based [5] and social [6] information. The scope of the RS has extended and currently covers an endless number of targets: tourism [7], films [8], networks [6], restaurants [9], e-learning, fashion [10], news [11], healthcare [12], etc. CF RS kernels have been implemented by using machine learning methods: memory-based and model-based ones. KNN was the main memory-based method, but currently model-based algorithms are used due to their accuracy superiority. Matrix Factorization (MF) is the most implemented approach, since it provides accurate recommendations, it is easy to understand, and it obtains a good performance. There are several MF variations such as PMF [13] [14], BNMF [15], BPR [16] and eALS [17]. MF extracts the complex relations between items and users and codes them into a reduced number of hidden factors.
Predictions are obtained by making the dot product of users and items. A MF drawback is the linearity of the dot product: it does not allow to accurately combine hidden factors to provide the most suitable predictions.
NCF reports better performance than NNMF, so we put the focus on it. Fig. 1 shows the NCF architecture; as explained, it takes the sparse user and item raw data vectors. Above the sparse layer is the fully connected embedding layer that projects the sparse representation to a dense vector. Then a Multilayer Perceptron (MLP) combines the two pathways features by concatenating them.
A deep learning type of architecture used in CF combines wide & deep learning [36] [38] to learn feature interactions. DeepFM achieves a 0.48% (Company dataset) and a 0.33% (Criteo dataset) of accuracy gain compared to NCF. This is a small performance improvement, so we will use the best known NCF as baseline. Classical methods make use of MF to make predictions. Some neural collaborative filtering methods replace the MF linear dot product used to predict for a non-linear MLP stage. Neural network methods that do not use MF often are based on deep hybrid models: CNN, RNN, etc. that exploit additional data to the rating matrix, usually not available for most of the existing RS. Wide and multi-view deep learning architectures have been used as pure neural network RS approaches; their main drawback is the size of the wide layers that jeopardize scalability. The proposed method architecture is scalable, because it is just based on the RS item dimensionality, relying the huge RS user dimensionality to the number of samples used to train the model.
As we have seen, existing RS deep neural approaches make use of a MF level or an embedding layer to catch the item and user latent vectors. They are regression-based architectures: for each user, recommendations are chosen from its N best predictions. Then, to recommend a user it is necessary to run I predictions (I is the number of items in the RS). Our proposed deep neural architecture does not need an external machine learning level, as the MF one. It does not use, either, explicit embedding layers. Its original architecture comes from the fact that it makes recommendations by means of a classification process instead a regression one. The classification approach requires a different design that the existing ones: it will learn by using categorical labels instead numerical values. Additionally, we are committed to create a simple and scalable architecture. To meet the stated aims we have decided to renounce to the numerical rating values and predictions: we will only make use of the binary relevant/non-relevant rating and the binary voted/non-voted information. As the reader can imagine, this decision opens a new range of architectural choices. Of course, a key question must be answered: will the accuracy of recommendations decrease significantly? State of art papers indicate that accuracy is not significantly affected, although it could vary depending on the dataset. From [39], the graph in Fig. 2 is provided. We will call our proposed method as Neural Classification-based Collaborative Filtering (NCCF), and we are going to motivate this approach by explaining, with the help of representative figures: a) Its innovative design, b) Its scalability basis, and c) Its straightforward method to get predictions of each user. Fig. 3 shows the evolution from MF methods to the proposed NCCF. First of all, both the MF and the NCF models ( Fig. 3a & 3b) are regression-based: they provide prediction values, and the RS recommends the highest N predictions. Instead, NCCF (Fig. 3c) is based on a classification neural network. Furthermore, NCCF is not based on a previous machine learning MF process or an embedding layer; instead, raw rating data is used avoiding the necessity of feature engineering stages [18]. It is important to highlight that NCCF raw data is made up by user vectors, where each vector contains items information. NCCF does not combine user vectors with item vectors, such as the NCF baseline does. It does not combine, either, user/item vectors with dense embedding coming from a factorization machine. This make the proposed approach simpler and more scalable than the current baselines. To improve understanding of the proposed method advantages, Fig. 4 shows an outline involving the learning parameters of both the proposed and the baseline models. The MF machine learning algorithm needs to learn a set of F factors for each user (U) and for each item (I) in the dataset. F usually ranges from 12 to 40 and, as shown in Fig. 4a, in commercial RS the number of users is much larger than the number of items. The necessary MF parameters is: (U+I)F. NCF approaches are built on MF, so Fig. 4b reproduces the MF parameters shown in Fig. 4a. Additionally, the MLP needs learning parameters; if the first inner layer contains n neurons (Ln), we need (U+I+Ln)F to run the MF model and to process the first MLP layer. The following layers will need more learning parameters (Lm), but usually Lm (number of neurons in the following layer) is not a high number. The output layer just needs Lm parameters, since we just need a neuron to obtain the regression value (the prediction). Overall, the NCF baseline, when the MLP has two inner layers, needs (U+I+Ln)F+(Ln+1)Lm parameters. The NCCF proposed method is designed to be scalable: it makes use of a tiny portion of the necessary parameters of the NCF baseline. Even although it uses an output layer of dimension I, it just needs I(Lm+Ln)+LmLn learning parameters.  Another relevant improvement of the proposed method is its fully neural network processing: it allows to model the non-linear relations from the raw data to the result (predictions). It does not start from previous machine learning results that could have lost some of the rich data dependencies. Furthermore, NCCF processes RS predictions in a different way that the baselines do, as it is shown in Fig. 5: in order to provide N recommendations to a user, the MF approach ( Fig.  5a) needs to run I dot products (if we dismiss the items voted for the user). The NCF baseline (Fig. 5b) must run forward I times its MLP: each one providing the pair <user,item> to the neural network. Instead, to provide recommendations to a user, the proposed NCCF ( Fig. 5c) just have to run forward once its MLP. Since we are working with a categorical target, the output layer will provide I classification probabilities. To make N recommendations we just have to select the N highest probabilities; this is a simple and straightforward operation.

II. Proposed Method
This section explains the proposed methods both in a conceptual way and in its formalized version. Each explanation type is made in a separated subsection. First subsection is centered in the concepts and the motivation of each proposed approach; a data-toy running example is provided as well as the proposed architecture explanation through a representative figure. Second subsection provides the proposed methods formalization to facilitate their implementation; formalization is explained by means of a data-toy running example.

A. Concepts
We propose two related methods: a) the short one, and b) the long one. The short method is based on the relevant ratings of the dataset: ratings that exceed the value of an established threshold (e.g.: 4 stars in a RS where users can vote from 1 to 5 stars). We will lose the numerical fine grain of the votes and we renounce to make numerical predictions (e.g.: we recommend you this item with the 4.2 stars value). We will not work with the usual rating data; instead we make an abstraction by selecting the binary relevant/non-relevant information. Table I shows a data-toy example of CF RS dataset ("Original dataset") and the abstracted relevant vs. non-relevant information ("Proposed short and long methods preliminary information"). Our aim is to be able to accurately recommend RS items without making use of the usual numerical prediction that in fact is a regression process. We will make RS recommendation by means of a classification process, and the proposed classification process will be mainly based on the rating relevancy abstraction.
Proposed short and long methods preliminary information relev.
Preliminary information for both the short and long proposed methods. The value "•" means non-voted item. Threshold = 4.
By converting ratings to relevant/non-relevant data we lose a type of information: voted or non-voted item. This is due to the inherent sparse nature of CF RS: users can only vote a reduced number of the available items because we ca not see all the films, buy all the products or consume all the services that a modern RS provides. Our proposed long method incorporates the voted/non-voted information and adds it to the short method information; Table I ("Proposed long method preliminary information") shows the concept. Neural networks are able to relate both types of information to make their job (in this case to accurately classify); this means that using the long model it is possible to discern from: a) relevant vote, b) non-relevant vote, and c) not voted. The short method approach works without the 'not voted' information.
The long model approach has the advantage of working with more information than the short model, and it is expected to provide better results. Its main drawback is the size of its necessary input and output neural network layers: it will need twice input and output neurons than the short method approach (since it provides two types of information: voted/non-voted and relevant/non-relevant). The key issue is the obtained balance between performance and accuracy. Accuracy mainly will come from the relevant/non-relevant information, whereas voted/non-voted data can improve results. While the "voted" items information meaning is clear, the "non-voted" items information is a controversial subject: non-voted items meanings can be: a) the user does not know the item, b) the user knows the item but he/she does not know if he/she will like it, c) the user knows that he/she does not like the item, but he/she did not vote it, and d) the user knows that he/ she likes the item, but he/she did not vote it. RS aim is centered in the a) and b) options. Our short method loses the semantic difference between not to vote an item and to vote this item as non-relevant. The lost information will be more or less important depending on the relative significance of cases c) and d). The proposed short method assigns the meaning "voted as not relevant" to the non-voted items, so case c) does not affect to the proposed short model accuracy. The proposed long method does not assign a "negative" semantic to the non-voted items, as the short model does; this means that the long method correctly manages the case d), whereas the short method does not. The accuracy difference between the proposed methods rely on the significance of the case d) in each RS: if the users are more active, this case importance decreases and vice versa. Published studies show that using binary relevancy information instead numerical votes does not significantly change accuracy results [39].
On the basis of the preceding information (Table I) Table I) we create a sample: data and target, thus the model number of samples will be the existing number of relevant ratings in the dataset. Each dataset user will generate a number of samples equal to her number of relevant ratings. The key concept is to train our model with the relevant ratings of each user except the first one (this is the X data sample), and to use this rating as a target (this is the y label of the sample). This process is repeated until the last relevant rating of the user is reached. As an example, the obtained samples from the sort method information in Table I are shown in Table II.  Our proposed methods provide valuable information to make a RS classification: each user relevant rating can be predicted from the set of the remaining relevant ratings from this user. Additionally, each user relevant rating can be predicted from the rest of users' data. In Table II we can observe how the model can learn that target 3 (y_0,3 and y_1, 3) can be predicted when i1 is relevant or when i2 and i5 are relevant. In the same way, target 5 prediction will be particularly probable when i2 is relevant, and this is an information coming from users 1 and 2. Current datasets contain hundreds of thousands or millions relevant ratings, thus the proposed methods will provide rich models to be used in artificial intelligence classification algorithms.
A fundamental decision design in the proposed methods is to maintain the dimensionality of the samples in reasonable sizes; we want: a) A large number of samples to train and test the model, and b) A reasonable size of the samples in the model in order to maintain the performance (time consuming) of the classification algorithms. Since commercial RS have much more users than items, to make these requirements compatible, we must use the RS user dimension to generate samples, and the RS item dimension to stablish the samples size. Some RS have a large number of items; in such cases, the proposed short model can be particularly appropriated since it reduces by half the samples size. Fig. 6 shows a generic neural network model implementing these paper's proposed methods. Each input data column (on the left) represents a sample. Our proposed methods provide the samples; e.g.: each row in Table II feeds the corresponding column sample in the proposed short method neural network architecture. The proposed long method adds the voted/non-voted information to the neural network. We also provide the categorical labels, necessary for the classification task: there are as many classes as items in the RS (I number of items in Fig. 6). Once the neural network has been trained, for each new input sample (new user) the neural network predicts a set of probabilities: the probability of each class (item) to be classified from the sample data (user). We just have to take the N highest probabilities to obtain the N recommendations; this process is drawn on the right of both methods in Fig. 6.

B. Formalization and Data-Toy Example
Let I be the set of the RS items.
Let U be the set of the RS users.
Let R be the set of the dataset ratings: (1) Let R u be the set of the dataset ratings casted by user u: Let R u,i be the rating casted by user u to the item i: Let I' be an ordered list of the I elements.
Let be the rating relevance threshold (4) Let D u be the intermediate data for the proposed method short version. D u differentiates the relevant and not relevant ratings casted by user u. (5) where D u,i is the relevance of the R u,i vote: (6) Let X u,s be the sample data s from D u , and Y u,s the target data s from D u : Finally, the X and Y data to train the proposed short method is: Let be the intermediate data for the proposed method long version.
differentiates the user u voted and not voted items.
Let be the sample data s from , and the target data s from : Finally, the X' and Y' data to train the proposed short method is: Once the model has been trained by using the X and Y sets or the X' and Y' sets, we obtain a result set W with cardinality I. W contains the probability of each item to be recommended. These results are predicted by using a neural network containing I "softmax" output neurons. The N expected recommendations Z will be the N highest results from W.
We will explain the previous equations by means of the data-toy example on Table I. It shows the ratings casted by users 0, 1 and 2 on items 0 to 5. The • symbol means not voted. This data-toy uses the V set definition from the above equations From the data-toy, the sets and lists from previous formalization are: (1) The number that accompanies each expression refers to the applied equation in previous formalization.
The set of the dataset ratings casted by users 0, 1 and 2: (2) (2) As an example, we can see that: (3) We will stablish the θ = 4 threshold (4) The set of information for the training and testing short proposed method is: The X data and the y targets to the classification neural network is extracted from D u . Each n i column feeds a neuron of the neural network input layer. Each u i row is a sample for the training or the testing process.
From equations (7) to (10) we obtain the input X and the output Y values to feed and train the neural network. This is the proposed short method to arrange data in order to make RS classification and then RS recommendation. shows the obtained results. As it can be seen, the method provides the categorical values for the labels.
The additional set of information for the training and testing long proposed method tell us about the voted and not voted items: The proposed long method concatenates both D u and : (Table IV).
From equations (13) to (16) we obtain the input X' and output Y' values to feed the neural network training. This is the proposed long model to arrange data in order to make RS classification and then RS recommendation. We can see it in Table V

III. Experiments and Results
The designed experiments to test our proposed approaches make use of the MovieLens 1M 1 [45] dataset and the FilmTrust [44] dataset.
The tested classification quality measures will mainly be the Precision and Recall ones [1]. The chosen relevancy threshold has been 4 for MovieLens and 3 for FilmTrust. The tested number of recommendations N has ranged from 1 to 96, in incrementes 5. For both datasets, the training set has been randomly obtained taking the 80% of the samples, whereas the test set has used the remaining 20%. Several neural network architectures have been tested and we have chosen the ones that have achieved better results. The Keras 2 deep learning library has been used to run experiments. For experiments reproducibility purposes we openly provide the necessary files to feed the neural networks 3 ; these files have been obtained by implementing the proposed methods.
Experiment explanations have been organized by using two subsections. First subsection is devoted to compare the proposed long approach versus the proposed short approach. These experiments have been run using the MovieLens dataset, and they are explained by means of illustrative architectural figures. Results are provided both for classification accuracy and for recommendation accuracy. Second subsection compares the proposed short method with a representative state of art deep learning baseline: NCF. Recommendation and classification results are discussed both for MovieLens and FilmTrust datasets.

A. Long versus Short Approaches Comparative
Our first set of experiments compares the results of the proposed 1 https://movielens.org/ 2 https://keras.io/ 3 http://rs.etsisi.upm.es/ short method versus the proposed long method, making use of the MovieLens dataset. The number I of items is 1682 in this dataset. We have trained our proposed long method by using a dense deep neural network involving a 2 x 1682 = 3364 size for the input layer, 1682 size in the output layer, 400 neurons in the first internal layer, followed by a 0.2 dropout layer, and finally 200 neurons in the second internal layer, followed by another 0.2 dropout layer. All the layers use 'relu' activation, except the output one that uses 'softmax' in order to provide probabilistic categorical results. The chosen loss function has been 'categorical cross entropy' since we are using categorical labels. The selected optimizer: 'adam'. Fig. 7 shows the explained neural network architecture. The resulting number of trainable parameters is 1,308,632.     Analogously to the performed experiment using the proposed long model, an additional experiment has been made using the proposed short model. In this case, the size of the neural network input layer is the number of items of the dataset: 1682. Since the input layer size is half of the previous experiment one, we have reduced the neural network internal layer sizes to 200 and 100 and we have maintained the rest of parameters: two 0.2 dropouts, 'relu' and 'softmax' activations, etc. The resulting number of trainable parameters is 526,582. Fig. 9 shows the designed neural network architecture, and in Fig. 10 we can see that the obtained loss and accuracy values and evolutions are similar to the proposed long method ones. This case also feeds the expectation of accurate precision and recall results from recommendations.
It is important to distinguish between the loss and accuracy results, on one side, and the precision and recall, on the other side. Loss and accuracy are related to the training and testing results of the deep neural network (biggest square in Figs. 7 and 9). Precision and recall are related to the recommended items (most-right square in Figs. 7 and 9). Low loss values usually provide high accuracy ones. High accuracy values, in our context, means that we are correctly classifying items. Even if we classify correctly an item it does not directly mean that we have got a correct recommendation: probably this item has been voted by the user and it cannot be recommended, or it is possible that we have reached our limit of N recommendations and we cannot recommend the item.  From each sample to predict (set of votes from some user) we will get up to N recommended items. These are the ones with the highest output probabilities (right boxes in Figs. 7 and 9; equation 17 in the formalization section). From these recommended items we get the combined precision, recall: F1, quality values using the classical information retrieval approach. Comparative results of both proposed methods are shown in Fig. 11; we can see a better performance of the short method when N is low, whereas for high number of recommendations the long method quality is something better. It makes sense, since the higher the N the more information is needed. The key issue here is that the quality difference between both approaches is not proportionally significant, whereas the deep neural network size needed to implement the long method is substantially bigger than the one needed to implement the short method (Figs. 7 and 9). Accordingly, we select the proposed short method as the preferred one, and we will use it in the rest of this paper's experiments. Experiments make use of the state of art baseline NCF [33]. NCF (Neural-based Collaborative Filtering) is a neural architecture that can express and generalize Matrix Factorization. Its design is shown in Fig. 12. NCF has been tested using several current baselines: BPR [16] and eALS [17] that are optimizations of the Matrix Factorization model, and ItemKNN [40]: the standard item-based KNN. The chosen NCF baseline can be classified as Neural Collaborative Filtering [18] and it provides similar accuracy to deepFM [34]; NCF integrates matrix factorization and Multilayer Perceptrons (MLP). Their main advantages are: a) Its performance, b) It avoid tedious feature engineering, and c) It does not need additional information to the CF dataset ratings. NCF is considered as a robust and accurate deep learning design for CF RS. The main current alternatives to the NCF are: a) The autoencoder based ones, such as CDL [41], and b) The deep hybrid models for recommendation [42] [43] [11]. Autoencoder based approaches avoid the non-neural network stage (MF) to obtain features from data. They increase the neural network complexity without reaching a remarkable accuracy rise. Deep hybrid models are based on additional data to the dataset ratings: demographic, social, context-aware, content-based information, etc. They do not provide a universal method to CF RS. Once the baseline choice has been motivated, we are going to test our NCCF short method in both MovieLens and FilmTrust datasets. Fig. 13 shows the MovieLens comparative results for classification quality measures in both the baseline (NCF) and the proposed deep neural model (NCCF). Overall, Fig. 13 shows better precision results and worst recall results than the baseline. It is remarkable the proposed NCCF capability to maintain the precision accuracy level when a high number of recommendations is chosen. This trend is consolidated on the recall results. FilmTrust (Fig. 14) results point in the same direction: NCCF improves the baseline recall when the number of recommendations exceeds a threshold, although in this case precision is similar in both the baseline and the proposed method. The proposed method works particularly fine when the number of recommendations is not low. This is due to its original prediction schema: instead of making a prediction for each RS item, the proposed NCCF model generates a set of probabilities in the categorical output layer of the deep architecture. The result is a more equilibrated distribution of predictions along the RS set of items. Fig. 13. MovieLens. Recommendation quality results; proposed NCCF versus the NCF baseline. NCF [33] outperforms to the state of art baselines: BPR [16], eALS [17], ItemKNN [40] and it provides similar accuracy to deepFM [34].  FilmTrust. Recommendation quality results; proposed NCCF versus the NCF baseline. NCF [33] outperforms to the state of art baselines: BPR [16], eALS [17], ItemKNN [40] and it provides similar accuracy to deepFM [34]. y-axis: Precision & recall; x-axis: Number of recommendations N.
We can conclude by discussing that the proposed NCCF method can reach the current state of art neural collaborative filtering approaches, and it also improves recommendation results, particularly when the number of recommendations is not low. The NCCF method strengths are: 1) This is a scalable fully deep neural network approach, 2) The obtained recommendation accuracy is comparable to state of art approaches, and it improves results when the number of recommendations is not low, and 3) The size and complexity of the architecture is smaller than the autoencoder based ones.

IV. Conclusions
A scalable and original classification-based deep neural architecture has been proposed in this paper. It is based on the concept that by reducing the ratings dataset information, accuracy will not significantly change. We renounce to the numerical rating values and transform them to the binary: relevant/non-relevant vote and voted/non-voted item. This information reduction leads to a new abstraction level and it opens a new range of classification-based architectures. Two related architectures are proposed in the paper: the short one and the long one. The large architecture makes use of both relevant/non-relevant and voted/non-voted binary information, whereas the short architecture only uses the relevant/non-relevant subset of information. Experiments show that the short architecture achieves a similar performance to the large one. Even although we lose a portion of accuracy, the short architecture is chosen due to its better scalability. The proposed NCCF model has been tested on the MovieLens and the FilmTrust datasets and their results have been compared with the state of art baseline NCF. NCF is a powerful baseline, since it outperforms KNN, MF, BPR and eALS and it provides similar accuracy to deepFM. Results show that our scalable approach provides a competitive recommendation quality. It particularly outperforms the baseline when the number of recommendations is not low. The result is consistent with the way predictions are made: to recommend a user, instead of making a prediction for each RS item, the proposed NCCF model generates a set of probabilities in the categorical output layer of the deep architecture. This is an original approach, and it obtains a more equilibrated distribution of predictions along the RS set of items.
The proposed method novelty opens a wide range of future works, among which are: a) To design an NCCF ensemble, inspired on the random forest method, where the input data that feeds each NCCF is a random subset of the RS items, b) To make use of data augmentation from the binary relevant/non-relevant information in order to enrich the set of input samples, c) To experiment the quality impact of stablishing a multilabel target on the samples, instead the proposed one label approach, and d) To incorporate a wide (memorization) layer to create a wide & deep neural network architecture.