Sales Prediction through Neural Networks for a Small Dataset

Sales forecasting allows firms to plan their production outputs, which contributes to optimizing firms' inventory management via a cost reduction. However, not all firms have the same capacity to store all the necessary information through time. So, time-series with a short length are common within industries, and problems arise due to small time series does not fully capture sales' behavior. In this paper, we show the applicability of neural networks in a case where a company reports a short time-series given the changes in its warehouse structure. Given the neural networks independence form statistical assumptions


I. Introduction
S ALES forecasting has positive implications within enterprises. It can improve the planning of production processes as well as inventory management practices [1]. Specifically, the fast growth of enterprises and their incursion into new markets make inventory optimization necessary. Even more, because sales have a close relationship with the clients' demands which depend on factors that enterprises cannot control, knowing the customers' behavior may reduce the enterprises' operational costs [2].
It is important to recall that predicting future demand is a complex problem since it relies on determining a concrete definition of what demand is [3]. Note that order times, preferences and previous sales affect the current demand that a company must satisfy or supply. Thus, forecasting accuracy depends on the information gathered by the enterprise [4]. In this sense, the literature recognizes that previous sales are information that companies register to be used as starting points. Even more, sales do not only summarize the sold quantity, but they also include information that reflects changes in consumer's behavior concerning the selling prices. Hence, sales represent a good proxy to understand the demand trends that a company faces [5].
Regarding the quantity of data required to estimate demand trends, there is an active discussion in the literature focused on determining the optimal size of a dataset for demand forecasting [6]. In general, research studies on this topic agree that dataset size depends on the prediction model [7]. For example, ARIMA MODELS need � � � � � � � � � � �� observations, where � is the number of time lags, � represents the model's order, � is the total number of periods at each season, and � is the differencing degree. Also, this model pretends to capture seasonal effects, a reason why it includes terms related to autoregressive �, moving � and differencing averages �. On the other hand, literature has suggested the use of artificial neural networks (ANNs) as robust techniques to analyze small datasets due to their performance being independent of the total number of input data [8]. Recently, it is common to find applications of ANNs to solve forecasting problems in medicine, where small datasets prevail since patients and diseases' evolution change regularly [9]. This last factor has a direct impact on patients' medical visits, which tend to no show at them because they "feel better". A prediction of such behavior is useful to reduce costs and improve medical services efficiency [10].
Given the favorable applications of ANNs to analyze non-linear behavior in datasets [1,9] our work was focused on the application of this technique to predict the sales of a chemical company located in Mexico. For this case, the dataset consisted of 12 monthly sales data as the company reports its sales on a monthly basis. 70.0% of this data was considered for training and 30.0% for initial validation of the ANN which was designed for this work.
This ANN consisted of a multi-layer perceptron with a sigmoid activation function. Implementation of the ANN was performed using programming with the software R v.3.2.1. We find that a small dataset does not significantly increase the number of iteration, even when we consider the maximum possible learning rate. However, as it is expected by the classical time series literature, the prediction presents a higher error, which we demonstrate in the validation phase.
with the ANN, highlighting interesting insights about the training, validation and prediction processes. To motivate the use of ANN, even in small databases, we compare the ANN performance with the one of using the average moving technique. Finally, Section V presents our conclusions.

II. Literature Review
Given that time series are central in determining predictions, there is a considerable literature that analyzes their characteristics and the methodologies to find the desired results [11]. This paper does not pretend to be an exhaustive exposition of the time-series forecasting literature. However, we present some of the most significant contributions in the application of time-series for demand forecasting.

A. Sales Forecasting and Industry
Industries, in general, concern about demand forecasting since their benefits depend on consumer's behavior. Hence, forecasting studies contribute to diminishing uncertainty about future sales, which implies a better production planning and inventory cost reduction [12].
Although future sales have a high degree of uncertainty, there exist different approaches to approximate the consumer's behavior in future periods. The literature on this problem considers different approaches and methodologies, which rely on the analysis of time series that summarizes how the demand evolves through time together with its related variables [3].
It is important to recall that demand and sales have a close relationship since the last summarizes the consumers' behavior. Thus, sales work as a good indicator of consumers' preferences even though they are subjective [13]. For example, demand forecasting uses variables such as sales, expenses, and benefits [14]. However, there exist other studies where demand is predicted or forecasted through different variables that depend on the market where companies and products stand. For example, Rodriguez et al. [18] built an index for water demand that summarized water consumption, land use, average incomes, and population. In the case of electricity demand, it is common to consider income, prices of related fuels, community, and technology [15]. All the previous applications used time-series to determine future events.

B. Classical Time-Series Forecasting Literature
Typically, Time-Series Forecasting relies on the application of two methods: Holt-Winters and Box-Jenkins [16]. In the case of time-series with linear behavior, the previous techniques generate a near-exact approximation of future demands. Thus, we can find applications of these methods in forecasting exchange rate [17], demand flowmeters [14], aircraft failure rates [18], general prices [19], and transport demand [20] among others.
It is important to mention that not all time-series present a linear behavior. In a non-linear situation, the Holt-Winters and Box Jenkins methods fail to provide a good forecasting of future events. Specifically, linear models cannot capture and analyze amplitude dependence, volatility clustering, and asymmetry, which prevail in finance timeseries [21]. In these cases, it is appealing to assume different states [22], or considering situations with the minimum number of statistical assumptions [23].
The analysis of non-linear time-series requires the construction of models suitable for each phenomenon, which pretend to capture stochastic processes. Hence, Campbell, Lo and McKinley [24] showed that it is possible to classify most of the non-linear time series according to their deviations from their mean, the non-linear moving average model [25], or their variance (for example, the ARCH model and their variations like GARCH and EGARCH [26].

C. Neural Networks and Sales Forecasting
Classical non-linear time series models have a significant dependence on statistical assumptions, which diminish their capacity to predict a behavior that was not previously registered. To deal with such problem, Artificial Neural Networks (ANNs) appear as a suitable alternative. These techniques try to mimic neural cells behavior, which makes them non-linear models by nature [27]. Because ANNs are data-driven, they have a self-adaptive feature. This adaptation capacity makes them suitable to analyze problems where data is not complete, or when its documentation was not appropriately gathered (i.e., data with noise [28]).
Given that ANNs allow the approximation of almost all non-linear continuous function, they are suitable for sales forecasting [29]. However, in the classical literature, there is not a unique approach to sales forecasting because ANNs may have different structures according to the problem. For example, electricity demand forecasting considers a hybrid structure since data may present seasonal effects. An et al. [15] propose a two-step architecture, where seasonal effects are reduced in the first step through an empirical mode decomposition; and later, a multi-output feed-forward ANN is used to get the demand forecast for the following H periods given the demand of the previous T periods. Unlike electricity demand, water consumption follows an ANN with stochastic connections since it noisy data prevail in such situations [14]. In both cases, ANNs' architecture requires a long timeseries.
Graphically, an ANN is a graph with neurons (nodes) and their synapses (edges). These models may learn and identify consumers' behavior patterns according to the input data and the architecture of the ANN. The structure of an ANN summarizes the interconnection between neurons, which are organized in layers [30]. The most uncomplicated architecture considers the input layer, the output layer, and the hidden layer, whose objective is the identification and communication of pattern to get the demand forecasting. Fig. 1 shows an example of an ANN with three layers (input, hidden and output layer). The hidden layer serves to adjust the importance of the input data in the determi nation of future behavior. Therefore, it is possible to add as many hidden layers as required [27]. Although no theoretical tool indicates the optimal number of hidden layers, experiments demonstrate that more than one hidden layer does not significantly improve the prediction's quality. Also, there is a positive relationship between the number of hidden layers and the computing time. Thus, one hidden layer is enough to get a proper prediction, mainly when the database is small [30].

III. Designing a Neural Network for Forecasting
We analyze the case of a chemical company located in Mexico. This firm owns nine central warehouses across the Mexican territory which are used to fulfill the requisitions of its clients. Because these distribution centers may generate inventory costs to the detriment of company's benefits, the company is interested if sales justify the construction of additional warehouses. This motivates the prediction of future sales through ANNs.

A. Data Selection
The chemical company builds central warehouses at different locations, which imply the absence of a long time-series that includes all central warehouses sales. Thus, we considered a time series from 2015 to 2016 that consists of the sales of the current distribution centers.

B. Data Collection
The chemical enterprise provided a database with the sales that each of its central warehouses must comply. This dataset included information related to a product, client and transportation costs. However, due to confidentiality restrictions, we considered only sales information of twelve months from 2015 to 2016 for the time series that shows the company's sales for one year.

C. Choosing the Neural Network
Forecasting literature that applies ANNs agrees in the use of the multi-layer perceptron (MLP). This ANN is characterized by a feedforward architecture where unidirectional weighted edges connect neurons. Since we have a total of 12 data, we considered an MLP with three neurons in its hidden layer. Thus, the MLP mathematical expression is where and are the connection weights, for all and 2. Also, represents the activation function that defines the output of a neuron. Given the size of the dataset, we considered a hidden layer with three neurons.
There are different activation functions, and their use depends on the problem's features. Such functions may produce an activation within a discrete range, like a set 0,1 that represents "yes" or "no," or a continuous range like ℝ. In the case of forecasting, Thomonopoulus [2] suggested that continuous ranges are more desirable because they do not restrict neurons behavior to a finite number of results. Also, the sigmoid function was suggested as beneficial for forecasting because it allows any real number as an input, and does not present convergence problems since it is bounded between -1.0 and 1.0. Thus, in this paper, we consider the sigmoid function as the activation function . For our sales prediction problem, we compute which is the company's sales during period . To generate the prediction, the perceptron that we consider uses the sales from the three previous periods. In other words, we "learn" from months one, two and three to "predict" the fourth month. Then, the ANN computes the weights that minimize the error, and such values are saved. In the next iteration, the ANN "learns" from month two, three and four to "predict" the fifthmonth sale. So, we proceed in the following way ANN_0: Input (1,2,3) → output (4) ANN_1: Input (2,3,4) → output (5) ANN_2: Input (3,4,5) → output (6) ANN_3: Input (4,5,6) → output (7) ANN_4: Input (5,6,7) → output (8) ANN_5: Input (6,7,8) → output (9) We recall that the previous architecture produces a single output, and weights are actualized every time a fourth value is predicted using the three previous weights. That is to say, we choose a recurrent architecture for the neural network [30].

IV. Neural Network Implementation
There is much mathematical software that is capable of programming and implementing ANNs. We choose the software R to implement the ANN given the possibility to change the ANN parameters. Now, the construction of an ANN requires the weights' computation, and their validation as well. To do this, we divide the database into two sets: one for training (or learning) and one for validation. By following the guidelines of [31,32], we determined the size of both sets: the training set was considered with nine data, and the validation set was considered with three data. This corresponds to the 75.0% and 25.0% of the entire database respectively.

A. Training
The self-adaptation property allows ANNs to adjust future behavior given previous behavior. Thus, in the training phase, the ANN learned the time series behavior through computing how high the connections between neurons must be. Below, we present the code in R language.  About the previous code, it is important to remark the following:

## DEFINING THE NEURAL NETWORK AS
• First, we know that the sigmoid function range is the interval [0,1].
Thus, this function outputs values between zero and one. Since the time series registers the total sales, we need to normalize the values of our database. So, let be the sales of period , and the normalized sales for period . We use the following transformation • Second, the training part is a supervised learning process. This means that the ANN adjusts the weights between the neurons to learn the behavior or pattern of the time series. Then, the ANN computes the weights that minimize the following error In this exercise, we tolerate an error less than 0.001.
• Third, programming the ANN in R allows determining the learning rate. This indicates how fast the ANN adapts to new behavior presented in the time series. Given that we have a small dataset, we compared different learning rates to assess the ANN's performance. Fig. 2. Comparison between learning rates and iterations. Fig. 2 shows the relationship between the error and the number of iterations when we consider the following learning rates: 0.2, 0.5 and 1.0. In this situation, the ANN tries to learn the real values 0.774 and 0.856. Note that the highest learning rate implies a complete understanding of the time series behavior, but it requires the most significant number of iterations. However, despite the database's size, the number of iterations is not unusual in the literature [27].
In general, the previous behavior is replicated by the ANN when we change the real values. For example, Fig. 3 shows the relationship between the three learning rates when the ANN learns the largest possible sale, which is equal to 1.0. Although the number of iteration increases while the ANN tries to reduce the error, this number does not excessively increase.  Fig. 2 and 3 illustrate how fast the ANN reduces the approximation error when learning rate is switched. In all cases, the small dataset does not induce a time convergence problem, the number of iterations is relatively small. Also, as the literature suggests, when we impose a learning rate equal to 1.0, the number of iterations increases. It is interesting to note that a learning rate equal to 0.5 is faster than the learning rate equal to 0.2. However, the computation time is not an additional restriction for our work, R spends 27.82 seconds to compute all weights.
Given the limited information, we consider appropriate to choose a learning rate equal to 0.2. This assumption comes from the literature because an ANN based on small dataset requires slackness in their learning process [8].

B. Validation
For the validation process, we use the nine weights produced by the algorithm during the previous step. Remember that the fifth iteration summarizes the knowledge required by the ANN to forecast. Then, we have that Input (7,8,9) → ANN_5 → Output (10) Input (8,9,10) → ANN_5 → Output (11) Input (9,10,11) → ANN_5 → Output (12) Table I presents the weight for each value in the training set and the number of iterations that R took to compute them. To proceed with the validation phase, we use the sales for months 10, 11 and 12. By the ANN architecture, we follow an iterative process where the weights of the three previous periods were used. In this case, the process was unsupervised, i.e., we let the ANN adjusts the weights in a free manner to get the last three known values. Table II illustrates the weights that we use in the validation process. During the validation process, we find that the ANN does not predict the time series with a one hundred percent of accuracy. However, the error is not higher than the 10%, as Table III shows with the comparison between the real values and the ones produced by the ANN.

C. Predictions
Although the ANN does not entirely replicate the time series behavior, this is due to the database size. Consequently, we use the weights in Table II to get the predictions for the next three months. Then, the ANN architecture assumes that the first projection, sales of month 13, is real to compute the prediction of month 14. We continue iteratively to get the prediction for month 15. Then, the algorithm does the following computations Input (10,11,12) → ANN_5 → Output (13) Input (11,12,Output (13)) → ANN_5 → Output (14) Input (12,Output (13), Output (14)) → ANN_5 → Output (15) Following the previous process, Table IV shows the weights that the ANN requires to compute the sales prediction for the next three months. Using the information in Table IV, we proceed to compute the sales that the chemical company may face in the first three months of the next year. In Table V we present these results in the normalized form and their corresponding transformation. Finally, Fig. 4 shows the relation between the three phases: training, validation, and prediction. In this graph, we compare the prediction values with the real values, and we can see the trend for future sales. As we mentioned before, during the supervised phase, the ANN fully captures the previous behavior of the time series, while in the validation phase it presents problems to predict with a 100% of accuracy the other values. However, the distance between both lines is not significant (less than 5%).
Also, it is interesting to note that the following next three months present a decreasing behavior maybe which does not match with the behavior in the corresponding months in the database. This last point may be related to the fact that we consider the 13th month as real information to get the months 14 and 15.

D. Comparison with Moving Average
As we mentioned before, moving average models (ARMA, ARIMA, …) are commonly used to analyze non-linear series [10]. In this section, we compare the sales forecasting that results from the application of a simple moving average, with the ANN's results. Since we have a small database, we forecast a month by using the following simple moving average (SMA) model 3 In other words, this model approximates the sales of month using the mean of the three previous month sales.  Considering the sales of the validation set (months 7, 8 and 9), we note that the SMA prediction reports an error greater than 5% percent for months 7 and 9 (see Table VI). In comparison, for the same set of months, the ANN provides predictions with an error less than 5%, see Table III. Since the sales of the validation are known, we can conclude that the ANN's predictions are more accurate than the ones from the SMA.

V. Conclusion
The present paper shows the flexibility of neural networks to deal with sales forecasting in the presence of a small database. As the literature suggests, the ANN does not fully capture the time series behavior as we show during the validation phase. However, it is worth noting that the prediction error is not greater than 5%, which is accurate in sales forecasting instances [31].
Even more, in our case of study, the small dataset does not increase the time convergence even when we impose a high learning rate. Also, the ANN has a better performance than the SMA concerning error.
Specifically, for months 7 and 9, whose sales are known, the SMA reports an error greater than 5% while the ANN reports an error less than 5%. The previous facts support the use of this technique in a situation when time series are small, or the database does not present all the possible information.
Therefore, ANN may be useful in a situation where time series present a non-linear behavior and do not show an explicit statistic behavior, as it is the situation in time series with a short length. However, in the present experiment, we observe the significant dependence that new predictions have when the first prediction is considered real information. In future works, we address this problem from theoretical and empirical perspectives.

Rosa María Cantón Croda
Ph.D. in Computer Science from the Tecnológico de Monterrey, Ciudad de México, Master in Information Technologies from the Tecnológico de Monterrey, Veracruz, was graduated in Computer Science Administrator from the Tecnológico de Monterrey, Monterrey. She was Manager of Systems in transport and construction companies. More than twenty years of experience in the academic administration of Tecnológico de Monterrey and currently Dean of postgraduate studies in Engineering and Business at UPAEP. She has written some articles for national and international congresses. She was certified in PMBook, Positive Psychology by Tecmilenio and in Project-Based Learning by Aalborg University in Denmark. Currently, her research interests are Big Data and Business Intelligence.

Damián Emilio Gibaja Romero
Ph.D. in Economics from El Colegio de México, Ciudad de México, Master in Economics from El Colegio de México, México. He has written some articles for national and international congresses. He did research visits at the Paris School of Economics and the University of Glasgow. Currently, hir research interests are Game Theory and Mathematical Economics.

Santiago Omar Caballero Morales
Santiago Omar Caballero Morales is a professor-researcher in the Department of Logistics and Supply Chain Management at Universidad Popular Autónoma del Estado de Puebla (UPAEP) in Mexico. In 2009, he received a Ph.D. in Computer Science from the University of East Anglia in the United Kingdom. Since 2011 he has been a member of the National System of Researchers (SNI) in Mexico. His research interests are quality control, operations research, combinatorial optimization, pattern recognition, analysis and simulation of manufacturing processes, and human-robot interaction.