FinalPaper SalesPredictionModelforBigMart
FinalPaper SalesPredictionModelforBigMart
FinalPaper SalesPredictionModelforBigMart
Abstract: Machine Learning is a category of algorithms that allows software applications to become
more accurate in predicting outcomes without being explicitly programmed. The basic premise of
machine learning is to build models and employ algorithms that can receive input data and use statistical
analysis to predict an output while updating outputs as new data becomes available. These models can be
applied in different areas and trained to match the expectations of management so that accurate steps can
be taken to achieve the organization’s target. In this paper, the case of Big Mart, a one-stop-shopping-
center, has been discussed to predict the sales of different types of items and for understanding the effects
of different factors on the items’ sales. Taking various aspects of a dataset collected for Big Mart, and the
methodology followed for building a predictive model, results with high levels of accuracy are generated,
and these observations can be employed to take decisions to improve sales.
Keywords: Machine Learning, Sales Prediction, Big Mart, Random Forest, Linear Regression
I. Introduction
In today’s modern world, huge shopping centers such as big malls and marts are recording data
related to sales of items or products with their various dependent or independent factors as an
important step to be helpful in prediction of future demands and inventory management. The
dataset built with various dependent and independent variables is a composite form of item
attributes, data gathered by means of customer, and also data related to inventory management in
a data warehouse. The data is thereafter refined in order to get accurate predictions and gather
new as well as interesting results that shed a new light on our knowledge with respect to the
task’s data. This can then further be used for forecasting future sales by means of employing
machine learning algorithms such as the random forests and simple or multiple linear regression
model.
The data available is increasing day by day and such a huge amount of unprocessed data is
needed to be analysed precisely, as it can give very informative and finely pure gradient results
as per current standard requirements. It is not wrong to say as with the evolution of Artificial
Intelligence (AI) over the past two decades, Machine Learning (ML) is also on a fast pace for its
evolution. ML is an important mainstay of IT sector and with that, a rather central, albeit usually
hidden, part of our life [1]. As the technology progresses, the analysis and understanding of data
to give good results will also increase as the data is very useful in current aspects. In machine
learning, one deals with both supervised and unsupervised types of tasks and generally a
classification type problem accounts as a resource for knowledge discovery. It generates
resources and employs regression to make precise predictions about future, the main emphasis
being laid on making a system self-efficient, to be able to do computations and analysis to
generate much accurate and precise results [2]. By using statistic and probabilistic tools, data can
be converted into knowledge. The statistical inferencing uses sampling distributions as a
conceptual key [11].
ML can appear in many guises. In this paper, firstly, various applications of ML and the types of
data they deal with are discussed. Next, the problem statement addressed through this work is
stated in a formalized way. This is followed by explaining the methodology ensued and the
prediction results observed on implementation. Various machine learning algorithms include [3]:
“To find out what role certain properties of an item play and how they affect their sales by
understanding Big Mart sales.” In order to help Big Mart achieve this goal, a predictive model
can be built to find out for every store, the key factors that can increase their sales and what
changes could be made to the product or store’s characteristics.
II. Methodology
The steps followed in this work, right from the dataset preparation to obtaining results are
represented in Fig.1.
Big Mart’s data scientists collected sales data of their 10 stores situated at different locations
with each store having 1559 different products as per 2013 data collection. Using all the
observations it is inferred what role certain properties of an item play and how they affect their
sales. The dataset looks like shown in Fig.2 on using head() function on the dataset variable.
In the raw data, there can be various types of underlying patterns which also gives an in-depth
knowledge about subject of interest and provides insights about the problem. But caution should
be observed with respect to data as it may contain null values, or redundant values, or various
types of ambiguity, which also demands for pre-processing of data. Dataset should therefore be
explored as much as possible.
Various factors important by statistical means like mean, standard deviation, median, count of
values and maximum value etc. are shown in Fig.4 for numerical variables of our dataset.
Preprocessing of this dataset includes doing analysis on the independent variables like checking
for null values in each column and then replacing or filling them with supported appropriate data
types, so that analysis and model fitting is not hindered from its way to accuracy. Shown above
are some of the representations obtained by using Pandas tools which tells about variable count
for numerical columns and modal values for categorical columns. Maximum and minimum
values in numerical columns, along with their percentile values for median, plays an important
factor in deciding which value to be chosen at priority for further exploration tasks and analysis.
Data types of different columns are used further in label processing and one-hot encoding
scheme during model building.
Scikit-Learn can be used to track machine-learning system on wholesome basis [12]. Algorithms
employed for predicting sales for this dataset are discussed as follows:
Random forest algorithm is a very accurate algorithm to be used for predicting sales. It is easy to
use and understand for the purpose of predicting results of machine learning tasks. In sales
prediction, random forest classifier is used because it has decision tree like hyperparameters. The
tree model is same as decision tool. Fig.5 shows the relation between decision trees and random
forest. To solve regression tasks of prediction by virtue of random forest, the sklearn.ensemble
library’s random forest regressor class is used. The key role is played by the parameter termed as
n_estimators which also comes under random forest regressor. Random forest can be referred to
as a meta-estimator used to fit upon numerous decision trees (based on classification) by taking
the dataset’s different sub-samples. min_samples_split is taken as the minimum number when
splitting an internal node if integer number of minimum samples are considered. A split’s quality
is measured using mse (mean squared error), which can also be termed as feature selection
criterion. This also means reduction in variance mae (mean absolute error), which is another
criterion for feature selection. Maximum tree depth, measured in integer terms, if equals one,
then all leaves are pure or pruning for better model fitting is done for all leaves less than
min_samples_split samples.
Y = βo + β1X + ∈ (1)
Equation shown in eq.1 is used for simple linear regression. These parameters can be said as:
Y - Variable to be predicted
X - Variable(s) used for making a prediction
βo - When X=0, it is termed as prediction value or can be referred to as intercept term
β1 - when there is a change in X by 1 unit it denotes change in Y. It can also be said as slope
term
∈ -The difference between the predicted and actual values is represented by this parameter and
also represents the residual value. However efficiently the model is trained, tested and validated,
there is always a difference between actual and predicted values which is irreducible error thus
we cannot rely completely on the predicted results by the learning algorithm. Alternative
methods given by Dietterich can be used for comparing learning algorithms [10].
• The error measurement is an important metric in the estimation period. Root mean squared
error (RMSE) and Mean Absolute Error (MAE) are generally used for continuous variable’s
accuracy measurement. It can be said that the average model prediction error can be expressed in
units of the variable of interest by using both MAE and RMSE. MAE is the average over the test
sample of the absolute differences between prediction and actual observation where all
individual differences have equal weight. The square root of the average of squared differences
between prediction and actual observation can be termed as RMSE. RMSE is an absolute
measure of fit, whereas R2 is a relative measure of fit. RMSE helps in measuring the variable’s
average error and it is also a quadratic scoring rule. Low RMSE values obtained for linear or
multiple regression corresponds to better model fitting.
With respect to the results obtained in this work, it can be said that there is no big difference
between our train and test sample since the metric RMSE ratio is calculated to be equal to the
ratio between train and test sample. The results related to how accurately responses are predicted
by our model can be inferred from RMSE as it is a good measure along with measuring precision
and other required capabilities. A considerable improvement could be made by further data
exploration incorporated with outlier detection and high leverage points. Another approach,
which is conceptually easier, is to combine several sub-models which are low dimensional and
easily verifiable by domain experts, i.e., ensemble learning can be exploited [9].
Python is a general purpose, interpreted-high level language used extensively nowadays for
solving domain problems instead of dealing with complexities of a system. It is also termed as
the ‘batteries included language’ for programming. It has various libraries used for scientific
purposes and inquiries along with number of third-party libraries for making problem solving
efficient.
In this work, the Python libraries of Numpy, for scientific computation, and Matplotlib, for 2D
plotting have been used. Along with this, Pandas tool of Python has been employed for carrying
out data analysis. Random forest regressor is used to solve tasks by ensembling random forest
method. As a development platform, Jupyter Notebook, which proves to work great due to its
excellence in ‘literate programming’, where human friendly code is punctuated within code
blocks, has been used.
Correlation is used to understand the relation between a target variable and predictors. In this
work, Item-Sales is the target variable and its correlation with other variables is observed.
Considering the case of Item-Weight, the feature item weight is shown to have a low correlation
with the target variable Item-Outlet-Sales in Fig.6.
Fig6: Correlation between target variable and Item-weight variable
As can be seen from Fig.7, there is no significant relation found between the year of store
establishment and the sales for the items. Values can also be combined into variables that classify
them into periods and give meaningful results.
The place where an item is placed in a store, referred to as Item_visibility, definitely affects the
sales. However, the plot chart and correlation table generated previously show that the flow is in
opposite side. One of the reasons might be that daily used products don’t need high visibility.
However, there is an issue that some products have zero visibility, which is quite impossible.
Fig.8 shows the correlation between item visibility variable and the target variable.
Fig8: Correlation between target variable and Item-visibility variable
Frequency for each categorical or nominal variable plays a significant role in further analysis of
the dataset, thus supporting and collaborating in data exploration to be performed. As shown in
Fig.9, various variables in our dataset, with their data type and categories are shown. Here, the
ID column and the source column, denoting from where the test or train sample data belongs to,
are excluded and not used.
Fig 10: Flowchart for division of dataset on various factors (having proper leaves after pruning)
In Fig.10, a flowchart is represented in which the dataset has been divided on the basis of various
factors. In the last stage of the flowchart, the nodes with numbers ‘a’ and ‘b’ represent some
string values for distinguishing the dataset items and ‘num’ can be any arbitrary number. The
dataset has been divided and pruning has been performed on the basis of different factors.
Ensembling many such decision trees will generate a random forest model.
Fig11: Diagram showing correlation among different factors
From Fig.11, the correlation among various dependent and independent variables is explored to
be able to decide on the further steps that are to be taken. Variables used are obtained after data
pre-processing, and following are some of the important observations about some of the used
variables:
It is observed that the R-squared value is 0.563 for our dependent variable for 8523 number of
observations taken under consideration. This signifies how accurately the built regression model
fits.
3.3 Prediction results and Conclusion
• The largest location did not produce the highest sales. The location that produced the highest
sales was the OUT027 location, which was in turn a Supermarket Type3, having its size
recorded as medium in our dataset. It can be said that this outlet’s performance was much better
than any other outlet location with any size provided in the considered dataset.
• The median of the target variable Item_Outlet_Sales was calculated to be 3364.95 for OUT027
location. The location with second highest median score (OUT035) hada median value of
2109.25.
• Adjusted R-squared and R-squared values are higher for Linear regression model than average.
Therefore, the used model fits better and exhibits accuracy.
• Also, model accuracy and score of regression model can reach nearly 61% if built with more
hypothesis consideration and analysis, as shown by code snippet in Fig.13.
It can be concluded that more locations should be switched or shifted to Supermarket Type3 to
increase the sales of products at Big Mart. Any one-stop-shopping-center like Big Mart can
benefit from this model by being able to predict its items’ future sales at different locations.
Multiple instances parameters and various factors can be used to make this sales prediction more
innovative and successful. Accuracy, which plays a key role in prediction-based systems, can be
significantly increased as the number of parameters used are increased. Also, a look into how
the sub-models work can lead to increase in productivity of system. The project can be further
collaborated in a web-based application or in any device supported with an in-built intelligence
by virtue of Internet of Things (IoT), to be more feasible for use. Various stakeholders concerned
with sales information can also provide more inputs to help in hypothesis generation and more
instances can be taken into consideration such that more precise results that are closer to real
world situations are generated. When combined with effective data mining methods and
properties, the traditional means could be seen to make a higher and positive effect on the overall
development of corporation’s tasks on the whole. One of the main highlights is more expressive
regression outputs, which are more understandable bounded with some of accuracy. Moreover,
the flexibility of the proposed approach can be increased with variants at a very appropriate stage
of regression model-building. There is a further need of experiments for proper measurements of
both accuracy and resource efficiency to assess and optimize correctly.
References
[1] Smola, A., & Vishwanathan, S. V. N. (2008). Introduction to machine learning. Cambridge
University, UK, 32, 34.
[2] Saltz, J. S., & Stanton, J. M. (2017). An introduction to data science. Sage Publications.
[3] Shashua, A. (2009). Introduction to machine learning: Class notes 67577. arXiv preprint
arXiv:0904.3664.
[4] MacKay, D. J., & Mac Kay, D. J. (2003). Information theory, inference and learning algorithms.
Cambridge university press.
[5] Daumé III, H. (2012). A course in machine learning. Publisher, ciml. info, 5, 69.
[7] Cerrada, M., & Aguilar, J. (2008). Reinforcement learning in system identification. In Reinforcement
Learning. IntechOpen.
[8] Welling, M. (2011). A first encounter with Machine Learning. Irvine, CA.: University of
California, 12.
[9] Learning, M. (1994). Neural and Statistical Classification. Editors D. Mitchie et. al, 350.
[10] Mitchell, T. M. (1999). Machine learning and data mining. Communications of the ACM, 42(11), 30-
36.
[12] Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly Media.