ICAICT 2016 Paper 26
ICAICT 2016 Paper 26
ICAICT 2016 Paper 26
Abstract
Human capital is of a high concern for companies’ management where their most interest is in hiring the highly qualified personnel
which are expected to perform highly as well. The advanced branch of data engineering is Predictive Analytics. Generally, these analytics
predicts some occurrence or probability based on data. For predicting the future occurrence or events, the process involves an analysis of
historic data. The main objective is to provide the performance appraisal report of an employee using Predictive Analytics. Performances are
found by testing the attributes of an employee against the rules generated by the decision tree classifier. This paper concentrates on collecting
data about employees using the user interface, generating a decision tree from the historical data, testing the decision tree with attributes of
an employee. In this paper, Predictive Analytics techniques are utilized to predict the performance of employees. With the latest prediction
algorithm, we will predict employees’ performance more efficiently than the existing system. We define the performance of a frontline
employee, as his/her productivity comparing with his/her past performance .On working on performance, many attributes have been tested,
and some of them are found effective on the performance prediction. In general the performance is usually measured by the units produced
by the employee in his/her job within the given period of time. This paper has concentrated on the possibility of building two or three
prediction algorithms for predicting the employees’ performance and picking the one best suited for the specific organization.
I. Introduction
Business Organizations are really interested to settle plans for correctly selecting proper employees. After hiring employees,
managements become concerned about the performance of these employees were management build evaluation systems in an
attempt to preserve the good performers of employees. Data mining techniques are analytical tools that can be used to extract
meaningful knowledge from large data sets. Data mining is a powerful new technology with great potential in information system.
It can be best defined as the automated process of extracting useful knowledge and information including, patterns, associations,
changes, trends, anomalies and significant structures from large or complex data sets that are unknown.
Data mining consists of a set of techniques that can be used to extract relevant and interesting knowledge from data. Data
mining has several tasks such as association rule mining, classification and prediction, and clustering. The advanced branch of
data engineering is Predictive analytics. Analytics predicts some occurrence or probability based on data. Predictive analytics
encompasses a variety of statistical techniques from predictive modeling, machine learning and data mining. It analyses the
current and historical facts to make predictions about future. Classification techniques are supervised learning techniques that
classify data item into predefined class label. It is one of the most useful techniques in data mining to build classification models
from an input data set. The used classification techniques commonly build models that are used to predict future data trends.
A. Classification
Classification involves finding rules that partition the data into separate groups. The input for the classification is the
training data set, whose class labels are previously known. It explores the training data set and constructs a model based on the
class label, and intentions to allocate a class label to the future unlabeled records. Since the class field is well-known, this type of
classification is known as supervised learning. There are several classification models such as Decision Tree, Genetic algorithms,
statistical models and so on.
B. Regression
Regression is a statistical process for estimating the relationships among variables. It includes many techniques for modelling
and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent
variables (or ‘predictors’).
C4.5 84.79%
C4.5 41.47%
Bagging 75.57%
Bagging 45.62%
• Make the rule assign that class to this value of the predictor
• Calculate the total error of the rules of each predictor Choose the predictor with the smallest error
Elastic Net
In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized
regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. The elastic net method
overcomes the limitations of the LASSO (least absolute shrinkage and selection operator) method which uses a penalty function
based on
Use of this penalty function has several limitations. For example, in the "large p, small n" case (high-dimensional data
with few examples), the LASSO selects at most n variables before it saturates.
Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and
ignore the others. To overcome these limitations, the elastic net adds a quadratic part to the penalty ( ), which when used
alone is ridge regression. The estimates from the elastic net method are defined by
The quadratic penalty term makes the loss function strictly convex, and it therefore has a unique minimum. The elastic net
method includes the LASSO and ridge regression: in other words, each of them is a special case where𝜆1 = 𝜆, 𝜆2 = 0 𝑜𝑟 𝜆 = 𝜆.
Meanwhile, the naive version of elastic net method finds an estimator in a two-stage procedure: first for each fixed it finds the
ridge regression coefficients, and then does a LASSO type shrinkage. This kind of estimation incurs a double amount of
shrinkage, which leads to increased bias and poor predictions. To improve the prediction performance, the authors rescale the
coefficients of the naive version of elastic net by multiplying the estimated coefficients by ..
The Process
Data Collection and Refinement.
The type of organization under study is a manufacturing organisation and the personnel under study are blue collared employees
who are governed by the respective employee union. The entire implementation is divided into phases. In order to apply any of
the algorithms we need the data set to do so. Thus the first step in implementation is the data collection .In this project the data
was given by the respective HR of the organisation in a raw format, for some of the data it was required to personally inspect and
collect data. The data set consists of 11 attributes for 3 years 2012, 2013, 2014, this means 3 individual data sets, this was so
done to analyse the performance over the 3 years and finally we combine the data sets of all 3 year into a single dataset called
master, this was done to compare the efficiency between the algorithms.
The data finally refined was converted csv format which includes 3 individual year data set and one master dataset that is a
combination of all 3.
Inputting to WEKA.
The master data set is inputted to WEKA, if necessary it is filtered to support the type of algorithms, it is later
classified, for this we choose the algorithms of our choice , classification is done both on training the data set as well as 10 cross
fold validation method. We implemented each of the algorithms available in WEKA to the master data set but for comparison
purpose we have taken only a few that have given maximum efficiency on both training and 10 cross fold validation with default
66% split.
Of all the algorithms it was found that One R algorithm gives the maximum efficiency with figures of 70.0361% for
training and 69.6751% for 10 cross fold validation for the performance attribute. The efficiency figures for the rest is given in the
conclusion section. Thus compare our proposed method Elastic Net to One R algorithm.
The Elastic net algorithm is available, thus here we focus on implementing the weka compatible java code of Elastic
Net. The coded java file is pasted into the functions folder of weka classifiers. In order to access via the WEKA GUI we attach a
GUI property manager.
Elastic net being a regression techniques involves statistical modules and hence can be applied on only numeric data ,
thus the entire data set master was converted to numbers for example the gender attribute has 2 values namely female and male ,
female was taken 0 and male as 1.In this way all the attributes were converted.
The modified master file was inputted to WEKA and we choose Elastic Net listed under function of weka classifiers,
we run it on the data set for both training and 10 cross fold validation, the results of Elastic Net algorithm are not in terms of
efficiency % rather as correlation coefficient. We can say that higher the correlation better the efficiency. The figures we obtained
for training and 10 cross fold validation are 0.3297 and 0.2502 respectively for the performance attribute. Elastic Net also
provides an impact value between -1 and +1 for each of the other attributes, where -1 means inverse relation and +1 means direct
relation and a value of 0 means no relation. This impact factor is compared to that generated by GainRatioAttributeEval method
by having One R as base algorithm, and the results are in form of table.
The employee data set can be studied upon and can be analyzed by various classifiers in WEKA. We have studied
various papers and chosen J48 and Rotation Forest algorithms to be the best among those. The very same algorithms have
given us better results on data set.
0.27
0.24
0.03
0 0 0.01
Considering J48 and Rotation Forest mentioned in our base papers we do a time study with One R and the results
as depicted above reveal that One R takes the least time to build 0 seconds. Thus WEKA has One R as the best solution to
the master data set. Hence we decided to compare our implementation of the Elastic Net with the One R algorithm.
Comparison Study.
The prime focus is on the two algorithms namely One R and Elastic net. While the efficiency of One R can be
evaluated by correctly classified instances in percentage that of Elastic Net is in terms of correlation coefficient. This is
because Elastic Net being a regression technique is a statistical model, and hence can be applied to data of numeric type.
0.03 0.03
0 0
Training set 10 Cross Fold
Elastic net takes all the attributes other than the one evaluated into consideration. It overcomes the limitation of
LASSO where only highly related attributes are considered for impact calculation. Elastic Net uses correlation factor to
express its efficiency higher the correlation better the efficiency. Since the efficiency of the two algorithms cannot measured
by the same unit. We consider the Root Relative Squared error. The comparison is as follows:
110.00 % 112.14 %
105.00 %
100.00 %
95.00 % 96.11 %
90.00 %
85.00 %
One R Elastic Net
Thus results reveal that Elastic net has lesser root relative squared error when compared to One R on both Training sets
and 10 Cross Validation .Although One R has proved better for attributes that contain a range of values, its high root relative
squared error exposes its limitations. If your goal is analyze data with nominal or string attributes, and to support a range of
values for attributes One R is your choice. Thus to analyze data sets with minimal attributes Elastic Net is on the upper hand,
since it gives an impact value for each of the attributes. Since manual intervention in essential for data preprocessing in
Elastic Net and since the efficiency is not directly in terms of percentage we find it hard to relate layman terms. Thus we
recommend One R for Traditional Blue collared based organizations for data set similar to ours naturally due to its efficiency
and its ease of use. Thus, we conclude that if the data pattern remains the same for the given data set, then the performance is
predicted to be “GOOD”. If there are variations in the data set, then the performance can vary accordingly.
Future Enhancement
The idea of this project was imbibed from the latest developments in regression techniques. With more future
advancements and better algorithms, their usage to the data set may impact the result. More number of data instances and a
better refined data set may impact result.
References
[1] Emin Kahya “The effects of job characteristics and working conditions on job performance”.
[2] V.Kalaivani, Mr.M.Elamparithi “An Efficient Classification Algorithms for Employee Performance
Prediction“.
[3] K. Mohammed Hussain, P. Sheik Abdul kadher on “A Review of Factors and Data Mining Techniques for
Employee Attrition and Retention in Industries”.
[4] Mrs. M.S. Mythili , Dr. A.R.Mohamed Shanavas “An Analysis of students’ performance using classification
algorithms”.
[5] Osman M. Karatepe “The effects of family support and work engagement on organizationally valued job
outcomes”.
[6] Qasem A. Al-Radaideh and Eman Al Nagi “Using Data Mining Techniques to Build a Classification Model for
Predicting Employees Performance”
[7] Thushel Jayaweera1 on “Impact of Work Environmental Factors on Job Performance, Mediating Role of Work
Motivation: A Study of Hotel Sector in England”.