Data Mining 1-3
Data Mining 1-3
Data Mining 1-3
BY
RAMLAT IBRAHIM
IMT/18D/2200
MAY, 2023
APPROVAL PAGE
This project report entitled “Data Mining Techniques of Telecommunication Companies in Nigeria
(Case Study of MTNN).” meet the regulations governing the award of B.Tech degree of the
Modibbo Adama University, Yola and is approved for its contribution to knowledge and literacy
presentation.
-------------------------------------- ----------------------
(Project Supervisor)
-------------------------------------- ----------------------
Name: Date
(External Examiner)
-------------------------------------- ----------------------
2
Table of Contents
Chapter One
Chapter Two
2.1 Introduction....................................................................................................................................8
2.6 Summary......................................................................................................................................21
Chapter Three
3.0 Introduction..................................................................................................................................22
3
APPROVAL PAGE
This project report entitled Data Mining Techniques Of Telecommunication Companies In
Nigeria
(Case Study Of Mtn Nigeria), meet the regulations governing the award of B.Tech degree of the
Modibbo Adama University, Yola and is approved for its contribution to knowledge and literacy
presentation.
-------------------------------------- ----------------------
(Project Supervisor)
-------------------------------------- ----------------------
Name: Date
(External Examiner)
-------------------------------------- ----------------------
4
CHAPTER ONE
INTRODUCTION
5
hardware or software component generating the message and a code that explains why the message
is being generated.
Telecommunication networks are extremely complex configurations of equipment, comprised
of thousands of interconnected components. Each network element can generate error and status
messages, which leads to a tremendous amount of network data. This data must be stored and
analyzed to support network management functions, such as fault isolation Nora (2017). This data
will minimally include a timestamp, a string that uniquely identifies the hardware or software
component generating the message and a code that explains why the message is being generated. For
example, such a message might indicate that “controller 7 experienced a loss of power for 30
seconds starting at 10:03 pm on Monday, May 12.”
Galambos (2014) viewed that the actual data mining task is the automatic or semi-automatic
analysis of large quantities of data to extract previously unknown interesting patterns such as groups
of data records (cluster analysis), unusual records (anomaly detection) and dependencies (association
rule mining). This usually involves using database techniques such as spatial indices. Owning to this
huge rate of networked messaging being generated, skilled workers cannot handle all the incoming
and outgoing messages. For this reason, expert systems have been developed to automatically
analyze all the messages and also to take necessary action to execute commands, involving skilled
workers when a problem cannot be automatically resolved (Eze et al, 2016).
These patterns can then be seen as a kind of summary of the input data, and may be used in
further analysis or, for example, in machine learning and predictive analytics. For example, the data
mining step might identify multiple groups in the data, which can then be used to obtain more
accurate prediction results by a decision support system. Neither the data collection, data preparation,
nor result interpretation and reporting are part of the data mining step but do belong to the overall
KDD (Knowledge Discovery in Database) process as additional steps Nicholson (2014).
Since launch in August 2001, MTN has steadily deployed its services across Nigeria, as of
Monday 7th June, 2021 during its annual general meeting disclosed that they have achieve 89.9%
nationwide coverage in Nigeria. It now provides services in 223 cities and towns, more than 10,000
villages and communities and a growing number of highways across the country, spanning the 36
states of Nigeria and the Federal Capital Territory, Abuja. Many of these villages and communities
are being connected to the world of telecommunications for the first time ever.
6
1.2 Statement of the Problem
Fraud is a serious problem for telecommunication companies, leading to billions of dollars in lost
revenue each year. Fraud can be divided into two categories: subscription fraud and super imposition
fraud. Subscription fraud occurs when a customer opens an account with the intention of never
paying for the account charges. Super imposition fraud involves a legitimate account with some
legitimate activity, but also includes some “super imposed” illegitimate activity by a person other
than the account holder (Kolajo,and Adeyemo, 2012).
It is not feasible for people to analyze great amounts of data without the assistance of
appropriate computational tools. Therefore, the development of tools of an automatic and intelligent
nature becomes essential for analyzing, interpreting, and correlating data in order to develop and
select strategies in the context of each application. To serve this new context, the area of Knowledge
Discovery in Databases (KDD), came into existence with great interest within the scientific,
industrial, and commercial communities. The popular expression "Data Mining" is actually one of
the stages of the Discovery of Knowledge in Databases. The term "KDD" was formally recognized
in 1989 in reference to the broad concept of procuring knowledge from databases.
Super imposition fraud poses a bigger problem for the telecommunications industry and for
this reason data mining technique is used for identifying this type of fraud Bharati (2017). These
applications should ideally operate in real-time using the call detail records and, once fraud is
detected or suspected, should trigger some action.
This action may be to immediately block the call and/or deactivate the account, or may
involve opening an investigation, which will result in a call to the customer to verify the legitimacy
of the account activity. However it is against this background that this current study seek to examine
various data mining techniques of Mobile Telecommunication Network in Nigeria (MTNN).
This research work therefore addresses the intelligent on data mining techniques which is
been used by MTN Nigeria . This will facilitate better performance of telecommunication companies
in data security and mining
7
ii. To examine the various data mining techniques of MTN Nigeria.
iii. To identify the challenges of data mining faced by MTN Nigeria.
iv. To recommend ways of improving data mining techniques been used in MTN Nigeria
i. The outcome of this study will educate on data mining techniques of telecommunication
companies in Nigeria, the data mining applications and how they can be used in fraud
detection.
ii. This research will be a contribution to the body of literature in the effect of personality
trait on student’s academic performance, thereby constituting the empirical literature for
future research in the subject area.
Data mining is a process of extracting and discovering patterns in large data sets involving methods
at the intersection of machine learning, statistics, and database systems.
8
network service provider (NSP) is a business or organization that sells bandwidth or network
access by providing direct Internet backbone access to internet service providers and usually
access to its network access points (NAPs).
Networking is the exchange of information and ideas among people with a common profession or
special interest, usually in an informal social setting. Networking often begins with a single point of
common ground.
Internet fraud is the use of Internet services or software with Internet access to defraud victims or
to otherwise take advantage of them.
9
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
The telecommunications companies in Nigeria generate and stores a massive amount of data. These
data include call detail data, which describes the calls that extends across the telecommunication
networks, network data, which describes the state of the hardware and software components in the
network, and customer data, which describes the telecommunication customers Weiss et al, (2017).
The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to
extract previously unknown interesting patterns such as groups of data records (cluster analysis),
unusual records (anomaly detection) and dependencies (association rule mining). This usually
involves using database techniques such as spatial indices. These patterns can then be seen as a kind
of summary of the input data, and may be used in further analysis or, for example, in machine
learning and predictive analytics. For example, the data mining step might identify multiple groups
in the data, which can then be used to obtain more accurate prediction results by a decision support
system. Neither the data collection, data preparation, nor result interpretation and reporting are part
of the data mining step but do belong to the overall KDD process as additional steps Eze et al (2016).
A data mining algorithm is a set of heuristics and calculations that creates a data mining model from
data. To create a model, the algorithm first analyzes the data you provide, looking for specific types
of patterns or trends. The algorithm uses the results of this analysis to define the optimal parameters
for creating the mining model. These parameters are then applied across the entire data set to extract
actionable patterns and detailed statistics. The mining model that an algorithm creates from your data
can take various forms, including:
A set of clusters that describe how the cases in a dataset are related.
10
A decision tree that predicts an outcome and describes how different criteria affect that
outcome.
A mathematical model that forecasts sales.
A set of rules that describe how products are grouped together in a transaction, and the
probabilities that products are purchased together Bharati (2017).
11
information repositories and of discovering implicit, but potentially useful information (Han,
Kamber, & Pei, 2011). Data mining has the capability to uncover hidden relationships and to reveal
unknown patterns and trends by digging into large amounts of data (Sumathi & Sivanandam, 2006).
The functions, or models, of data mining can be categorized according to the task performed:
association, classification, clustering, and regression (Hui & Jha, 2000; Kao, Chang, & Lin, 2003;
Nicholson, 2006b). Data mining analysis is based normally on three techniques: classical statistics,
artificial intelligence, and machine learning (Girija & Srivatsa, 2006).
Classical statistics is mainly used for studying data, data relationships, as well as for dealing
with numeric data in large databases (David J. Hand, 1998). Examples of classical statistics include
regression analysis, cluster analysis, and discriminate analysis. Artificial intelligence (AI) applies
“human-thought-like” processing to statistical problems (Girija & Srivatsa, 2006). AI uses several
techniques such as genetic algorithms, fuzzy logic, and neural computing. Finally, machine learning
is the combination of advanced statistical methods and AI heuristics, used for data analysis and
knowledge discovery (Kononenko & Kukar, 2007). Machine learning uses several classes of
techniques: neural networks, symbolic learning, genetic algorithms, and swarm optimization. Data
mining benefits from these technologies, but differs from the objective pursued: extracting patterns,
describing trends, and predicting behavior. A typical data mining process, as shown in Figure 1, is an
interactive sequence of steps that normally starts by integrating raw data from different data sources
and formats. These raw data are cleansed in order to remove noise, and duplicated and inconsistent
data (Han et al., 2011). These cleansed data are then transformed into appropriated formats that can
be understood by other data mining tools, and filtration and aggregation techniques are applied to the
data in order to extract summarized data. In fact, interesting knowledge is extracted from the
transformed data. This information is analyzed in order to identify the truly interesting patterns.
Eventually, knowledge is visualized to the user. More detailed information regarding a data mining
process can be found in (Han et al. 2011).
12
Figure 2.1: Data mining process (Han et all, 2011)
Data mining techniques are applied in a wide range of domains where large amounts of data
are available for the identification of unknown or hidden information. In this sense, Girija and
Srivatsa (2006) indicate that data mining techniques used in WWW are called web mining, used in
text are called text mining, and used in libraries are called bibliomining. The term bibliomining, or
data mining for libraries, was first used by Nicholson and Stanton (2003) to describe the combination
of data warehousing, data mining and bibliometrics. This term is used to track patterns, behavior
changes, and trends of library systems transactions. Although the concept is not new, the term
bibliomining was created to facilitate the search of the terms “library” and “data mining” in the
context of libraries rather than in software libraries. Bibliomining is an important tool to discover
useful library information in historical data to support decision-making (Kao et al., 2003). However,
to provide a complete report of the library system, bibliomining needs to be used iteratively applied
in combination with other measurement and evaluation methods; as strategic information is
discovered, more questions may be raised and thus start the process again (Nicholson, 2003b).
Bibliomining, as any knowledge extraction method, needs to follow a systematic procedure in
order to allow an appropriate knowledge discovery. The bibliomining process starts by determining
areas of focus and collecting data from internal and external sources (Nicholson, 2003b). Then, these
data are collected, cleansed, and anonymized into a data warehouse. To discover meaningful patterns
in the collected data, the bibliomining process includes the selection of appropriate analysis tools and
techniques from statistics, data mining, and bibliometrics (Nicholson, 2006a). Interesting patterns are
analyzed and visualized through reports. The mining process will be iterated until the resulted
information is verified and proved by key users such as librarians and library managers (Shieh,
13
2010). The application of bibliomining tools is an emerging trend that can be used to understand
patterns of behavior among library users and staff, and patterns of information resource use
throughout the library (Nicholson & Stanton, 2006). Bibliomining is highly recommended to provide
useful and necessary information for library management requirements, focusing on the professional
librarianship issues, but highly database technical dependent (Shieh, 2010). Bibliomining can also be
used to provide a comprehensive overview of the library workflow in order to monitor staff
performance, determine areas of deficiency, and predict future user requirements (Prakash, Chand, &
Gohel, 2004).
The resulting information gives the possibility to perform scenario analysis of the library
system, where different situations that need to be taken into account during a decision-making
process are evaluated (Nicholson, 2006a). An additional application is to standardize structures and
reports in order to share data warehouses among groups of libraries, allowing libraries to benchmark
their information (Nicholson, 2006a). Therefore, in order to improve the interaction quality between
a library and its users, the application of data mining tools in libraries is worth pursuing (Chang &
Chen, 2006). The aim of this study is to investigate how far academic libraries are pragmatically
using data mining tools, and in which library aspects librarians are implementing them. To this end,
content and statistical analyses are used to examine articles that include case studies of academic
libraries implementing data mining tools.
Every time a call is placed on a telecommunications network, descriptive information about the
call is saved as a call detail record. The number of call detail records that are generated and stored is
huge. For example, AT&T long distance customers alone generate over 300 million call detail
records per day (Pregibon, 2013). Given that several months of call detail data is typically kept
online, this means that tens of billions of call detail records will need to be stored at any time. Call
detail records include sufficient information to describe the important characteristics of each call. At
a minimum, each call detail record will include the originating and terminating phone numbers, the
date and time of the call and the duration of the call. Call detail records are generated in real-time
and therefore will be available almost immediately for data mining Cortes (2014). This can be
contrasted with billing data, which is typically made available only once per month. Call detail
records are not used directly for data mining, since the goal of data mining applications is to extract
knowledge at the customer level, not at the level of individual phone calls. Thus, the call detail
14
records associated with a customer must be summarized into a single record that describes the
customer’s calling behavior. The choice of summary variables (i.e., features) is critical in order to
obtain a useful description of the customer. Below is a list of features that one might use when
generating a summary description of a customer based on the calls they originate and receive over
some time period P:
These eight features can be used to build a customer profile. Such a profile has many potential
applications. For example, it could be used to distinguish between business and residential customers
based on the percentage of weekday and daytime calls. Most of the eight features listed above were
generated in a straightforward manner from the underlying data, but some features, such as the
eighth feature, required a little more thought and creativity Cortes, (2016). Because most people call
only a few area codes over a reasonably short period of time (e.g., a month), this feature can help
identify telemarketers, or telemarketing behavior, since telemarketers will call many different area
codes. The above example demonstrates that generating useful features, including summary features,
is a critical step within the data mining process. Should poor features be generated, data mining will
not be successful. Although the construction of these features may be guided by common sense and
expert knowledge, it should include exploratory data analysis Kukar, (2007). For example, the use of
the time period 9am-5pm in the fifth feature is based on the commonsense knowledge that the typical
workday is 9 to 5 (and hence this feature may be useful in distinguishing between business and
residential calling patterns).
15
This data will minimally include a timestamp, a string that uniquely identifies the hardware
or software component generating the message and a code that explains why the message is being
generated. For example, such a message might indicate that “controller 7 experienced a loss of power
for 30 seconds starting at 10:03 pm on Monday, May 12. Due to the enormous number of network
messages generated, technicians cannot possibly handle every message. For this reason, expert
systems have been developed to automatically analyze these messages and take appropriate action,
only involving a technician when a problem cannot be automatically resolved (Weiss, 2018). As was
the case with the call detail data, network data is also generated in real-time as a data stream and
must often be summarized in order to be useful for data mining. This is sometimes accomplished by
applying a time window to the data. For example, such a summary might indicate that a hardware
component experienced twelve instances of a power fluctuation in a 10-minute period.
Telecommunication companies, like other large businesses, may have millions of customers.
By necessity this means maintaining a database of information on these customers. This information
will include name and address information and may include other information such as service plan
and contract information, credit score, family income and payment history Chang, (2003). This
information may be supplemented with data from external sources, such as from credit reporting
agencies. The customer data maintained by telecommunication companies does not substantially
differ from that maintained in most other industries Pei, (2011). However, customer data is often
used in conjunction with other data in order to improve results. For example, customer data is
typically used to supplement call detail data when trying to identify phone fraud.
16
telecommunication industry has moved from identifying new customers to measuring customer value
and then taking steps to return the profitable customers. This shift has happened because it is
expensive to acquire new customers than retaining the existing ones Hand, (2014). A numerous Data
Mining method can be used to generate the customer lifetime value (the total net income a company
can expect from a customer over time) for telecommunication customers. Different Data Mining
techniques are used to model customer lifetime value for telecommunication customers. The key
element of modeling the lifetime value for a telecommunication customer is to estimate how long
he/she will remain with their current network. It will help the company to predict when a customer is
likely leave and to take proactive steps to retain the customer. One of the serious issues that the
telecommunications industries face is the customer churn David, (1998).
The process that a customer leaving a company is referred to as churn and churn analysis can
be done through numerous systems and methods. Network Fault Isolation & Prediction
Telecommunication networks are comprised of highly complex configurations of hardware and
software. Since the industry requires optimum network efficiency and reliability, most of the
network elements have the capability of self – diagnosis and generating status and alarm messages.
Expert systems were developed to handle alarms. Network fault isolation in the Telecommunication
industry is a quiet tedious task because of the Following reasons Han et al, (2014). Huge volume of
data a single fault can generate different unrelated alarms. Hence alarm correlation has an important
role in predicting network faults. A proactive rapid response is very much essential for maintaining
the reliability of the network. Data mining techniques like classification, neural network and
sequence analysis can be used for identifying network faults. The telecommunication Alarm
Sequence Analysis (TASA) is a Data Mining tool which support fault identification by searching for
recurrent patterns of algorithms This information can be used to generate a rule-based alarm
correlation system, which can be used for identifying faults in real time. Genetic algorithm is another
method to predict the telecommunication switch failures. Time weaver is a genetic algorithm which
has the capability to operate directly on the raw network level time series data. This algorithm will
identify patterns that will successfully predict the target event. Bayesian Belief Networks can also be
used to identify the network faults Standard classification tools can be used to generate rules to
predict future failures, but it has several draw backs. Most importance drawback of this is that some
information will be lost in reformulation process Jha, (2014).
17
2.4.2 Marketing/Customer Profiling
The most likely reason for this is that MCI did not want to anger its customers by using
highly personal information (calling history). This demonstrates that privacy concerns are an issue
for data mining in the telecommunications industry, especially when call detail data is involved. The
MCI Friends and Family promotion relied on data mining to identify associations within data.
Another marketing application that relies on this technique is a data mining application for finding
the set of non-U.S. countries most often called together by U.S. telecommunication customers
(Cortes & Pregibon, 2001). One set of countries identified by this datamining application is: Jamaica,
Antigua, Grenada, Dominica.
This information is useful for establishing and marketing international calling plans. A
serious issue with telecommunication companies is customer churn. Customer churn involves a
customer leaving one telecommunication company for another. Customer churn is a significant
problem because of the associated loss of revenue and the high cost of attracting new customers.
Some of the worst cases of customer churn occurred several years ago when competing long distance
companies offered special incentives, typically $50or $100, for signing up with their company—a
practice which led to customers repeatedly switching carriers in order to earn the incentives.
Datamining techniques now permit companies the ability to mine historical data in order to predict
when a customer is likely to leave. These techniques typically utilize billing data, call detail data,
18
subscription information (calling plan, features, contract expiration data) and customer information
(e.g., age).
Based on the induced model, the company can then take action, if desired. For example, a
wireless company might offer a customer a free phone for extending their contract. One such effort
utilized a neural network to estimate the probability h(t) of cancellation at a given time t in the future
(Datta, 2014). In the telecommunications industry, it is often useful to profile customers based on
their patterns of phone usage, which can be extracted from the call detail data. These customer
profiles can then be used for marketing purposes, or to better understand the customer, which in turn
may lead to better forecasting models. In order to effectively mine the call detail data, it must be
summarized to the customer level as described earlier in this chapter. Then, a classifier induction
program can be applied to a set of labeled training examples in order to build a classifier. This
approach has been used to identify fax lines (Orji, 2014) and to classify a phone line as belonging to
a business or residence (Cortes, 2018). Other applications have used this approach to identify phone
lines belonging to telemarketers and to classify a phone line as being used for voice, data, or fax.
Two sample rules for classifying a customer as being a business or residential customer are shown
below (using pseudo-code). These rules were generated using SAS Enterprise Miner, a sophisticated
data mining package that supports multiple data mining techniques. The rules shown below were
generated using a decision tree learner. However, a neural network was also used to predict the
probability of a customer being a business or residential customer, based solely on the distribution of
calls by time of day (i.e., the neural network had 24 inputs, one per hour of the day). The probability
estimate generated by the neural network was then used as an input (i.e. feature) to the decision tree
learner. Evaluation on a separate test set indicates that rule 1 is 88% accurate and rule 2 is 70%
accurate.
Rule 1:if < 43% of calls last 0-10 seconds and < 13.5% of calls occur during the weekend and neural
network says that P(business) >0.58 based on time of day call distribution then business Customer.
Rule 2:if calls received over two-month period from at most 3 unique area codes and <56.6% of
calls last 0-10 seconds then residential customer.
It is worth noting that because a telecommunications company generates a call detail record if the
calling (paying) party is its customer, the company will also have a sample of (received) calls for
non-customers. If a company has high overall market penetration, this sample may be large enough
for data mining. Thus, telecommunication companies have the technical ability to profile non-
customers as well as customers.
19
2.4.3 Fraud Detection
Fraud is the crime of using dishonest methods to take something valuable from another
person. This is very serious issue that the telecommunication industry faces since it leads to the loss
of revenue by billions of dollars. As provided by Gosset & Hyland 2014, the telecommunication
fraud can be defined as any activity by which telecommunication service is obtained without
intention of paying. Telecommunication fraud can be classified into two categories namely
i. Subscription fraud
ii. Superimposition fraud
Subscription fraud occurs when a customer opens an account with the intention of never
paying. Telecommunication companies consider Superimposition frauds are the most significant
problems which occurs when a perpetrator gains illegal access to the account of a legitimate
customer. Both subscriptions fraud and Superimposition fraud should be detected immediately. and
customer account should be deactivated. Cellular cloning was a very serious issue in 1990’s. This
was eliminated with the Authentication methods. Deviation detection and Anomaly detection are the
most common techniques used for detecting superimposed fraud. Combined use of customer
signatures dynamic clustering and pattern recognition are some other methods which are recently
applied in this area.
Absolute analysis and differential analysis are considered as the two main sub-categories of
approaches for fraud detection. According to saurkar et.al (2014), the most often used techniques for
fraud detection in telecommunication include statistical modeling, Bayesian rules, visualization
methods, clustering, rule discovery, neural network, Markov models as well as combinations of more
than one method. Customer data can also be used for detecting fraud. For example price plan and
credit rating information can be in cooperated into the fraud analysis. Another common method for
fraud detection is to create a profile of customer’s calling behavior and compare activity against this
behavior. This calling behavior can be generated by briefing the call detail records for a particular
customer. Fraud can be identified immediately after it happens, only if the call details records are
updated in real time. Fraud detection system works at the customer level, not at the individual call
level. Fraud detection application involves predicting a relatively rare event where the class
distributions involved is highly twisted.
20
2.4.4 Network Fault Isolation
Telecommunication networks are extremely complex configurations of hardware and software. Most
of the network elements are capable of at least limited self-diagnosis, and these elements may
collectively generate millions of status and alarm messages each month. In order to effectively
manage the network, alarms must be analyzed automatically in order to identify network faults in a
timely manner or before they occur and degrade network performance. A proactive response is
essential to maintaining the reliability of the network. Because of the volume of the data, and
because a single fault may cause many different, seemingly unrelated, alarms to be generated, the
task of network fault isolation is quite difficult. Data mining has a role to play in generating rules for
identifying faults. The Telecommunication Alarm Sequence Analyzer (TASA) is one tool that helps
with the knowledge acquisition task for alarm correlation (Klemettinen, Mannila &Toivonen, (2011).
This tool automatically discovers recurrent patterns of alarms within the network data along with
their statistical properties, using a specialized data mining algorithm. Network specialists then use
this information to construct a rule-based alarm correlation system, which can then be used in real-
time to identify faults.
TASA is capable of finding episodic rules that depend on temporal relationships between the
alarms. For example, it may discover the following rule: “if alarms of type link alarm and link failure
occur within 5 seconds, then an alarm of type-high fault rate occurs within 60 seconds with
probability 0.7.” Before standard classification tasks can be applied to the problem of network fault
isolation, the underlying time-series data must be represented as a set of classified examples. This
summarization, or aggregation, process typically involves using a fixed time window and
characterizing the behavior over this window. For example, if n unique alarms are possible, one
could describe the behavior of a device over this time window using a scalar of length n. In this case
each field in the scalar would contain a count of the number of times a specific alarm occurs. One
may then label the constructed example based on whether a fault occurs within some other time
frame, for example, within the following 5 minutes. Thus, two-time windows are required. Once this
encoding is complete, standard classification tools can be used to generate “rules” to predict future
failures. Such an encoding scheme was used to identify chronic circuit problems (Sasisekharan,
Seshadri& Weiss, 1996). The problem of reformulating time-series network events so that
conventional classification-based data mining tools can be used to identify network faults has been
studied. Weiss & Hirsh (2018) view this task as an event prediction problem while Fawcett &
Provost (2009) view it as an activity monitoring problem. Transforming the time-series data so that
standard classification tools can be used has several drawbacks. The most significant one is that
21
some information will be lost in the reformulation process. For example, using the scalar-based
representation just mentioned, all sequence information is lost. Time weaver (Weiss & Hirsh, 1998)
is a genetic-algorithm based data mining system that is capable of operating directly on the raw
network-level time series data (as well as other time-series data), thereby making it unnecessary to
re-represent the network level data. Given a sequence of timestamped events and a target event T,
Time weaver will identify patterns that successfully predict T. Time weaver essentially searches
through the space of possible patterns, which includes sequence and temporal relationships, to find
predictive patterns. The system is especially designed to perform well when the target event is rare,
which is critical since most network failures are rare. In the case studied, the target event is the
failure of components in the 4ESS switching system.
22
predict churn. Models generated are evaluated using ROC curves and AUC values. They also
adopted cost sensitive learning strategies to address imbalanced class labels and unequal
misclassification costs issues discussed commercial bank customer churn prediction based on SVM
model, and used random sampling method to improve SVM model, considering the imbalance
characteristics of customer data sets. A study investigated determinants of customer churn in the
Korean mobile telecommunications service market based on customer transaction and billing data.
Their study defines changes in a customer’s status from active use to non-use or suspended as partial
defection and from active use to churn as total defection. Results indicate that a customer’s status
change explains the relationship between churn determinants and the probability of Churn. A neural
network (NN) based approach to predict customer churn in subscription of cellular wireless services.
Their results of experiments indicate that neural network-based approach can predict customer churn
with accuracy more than 92%.
An academic database of literature between the periods of 2000–2006 covering 24 journals
and proposes a classification scheme to classify the articles. Nine hundred articles were identified
and reviewed for their direct relevance to applying data mining techniques to CRM. They found that
the research area of customer retention received most research attention; and classification and
association models are the two commonly used models for data mining in CRM A critique on the
concept of Data mining and Customer Relationship Management in organized Banking and Retail
industries was also discussed. Most of these papers used existing customer’s data from a single
database. Some of them used only demographic data. But in our system, we used data from different
branches of a bank and merge these into a single database. We have analyzed borrower’s
transactional data. We focused on predicting prospective business sectors to disburse loan in retailing
commercial bank.
2.6 Summary
This literature review discussed the most prevailing data mining techniques machine-learning
and cluster analysis. Machine-learning algorithms could realize different functions such as
classification, prediction and association. These function systems and cluster analysis could
outperform the traditional methods on text mining and sentiment analysis, obtaining better accuracy
and larger capacity tolerance. Data mining provides a variety of systems for identifying cooperative
learning from vast datasets and an extensive range of methods for detecting useful knowledge from
massive datasets such as patterns, trends, and rules. Different data mining methods have been used in
social network and telecommunication analysis as focused on this paper. In this paper, the current
evaluation and update of data mining analysis were discussed and reviewed based on different
23
aspects analysis. Data mining techniques have been faces many challenges during this analysis area
to be resolve with aggressive improvement.
24
CHAPTER THREE
METHODOLOGY
3.1 Introduction
This chapter states the various methods will be used in research, as well as the population of the
study, and sampling techniques used in determining the sample size for the research. How data was
collected and analyzed is also discussed in this chapter.
The main objectives of this research will be achieved through quantitative methods, as inferential
statistics were used to measure the level of accuracy and validate responses from the respondents in
accordance to the objectives of the research.
Adamawa is a state in northeastern Nigeria, whose capital and largest city is Yola. In 1991,
when Taraba State was carved out from Gongola State, the geographical entity Gongola State was
renamed Adamawa State, with four administrative divisions:
Adamawa, Michika, Ganye, Mubi and Numan. It is one of the thirty-six states that constitute the
Federal Republic of Nigeria. Adamawa is one of the largest states of Nigeria and occupies about
36,917 square kilometres. It is bordered by the states of Borno to the northwest, Gombe to the west
and Taraba to the southwest. Its eastern border forms the national eastern border with Cameroon.
Topographically, it is a mountainous land crossed by the large river valleys – Benue, Gongola and
Yedsarem. The valleys of the Mount Cameroon, Mandara Mountains and Adamawa Plateau form
part of the landscape.
25
3.5 Sample and Sampling Techniques
A total sample size of 400 respondents will be randomly selected using confidence interval of 5 and
confidence level of 95% (0.05) from the total population of 632 MTNNigeria workers in Yola
Metropolis. Based on the populations the sample size was determined at 5% error of tolerance and
95% degree of confidence, using Taro Yamane’s Formula:
n= N
1 + ne2
Where; n = Population Size
N = Total Number of Students
e = Error tolerance (5%)
1 = Theoretical Constant
Data collected were analyzed using frequencies and percentages. These frequencies and percentages
enabled the researcher to clearly represent true data characteristics and findings with a great deal of
accuracy. Interpretation and analysis of data was also used to describe items in tables and charts used
for this study.
26
REFERENCES
Aggarwal, C. (Ed.). (2017). Data Streams: Models and Algorithms; New York: Springer.
Aregbeyen, Ph.D, (2011) The Determinants of Bank Selection Choices by Customers: Recent and
Extensive Evidence from Nigeria. International Journal of Business and Social Science.Vol.
2, No. 22, pp.276-288.
Bayesian Network Models. (1995) Proceedings of the First International Conference on Knowledge
Discovery and Data Mining; 1995 August 20-21. Montreal Canada. AAAI Press: Menlo
Park, CA.
Cortes, C., Pregibon, D. Giga-mining. (2018) Proceedings of the Fourth International Conference on
Knowledge Discovery and Data Mining; 174-178, 1998 August 27-31; New York, NY:
AAAI Press.
Ezawa, K., Norton, S. (2015) Knowledge discovery in telecommunication services data using
Bayesian network models. Proceedings of the First International Conference on Knowledge
Discovery and Data Mining; 1995 August 20–21. Montreal Canada. AAAI Press.
Ngai, E.W.T. , Li Xiu. D.C.K. Chau, (2009) Application of data mining techniques in customer
relationship management: A literature review and classification. Expert Systems with
Applications. Vol. 36, pp. 2592–2602.
27
Hangxia Ma, Min Qin, Jianxia Wang. (2009), "Analysis of the Business Customer Churn Based on
Decision Tree Method", The Ninth International Conference on Control and Automation,
Guangzhou, China.
Han, J., Altman, R. B., Kumar, V., Mannila, H., Pregibon, D. (2002) Emerging scientific
applications in data mining. Communications of the ACM; 45(8): 54-58.
Hafeez Ur Rehman and Saima Ahmed, (2008) An Empirical Analysis of the determinants of bank
selection in Pakistan; A customer view.Pakistan Economic and Social Review.Vol. 46, no.
2, pp.147-160.
Kaplan, H., Strauss, M., Szegedy, M. (1999) Just the fax differentiating voice and fax phone lines
using call billing data. Proceedings of the Tenth Annual ACM-SIAM Symposium
Kazi Omar Siddiqi, (2011) Interrelations between Service Quality Attributes, Customer Satisfaction
and Customer Loyalty in the Retail Banking Sector in Bangladesh. International Journal of
Business and Management.Vol. 6, No. 3, pp.12-36.
KristofCoussement, and Dirk Van den Poel (2008) Churn prediction in subscription services: An
application of support vector machines while comparing two parameter-selection
techniques. Expert Systems with Applications.Vol. 34, pp.313–327.
Liebowitz, J. (1988). Expert System Applications to Telecommunications. New York, NY: John
Wiley
Berry, M. and G. Linoff. (2000) Mastering Data Mining. John Wiley and Sons, New York, USA,.
Menlo Park, CA, 1995Fawcett, T., Provost, F. (1997) Adaptive fraud detection. Data Mining and
Knowledge Discovery; 1(3):291-316.
Mozer, M., Wolniewicz, R., Grimes, D., Johnson, E., &Kaushansky, H. (2000). Predicting subscriber
dissatisfaction and improving retention in the wireless telecommunication industry
MO Zan, ZHOA Shan, LI Li, LIU Ai-Jun, (2007), "A predictive Model of Churn in
Telecommunications Base on Data Mining".IEEE International Conference on Control and
Automation", Guangzhou, China.
PAKDD 2006 Data Mining Competition, http://www3. ntu. edu.
sg/SCE/pakdd2006/competition/overview. Htm
Pareek, D.: Business Intelligence for Telecommunications. Auerbach Publications, Taylor & Francis
Group LLC.
28
Weiss, G. (2004). Mining with rarity: A unifying framework SIGKDD Explorations.
Yossi Ritcher, Elad Yom-Tov, Noam Slonim, (2008) "Predicting Customer Churn in Mobile
Networks through Analysis of Social Groups". SIAM.
Madhavi, S. , (2012) The Prediction of churn behaviour among Indian bank customer: An
application of Data Mining Techniques. International Journal of Marketing, Financial
Services & Management Research. Vol.1,o. 2, pp.11-19.
OshiniGoonetilleke T. L. , and H. A. Caldera, (2013) Mining Life Insurance Data for Customer
Attrition Analysis. Journal of Industrial and Intelligent Information. Vol. 1, no. 1.pp. 52-58.
Xiaohua Hu, (2005) A Data Mining Approach for Retailing Bank Customer Attrition Analysis.
Applied Intelligence. Vol. 22, pp. 47–60.
29