Ghost Imputation Project
Ghost Imputation Project
Ghost Imputation Project
Of
RAHUL BARIHA
Examination Roll no: UBS18CSC-004
CERTIFICATE
It is certified that the work contained in this report titled “GHOST IMPUTATION PROJECT”
original work done by SNEHA KAUSHAL(UBS18CSC-001), RAHUL BARIHA(UBS18CSC-004)
and has been carried out under my supervision.
________________________
Madhusmita Panda
Head of the Department
(Project Supervisor)
GANGADHAR MEHER UNIVERSITY, AMRUTA VIHAR SAMBALPUR
DECLARATION
We hereby declare that the work which is being presented in this project entitled, “GHOST
IMPUTATION PROJECT” submitted to GANGADHAR MEHER UNIVERSITY,
AMRUTA VIHAR SAMBALPUR in the partial fulfilment of the requirements for the degree
of Bachelor of Science in Computer Science, is an authentic record of our own work carried out
under the supervision of Dr. Madhusmita Panda .The matter embodied in this project report has
not been submitted by us for the award of any other degree.
SNEHA KAUSHAL(UBS18CSC-001)
RAHUL BARIHA(UBS18CSC-004)
Date:
Place:
ACKNOWLEDGEMENT
The success and final outcome of this project required a lot of guidance and
assistance from many people. We are extremely fortunate to have this all
along the completion of our project work. Whatever we have done is only due
to such guidance and assistance and we would not forget to thank them. First
of all, we would like to thank the university for giving us such a great
opportunity in carving ourselves technically. The Vice Chancellor Prof. Atanu
kumar Pati has supported us in every way possible. Second of all, we show our
gratitude for the Head of the Department Dr. Madhusmita Panda for
motivating and encouraging us every time. We are extremely grateful and
remain indebted to our guide Dr. Madhusmita Panda for being a source of
inspiration and for their constant support in the Design, Implementation and
Evaluation of the project. We are thankful to them for their constant
constructive criticism and invaluable suggestions, which benefited us a lot
while developing the project, “GHOST IMPUTATION PROJECT ” . They have
been a constant source of inspiration and motivation for hard work. They have
been very co-operative throughout this project work.
Noise and missing data are intrinsic characteristics of real-world data, leading to uncertainty
that negatively affects the quality of knowledge extracted from the data. The burden imposed
by missing data is often severe in sensors that collect data from the physical world, where
large gaps of missing data may occur when the system is temporarily off or disconnected.
How can we reconstruct missing data for these periods? We introduce an accurate and
efficient algorithm for missing data reconstruction (imputation), that is specifically designed
to recover off-period segments of missing data. This algorithm, Ghost, searches the
sequential dataset to find data segments that have a prior and posterior segment that matches
those of the missing data. If there is a similar segment that also satisfies the constraint – such
as location or time of day – then it is substituted for the missing data. A baseline approach
results in quadratic computational complexity, therefore we introduce a caching approach that
reduces the search space and improves the computational complexity to linear in the common
case.
Introduction
With recent technological advances and increases in computing capabilities, data
intensive scientific discovery is being widely used. This has led to the introduction of
methods for analyzing data collected from multiple sources of information, i.e., “multivariate
data” 1. One of the inevitable challenges of real-world data analysis is uncertainty rising from
noise and missing data. This uncertainty negatively affects the quality of knowledge extracted
from the data. Indeed, the burden imposed by missing data is often severe in applications
collecting data from the physical world, e.g., mobile sensing [33] or genome sequencing
[6].For example, consider battery powered devices, such as smart watches, equipped with
inexpensive sensors such as ambient light and accelerometer. Due to sensor quality, battery
limits, and user preferences, context-sensing application scan not continuously and
effectively collect data [33]and there are often segments of missing data, e.g., the device is
turned off. These missing segments affect the quality of knowledge-extraction methods [31].
Although, missing data reconstruction is an important requirement of these system, it has not
received much attention. There are longstanding efforts in statistics [10], [40], [43] to
reconstruct missing data. These imputation methods assume the missing data points occur at
random, i.e., missing at random (MAR) or missing completely at random (MCAR).
If the data is missing not at random (MNAR) [40], the imputation process is more
challenging. In this paper we propose a novel algorithm for imputation of multivariate sensor
data. This algorithm only uses (i) a constraint such as time of the day or location (ii) the data
values immediately prior to the missing event, and (iii) thedata values immediately following
the missing event. Since our method does not rely on statistical methods, it might be able to
handle some MNAR data, but only if a similar segment exists in the dataset. In particular, our
algorithm operates on multivariate and sequential data streams. It reads two adjacent data
segments–one before and one after the missing data (missing segment) –and searches the
dataset to find two segments similar to the adjacent segments of the missing segment. If
segment between these two similar segments is of the same length as missing segment, it is a
candidate recovery segment. Next, if the constraint values of the segment of interest matches
the constraint values of the missing segment, the algorithm substitutes the missing segment
with the content of this candidate recovery segment. A naïve approach imposes a quadratic
computational complexity, so we add a pre-processing step that reads all data segments and
their indexes into a cache, achieving a linear computational complexity in the common case.
The characteristics and contributions of our algorithm2 areas follows.2. To allow full
reproducibility of our claims, we share the algorithm and its implementation with the
community; all of our code and links to acquire access to the datasets are available
online:https://sites.google.com/view/ghostimputation.
Heterogeneous Multivariate Real-World Data: Statistical imputation approaches are
optimized to handle numerical data [40], [43]. Real-world systems, however, produce data in
numerical, categorical, or binary forms. Our algorithm relies on a categorical abstraction of
the original data by converting data values to categorical symbols, e.g., by bucketing numeric
data into categories. Therefore, in contrast to statistical-based imputation, any data type,
regard less of its distribution, can be fed into this algorithm, i.e. non parametric imputation.
All datasets we employed in this work are real-world datasets and most of them (wearable,
mobile, IoT, and news media datasets) have not previously been used for imputation studies.
We recommend to use this algorithm mainly for multivariate sensor data used in consumer
IoT and mobile devices, but to demonstrate its versatility, we experiment it with two other
real-world datasets as well (clinical data and real estate data).
Instance based Learning: Inspired from algorithms that learn from a single labelled instance,
our algorithm tries to estimate the missing data from the first similar instance that can be
found in a sequential search of the dataset. Clearly, relying on a single label (similar instance)
is prone to false positive errors. Instead, we rely on a constraint, as a controlling variable that
significantly reduces false positives. Our definition of constraint is inspired by binary
constraint in constraint satisfaction problems (CSP), but, unlike traditional CSP it is not used
to reduce the search space.
Search Space: Continuously collecting and storing data can be expensive in terms of resource
usage, especially in battery-powered wireless devices. Data is typically not stored long-term
on these devices, and most data processing is conducted in cloud servers. Our algorithm can
reconstruct the missing data merely by finding the first match for the missing segment
without a need to search the entire dataset. For instance, we have used only three days for a
smart phone dataset and only seven days for a smart watch dataset. Their datasets are fairly
small, but in both of these examples, our algorithm outperforms state-of the-art algorithms
and reconstructs the missing data with higher accuracy. Note that all versions of our
algorithm, i.e. baseline and cache based ones, have only one efficient parameter and itis the
window size. In the evaluation section we identify an optimal window size value for each
dataset. Therefore, based on the target dataset (or application), this parameter could be
assigned automatically and there is no need for the user, who has a domain knowledge, to
decide about its optimal value. There is another parameter used for tolerating slight
dissimilarity, we will demonstrate why users should not tolerate dissimilarity.
SCOPE
Heterogeneous Multivariate Real-World Data: Statistical imputation approaches are
optimized to handle numerical data. Real-world systems, however, produce data in
numerical, categorical, or binary forms. Our algorithm relies on a categorical abstraction of
the original data by converting data values to categorical symbols, e.g., by bucketing numeric
data into categories. Therefore, in contrast to statistical-based imputation, any data type,
regardless of its distribution, can be fed into this algorithm, i.e. nonparametric imputation. All
datasets we employed in this work are real-world datasets and most of them (wearable,
mobile, IoT, and news media datasets) have not previously been used for imputation studies.
We recommend to use this algorithm mainly for multivariate sensor data used in consumer
IoT and mobile devices, but to demonstrate its versatility, we experiment it with two other
real-world datasets as well (clinical data and real estate data).
Our algorithm can reconstruct the missing data merely by finding the first match for the
missing segment without a need to search the entire dataset. For instance, we have used only
three days for a smart phone dataset and only seven days for a smart watch dataset. Their
datasets are fairly small, but in both of these examples, our algorithm outperforms state-of
the-art algorithms and reconstructs the missing data with higher accuracy.
System Requirements Specification
Introduction
A Software Requirements Specification (SRS) – a requirements specification
for a software system – is a complete description of the behaviour of a system to be
developed. It includes a set of use cases that describe all the interactions the users will have
with the software. In addition to use cases, the SRS also contains non-functional
requirements. Non-functional requirements are requirements which impose constraints on the
design or implementation (such as performance engineering requirements, quality standards,
or design constraints).
In software engineering, the same meanings of requirements apply, except that the focus of
interest is the software itself.
FEASIBILITY STUDY
Technical Feasibility
Operational Feasibility
Economical Feasibility
ECONOMIC FEASIBILITY
A system can be developed technically and that will be used if installed must still be a
good investment for the organization. In the economical feasibility, the development cost in
creating the system is evaluated against the ultimate benefit derived from the new systems.
Financial benefits must equal or exceed the costs.
The system is economically feasible. It does not require any addition hardware or
software. Since the interface for this system is developed using the existing resources and
technologies available at NIC, There is nominal expenditure and economical feasibility for
certain.
OPERATIONAL FEASIBILITY
Proposed projects are beneficial only if they can be turned out into information
system. That will meet the organization’s operating requirements. Operational feasibility
aspects of the project are to be taken as an important part of the project implementation.
Some of the important issues raised are to test the operational feasibility of a project includes
the following: -
The well-planned design would ensure the optimal utilization of the computer resources and
would help in the improvement of performance status.
TECHNICAL FEASIBILITY
The technical issue usually raised during the feasibility stage of the investigation
includes the following:
Functional Requirements:
Admin
1. Home
2. View All Data Users
3. View In Active Data Users
4. Import Dataset
5. User Dataset Requests (Accept/Reject)
6. Logout
User
1. Home
2. Search Dataset Record , View, Recover the Data
3. Send request to Admin
4. Logout
Non-Functional Requirements:
The major non-functional requirements of the system are as follows
Usability:
The system is designed with completely automated process hence there is no or less user
intervention
Reliability:
The system is more reliable because of the qualities that are inherited
From the chosen platform in java. The code built by using java is more reliable.
Performance:
This system is developing in the high level languages and using the advanced front-end and
back-end technologies
It will give response to the end user on client system with in very less time
Chapter- 3
3.1 Introduction
ER diagrams are related to data structure diagrams (DSDs), which focus on the relationships of
elements within entities instead of relationships between entities themselves. ER diagrams also are
often used in conjunction with data flow diagrams (DFDs), which map out the flow of information for
processes or systems.
E-R Diagram :-
3.3 DATA FLOW DIAGRAMS
Use case diagrams model the functionality of a system using actors and use cases. Use cases
are services or functions provided by the system to its users.
System
Draw your system's boundaries using a rectangle that contains use cases. Place actors outside
the system's boundaries.
Use Case
Draw use cases using ovals. Label with ovals with verbs that represent the system's functions.
Actors
Actors are the users of a system. When one system is the actor of another system, label the
actor system with the actor stereotype.
Relationships
Illustrate relationships between an actor and a use case with a simple line. For relationships
among use cases, use arrows labeled either "uses" or "extends." A "uses" relationship
indicates that one use case is needed by another in order to perform a task. An "extends"
relationship indicates alternative options under a certain use case.
Usecase Diagram
4.1.5 Home
4.1.17 Logout
4.2 Database Table Design (Screenshots)
4.2.1 Tables
4.2.2 Tables With Data
users
4.2.3 dataset
4.2.4 reqdata
CHAPTER 5 TESTING & TESTCASES
5.1 TESTING
TESTING METHODOLOGIES
The following are the Testing Methodologies:
o Unit Testing.
o Integration Testing.
o User Acceptance Testing.
o Output Testing.
o Validation Testing.
During this testing, each module is tested individually and the module interfaces are
verified for the consistency with design specification. All important processing path are
tested for the expected results. All error handling paths are also tested.
2. Bottom-up Integration
This method begins the construction and testing with the modules at the lowest level
in the program structure. Since the modules are integrated from the bottom up, processing
required for modules subordinate to a given level is always available and the need for stubs is
eliminated. The bottom up integration strategy may be implemented with the following steps:
The low-level modules are combined into clusters into clusters that perform a specific
Software sub-function.
A driver (i.e.) the control program for testing is written to coordinate test case input
and output.
The cluster is tested.
Drivers are removed and clusters are combined moving upward in the program
structure
The bottom up approaches tests each module individually and then each module is
module is integrated with a main module and tested for functionality.
Text Field:
The text field can contain only the number of characters lesser than or equal to its
size. The text fields are alphanumeric in some tables and alphabetic in other tables. Incorrect
entry always flashes and error message.
Numeric Field:
The numeric field can contain only numbers from 0 to 9. An entry of any character
flashes an error messages. The individual modules are checked for accuracy and what it has
to perform. Each module is subjected to test run along with sample data. The individually
tested modules are integrated into a single system. Testing involves executing the real data
information is used in the program the existence of any program defect is inferred from the
output. The testing should be planned so that all the requirements are individually tested.
A successful test is one that gives out the defects for the inappropriate data and
produces and output revealing the errors in the system.
Taking various kinds of test data does the above testing. Preparation of test data plays
a vital role in the system testing. After preparing the test data the system under study is tested
using that test data. While testing the system by using test data errors are again uncovered
and corrected by using above testing steps and corrections are also noted for future use.
Live test data are those that are actually extracted from organization files. After a system is
partially constructed, programmers or analysts often ask users to key in a set of data from their normal
activities. Then, the systems person uses this data as a way to partially test the system. In other
instances, programmers or analysts extract a set of live data from the files and have them entered
themselves.
It is difficult to obtain live data in sufficient amounts to conduct extensive testing.
And, although it is realistic data that will show how the system will perform for the typical
processing requirement, assuming that the live data entered are in fact typical, such data
generally will not test all combinations or formats that can enter the system. This bias toward
typical values then does not provide a true systems test and in fact ignores the cases most
likely to cause system failure.
Using Artificial Test Data:
Artificial test data are created solely for test purposes, since they can be generated to test all
combinations of formats and values. In other words, the artificial data, which can quickly be
prepared by a data generating utility program in the information systems department, make
possible the testing of all login and control paths through the program.
The most effective test programs use artificial test data generated by persons other than those
who wrote the programs. Often, an independent team of testers formulates a testing plan,
using the systems specifications.
The package “Virtual Private Network” has satisfied all the requirements specified as per
software requirement specification and was accepted.
12Whenever a new system is developed, user training is required to educate them
about the working of the system so that it can be put to efficient use by those for whom the
system has been primarily designed. For this purpose the normal working of the project was
demonstrated to the prospective users. Its working is easily understandable and since the
expected users are people who have good knowledge of computers, the use of this system is
very easy.
MAINTAINENCE
This covers a wide range of activities including correcting code and design errors. To reduce
the need for maintenance in the long run, we have more accurately defined the user’s
requirements during the process of system development. Depending on the requirements, this
system has been developed to satisfy the needs to the largest possible extent. With
development in technology, it may be possible to add many more features based on the
requirements in future. The coding and designing is simple and easy to understand which will
make maintenance easier.
TESTING STRATEGY :
A strategy for system testing integrates system test cases and design techniques into a well
planned series of steps that results in the successful construction of software. The testing
strategy must co-operate test planning, test case design, test execution, and the resultant data
collection and evaluation .A strategy for software testing must accommodate low-level
tests that are necessary to verify that a small source code segment has been correctly
implemented as well as high level tests that validate major system functions against user
requirements.
Software testing is a critical element of software quality assurance and represents the
ultimate review of specification design and coding. Testing represents an interesting anomaly
for the software. Thus, a series of testing are performed for the proposed system before
the system is ready for user acceptance testing.
SYSTEM TESTING:
Software once validated must be combined with other system elements (e.g. Hardware,
people, database). System testing verifies that all the elements are proper and that overall
system function performance is achieved. It also tests to find discrepancies between the
system and its original objective, current specifications and system documentation.
UNIT TESTING:
In unit testing different are modules are tested against the specifications produced during the
design for the modules. Unit testing is essential for verification of the code produced during
the coding phase, and hence the goals to test the internal logic of the modules. Using the
detailed design description as a guide, important Conrail paths are tested to uncover errors
within the boundary of the modules. This testing is carried out during the programming stage
itself. In this type of testing step, each module was found to be working satisfactorily as
regards to the expected output from the module.
Test cases can be divided in to two types. First one is Positive test cases and second one is
negative test cases. In positive test cases are conducted by the developer intention is to get the
output. In negative test cases are conducted by the developer intention is to don’t get the
output.
As a future work we will try to develop a distance function that can identify prior and
posterior segments with are in the proximity (not adjacent) of the missing segments. Finding
priors and posteriors patterns and their distance to the missing segments could increase
number of recovery segments, and thus accuracy of the algorithm.