Anush J Internship Report
Anush J Internship Report
Anush J Internship Report
A report on
“Introduction to Python and Data Science”
Submitted in partial fulfillment for Internship
in
of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
Submitted by
DEEPAK V CHANNAPPAGOUDAR
(1RF21CS032)
2022-23
RV Educational Institutions
RV INSTITUTE OF TECHNOLOGY AND MANAGEMENT®
(Affiliated to Visvesvaraya Technological University, Belagavi & Approved by AICTE, New Delhi)
Chaitanya Layout, JP Nagar 8th Phase, Kothanur, Bengaluru-560076
CERTIFICATE
Certified that the Inter / Intra institutional internship-I work titled “Introduction to
Python and Data Science” has been carried out by
DEEPAK V CHANNAPPAGOUDAR
(1RF21CS032) is a bonafide student of RV Institute of Technology and Management,
Bengaluru in partial fulfillment for the award of summer internship-I in Computer Science
and Engineering of the Visvesvaraya Technological University, Belagavi during the
academic year 2022-2023. It is certified that all corrections/suggestions indicated for the
internal assessment have been incorporated in the report. The internship report has been
approved as it satisfies the academic requirements prescribed by the university.
DECLARATION
I, DEEPAK V CHANNAPPAGOUDAR
(1RF21CS032) the student of third semester B.E, Computer Science and Engineering, RV
Institute of Technology and Management, Bengaluru hereby declare that the Inter / Intra
institutional internship-I titled “Introduction to Python and Data Science” has been
carried out by me and submitted in partial fulfillment for the award of summer internship-I
Visvesvaraya Technological University, Belagavi during the academic year 2022 -2023. I
declare that matter embodied in this report has not been submitted to any other university
DEEPAK V CHANNAPPAGOUDAR
(1RF21CS032)
ACKNOWLEDGEMENT
The successful presentation of the summer internship-I would be incomplete without the
mention of the people who made it possible and whose constant guidance crowned my effort
with success.
I thank Dr. Anitha J, Professor and Head, Department of Computer Science and
Engineering, RV Institute of Technology and Management, Bengaluru, for her initiative and
encouragement.
I would like to thank my internship resource person, Dr. Roopashree S, Assistant Professor,
Department of Computer Science and Engineering, RV Institute of Technology and
Management, Bengaluru, for his/her constant guidance and inputs.
I would like to thank all the Teaching Staff and Non-Teaching Staff of the college for their
co-operation.
Finally, I extend my heart-felt gratitude to my family for their encouragement and support
without which I would not have come so far. Moreover, I thank all my friends for their
invaluable support and cooperation.
DEEPAK V CHANNAPPAGOUDAR
(1RF21CS032)
ABSTRACT
Python is a high-level object-oriented programming language that is used in a wide
variety of application domains. It has the right combination of performance and features that
demystify program writing. Python follows modular programming approach, which is a
software design technique that emphasizes separating the functionality of a program into
independent, inter-changeable modules, such that each contains everything necessary to
execute only one aspect of the desired functionality. Conceptually, modules represent a
separation of concerns, and improve maintainability by enforcing logical boundaries
between components.
Data science encompasses a set of principles, problem definitions, algorithms, and
processes for extracting nonobvious and useful patterns from large data sets. Many of the
elements of data science have been developed in related fields such as machine learning and
data mining. The commonality across these disciplines is a focus on improving decision
making through the analysis of data. Machine learning (ML) focuses on the design and
evaluation of algorithms for extracting patterns from data. Data science takes these
considerations into account but also takes up other challenges, such as the capturing,
cleaning, and transforming of unstructured social media and web data; the use of big-data
technologies to store and process big, unstructured data sets; and questions related to data
ethics and regulation.
This internship report reflects the 3 weeks training received. The details of the
practical experience and the academic knowledge that have been gained from the internship
during its tenure are incorporated.
Table of Contents
Chapter No. Contents Page No.
Acknowledgement I
Abstract II
Table of Contents III
List of Figures IV
Python has libraries with large collections of mathematical functions and analytical tools.
The following are the libraries that give users the necessary functionality when crunching
data:
1. NumPy:
NumPy stands for Numerical Python. It is a general-purpose array-processing
package. The most powerful feature of NumPy is n-dimensional array. It is the
fundamental package for scientific computing with Python. This library also contains
basic linear algebra functions, Fourier transforms, advanced random number
capabilities and tools for integration with other low-level languages like Fortran, C
and C++. NumPy can also be used as an efficient multi-dimensional container of
generic data.
2. Pandas Data Frame:
Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. It is built on NumPy package due to which, NumPy is
necessary for operating the Pandas. It can perform five significant steps required for
processing and analysis of data irrespective of the origin of the data, i.e., load,
manipulate, prepare, model, and analyse.
Pandas has a fast and efficient Data Frame object with the default and customized
indexing. It can be used for reshaping and pivoting of the data sets. A variety of
datasets can be processed in different formats like matrix data, tabular heterogeneous
and Time Series. It can also integrate with other libraries such as SciPy and Scikit-
learn.
3. MatPlotLib:
Matplotlib is a quintessential Python library for plotting and visualisation. A vast
variety of graphs, starting from histograms to line plots to heat plots can be done. One
of the greatest benefits of visualization is that it allows us visual access to huge
amounts of data in easily digestible visuals. This library can provide an object-
oriented API for embedding plots into applications. It is a close resemblance to
MATLAB embedded in Python programming language. Matplotlib also facilitates
labels, grids, legends, and some more formatting entities.
4. Scikit Learn:
Scikit Learn is a robust machine learning library for Python. It features ML
algorithms like SVMs, random forests, k-means clustering, spectral clustering, mean
shift, cross-validation and more. Even NumPy, SciPy and related scientific operations
are supported by Scikit Learn by being a part of the SciPy Stack. It implements a
range of machine learning, pre-processing, cross-validation, and visualization
algorithms using a unified interface.
The scikit-learn library provides many different algorithms which can be imported
into the code and then used to build models just like we would import any other
Python library. This makes it easier to quickly build different models and compare
these models to select the highest scoring one. But to really appreciate its true power,
we need to start using it on different open data sets and build predictive models using
them.
Training Contents
This internship was carried out from 10th of October to 31st of October 2022. In the first
week, I was able to understand the basic python programming skills. Following were the
topics covered that helped me to do so:
1. Introduction to Python: The procedure to install Python, distinguish between
important datatypes and use basic features of the python interpreter and IDLE were
the starting steps. The concept of difference between a module and a script was also
covered.
2. Using variables in Python: I learn about numeric, string, sequence and dictionary data
types along with relevant operations while practicing Python syntax.
3. Basic concepts in Python: The basic idea of using conditional statements, loops and
iterators were studied. This helped me in developing Python programs using the
above statements.
4. Python Datatypes: After understanding the basics, I moved on to exploring the
various datatypes available in Python such as lists, dictionaries, tuples and sets. I also
learned the operations that can be performed on them.
5. Functions and Packages: Optimisation of Python code was studied further by dividing
the program into functions.
By the end of this week, I was successfully able to develop programs in Python. I got to
put the known concepts into practice in my assignments.
Chapter-2
INTRODUCTION TO DATASCIENCE
Basic Concepts
Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find unseen patterns, derive meaningful information, and make
business decisions. It uses complex machine learning algorithms to build predictive models.
The data used for analysis can come from many different sources and presented in various
formats.
A data science life cycle is an iterative set of steps taken to deliver a project or
analysis. Since every data science project and team are different, every specific data science
life cycle is different. However, most data science projects tend to flow through the same
general life cycle of data science steps which are as follows:
1. Capture: In this stage, the data science team is trained in researching the issue to
create context and gain understanding. Raw structured and unstructured data are
gathered. The team comes up with an initial hypothesis, which can be later confirmed
with evidence.
2. Maintain: This stage covers methods to investigate the possibilities of pre-processing,
analysing, and preparing data before analysis and modelling. The raw data is taken
and put into a form that can be used.
3. Process: After pre-processing, data scientists take the data and examine its patterns,
ranges, and biases to determine how useful it will be in predictive analysis.
4. Analyze: This stage involves performing the various analyses on the data. It involves
exploratory and predictive analysis.
5. Communicate: In the final stage, analysts prepare the analyses in easily readable
forms such as charts, graphs, and reports. Thus, it is a data reporting or data
visualisation stage.
Training Contents
In the second week of the internship, I was able to gain knowledge about data analysis,
data visualisation, machine learning, and applying the same on real life Data Science projects.
The following topics covered during this week helped me to gain this knowledge:
1. Python libraries: The available libraries in Python such as NumPy, Pandas, SciPy and
Matplotlib were studied, which would be useful for analysing data.
2. Model building: I learnt the basic steps involved in model building using SciKit learn
which are- loading a dataset, splitting the dataset and timing the model. Data
visualisation can be done with Matplotlib.
3. Machine learning: Understanding data processing, operation on NumPy arrays,
reciprocal, power function, modulus function and execution of programs.
4. Data cleaning: After learning the basic tools available for analysing data sets,
overview of data cleaning was covered which involves the unwanted observations,
fixing structural errors, managing unwanted outliers, handling missing data and
inputting the missing values from past observations.
Digits Dataset
For the second assignment of the internship, I worked on the digits dataset provided by scikit-
learn. K-nearest Neighbor is a Non parametric ,lazy and supervised machine learning
Uses the phenomenon “similar things are near to each to each other”
It predicts the class of the new data point by majority of votes of k nearest neighbors based
It is commonly used for its easy of interpretation and low calculation time.
The digit database is created by collecting 250 samples from 44 writers. The samples written
by 30 writers are used for training, cross-validation and writer dependent testing, and the
digits written by the other 14 are used for writer independent testing. In this dataset, all
classes have equal frequencies. So the number of objects in one class (corresponding to the
Digits Dataset is a part of sklearn library. Sklearn comes loaded with datasets to practice
Digits has 64 numerical features(8×8 pixels) and a 10 class target variable(0-9). Digits