Python 3 and Machine Learning Using ChatGPT / GPT-4: Harness the Power of Python, Machine Learning, and Generative AI
()
About this ebook
This book bridges the gap between theoretical knowledge and practical application in Python programming, machine learning, and using ChatGPT-4 in data science. It starts with an introduction to Pandas for data manipulation and analysis. The book then explores various machine learning classifiers, from kNN to SVMs. Later chapters cover GPT-4's capabilities, enhancing linear regression analysis, and using ChatGPT in data visualization, including AI apps, GANs, and DALL-E.
The journey begins with mastering Pandas and machine learning fundamentals. It progresses to applying GPT-4 in linear regression and machine learning classifiers. The final chapters focus on using ChatGPT for data visualization, making complex results accessible and understandable.
Understanding these concepts is crucial for modern data scientists. This book transitions readers from basic Python programming to advanced applications of ChatGPT-4 in data science. Companion files with source code, datasets, and figures enhance learning, making this an essential resource for mastering Python, machine learning, and AI-driven data visualization.
Read more from Mercury Learning And Information
Computer Graphics Programming in OpenGL With C++ (Edition 3): Mastering 3D Graphics and Animation Techniques Rating: 0 out of 5 stars0 ratingsPython 3 for Machine Learning: Harness the Power of Python for Advanced Machine Learning Projects Rating: 0 out of 5 stars0 ratingsArtificial Intelligence and Expert Systems: Techniques and Applications for Problem Solving Rating: 0 out of 5 stars0 ratingsAccess 2021 / Microsoft 365 Programming by Example: Mastering VBA for Data Management and Automation Rating: 0 out of 5 stars0 ratingsGame Development Using Python: Mastering Interactive Game Creation and Development through Python Rating: 0 out of 5 stars0 ratingsAccess 365 Project Book: Hands-On Database Creation Rating: 0 out of 5 stars0 ratingsText Analytics for Business Decisions: Mastering Techniques for Insightful Data Interpretation through a Case Study Approach Rating: 0 out of 5 stars0 ratingsComputer Concepts and Management Information Systems: A Comprehensive Guide to Modern Computing and Information Management Rating: 0 out of 5 stars0 ratings3D Printing: The Complete Guide to Mastering 3D Printing Techniques Rating: 0 out of 5 stars0 ratingsDatabase Security: Master the Art of Protecting Your Data with Cutting-Edge Techniques Rating: 0 out of 5 stars0 ratingsAngular and Deep Learning Pocket Primer: A Comprehensive Guide to AI and Expert Systems for Professionals Rating: 0 out of 5 stars0 ratingsAutoCAD 2024 Beginning and Intermediate: Mastering 2D Drafting Techniques for All Levels Rating: 0 out of 5 stars0 ratingsClassic Game Design: From Pong to Pac-Man with Unity: Crafting Timeless Retro Games with Expert Techniques Rating: 0 out of 5 stars0 ratingsData Wrangling Using Pandas, SQL, and Java: A Comprehensive Guide to Data Cleaning and Transformation Rating: 0 out of 5 stars0 ratingsPython Data Structures Pocket Primer: A concise guide to Python data structures to enhance your skills Rating: 0 out of 5 stars0 ratingsData Analytics: Master the Art of Data Analytics with Essential Tools and Techniques Rating: 0 out of 5 stars0 ratingsTensor Analysis for Engineers: Mastering Coordinate Systems, Transformations and Applications using Mathematics Rating: 0 out of 5 stars0 ratingsMicrosoft Excel 2021 Programming Pocket Primer: A Comprehensive Guide to Mastering Excel VBA Rating: 0 out of 5 stars0 ratingsData Science for IoT Engineers: Master Data Science Techniques and Machine Learning Applications for Innovative IoT Solutions Rating: 0 out of 5 stars0 ratingsData Literacy With Python: A Comprehensive Guide to Understanding and Analyzing Data with Python Rating: 0 out of 5 stars0 ratingsAutodesk Revit 2025 Architecture: Mastering Revit Techniques for Efficient Architectural Design Rating: 0 out of 5 stars0 ratingsData Visualization for Business Decisions: Transforming Data into Actionable Insights Rating: 0 out of 5 stars0 ratingsDigital Signal Processing: An Introduction to Mastering Advanced Techniques for Transforming and Analyzing Signals Rating: 0 out of 5 stars0 ratingsEmbedded Vision: Mastering Advanced Techniques for Real-Time Image Processing and Analysis Rating: 0 out of 5 stars0 ratingsMultiphysics Modeling Using COMSOL 5 and MATLAB: Explore Advanced Techniques for Simulation and Analysis Rating: 0 out of 5 stars0 ratingsPython Tools for Data Scientists Pocket Primer: A Quick Guide to Essential Python Libraries for Data Science Rating: 0 out of 5 stars0 ratingsAdobe InDesign: Creative Class for Beginners Rating: 0 out of 5 stars0 ratingsPython for Programmers: A Comprehensive Guide for Intermediate to Advanced Python Programmers and Developers Rating: 0 out of 5 stars0 ratings
Related to Python 3 and Machine Learning Using ChatGPT / GPT-4
Related ebooks
Large Language Models An Introduction: Understanding the Fundamentals and Applications of Generative AI Rating: 0 out of 5 stars0 ratingsData Science Fundamentals Pocket Primer: An Essential Guide to Data Science Concepts and Techniques Rating: 0 out of 5 stars0 ratingsPython Tools for Data Scientists Pocket Primer: A Quick Guide to Essential Python Libraries for Data Science Rating: 0 out of 5 stars0 ratingsPandas Basics: Mastering Data Analysis with Pandas Rating: 0 out of 5 stars0 ratingsPython 3 Data Visualization Using Google Gemini: Unlock the Power of Python and Google Gemini for Stunning Data Visualizations Rating: 0 out of 5 stars0 ratingsComputer Concepts and Management Information Systems: A Comprehensive Guide to Modern Computing and Information Management Rating: 0 out of 5 stars0 ratingsDealing With Data Pocket Primer: A Comprehensive Guide to Data Handling Techniques Rating: 0 out of 5 stars0 ratingsArtificial Intelligence, Machine Learning, and Deep Learning: A Practical Guide to Advanced AI Techniques Rating: 0 out of 5 stars0 ratingsPython 3 Data Visualization Using ChatGPT / GPT-4: Master Python Visualization Techniques with AI Integration Rating: 0 out of 5 stars0 ratingsGame Testing: Mastering the Art of Quality Assurance in Game Development Rating: 0 out of 5 stars0 ratingsData Structures and Program Design Using Python: A Self-Teaching Introduction to Data Structures and Python Rating: 0 out of 5 stars0 ratingsData Literacy With Python: A Comprehensive Guide to Understanding and Analyzing Data with Python Rating: 0 out of 5 stars0 ratingsTransformer, BERT, and GPT: Unlock the Power of Transformers, BERT, GPT-3, and GPT-4 in Natural Language Processing Rating: 0 out of 5 stars0 ratingsEmbedded Vision: Mastering Advanced Techniques for Real-Time Image Processing and Analysis Rating: 0 out of 5 stars0 ratingsData Analysis for Business Decisions: A Laboratory Manual Rating: 0 out of 5 stars0 ratingsTech Trends of the 4th Industrial Revolution: Navigating the Future of Technology in Business Rating: 0 out of 5 stars0 ratingsDigital Signal Processing: An Introduction to Mastering Advanced Techniques for Transforming and Analyzing Signals Rating: 0 out of 5 stars0 ratingsJava for Developers Pocket Primer: A Concise Guide to Mastering Java Programming Rating: 0 out of 5 stars0 ratingsData Structures and Program Design Using C++: A Self-Teaching Introduction to Data Structures and C++ Rating: 0 out of 5 stars0 ratingsGoogle Gemini for Python: Coding with Bard: Mastering Python with Google's AI Tools Rating: 0 out of 5 stars0 ratingsAngular and Machine Learning Pocket Primer: A Comprehensive Guide to Angular and Integrating Machine Learning Rating: 0 out of 5 stars0 ratingsData Science Tools: Comprehensive Guide to Mastering Fundamental Data Science and Statistics Techniques Rating: 0 out of 5 stars0 ratingsMarket Research and Analysis: Mastering Market Research: Advanced Methods, Design, and Data Analysis Rating: 0 out of 5 stars0 ratingsPython: An Introduction to Python Programming Rating: 0 out of 5 stars0 ratingsText Analytics for Business Decisions: Mastering Techniques for Insightful Data Interpretation through a Case Study Approach Rating: 0 out of 5 stars0 ratingsCSS3 and SVG with Claude 3: Mastering CSS3 and SVG: Techniques for Advanced Data Visualization and Animation Rating: 0 out of 5 stars0 ratingsData Structures and Program Design Using Java: A Self-Teaching Introduction to Data Structures and Java Rating: 0 out of 5 stars0 ratingsThe AI Marketing Playbook: Mastering the Latest AI Tools and Techniques for Next-Gen Marketing Success Rating: 0 out of 5 stars0 ratingsAccess 365 Project Book: Hands-On Database Creation Rating: 0 out of 5 stars0 ratings
Programming For You
Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Gray Hat Hacking the Ethical Hacker's Rating: 5 out of 5 stars5/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5C Programming For Beginners: The Simple Guide to Learning C Programming Language Fast! Rating: 5 out of 5 stars5/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5HTML in 30 Pages Rating: 5 out of 5 stars5/5Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5Python QuickStart Guide: The Simplified Beginner's Guide to Python Programming Using Hands-On Projects and Real-World Applications Rating: 0 out of 5 stars0 ratingsLearn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsNarrative Design for Indies: Getting Started Rating: 4 out of 5 stars4/5JavaScript All-in-One For Dummies Rating: 5 out of 5 stars5/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5C All-in-One Desk Reference For Dummies Rating: 5 out of 5 stars5/5C++ Learn in 24 Hours Rating: 0 out of 5 stars0 ratingsPYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5
Reviews for Python 3 and Machine Learning Using ChatGPT / GPT-4
0 ratings0 reviews
Book preview
Python 3 and Machine Learning Using ChatGPT / GPT-4 - Mercury Learning and Information
PREFACE
This book is designed to bridge the gap between theoretical knowledge and practical application in the fields of Python programming, machine learning, and the innovative use of ChatGPT in data science. It aims to provide a comprehensive guide for those who aspire to deepen their understanding and enhance their skills in these rapidly evolving areas.
The motivation stems from a growing demand for practical, in-depth resources that cater to the needs of students, data scientists, and AI researchers looking to leverage advanced techniques and tools. As these fields continue to grow in importance and impact, the ability to adeptly manipulate data, understand machine learning algorithms, and apply the latest advancements in AI becomes critical.
This book is structured to facilitate a deep understanding of several core topics:
■ Introduction to Pandas: We begin with a detailed introduction to Pandas, a cornerstone Python library for data manipulation and analysis. This section is tailored to help you master data frames and perform complex data cleaning and preparation tasks efficiently.
■ Machine Learning Classifiers: Next, we explore a variety of machine learning classifiers, providing you with the knowledge to choose and implement the right algorithm for your projects. From kNN to SVMs, you will learn the intricacies of each method through practical examples.
■ GPT-4 and Linear Regression: As we explore the capabilities of GPT-4, we discuss its application in enhancing traditional linear regression analysis. This section demonstrates how GPT-4 can be used to perform and interpret regression in ways that push the boundaries of conventional data analysis.
■ Data Visualization with ChatGPT: Finally, the book covers the innovative use of ChatGPT in data visualization. This segment focuses on how AI can transform data into compelling visual stories, making complex results accessible and understandable. It includes material AI apps, GANs, and DALL-E.
Each chapter is crafted to build on the knowledge from the previous sections, ensuring a cohesive and comprehensive learning experience. To cater to a wide range of learning styles, the book includes step-by-step tutorials, real-world applications, and sections dedicated to theoretical concepts backed by practical examples. This approach not only solidifies understanding but also enhances your ability to apply these techniques in real-world scenarios.
Features of This Book
■ Coverage of Latest Python Libraries: You will gain proficiency in using state-of-the-art libraries essential for modern data scientists.
■ Real-World Problem Solving: The book challenges you to apply your skills on real data, preparing you for professional success.
■ Companion files with source code, datasets, and figures are available for downloading by writing to the publisher (with proof of purchase) to [email protected].
This book is more than just a learning tool; it is a reference that you will return to repeatedly as you progress in your career. Whether you are a beginner aiming to get a solid start in programming and data science or an experienced professional looking to explore new advancements in AI, Python 3 and Machine Learning Using ChatGPT/GPT-4
is an invaluable asset.
We hope that you will find this book to be a valuable resource, one that inspires you to explore further and apply your knowledge to solve complex problems. The future of Generative AI is exciting and full of possibilities.
O. Campesato
April 2024
CHAPTER 1
INTRODUCTION TO PANDAS
This chapter introduces you to Pandas and provides code samples that illustrate some of its useful features. If you are familiar with these topics, skim through the material and peruse the code samples, just in case they contain information that is new to you.
The first part contains a brief introduction to Pandas. This section contains code samples that illustrate some features of Pandas DataFrames and a brief discussion of series, which are two of the main features of Pandas.
The second part of this chapter discusses various types of data frames that you can create, such as numeric and Boolean data frames. In addition, we discuss examples of creating data frames with NumPy functions and random numbers.
Note: Several code samples in this chapter reference the NumPy library for working with arrays and generating random numbers, which you can learn from online articles.
WHAT IS PANDAS?
Pandas is a Python library that is compatible with other Python libraries, such as NumPy and Matplotlib. Install Pandas by opening a command shell and invoking this command for Python 3.x:
pip3 install pandas
In many ways, the semantics of the APIs in the Pandas library are similar to a spreadsheet, along with support for XSL, XML, HTML, and CSV file types. Pandas provides a data type called a data frame (similar to a Python dictionary) with an extremely powerful functionality.
Pandas data frames support a variety of input types, such as ndarray, list, dict, or series.
The data type series is another mechanism for managing data. In addition to performing an online search for more details regarding series, the following article contains a good introduction:
https://towardsdatascience.com/20-examples-to-master-pandas-series-bc4c68200324
Pandas Options and Settings
You can change the default values of environment variables, an example of which is shown below:
import pandas as pd
display_settings = {
'max_columns': 8,
'expand_frame_repr': True, # Wrap to multiple pages
'max_rows': 20,
'precision': 3,
'show_dimensions': True
}
for op, value in display_settings.items():
pd.set_option(display.{}
.format(op), value)
Include the preceding code block in your own code if you want Pandas to display a maximum of 20 rows and 8 columns, and floating point numbers displayed with 3 decimal places. Set expand_frame_rep to True if you want the output to wrap around
to multiple pages. The preceding for loop iterates through display_settings and sets the options equal to their corresponding values.
In addition, the following code snippet displays all Pandas options and their current values in your code:
print(pd.describe_option())
There are various other operations that you can perform with options and their values (such as the pd.reset() method for resetting values), as described in the Pandas user guide:
https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html
Pandas Data Frames
In simplified terms, a Pandas data frame is a two-dimensional data structure, and it is convenient to think of the data structure in terms of rows and columns. Data frames can be labeled (rows as well as columns), and the columns can contain different data types. The source of the dataset for a Pandas data frame can be a data file, a database table, and a Web service. The data frame features include:
Data frame methods
Data frame statistics
Grouping, pivoting, and reshaping
Handle missing data
Join data frames
The code samples in this chapter show you almost all the features in the preceding list.
Data Frames and Data Cleaning Tasks
The specific tasks that you need to perform depend on the structure and contents of a dataset. In general, you will perform a workflow with the following steps, not necessarily always in this order (and some might be optional). All of the following steps can be performed with a Pandas data frame:
Read data into a data frame
Display top of data frame
Display column data types
Display missing values
Replace NA with a value
Iterate through the columns
Statistics for each column
Find missing values
Total missing values
Percentage of missing values
Sort table values
Print summary information
Columns with > 50% missing
Rename columns
This chapter contains sections that illustrate how to perform many of the steps in the preceding list.
Alternatives to Pandas
Before delving into the code samples, there are alternatives to Pandas that offer very useful features, some of which are shown below:
PySpark (for large datasets)
Dask (for distributed processing)
Modin (faster performance)
Datatable (R data.table for Python)
The inclusion of these alternatives is not intended to diminish Pandas. Indeed, you might not need any of the functionality in the preceding list. However, in the event that you need such functionality in the future, so it is worthwhile for you to know about these alternatives now (and there may be even more powerful alternatives at some point in the future).
A PANDAS DATA FRAME WITH A NUMPY EXAMPLE
Listing 1.1 shows the content of pandas_df.py that illustrates how to define several data frames and display their contents.
LISTING 1.1: pandas_df.py
import pandas as pd
import numpy as np
myvector1 = np.array([1,2,3,4,5])
print(myvector1:
)
print(myvector1)
print()
mydf1 = pd.Data frame(myvector1)
print(mydf1:
)
print(mydf1)
print()
myvector2 = np.array([i for i in range(1,6)])
print(myvector2:
)
print(myvector2)
print()
mydf2 = pd.Data frame(myvector2)
print(mydf2:
)
print(mydf2)
print()
myarray = np.array([[10,30,20], [50,40,60],[1000,2000,3000]])
print(myarray:
)
print(myarray)
print()
mydf3 = pd.Data frame(myarray)
print(mydf3:
)
print(mydf3)
print()
Listing 1.1 starts with standard import statements for Pandas and NumPy, followed by the definition of two one-dimensional NumPy arrays and a two-dimensional NumPy array. Each NumPy variable is followed by a corresponding Pandas data frame (mydf1, mydf2, and mydf3). Now launch the code in Listing 1.1 to see the following output, and you can compare the NumPy arrays with the Pandas data frames:
myvector1:
[1 2 3 4 5]
mydf1:
0
0 1
1 2
2 3
3 4
4 5
myvector2:
[1 2 3 4 5]
mydf2:
0
0 1
1 2
2 3
3 4
4 5
myarray:
mydf3:
By contrast, the following code block illustrates how to define two Pandas Series that are part of the definition of a Pandas data frame:
names = pd.Series(['SF', 'San Jose', 'Sacramento'])
sizes = pd.Series([852469, 1015785, 485199])
df = pd.Data frame({ 'Cities': names, 'Size': sizes })
print(df)
Create a Python file with the preceding code (along with the required import statement), and when you launch that code, you will see the following output:
DESCRIBING A PANDAS DATA FRAME
Listing 1.2 shows the content of pandas_df_describe.py, which illustrates how to define a Pandas data frame that contains a 3x3 NumPy array of integer values, where the rows and columns of the data frame are labeled. Other aspects of the data frame are also displayed.
LISTING 1.2: pandas_df_describe.py
import numpy as np
import pandas as pd
myarray = np.array([[10,30,20], [50,40,60],[1000,2000,3000]])
rownames = ['apples', 'oranges', 'beer']
colnames = ['January', 'February', 'March']
mydf = pd.Data frame(myarray, index=rownames, columns=colnames)
print(contents of df:
)
print(mydf)
print()
print(contents of January:
)
print(mydf['January'])
print()
print(Number of Rows:
)
print(mydf.shape[0])
print()
print(Number of Columns:
)
print(mydf.shape[1])
print()
print(Number of Rows and Columns:
)
print(mydf.shape)
print()
print(Column Names:
)
print(mydf.columns)
print()
print(Column types:
)
print(mydf.dtypes)
print()
print(Description:
)
print(mydf.describe())
print()
Listing 1.2 starts with two standard import statements followed by the variable myarray, which is a 3x3 NumPy array of numbers. The variables rownames and colnames provide names for the rows and columns, respectively, of the Pandas data frame mydf, which is initialized as a Pandas data frame with the specified data source (i.e., myarray).
The first portion of the output below requires a single print() statement (which simply displays the contents of mydf). The second portion of the output is generated by invoking the describe() method that is available for any Pandas data frame. The describe() method is useful: you will see various statistical quantities, such as the mean, standard deviation minimum, and maximum performed by columns (not rows), along with values for the 25th, 50th, and 75th percentiles. The output of Listing 1.2 is here:
contents of df:
contents of January:
Name: January, dtype: int64
Number of Rows:
3
Number of Columns:
3
Number of Rows and Columns:
(3, 3)
Column Names:
Index(['January', 'February', 'March'], dtype='object')
Column types:
dtype: object
Description:
PANDAS BOOLEAN DATA FRAMES
Pandas supports Boolean operations on data frames, such as the logical OR, the logical AND, and the logical negation of a pair of data frames. Listing 1.3 shows the content of pandas_boolean_df.py that illustrates how to define a Pandas data frame whose rows and columns are Boolean values.
LISTING 1.3: pandas_boolean_df.py
import pandas as pd
df1 = pd.Data frame({'a': [1, 0, 1], 'b': [0, 1, 1] }, dtype=bool)
df2 = pd.Data frame({'a': [0, 1, 1], 'b': [1, 1, 0] }, dtype=bool)
print(df1 & df2:
)
print(df1 & df2)
print(df1 | df2:
)
print(df1 | df2)
print(df1 ^ df2:
)
print(df1 ^ df2)
Listing 1.3 initializes the data frames df1 and df2, and then computes df1 & df2, df1 | df2, and df1 ^ df2, which represent the logical AND, the logical OR, and the logical negation, respectively, of df1 and df2. The output from launching the code in Listing 1.3 is as follows:
df1 & df2:
Transposing a Pandas Data Frame
The T attribute (as well as the transpose function) enables you to generate the transpose of a Pandas data frame, similar to the NumPy ndarray. The transpose operation switches rows to columns and columns to rows. For example, the following code snippet defines a Pandas data frame df1 and then displays the transpose of df1:
df1 = pd.Data frame({'a': [1, 0, 1], 'b': [0, 1, 1] }, dtype=int)
print(df1.T:
)
print(df1.T)
The output of the preceding code snippet is here:
df1.T:
The following code snippet defines Pandas data frames df1 and df2 and then displays their sum:
df1 = pd.Data frame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=int)
df2 = pd.Data frame({'a' : [3, 3, 3], 'b' : [5, 5, 5] }, dtype=int)
print(df1 + df2:
)
print(df1 + df2)
The output is here:
df1 + df2:
PANDAS DATA FRAMES AND RANDOM NUMBERS
Listing 1.4 shows the content of pandas_random_df.py that illustrates how to create a Pandas data frame with random integers.
LISTING 1.4: pandas_random_df.py
import pandas as pd
import numpy as np
df = pd.Data frame(np.random.randint(1, 5, size=(5, 2)), columns=['a','b'])
df = df.append(df.agg(['sum', 'mean']))
print(Contents of data frame:
)
print(df)
Listing 1.4 defines the Pandas data frame df that consists of 5 rows and 2 columns of random integers between 1 and 5. Notice that the columns of df are labeled a
and b.
In addition, the next code snippet appends two rows consisting of the sum and the mean of the numbers in both columns. The output of Listing 1.4 is here:
Listing 1.5 shows the content of pandas_combine_df.py that illustrates how to combine Pandas data frames.
LISTING 1.5: pandas_combine_df.py
import pandas as pd
import numpy as np
print(contents of df:
)
print(df)
print(contents of foo1:
)
print(df.foo1)
print(contents of foo2:
)
print(df.foo2)
Listing 1.5 defines the Pandas data frame df that consists of 5 rows and 2 columns (labeled foo1
and foo2
) of random real numbers between 0 and 5. The next portion of Listing 1.5 shows the content of df and foo1. The output of Listing 1.5 is as follows:
contents of df:
READING CSV FILES IN PANDAS
Pandas provides the read-csv() method for reading the contents of CSV files. For example, Listing 1.6 shows the contents of sometext.csv that contain labeled data (spam or ham), and Listing 1.7 shows the contents of read-csv-file.py that illustrate how to read the contents of a CSV file.
LISTING 1.6: sometext.csv
LISTING 1.7: read-csv-file.py
import pandas as pd
import numpy as np
df = pd.read-csv('sometext.csv', delimiter='\t')
print(=> First five rows:
)
print(df.head(5))
Listing 1.7 reads the content of sometext.csv, whose columns are separated by a tab (\t
) delimiter. Launch the code in Listing 1.7 to see the following output:
=> First five rows:
The default value for the head() method is 5, but you can display the first n rows of a data frame df with the code snippet df.head(n).
Specifying a Separator and Column Sets in Text Files
The previous section showed you how to use the delimiter attribute to specify the delimiter in a text file. You can also use the sep parameter specifies a different separator. In addition, you can assign the names parameter the column names in the data that you want to read. An example of using delimiter and sep is here:
Pandas also provides the read_table() method for reading the contents of CSV files, which uses the same syntax as the read_csv() method.
Specifying an Index in Text Files
Suppose that you know that a particular column in a text file contains the index value for the rows in the text file. For example, a text file that contains the data in a relational table would typically contain an index column.
Fortunately, Pandas allows you to specify the kth column as the index in a text file, as shown here:
df = pd.read_csv('myfile.csv', index_col=k)
THE LOC() AND ILOC() METHODS IN PANDAS
If you want to display the contents of a record in a Pandas data frame, specify the index of the row in the loc() method. For example, the following code snippet displays the data by feature name in a data frame df:
df.loc[feature_name]
Select the first row of the height
column in the data frame:
df.loc([0], ['height'])
The following code snippet uses the iloc() function to display the first 8 records of the name column with this code snippet:
df.iloc[0:8]['name']
CONVERTING CATEGORICAL DATA TO NUMERIC DATA
One common task in machine learning involves converting a feature containing character data into a feature that contains numeric data. Listing 1.8 shows the contents of cat2numeric.py that illustrate how to replace a text field with a corresponding numeric field.
LISTING 1.8: cat2numeric.py
import pandas as pd
import numpy as np
df = pd.read_csv('sometext.csv', delimiter='\t')
print(=> First five rows (before):
)
print(df.head(5))
print(-------------------------
)
print()
# map ham/spam to 0/1 values:
df['type'] = df['type'].map( {'ham':0 , 'spam':1} )
print(=> First five rows (after):
)
print(df.head(5))
print(-------------------------
)
Listing 1.8 initializes the data frame df with the contents of the CSV file sometext.csv, and then displays the contents of the first five rows by invoking df.head(5), which is also the default number of rows to display.
The next code snippet in Listing 1.8 invokes the map() method to replace occurrences of ham with 0 and replace occurrences of spam with 1 in the column labeled type, as shown here:
df['type'] = df['type'].map( {'ham':0 , 'spam':1} )
The last portion of Listing 1.8 invokes the head() method again to display the first five rows of the dataset after having renamed the contents of the column type. Launch the code in Listing 1.8 to see the following output:
-------------------------
As another example, Listing 1.9 shows the contents of shirts.csv and Listing 1.10 shows the contents of shirts.py; these examples illustrate four techniques for converting categorical data into numeric data.
LISTING 1.9: shirts.csv
type,ssize
shirt,xxlarge
shirt,xxlarge
shirt,xlarge
shirt,xlarge
shirt,xlarge
shirt,large
shirt,medium
shirt,small
shirt,small
shirt,xsmall
shirt,xsmall
shirt,xsmall
LISTING 1.10: shirts.py
import pandas as pd
shirts = pd.read_csv(shirts.csv
)
print(shirts before:
)
print(shirts)
print()
# TECHNIQUE #1:
#shirts.loc[shirts['ssize']=='xxlarge','size'] = 4
#shirts.loc[shirts['ssize']=='xlarge', 'size'] = 4
#shirts.loc[shirts['ssize']=='large', 'size'] = 3
#shirts.loc[shirts['ssize']=='medium', 'size'] = 2
#shirts.loc[shirts['ssize']=='small', 'size'] = 1
#shirts.loc[shirts['ssize']=='xsmall', 'size'] = 1
# TECHNIQUE #2:
#shirts['ssize'].replace('xxlarge', 4, inplace=True)
#shirts['ssize'].replace('xlarge', 4, inplace=True)
#shirts['ssize'].replace('large', 3, inplace=True)
#shirts['ssize'].replace('medium', 2, inplace=True)
#shirts['ssize'].replace('small', 1, inplace=True)
#shirts['ssize'].replace('xsmall', 1, inplace=True)
# TECHNIQUE #3:
#shirts['ssize'] = shirts['ssize'].apply({'xxlarge':4, 'xlarge':4, 'large':3, 'medium':2, 'small':1, 'xsmall':1}.get)
# TECHNIQUE #4:
shirts['ssize'] = shirts['ssize'].replace(regex='xlarge', value=4)
shirts['ssize'] = shirts['ssize'].replace(regex='large', value=3)
shirts['ssize'] = shirts['ssize'].replace(regex='medium', value=2)
shirts['ssize'] = shirts['ssize'].replace(regex='small', value=1)
print(shirts after:
)
print(shirts)
Listing 1.10 starts with a code