A first look at ChatGPT

Tom Reid

Data Engineer

Published Dec 8, 2022

I wrote a brief LinkedIn post earlier today on ChatGPT, saying how clever I thought it was. Having played around with it some more since then, I have to say I've been really impressed with its capabilities from a coding perspective. For this article I've written some coding-related GPT prompts for a number of use cases and shown how GPT responded.

If you don't know what CharGPT is, it's one of the latest releases from the OpenAI organisation. OpenAI is an AI research and development company based out of San Francisco, founded in 2015 by, among others, Elon Musk.

ChatGPT is a prototype artificial intelligence chatbot developed by OpenAI that focuses on usability and dialogue. It uses a large language model trained with reinforcement learning and is based on the GPT-3.5 architecture. In practice, what all this means is that GPT knows, and can do, a lot of things including writing software. To save space, I've left out the GPT responses after it spits out the code but normally it contains a good explanation of what the code does. Let's get started.

1) GPT writes some pyspark code (the first part is from my original LinkedIn post)

Me: create a spark program which reads a series of CSV files in an S3 folder and creates a single parquet file in a different folder containing the CSV data.

GPT: Here is an example of a Spark program that reads a series of CSV files in an S3 folder and creates a single Parquet file in a different folder containing the CSV data:

from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("CSV to Parquet").getOrCreate()

# Set the input and output paths
input_path = "s3://my-bucket/csv-data/"
output_path = "s3://my-bucket/parquet-data/"

# Read the CSV files from the input path
df = spark.read.format("csv").option("header", "true").load(input_path)

# Write the data to a single Parquet file in the output path
df.write.parquet(output_path)

# Stop the SparkSession
spark.stop()

Those of you who know Spark will realise that for any decent-sized amount of input CSV data, the above code will output multiple parquet files, which is not what was specified, so I added this prompt.

Me: that creates multiple parquet files, I just want one parquet file

GPT: To create a single Parquet file from multiple CSV files in Spark, you can use the coalesce() method to reduce the number of partitions in the DataFrame to one before writing the data to the Parquet file. This will ensure that the data is written to a single Parquet file instead of multiple files. Here is an example of a Spark program that reads a series of CSV files in an S3 folder and creates a single Parquet file in a different folder containing the CSV data:

from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("CSV to Parquet").getOrCreate()

# Set the input and output paths
input_path = "s3://my-bucket/csv-data/"
output_path = "s3://my-bucket/parquet-data/"

# Read the CSV files from the input path
df = spark.read.format("csv").option("header", "true").load(input_path)

# Reduce the number of partitions to 1
df = df.coalesce(1)

# Write the data to a single Parquet file in the output path
df.write.parquet(output_path)

# Stop the SparkSession
spark.stop()

Me: rewrite this code in scala spark

GPT: Here is an equivalent code in Scala Spark:

import org.apache.spark.sql.SparkSession

// Create a SparkSession
val spark = SparkSession.builder.appName("CSV to Parquet").getOrCreate()

// Set the input and output paths
val inputPath = "s3://my-bucket/csv-data/"
val outputPath = "s3://my-bucket/parquet-data/"

// Read the CSV files from the input path
val df = spark.read.format("csv").option("header", "true").load(inputPath)

// Reduce the number of partitions to 1
df = df.coalesce(1)

// Write the data to a single Parquet file in the output path
df.write.parquet(outputPath)

2) Some AWS-related python code

Me: Write a python program to download a zip file from an sftp server. Unzip the zip and copy any files it contains to an S3 bucket.

GPT: Here is a Python program that downloads a zip file from an SFTP server, unzips the file, and copies the files it contains to an S3 bucket:

import io
import zipfile
import pysftp
import boto3
import os

# SFTP server credentials
host = "sftp.example.com"
username = "username"
password = "password"

# S3 bucket name
bucket_name = "my-bucket"

# Local directory for temporary file storage
temp_dir = "tmp"

# Connect to SFTP server
with pysftp.Connection(host, username=username, password=password) as sftp:
 
   # Download zip file from SFTP server to local directory
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    sftp.get("/path/to/file.zip", f"{temp_dir}/file.zip")

    # Unzip file
    with zipfile.ZipFile(f"{temp_dir}/file.zip", "r") as zf:
        zf.extractall(temp_dir)

        # Copy files to S3 bucket
        s3 = boto3.resource("s3")

        for file in zf.filelist:
            s3.meta.client.upload_file(f"{temp_dir}/{file.filename}", bucket_name, file.filename)

    # Clean up local directory
    os.remove(f"{temp_dir}/file.zip")

    for file in zf.filelist:
        os.remove(f"{temp_dir}/{file.filename}")
    os.rmdir(temp_dir)

I don't like the fact that it uses intermediate file storage for the zip and its contents so I modified the prompt a little, hoping GPT would do something about that.

Me: Write a python program to Download a zip file from an sftp server. Unzip the zip and copy any files it contains to an S3 bucket. Do not use intermediate file storage

GPT: Here is a Python program that downloads a zip file from an SFTP server, unzips the file, and copies the files it contains to an S3 bucket without using intermediate storage:

import io
import zipfile
import pysftp
import boto3

# SFTP server credentials
host = "sftp.example.com"
username = "username"
password = "password"

# S3 bucket name
bucket_name = "my-bucket"

# Connect to SFTP server
with pysftp.Connection(host, username=username, password=password) as sftp:

    # Download zip file from SFTP server
    zip_data = io.BytesIO()

    sftp.getfo("/path/to/file.zip", zip_data)

    # Unzip file in memory
    with zipfile.ZipFile(zip_data) as zf:

        zf.extractall()

        # Copy files to S3 bucket
        s3 = boto3.resource("s3")

        for file in zf.filelist:
            s3.meta.client.upload_file(file.filename, bucket_name, file.filename)

Pretty crazy - right? In a good way.

3) Some left-field stuff

Recently, I’ve been dealing with the downloading of weather-related data sets from the European Centre for Medium-Range Weather Forecasting (often abbreviated to ECMWF). Let’s see if ChatGPT knows about this organisation and how to download data from it.

Me: write python code to download a GRIB file from the ECMWF. Use the cdsapi. The grib file should contain era5 single-level reanalysis u10 and v10 component windspeed data for the whole globe for each of the 24 hours of every day in January 2018

GPT: Unfortunately, I am unable to browse the internet or access any external resources, so I am unable to provide specific code for downloading a GRIB file from the ECMWF using the cdsapi. However, here is a general example of how you might go about doing this using Python:

# Import necessary modules
import cdsapi

# Create a new client
c = cdsapi.Client()

# Request the GRIB file, specifying the desired data and date range
c.retrieve(
    'reanalysis-era5-single-levels',
    {
        'product_type': 'reanalysis',
        'format': 'grib',
        'variable': ['10m_u_component_of_wind', '10m_v_component_of_wind'],
        'year': '2018',
        'month': '01',
        'day': [
            '01', '02', '03', '04',
            '05', '06', '07', '08',
            '09', '10', '11', '12',
            '13', '14', '15', '16',
            '17', '18', '19', '20',
            '21', '22', '23', '24',
            '25', '26', '27', '28',
            '29', '30', '31'
        ],
        'time': [
            '00:00', '01:00', '02:00',
            '03:00', '04:00', '05:00',
            '06:00', '07:00', '08:00',
            '09:00', '10:00', '11:00',
            '12:00', '13:00', '14:00',
            '15:00', '16:00', '17:00',
            '18:00', '19:00', '20:00',
            '21:00', '22:00', '23:00'
        ]
    },
    'download.grib')

Wow, that’s hugely impressive. Not only did it know about the ECMWF but it also knew about its CDS (Climate Data Store) API interface and how to use it.

4) Data generation

There are so many use cases for chatGPT, not just pure coding. As another example, it can create fake data sets for you to test code on.

Me: create some fake data in JSON format. The names and addresses (including city and postcode) of five people, along with their mobile number and credit card number and social security number

GPT: Here is an example of some fake data in JSON format, containing the names and addresses (including city and postcode) of five people, along with their mobile numbers, credit card numbers, and social security numbers:

{
    "people": [
        {
            "name": "John Doe",
            "address": {
                "city": "New York",
                "post_code": "10001"
            },
            "mobile": "212-555-1212",
            "credit_card": "1234-5678-9012-3456",
            "ssn": "123-45-6789"
        },
        {
            "name": "Jane Smith",
            "address": {
                "city": "Los Angeles",
                "post_code": "90001"
            },
            "mobile": "213-555-1212",
            "credit_card": "2345-6789-0123-4567",
            "ssn": "987-65-4321"
        },
        {
            "name": "Robert Johnson",
            "address": {
                "city": "Chicago",
                "post_code": "60606"
            },
            "mobile": "312-555-1212",
            "credit_card": "3456-7890-1234-5678",
            "ssn": "111-11-1111"
        },
        {
            "name": "Samantha Williams",
            "address": {
                "city": "Houston",
                "post_code": "77002"
            },
            "mobile": "832-555-1212",
            "credit_card": "4567-8901-2345-6789",
            "ssn": "222-22-2222"
        },
        {
            "name": "David Brown",
            "address": {
                "city": "Philadelphia",
                "post_code": "19019"
            },
            "mobile": "215-555-1212",
            "credit_card": "5678-9012-3456-7890",
            "ssn": "333-33-3333"
        }
    ]
}

5) Data validation

Or, how about creating wild-card expressions that meet your input validation requirements?

Me: create a regular expression that will screen out invalid email addresses

GPT: Here is a regular expression that can be used to screen out invalid email addresses:

^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

Ok, that looks a bit bonkers. Does it work? Hint: - yes it does!

Me: use it to write a python program that validates ten inputs containing a mix of valid and invalid emails. Show the program output too

GPT: Here is a Python program that uses the regular expression above to validate a list of 10 email addresses:

import re

# Compile the regular expression
pattern = re.compile(r"^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$")

# List of email addresses to validate
emails = [
    "user@example.com",
    "user.name@example.com",
    "user_name@example.com",
    "user-name@example.com",
    "username@subdomain.example.com",
    "user@invalid",
    "user@-invalid.com",
    "user@invalid-.com",
    "user@invalid-.com",
    "user@inv-.alid-.com",
]

# Validate the email addresses
for email in emails:
    if pattern.match(email):
        print(f"{email} is a valid email address")
    else:
        print(f"{email} is not a valid email address")


Here is the output of the program:

 
user@example.com is a valid email address
user.name@example.com is a valid email address
user_name@example.com is a valid email address
user-name@example.com is a valid email address
username@subdomain.example.com is a valid email address
user@invalid is not a valid email address
user@-invalid.com is not a valid email address
user@invalid-.com is not a valid email address
user@invalid-.com is not a valid email address
user@inv-.alid-.com is not a valid email addresse

One thing to be aware of is that ChatGPT is not actually creating the code, running it and looking at its actual output. Rather it's simply showing what it expects the output would be if it did do all those things.

In this particular case, all that ChatGPT "knows" is what constitutes a valid email, it "knows" what a python program that validates email addresses should look like and it "knows" what the expected output of that program should be if run with valid and invalid email addresses as its input.

6) Sage advice?

Me: How do I enable cross-account read-only access to an S3 bucket I own

GPT: To give another AWS account read-only access to an S3 bucket you own, you can create an Amazon S3 bucket policy that allows read-only access to the bucket, and then provide the other AWS account with the access policy. Here are the steps to do this:

Open the Amazon S3 console and navigate to the bucket that you want to give read-only access to.
Click on the "Permissions" tab and then click on the "Bucket Policy" button.
In the bucket policy editor, enter the following policy, replacing [BUCKET_NAME] with the name of your bucket and [AWS_ACCOUNT_ID] with the AWS account ID of the account you want to give access to:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowReadAccess",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::[AWS_ACCOUNT_ID]:root"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::[BUCKET_NAME]/*"
        }
    ]
}



Click on the "Save" button to save the bucket policy.

After you have created the bucket policy, the other AWS account will have read-only access to the S3 bucket.

Ok, so is AI going to replace software developers? For now, the answer is - I hope not - as that's how I earn my living, but in 10 or 15 years time? probably Yes if the future progression of AI matches what it has been over the last few years. It'll be gradual but inevitable. Currently, we're on ChatGPT version 3.5 and I can't even imagine what ChatGPT Version 5 will be capable of, never mind versions 8, 9 or 10. I'm a bit long in the tooth now but if I was a young developer near the beginning of my career I might want to start adding a few non-coding strings to my bow.

That's all I have for now. If you found this useful, please like, comment and share to help spread the knowledge around.

For more information on ChatGPT, check out the following links:

A first look at ChatGPT

Tom Reid

Data Engineer

More articles by this author

Insights from the community

Others also viewed

Almost Timely News: ChatGPT Turns 1. What Have We Learned?

Self-help Advice from ChatGPT

How to Become a Certified ChatGPT Expert?

AI Needs a Human Touch

How non-programmers can use Chatgpt’s Code Interpreter to kickstart analysis

Another new ChatGPT

What Is ChatGPT and Why Is Everyone Talking About It. Would it Replace a lot of Talents in many Industries?

OpenAI; Closed book

ChatGPT is more than it lets on, and we should be extremely concerned

Boosting Developer Productivity: The Ultimate ChatGPT Cheat Sheet for 10x Efficiency

Explore topics

Free ebook

Jun 27, 2024

OpenAI GPT-4o vision capabilities are good

May 18, 2024

Altman talks GPT-5

Feb 14, 2024

Develop scalable AWS Python jobs with Glue, Ray & the SDK for Pandas

Jan 19, 2023

Using Partitioning and Bucketing to improve query performance and reduce costs in AWS Athena

Jan 4, 2023

Run scalable, cost-effective workloads on AWS in parallel using AWS Batch and Step Functions

Dec 5, 2022

Avoid using apply() on big dataframes

Nov 27, 2022

Running geospatial queries on AWS Athena

Sep 7, 2022

AWS EMR Serverless

Jul 5, 2022

Apache Iceberg tables in AWS Athena

May 14, 2022