Bulba Code ICE - RLHF Synthetic & Organic Loss

Download as pdf or txt
Download as pdf or txt
You are on page 1of 94

Bulba Code ICE - RLHF Synthetic &

Organic Loss Buckets


| IN PROGRESS

Executive Summary
In this document you will find instructions and examples for the branches and key points of
vanilla V3 (Advanced E 2 E ICE RLHF):

● Organic Branch:
○ Error Recovery.
○ Lossy Final Response.
● Synthetic Branch:
○ Loss Buckets for final response.
● Step 3
○ Loss Buckets for step 3.

Table of Content
Latests changes:
Tasks
Branches
Error Recovery
Recovery Strategy
Frequent causes of reject:
Detailed Example
Successful error recovery:
Unnecessary/Incorrect sampling:
Unnecessary sampling Scenario 1 / 2:
Unnecessary sampling Scenario 3:
Wrong recovery strategy and not sampling after error
Sampling Variations
How to complete a task (video)
Lossy Final Response
Loss buckets
How to complete a task
Scenario type
Variations examples
Branch: Synthetic Loss Bucket Generation
Objective
Task Instructions
Remember
Seeking loss buckets
Synthetic Loss Bucket Resolution Example
How to complete a task

List of Figures and Tables


Image 1 - Organic Scenarios and error categorization
Image 2 - Organic Scenarios and error categorization
Table 1 - List of Organic Loss Bucket
Table 2 - List of Scenarios
Table 3 - List of Examples
Image 3 - Possible Synthetic Scenarios
Table 4 - List of Synthetic Loss Bucket

Abbreviations and Definitions


EDA: Exploratory Data Analysis
EQ: Empty Queue
ICE: Implicit Code Execution
LLM: Large Language Model
RLHF: Reinforcement Learning by Human Feedback
SBQ: Send Back to Queue
SDA: Specific Data Analysis
Change Log

Date Version Changes

18-Mar-2024 v1 Original Version

20-Mar-2024 v1.5 Changes respect old vanilla


Causes of reject

23-Mar-2024 v1.7 errors=`coerce`

28-Mar-2024 v1.8 Simple prompts related to modifying table

31-Mar-2024 v2 Step3
Error Recovery: Filtering strategy (diagram)

01-Apr-2024 v2.1 Exploratory step new mandatory guidelines: Coder Step 2

03-Apr-2024 v2.2 Update of error recovery guidelines - New errors / Hints: Types of
Errors:, Quick Hints

04-Apr-2024 v2.3 Update of the Step 3 instructions/examples: Step 3

IMPORTANT UPDATE TODAY REGARDING THE EXPLORATORY


STEP (v2.1).

Some minor changes compared to the classic vanilla:

-The use of pd.set_option must appear in the exploratory step.

-The df.head() must be printed with pd.to_markdown.

-Using the print statement with .info() is not an issue.

-The mention of “Pandas” in the first turn is not an issue.


-The use of pd.to_markdown is not an issue. Moreover, if the only difference between the two
answers is the use of Markdown, you have to choose the answer that uses Markdown and rate
them as equally good.

-Assuming “df” as the name of the dataframe before defining it, is not an issue.

-The use of plural pronouns is still an issue.

- The use of headers and subheaders in the coder instructions explaining the thought process
behind the coder instructions is necessary from now on. Refer to Subheaders for the details.

General Understanding

Vanilla Project
The Vanilla Project is part of an RLHF process to improve the LLM performance. It is achieved
by solving prompts that require ICE. The language chosen for this purpose is Python.

For ICE, it is understood as Implicit Code Execution, so the task you need to accomplish will
execute several steps of python code.

Vanilla encompasses two distinct types of tasks, each situated in a separate branch. The two
branches, Organic and Synthetic, coexist; however, they pursue different goals as their
approaches to achieving the same objective differ.

The organic Branch presents a multi-turn chat task where a primary prompt must be answered
by exploring the data, while also identifying Loss Buckets (specific unexpected model’s
behaviour) and assisting the model in recovering from errors when they occur.

In each turn, you are in charge of guiding the model through a logical process. At each step, you
will need to evaluate multiple responses presented to you, aiming to select a better one, or in
some cases, rewrite and refine the responses provided.
Conversely, the Synthetic Branch employs a different task structure, where the prompt has
already been addressed. Here, the aim is to craft (synthetically) 'lossy' final responses in
alignment with the specified 'Loss Bucket' categories.

Tasks Taxonomy
The tasks for the Organic Branch are comprised of several turns with 2 Responses where a
prompt is answered. The LLM model takes on 2 roles during the task, acting either as an
Instructor or as a Coder.

● Instructor: Writes a set of instructions in natural language.


● Coder: Interprets the instructions and creates a block of code using Python.

The interaction between the two roles will continue throughout the task. The output from each
model feeds into the other for the next turn. When the Coder model provides enough
information for the dataset to answer the prompt, the Instructor will display the Final
Responses in the last turn.

Your duty is to orchestrate the models interaction by choosing the best response of each turn,
and grading the Responses across different topics.

Branch: Organic Loss Bucket Generation


In this new version of the classic Vanilla task, you will need to focus your efforts mainly on the
following steps:
● Error Recovery: focus on Intermediate steps (after EDA).
● Lossy final Response: focus on the Final turn.

This new approach to the good old vanilla task comes with a few changes in the methods and
rubrics used to solve the task. Let’s dive into them:

1. If any data type conversion is necessary (i.e. an object column that has numeric data or
an object column that has datetime data), we will use the first five rows of the dataset,
and use what we’ve learned from them to perform the conversion.
2. If you followed all the recommended steps for column conversion, and you run into a
code error that is not a consequence of poor/bad/incomplete coder instructions but a
result of an error that was impossible to prevent with the information available at that
point of the task, do not change the previous instructions using information about the
dataset gathered from the error. The idea is to teach the model how to recover from this
initial error.
3. Now you’ve reached the final response, you’ve helped the model get to a final response,
and maybe also helped the model recover from a code error, now it’s time to detect ‘Loss
Buckets’. You can find everything about them here: Seeking loss buckets.

I know this all seems daunting right now, and you’re probably a little lost, but we’ll take you step
by step to explain everything you need to know in the best way possible.

- If the prompt only requests the creation of new column/s or DataFrames, the Final Response
should have a truncated view of the first 3 rows of the final DataFrame to display the resulting
column/s or DataFrame.
Moreover, when the prompt asks for some merging/cleaning/modifying the table
structure/header etc., in the final response the perfect response should include a glimpse at the
new table.

Error Recovery
There will be cases where the code runs into an error that WAS IMPOSSIBLE TO PREVENT with
the information about the dataset available at that point. Do not redo the step if this happens!
Instead, use the next few turns to investigate and fix the error. This section will go over some
frequently occurring errors and the recommended approaches to address them.

To determine if an error is avoidable or unavoidable, use the following diagram as a decision


guide:
Image 1 - Organic Scenarios and error categorization

For more information about the scenario, you can check Scenario type, and for more examples
Variations examples.

What should I do when the prompt requires to filter a column by a specific value?
In the previous good old vanilla, the approach was checking the unique values looking for value
variations (“APRIL”, “april”, “April”, “ April ”). From now on, if the value appears in the first five
rows, you can filter right away. In this way, the model has higher chances of getting the solution
faster. Caveat, odds of getting an error also are higher, use the Recovery Strategy necessary for
each case.
Image 2 - Organic Scenarios and error categorization

IMPORTANT: If the errors stem from poor instructions(using incorrect methods, poor data
handling, etc.), or model hallucinations (column names that don’t exist, data that can’t be seen
in the first 5 rows, unnecessary column conversions, etc.), these errors can and must be
avoided. For all the details read the Recovery Strategy section.

Note: a common avoidable error is that the code does not follow the instructions.

Recommended action → Modify the code

Recovery Strategy
In an Error Recovery scenario, the model might fail during the code execution phase, and
then, in the following step of coder instructions, it should formulate a recovery plan considering
these new insights. As a contributor, you would only intervene with this recovery plan if the
model fails to utilize this new information to handle a proper recovery.
On the other hand, if the error is Not Suitable for Error Recovery scenario, your job as a
contributor is to intervene with the previous coder's instructions and guide the model along the
correct path to avoid this error. Do not let the model fail because of hallucinations.

If the model doesn’t follow the previous coder's instructions your job is to modify the given
instructions as many times as necessary until the model produces the correct code.

Code modification should ONLY be performed when the model doesn’t follow the previous
coder's instructions as a last resource.

Types of Errors:

A ValueError is raised when a function receives an argument of the correct type but an
inappropriate value. Common cases:

Suitable for Error Recovery:

- When the model tries to convert to numeric or date type after recognizing the pattern
seen in the EDA step, but it fails because there were unseen patterns, a ValueError is
raised as a consequence.

In the next turn for coder instructions, you should:

● Sample 20 unique rows from that Column that will cause an error while converting
● Based on these 20 rows:
○ If there are any patterns, update the conversion strategy to take care of these
patterns
● Retry conversion with the new strategy, but also set errors=’coerce’ so that any rows
that can not be converted are converted to NaNs and are then dropped

Not Suitable for Error Recovery:

- When the model tries to convert to numeric or date type without recognizing the pattern
seen in the EDA step.
- Recommended action →: When you encounter a Not Suitable KeyError, your job as a
contributor is to intervene and guide the model along the correct path, do not let the
model fail because of hallucinations.

A KeyError is raised when a mapping key is accessed and isn't found in the mapping. This
could happen because the model references a column name incorrectly due to various causes:

Suitable for Error Recovery:


- The model might not acknowledge the correct column names(mostly the correct
capitalization of the column name) or the presence of spaces before and after the column
names, thus using the incorrect name in the code and causing the model to fail (only in
cases where this is hard to spot in df.head() and df.info()).
- Recommended action →: When you encounter a suitable KeyError, you should instruct the
model, in the following coder instructions, to print the column names to ensure how they
are spelled.

Not Suitable for Error Recovery:

- The model hallucinates the column name.


- Recommended action →: When you encounter a Not Suitable KeyError, your job as a
contributor is to intervene and guide the model along the correct path, do not let the
model fail because of hallucinations.

An AttributeError is raised when we try to access or use an attribute of an object that does
not exist or is not accessible in the context where it's being used.

Common cases:

Suitable for Error Recovery:

- When the model tries to convert an object type column: The error is raised when the
condition for using .str accessor is not met, i.e., when the value being accessed is not a
string. In pandas, when a column contains elements of different data types (e.g., strings,
integers, floats), pandas may assign the object data type to that column. This allows
pandas to handle columns with mixed data types, but it can also indicate that the
column may not be as efficiently handled as columns with homogeneous data types.
Therefore, the error will occur because the column contains mixed string and numeric
data, and using the str method on a numeric value will cause the error.
- Recommended action →: When you encounter an AttributError (as the one seen above),
your job as a contributor is to, in the next step, repeat the instruction but add an
instruction to make sure the model casts or treats the column as the correct type (in the
example it would be to cast it to string) avoiding the error.

Another type of Error:

TypeError/ NameError / Module Not Found Error

Suitable for Error Recovery:

- Recommended action →: When you encounter any of these errors your job as a
contributor is to, in the next step, repeat the instruction but add an instruction to make
sure that the error doesn’t repeat itself in the new step. You’ll have to use your coding
knowledge to figure out what is the cause of the error and what’s the best way to avoid
the error at hand.

General Errors
- When you encounter any other type of error, you should analyze the possibility for this
code error to be avoidable or unavoidable. If you conclude that the error you faced was
avoidable because you could have provided better instructions, then go back to the
previous coder instructions and modify them to prevent the model from failing. If you
encounter an error that you couldn’t have avoided, then let the model recover in the next
coder instructions using this new information to create proper coder instructions.
Quick Hints
Sampling
Whenever a sampling is needed, whether it be scenario 3, or scenario 2 after an error occurs,
this sampling needs to be a 20 unique elements sampling (preferably a random sample), of
elements incompatible with the required conversion (i.e. a 20 unique random samples of
elements where the conversion fails, or that have non-numeric values, it’s all relative to the case
you’re working with). Refer to Planner Step 5 and Coder Step 6 for a very clear example.

In the next example this principle of filtering is used for dates.


Data Cleaning + Conversion
If any data cleaning is necessary, before performing the conversion to another data type,
separate the cleaning from the conversion, in two different steps of coder instructions, and into
two different steps of code, therefore if the conversion fails, at least the initial cleaning has
already been performed.

Conversion After Sampling


If you’ve performed a sample (being scenario 2 or 3) the idea is to get as much useful
information as possible (within a 20 elements sample), to understand what’s preventing or
could prevent the successful conversion of a column to another data type and use this
information to clean the column before performing the conversion in the next step.

The following image, has a good set of instructions. After failing, sampling and viewing that
many rows in the columns of interest had non-numeric characters interfering with the
conversion of the columns to a numeric data type, the contributor instructs the model to
perform a clean-up before performing the conversion using the parameter ‘errors = coerce’.

If you sample a column obtaining valuable information to clean it, before performing the data
type conversion, and then just directly use `errors = ‘coerce’` as a parameter without cleaning
the data, the sample becomes pointless, since you’re just eliminating all the rows of the column
that were causing the original error, which you could’ve done without any sample. The idea is to
use the information gotten while sampling, clean the data, convert the column and try to end
up which as much useful data as possible (limited by how good the sampling actually was).
Frequent causes of reject:
● Unnecessary / Incorrect sampling

In these cases there is an incorrect or unnecessary sampling of column/s, you can refer
to the following links for specific examples.

Print unique values (sampling) in scenario 1 and 2

Print all the unique values in scenario 3

Print unique values in unnecessary columns (not related to the prompt or


object columns that do need conversion).

● Not modifying the code / coder instructions when necessary

Injecting errors (check Not Suitable for Error Recovery)

● Assuming information based on the actual output


● Using coerce in the conversion without previously have gotten an error and without
sampling the data
● Not sampling after getting an error
● Getting the same value error over and over again
● Getting key errors
● Solving the task using only the sphere. The sphere is an environment provided to check
the code as you don’t have access to the dataset. But the important code is the one that
appears in the Code sections along the task.
● Do more steps in the EDA apart from the necessaries
● Submitting bad tasks

Avoid submitting incomplete tasks either because none of the responses reached
the final response section or because both of them have a `missing flow state`
message.

Detailed Example:
See Steps 1-4 below to understand the context and reason for the Code Error.

Note `Total` column is used for calculation and therefore needs cleaning before converting to
numeric

Planner Step 1

Python

CODER_INSTRUCTION:

1. Please use pandas to read the csv file "R_Plan_Budgetcsv_-_BT_Budget.csv".

2. Show in markdown the first 5 rows and all columns by head().

3. Show columns and their types by info().

Coder Step 2

Unset

import pandas as pd

pd.set_option('display.max_rows', None)

pd.set_option('display.max_columns', None)
# Read the CSV file into a DataFrame

df = pd.read_csv('R_Plan_Budgetcsv_-_BT_Budget.csv')

# Display the first 5 rows

print(df.head().to_markdown(index=False, numalign='left', stralign="left"))

# Print the column names and their data types

print(df.info())

Code Output
As we see below df.info() shows that the column Total is not in the right format for the
analysis/calculation to happen and therefore needs conversion (it’s of type Object instead of
numeric).

df.head() shows that the column that needs conversion has values that need to be cleaned up for
the conversion to happen successfully. For instance, if there’s a value such as “$1,770.00” in the
Total column which requires us to remove ‘$’ and ‘,’ before converting the column to numeric so it
can be used for the calculations later.

Unset

| Cost Code | Concept | Group | Stage | Quantity |


Unit | Unit Cost | Total | Markup % | Price w Markup |

|:------------|:------------------------------------|:--------|:--------|:-----------|:---
----|:------------|:----------|:-----------|:-----------------|

| 1110 | LP - Concrete paving and sidewalk | 1 | 15 | Lot | 1


| 100 | $100.00 | 15 | 115 |

| 1120 | LP - Head wall | 1 | 2 | Lot | 1


| 200 | $200.00 | 15 | 230 |
| 1150 | LP - Culvert Installation | 1 | 2 | Lot | 1
| 300 | $1,770.00 | 15 | 1495 |

| 1200 | LP - Lot clearing - Cut trees | 1 | 1 | Lot | 1


| 400 | $1,400.00 | 15 | 1610 |

| 1210 | LP - Lot clearing - Debris Haul off | 1 | 1 | Lot | 1


| 500 | $2,500.00 | 15 | 2875 |

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 181 entries, 0 to 180

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Cost Code 181 non-null int64

1 Concept 181 non-null object

2 Group 181 non-null int64

3 Stage 181 non-null int64

4 Quantity 181 non-null object

5 Unit 181 non-null int64

6 Unit Cost 181 non-null object

7 Total 181 non-null object

8 Markup % 181 non-null int64

9 Price w Markup 181 non-null int64

dtypes: int64(5), object(5)

memory usage: 14.3+ KB

Planner Step 3
The model will try to clean up this column based on the information in df.head()

Python

OUTPUT:
To examine the costs of the project at each stage, I'll calculate the Total
cost per stage. To do this, I'll calculate the sum of `Total` for every
`Stage`. Then I'll look for the highest and lowest stage and the overall
distribution.

CODER_INSTRUCTION:

Tidy the data:

1. Because the `Total` column contains non-numeric values (as seen in the
previous `df.info()` output, it is of type `Object`), I need to clean it.

2. Based on the `df.head()` output, `Total` column has values such as


'$1,770.00' that contain '$' and ','. I need to remove these characters.

3. Remove the '$' and ',' from `Total` column.

Convert the data:

1. Because the `Total` column is of type `Object`, I need to convert it to


numeric for the calculation later.

2. Convert the `Total` column to numeric.

Transform and aggregate key metrics:

1. Calculate the Sum of `Total` column per `Stage` by using the groupby method.

2. Store the results in a new dataframe labelled `df_results`.

Display outputs:

1. Identify the stage with the highest `StageTotal` and print the stage
alongside the `StageTotal` for that stage.

2. Identify the stage with the lowest `StageTotal` and print the stage
alongside the `StageTotal` for that stage.

3. The dataframe is short so we can show the full distribution. Sort the
results by `StageTotal`.
Coder Step 4
The coder writes code based on the following instructions:

Python

# Removing '$' and ',' from the 'Total' column

df['Total'] = df['Total'].replace('[\$,]', '', regex=True)

# Converting to numeric the 'Total' column

df['Total'] = pd.to_numeric(df['Total'])

# Calculating the Sum of 'Total' column per 'Stage'

df_results = df.groupby('Stage')['Total'].sum().reset_index()

# Display the results in markdown format

print(df_results)

Coder Step 4 Error


There’s an error since there’s a string literal in the 227 rows (which was obviously not seen in the
first 5 rows so there was no way to preempt this error)

Python

CODE_EXECUTION_ERROR:

Traceback (most recent call last):

File "<string>", line 21, in <module>

File "numeric.py", line 217, in to_numeric


values, new_mask = lib.maybe_convert_numeric( # type:
ignore[call-overload] # noqa

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "lib.pyx", line 2322, in pandas._libs.lib.maybe_convert_numeric

ValueError: Unable to parse string "#REF!"

During handling of the above exception, another exception occurred:

ValueError: Unable to parse string "#REF!" at position 227

Given the context above, we now want examples where the following steps are critical:

1. Write a plan for inspecting 20 unique rows that will raise an error during conversion
2. Code to sample and output 20 such unique rows
3. Plan that formulates a strategy for cleaning this column based on the Code Output in the
above step
4. Code to clean the column based on the plan above and then proceed with the rest of the
analysis

Note: When trying to convert the column a second time, we will use errors=’coerce’ so
that all problematic rows are converted to NaNs and do not cause another error. This
should be followed by dropping the rows that now contain NaN values, before
proceeding with the rest of the analysis. See Planner Step 6 and Coder Step 7 for more
details.

Planner Step 5

Unset

OUTPUT:

It seems there's a ValueError while converting the column `Total` to numeric


since there is a string "" in the 227 row. Since the `Total` column is not
numeric, I will check the values that are not numeric.

CODER_INSTRUCTION:

1. Filter all rows of `Total` where the conversion to numeric fails.

2. If there are more than 20 unique non-numeric values, print 20 of them.

3. Otherwise, print all unique non-numeric values from the `Total` column

Coder Step 6

Python

import numpy as np

# Check the invalid values of the column

has_invalid_value = pd.to_numeric(df['Total'], errors='coerce').isnull()

# Get all the unique invalid values

invalid_values = df.loc[has_invalid_value, 'Total'].unique()

# Print all unique non-numeric values if fewer than 20.

if (len(invalid_values) > 20):

print(np.random.choice(invalid_values, 20, replace=False))

else:

print(invalid_values)

IMPORTANT: If the instructions don’t result in this type of code, feel free to divide the
instructions into more steps and specify the methods to use.

Code Output:
Unset

['None', 'NaN', '#REF!']

In this case the `Total` column has values that can never be converted to numeric, so these
values need to be dropped from the column. See the following Plan and Code for doing this:

It’s very important that you use the information gathered in the sample and clean the data
before converting to numeric using ‘coerce’.

See the following Plan and Code for doing this:

Planner Step 6

Unset

OUTPUT:

It seems there are string values such as `#REF!` in the `Total` column that
can't be converted to numeric. I'll update the conversion of the `Total` column
to drop the non-numeric values before proceeding with the rest of the analysis.

I will then calculate the Total cost per stage. To do this, I'll calculate the
sum of `Total` for every `Stage`. Then I'll look for the highest and lowest
stage and the overall distribution.

CODER_INSTRUCTION:

First clean and convert columns to correct types for the analysis later:

1. Convert the `Total` column to numeric for the calculations later, using
errors='coerce' within pd.to_numeric() to convert non-numeric values to NaN

2. Remove all rows with NaN values in the `Total` column before proceeding with
the rest of the analysis

Transform and aggregate key metrics:

1. Calculate the Sum of `Total` column per `Stage` by using the groupby method.

2. Store the results in a new dataframe labelled `df_results`.


Display outputs:

1. Identify the stage with the highest `StageTotal` and print the stage
alongside the `StageTotal` for that stage.

2. Identify the stage with the lowest `StageTotal` and print the stage
alongside the `StageTotal` for that stage.

3. The dataframe is short so we can show the full distribution. Sort the
results by `StageTotal`.

Coder Step 7

Python

# Removing '$' and ',' from the 'Total' column and converting to numeric, using
errors='coerce' to convert non-numeric values to NaN

df['Total'] = pd.to_numeric(df['Total'].astype(str).str.replace('[\$,]', '',


regex=True), errors='coerce')

# Drop all rows with NaN in the `Total` column

df = df.dropna(subset=['Total'])

# Calculating the Sum of 'Total' column per 'Stage'

df_results = df.groupby('Stage')['Total'].sum().reset_index()

# Display the results in markdown format

print(df_results.to_markdown(index=False))

Successful error recovery:


The following snippets exemplify a situation where the tasker guides the model to recover from
an unavoidable error.

Note: These are suggestions only, please use as diverse and meaningful alternatives as
possible.

Main Prompt

This is how the first turn looks like:

This is how the second turn looks like:

This is the Code Output from the second turn. As the prompt asks, you need to work with the
`Cost` column. So, it is the one you should analyze to establish the scenario.
Based on the previous ER diagram, scenario 2 is the most suitable. Therefore, it is correct to
clean and make the conversion. If some of the signs shown by the df.head() are not instructed to
be cleaned, the tasker must modify the instructions to address them. In this case it is not
necessary because the model takes care of the ‘$’ currency symbol. In the third turn (just in
scenario 2) it is essential to differentiate the cleaning from the conversion by using different
subheaders. In this case ’Tidy the data’ and ‘Convert the data’.
This is how the fourth turn looks like:

This is the Code Output. It appears that the model falls into a ValueError because there’s a
non-numeric value that was not shown in the df.head() that makes the conversion fail.
When you face an error like this where there is an issue with the dataset because it was
impossible for the df.head() to reveal , you should classify it as an unavoidable error.
This is how the fifth turn looks like:
This is how the sixth turn looks like:

This is the Code Output. By looking at the unique values you can identify that a more complex
cleaning is required.

This is how the seventh turn looks like:


This is how the eight turn looks like:

This is the Code Output. The median was calculated successfully.


Finally, you reach the final response before the loss bucket section:

Unnecessary/Incorrect sampling:
Unnecessary sampling Scenario 1 / 2
The following task snippets exemplify a situation where the model prints unique values
unnecessarily for scenario 1 or 2. These scenarios don’t require sampling because the relevant
columns for the analysis are in the correct data type and don’t require conversion, or the
column/s need a data type conversion, and the visible rows in the data exploration have a
consistent format. For more details about the scenarios go to Scenario type.

Note: These are suggestions only, please use as diverse and meaningful alternatives as
possible.

Main Prompt

Turn 2 - Code Output

This is the turn 2 Code Output. As the prompt asks, you need to work with the `DATE` column.
So, it is the one you should analyze to establish the scenario.
It’s visible that the ‘Date’ column is an object column that needs to be converted, and the format
in the first 5 rows is consistent.

Turn 3

The model/contributor incorrectly instructs to print the unique values (even worse because it
prints all the uniques).
Based on the previous ER diagram, scenario 2 is the most suitable. Therefore, printing the
unique values is not the correct approach. The model should have converted the 'DATE' column
to DateTime, assuming that the datetime format in the entire column is consistent with the
format shown in the first five rows of the dataset.

In these scenarios, there is NO NEED FOR SAMPLING, the contributor must re-write the
instructions to make the conversion directly. A good rewrite looks like the following image.

Unnecessary sampling Scenario 3


By definition scenario 3 needs sampling, but it should never be a sample of the entirety of the
unique values of the column.

Main Prompt
This is the main prompt of the task.

Turn 2 - Code Output

This is the turn 2 Code Output. As the prompt asks, you need to work with the ‘Total charges
February’, ‘Total charges January’, ‘Total paid February’, ‘Total paid January’, ‘Final balance
February’, and ‘Final balance January’ columns. It’s necessary to analyze these columns to
establish the scenario.
Turn 3
As evidenced by the first five rows, there is a value that presents commas and periods, and the
rest of the present values present periods. Since there is more than one format present in the first
five rows it’s a case scenario 3 and it’s necessary to sample the column values, sampling 20
unique, incompatible values from this column for further analysis (identical to the one that has
to be performed after an error occurs).

In this step the contributor/model decided to print the entirety of the unique values of the
relevant columns, which is not the way to handle this situation, causing an immediate rejection
of the task.

It’s also noticeable that the contributor decided to remove white spaces of the column names,
just in case, that’s also an incorrect instruction (column name spaces can’t be stripped just in
case).

Wrong recovery strategy and not sampling after error


The following task snippets exemplify a situation where the model does not print the unique
values and the contributor did not modify the instructions. Moreover, the way the model recovers
from the error is incorrect. These scenarios require sampling because the relevant columns for
the analysis need a data type conversion, and the visible rows in the data exploration have an
inconsistent format. For more details about the scenarios go to Scenario type.

Note: These are suggestions only, please use as diverse and meaningful alternatives as possible.

Main Prompt
Above is a picture of the dataset and highlighted are the relevant columns necessary to solve the
prompt:

This is the third turn where the model provided incorrect coder instructions. The main problem
here is that the problem categorizes into a third scenario. As you’ll see the model makes the
conversion right away. In this case the contributor should have modified the instructions
accordingly, as sampling was required.
This is how the fourth turn looks like:

This is the Code Output. The model falls into a ValueError because there are mixed formats that
were shown in the df.head() and the conversion is being made with a unique format.
In cases like this, where errors occur due to incorrect coder instructions, the error is classified
as avoidable. The appropriate action would have been to modify the previous instructions
accordingly.
This is how the fifth turn looks like. The contributor incorrectly proceeded with the analysis
without modifying the previous coder instructions as required.
This is how the sixth turn looks like.

This is the Code Output. Although it seems correct as it does not show any issue, the way the
error was addressed is incorrect.

Sampling Variations
The plan and code for cleaning up should vary based on the output of the Sampling step (where
20 rows are sampled).

Please look at the following suggestions to synthetically generate variations for the output of the
sampling step and then write the follow-up Code Recovery accordingly.

Note: these are suggestions only, please use as diverse and meaningful alternatives as possible.
Note 1: You can also include columns that have multiple related formatting problems. For
example, a mix of commas and periods. Or a mix of currency symbols.

Note 2: Often you want to make it so that the issue is consistent within the column. For example,
if you add white space, it can be added to all column values.

Modifying the instructions in any turn after the EDA is allowed if the changes are based on the
information seen in the first five rows, or any code output.

How to complete a task (video)

https://frrautneduar-my.sharepoint.com/:v:/g/personal/pablo_caneva_frra_utn_edu_ar/ESClOr
liA5tImWjV71aqBuoB9wovIXu4kcGPDGSxevo-AQ?e=ylqUki

Lossy Final Response

The primary objective of this section is to identify tasks where at least one response is perfect
and the other contains a loss bucket.
Initially, you’ll get the traditional two final responses, the model produced. You have to rate the
responses as you did in vanilla, but a brand new feature is that you’ll need to identify and label
each loss bucket present in each final response.
To label the “Loss Buckets”, you can use the following table Loss buckets
Note: if the responses have no loss bucket, then just select “None of the above”.

After these previous steps, you’ll find the traditional side-by-side rating and the chance to refine
the response, to get the best final response possible.
After the response has been refined, you’ll get a questionnaire asking if this was a final response,
once you’ve confirmed that this was the final response the model will regenerate 3 new
responses for the non-chosen response.
Your job consists of choosing the most lossy one out of the 3 new responses, then completing
the ratings and labelling the corresponding Loss Buckets present in each of the 3 new responses
(if the responses have no loss bucket, then just select “None of the above”), and compare them
against the original (non-chosen response from the previous selection) in a small Side-By-Side
bar.
Finally, you’ll be requested to identify if any of the 5 responses (the refined response, the
non-chosen response, and the three responses generated by the model), present any loss bucket.

And now, you’re done, now it’s time to “End Session” and submit your completed task.
Loss buckets

Loss buckets are specific characteristics that make a model response bad. You will need to
review the Final Response and identify if any of the loss buckets listed in Table 1 are present.
The desired situation is when the Final Response chosen by the tasker is perfect (whether it
needs to be rewritten or not), and the non-chosen response contains one loss bucket. If this
situation occurs, the task is considered a good one.

Loss bucket Explanation


Weird/Wrong Final Response Pattern The final response refers to the previous analysis. It
could include but is not limited to “...data analysis
objective…", “...based on the analysis…" or similar
related expressions.

Incorrect/Hallucinated Final Response The final response offers either completely


incorrect information or information that hasn’t
been directly extracted from a preceding step’s
output.
It’s very important to mention that in order for the
final response to be incorrect or hallucinated the
previous code output has to be a valid and
correct output that provides the necessary
information for the successful resolution to the
user’s query.
If the preceding code output or the logical
process of the task is incorrect, naturally the Final
Response will be incorrect, this does not
constitute a Loss Bucket since it’s the “correct
response for the wrong logical process”.

Table in response is not formatted well When the main prompt requires a table, the final
response should present the requested table. If
there are too many rows (more than 20), the table
should display only three rows; otherwise, it
should display the complete table. If it’s
truncated, it should be explicitly specify in the
response. E.g. “This table is a truncated view of
the records…”
This loss bucket occurs when the model just starts
showing the table from the start, but stops
abruptly. E.g. printed 14 complete rows and does
not complete the 15th row.
Non-marked down tables should not be included
into this category.

Final Response is Not complete by itself In order to answer the prompt, the final response
refers to an output produced in an earlier step,
yet it does not display it within the final response.
I.e. “The max can be found in the previous step“,
“I’ve calculated the highest average in the last
turn…“, “Refer to the table above for…”

Incorrect / Hallucinated Insights The final response is fundamentally accurate but


includes insights that are either incorrect or not
derived from the analysis conducted in previous
steps. NOTE: insight refers to a valuable or
meaningful observation, interpretation, or
understanding gained from analyzing data or text.
It often involves identifying patterns, trends,
relationships, or significant information within the
given context.

Unnecessary Insights for Plots In cases where the main prompt explicitly
requests a plot, the expected final response
output is a confirmation message stating that the
plot has been generated. There should be no
additional insights regarding the plot or the data.
Be aware that if the prompt does not ask for a plot
but the resolution involves displaying one, then
providing insight might be necessary, and this loss
bucket does not apply to these situations.
NOTE: insight refers to a valuable or meaningful
observation, interpretation, or understanding
gained from analyzing data or text. It often
involves identifying patterns, trends,
relationships, or significant information within the
given context.

Final Response Does Not Fully Satisfy User Query The final response addresses a portion of the
main prompt but does not fulfill all the requested
information.

Missing Flow State The model fails and displays the following
message instead of a final response

<MISSING FLOW STATE>

Not a Final Response Instead of providing a Final Response to


conclude the task, the model provides a new set
of Output/Instructions to continue with the
process, when there’s enough information to
conclude the task.

Unnecessary Explanation in the Final Response The model generates unnecessary explanation in
the final response - e.g. offers some explanations
detailing the steps taken to solve the prompt.

The Model Failed in this Step There’s a model failure in this step, the most
common example is
<MALFORMED PLANNER RESPONSE>

Use of Plural Pronouns The model should not use plural pronouns to
refer to itself or its work.

Follow up Suggestions The model suggests conducting further analysis.


It does not acknowledge that all necessary
information has already been found to address
the main prompt. It may offer new coder
instructions.
Note: “Let me know if you have any other request”
or similar phrases are NOT a loss buckets.

Uses the File's Name The final response contains the filename of the
CSV file. I.e.: “…based on the data in ‘AMSA
INDUSTRIAL - INVENTORY.csv’.”

Final Response contains an image tag When working with plots the final response may
feature an image tag instead of presenting the
plot directly or referring to a plot generated in a
previous step. I.e: [Image of bar chart showing the
frequency of vehicles]

Table 1 - List of Organic Loss Bucket

How to complete a task (video)


https://frrautneduar-my.sharepoint.com/:v:/g/personal/pablo_caneva_frra_utn_edu_ar/ET65gt
ubBA5Gr84-o5lJchEB9fiiaC98IEFTUGKgL9G7qA?e=dPHk7D

Scenario type
The scenario analysis should be conducted based on the necessary columns to solve the
prompt. There are three scenarios to consider:

Scenario Details

Scenario 1: no cleaning/conversion ● df.info() indicates that a column does not need to be


needed. converted before being used.
Scenario 2: need cleaning/conversion. ● df.info() shows that a column is not in the right type
for the analysis/calculation to happen and therefore
needs conversion. (e.g. numeric or datetime that are
in object format).
● df.head() shows that the column that needs
conversion has consistency on how to clean them.

Scenario 3: need sampling before ● df.info() shows that a column is not in the right type
cleaning/conversion. for the analysis/calculation to happen and therefore
needs conversion. (e.g. numeric or datetime that are
in object format)
● df.head() shows that either:
○ there is no hint on how to clean it.
○ does not show consistency across all examples.

Table 2 - List of Scenarios

Variations examples
In the following table you can find different cases where samples are needed:

Case Sample Problem Examples of possible variations

Non-numeric characters in a Presence of non-numeric “$43,452”


column that needs to be values such as “$43,452”. ‘$’
converted to numeric. and ‘,’ need to be removed “¥43,452” (different currency)
from the Column before the
column can be converted to “¥43452” (different currency)
numeric.
“$43452” (no commas)

“43,452” (no currency)

“$43, 452” (extra space)

“43,452 dollars” (spelled dollars)

“$43,452.00¢” (cents symbol)

“43 452 . 00” (many extra spaces)

“$43.452,00-” (extra dash)


43452 (no cleaning up needed)

Different Formatting of Column has values such as See examples below:


Numeric Values. 1234,5 that's using a comma
instead of a decimal point. “(12345)” means “-12345” [Replace
() by a negative sign]
We need to replace the comma
with the period before “1234,5” means “1234.5”
conversion. [Replace comma by decimal]

“3.4 - 3.7” (column has ranges),


model should get the average value
for calculation

“70%” means 0.7 in some contexts

1.23E+02
1.23e+02

“## $1234.5 ##” means just


extraneous symbols are within the
column

Problems with datetime. Parse dates in different 05/21/1994


formats. 21/05/1994
1994-05-21
21-May-1994
May 21, 1994
1994/05/21
21.05.1994
21-May-94
05-21-94
21/05/94
19940521
21st May 1994

1994-05-21 12:00:00
1994-05-21T12:00:00Z
1994-05-21 12:00:00+00:00
1994-05-21 12:00:00-07:00
05/21/1994 12:00:00 PDT
21/05/1994 12:00:00 CEST
1994/05/21 12:00:00 UTC
21-May-1994 12:00:00 GMT
May 21, 1994 12:00:00 EST
19940521 12:00:00 MSK
21st May 1994 12:00:00 BST
05-21-94 12:00:00 CDT

Table 3 - List of Examples

Step 3

Context
Step 3 (Turn #3) refers to the step after data exploration, the model is making the plan and
choosing columns to calculate. In this step, the model needs to know when and why a column
needs to be cleaned and converted to the right format for 1) follow on calculations, and 2)
how to correctly do the cleaning.

Note: Please input bad tasks here → If the task is not solvable please input it here

Workflow
1. Read and review Turn #3’s response – Turn #3 is considered the “Good Example”, but it
will still need to be modified to fit the instructions. Please expand this section, scroll
to the last section, and make edits/improve this response based on the instructions.
2. Identify Scenario type for the task. Review the relevant sections. Use the following
chart for that

3. Write the 5 prompts for the “Bad examples”. Please follow the detailed instructions
below and please follow the style guidelines.
[IMPORTANT] Style Guidelines
EXPLANATION_AND_PLAN is more concise and general and CODER_INSTRUCTIONS are
more detailed / specific and follow the numbered subheadings format.

● New guideline: in Explanation and Plan, state the columns that don’t need
conversion & explain why
● In all cases (where no column cleaning up is needed), please add some rationale
in EXPLANATION_AND_PLAN similar to
○ “Since the column Cost is already of type int64, it can be used as is for
the numerical calculations.”
○ "Given that the column 'Cost' is already in the float64 data type, it is
suitable for our calculations."
Consistency in Coder Instruction:
a. Please ensure there are no inconsistent styles used (such as **Tidy Data**
instead of Tidy Data).
b. Use “Convert to numeric” for consistency instead of “Convert to float”.
c. All columns names, methods and variables, including df.info() and df.head()
should be referred to in backticks `` everywhere (including the first Coder
Instruction block).

Coercing errors: Please ensure we set errors=’coerce’ in the following cases:


a. df.info reveals issues -> Sample -> Fix and Convert column -> ValueError occurs.
The Plan and Code should then convert again with errors=’coerce’ (so
ValueError does not occur again for this column) (scenario 3).
b. ValueError occurs for the first time -> Sample -> the Plan and Code should then
convert with errors=’coerce’ (so ValueError does not occur again for this
column).

Bucketized Checklist[IMPORTANT]
Along with the checklist above, please also pay attention to common loss patterns in the
sections below to ensure these problems are not present in Supervised finetuning (SFT)
going forward.

Explanation and Plan:


● Incorrect Plan
○ Incorrect assumptions about data content or structure.
○ Incorrect interpretation of data or results.
○ Making unfounded assumptions about the data.
○ Incorrect handling of null values or missing data.
● Using wrong / irrelevant columns for analysis or conversions.
● Does not approach the solution / diverges from the query.
● Hallucinated response.
● Should not ask user for next steps or provide recommendations for next steps.
● Overfitting or unnecessary complexity in solutions.
● Should clearly state the columns expected to be utilized for the analysis in the first plan
block after seeing df.info or df.head.

Style
● Columns not properly referenced (missing backticks, using single / double quotes
instead).
● Literal values incorrectly marked (backticks instead of quotes, and vice versa).
● Ambiguous or unclear print / display instructions.
● Unnecessary blank lines (this applies all EXPLANATION_AND_PLAN,
CODER_INSTRUCTION and CODE).

Coder Instructions
● Unclear / Ambiguous Coder Instructions.
● Missing or incomplete data transformation steps.
● Missing column names.
● Coder Instruction not numbered / incorrectly numbered.
● Wrong rationale in Coder Instructions.
○ For cleaning.
○ For conversion.
● Code present in Coder Instructions.
● Missing subheaders / Incorrect formatting of subheaders.

Refers to Conversion to “float” instead of numeric

Categories

Scenario 1: Good data, no cleaning needed


df.info() indicates that a column does not need to be changed before being used. For example,
a column called `Cost` has type ‘int64’ according to df.info(). So. it can be used as-is for any
downstream calculations.

df.info & df.head output Example

Note `Total` and `Price w Markup` columns are used for calculation and has numeric value
already

Unset

| Cost Code | Concept | Group | Stage | Quantity |


Unit | Unit Cost | Total | Markup % | Price w Markup |

|:------------|:------------------------------------|:--------|:--------|:-----------|:---
----|:------------|:----------|:-----------|:-----------------|

| 1110 | LP - Concrete paving and sidewalk | 1 | 15 | Lot | 1


| 100 | 100 | 15 | 115 |

| 1120 | LP - Head wall | 1 | 2 | Lot | 1


| 200 | 200 | 15 | 230 |

| 1150 | LP - Culvert Installation | 1 | 2 | Lot | 1


| 300 | 300 | 15 | 345 |

| 1200 | LP - Lot clearing - Cut trees | 1 | 1 | Lot | 1


| 400 | 400 | 15 | 460 |

| 1210 | LP - Lot clearing - Debris Haul off | 1 | 1 | Lot | 1


| 500 | 500 | 15 | 575 |

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 181 entries, 0 to 180

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Cost Code 181 non-null int64


1 Concept 181 non-null object

2 Group 181 non-null int64

3 Stage 181 non-null int64

4 Quantity 181 non-null object

5 Unit 181 non-null int64

6 Unit Cost 181 non-null object

7 Total 181 non-null int64

8 Markup % 181 non-null int64

9 Price w Markup 181 non-null int64

dtypes: int64(5), object(5)

memory usage: 14.3+ KB


None

RL Instruction
Good Example
- Coder instruction should explicitly acknowledge the correct type of the column(s)
- Coder instruction should explicitly explain that no conversion is needed
- In Explanation and Plan, state the columns that don’t need conversion & explain
why

Bad Examples:
1. Coder instruction identifies the wrong column type and says conversion needed
2. Coder instruction identifies the correct column type but still says conversion needed
(when it’s actually not needed)
a. E.g. convert int to str. But the calculation is only doing group by
b. E.g. convert int to float. But nothing requires float calculation
3. Coder instruction didn’t mention the column type and says conversion needed
4. Coder instruction mentions another column that’s not used in the calculation at all.
5. [Less penalty] Coder instruction didn’t mention the column type and didn’t mention
conversion

Good Example
Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

Because `Total` and `Price w Markup` are both type int64 (seen from the df.info output),
they don't need conversion for the calculation.

CODER_INSTRUCTION:

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Bad Example 1) identifies the wrong column type

Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Tidy the data:

1. `Total` column is type 'Object' (seen in the df.info output) which means it contains
non-numeric values. I need to convert it to numeric before calculation.

2. Convert the `Total` column to numeric and remove '$' and ','.

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Bad Example 2) Requests unnecessary conversion


Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Tidy the data:

1. `Total` column is type 'int64' (seen in the df.info output). I need to convert it to
numeric before calculation.

2. Convert the `Total` column to numeric and remove '$' and ','.

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Bad Example 3) didn’t mention column type

Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Tidy the data:

1. Convert the `Total` column to numeric.

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Bad Example 4) mentions wrong column


Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Tidy the data:

1. Convert the `Markup %` column to numeric.

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Bad Example 5) didn’t mention anything

Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Scenario 2: Bad Data, Need Cleaning


df.info() shows that a column is not in the right format for the analysis/calculation to happen
and therefore needs conversion.
df.head() shows that the column that needs conversion has values that need to be cleaned up
for the conversion to happen successfully.
For instance, if there’s a value such as “$43,4521” in the `Cost` column which requires us to
remove ‘$’ and ‘,’ before converting the column to numeric so it can be used for the
calculations later.

df.info & df.head output Example

Note `Total` column is used for calculation and needs cleaning before converting to numeric

Unset

| Cost Code | Concept | Group | Stage | Quantity |


Unit | Unit Cost | Total | Markup % | Price w Markup |

|:------------|:------------------------------------|:--------|:--------|:-----------|:---
----|:------------|:----------|:-----------|:-----------------|

| 1110 | LP - Concrete paving and sidewalk | 1 | 15 | Lot | 1


| 100 | $100.00 | 15 | 115 |

| 1120 | LP - Head wall | 1 | 2 | Lot | 1


| 200 | $200.00 | 15 | 230 |

| 1150 | LP - Culvert Installation | 1 | 2 | Lot | 1


| 300 | $1,300.00 | 15 | 1495 |

| 1200 | LP - Lot clearing - Cut trees | 1 | 1 | Lot | 1


| 400 | $1,400.00 | 15 | 1610 |

| 1210 | LP - Lot clearing - Debris Haul off | 1 | 1 | Lot | 1


| 500 | $2,500.00 | 15 | 2875 |

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 181 entries, 0 to 180

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Cost Code 181 non-null int64

1 Concept 181 non-null object

2 Group 181 non-null int64

3 Stage 181 non-null int64

4 Quantity 181 non-null object

5 Unit 181 non-null int64

6 Unit Cost 181 non-null object

7 Total 181 non-null object


8 Markup % 181 non-null int64

9 Price w Markup 181 non-null int64

dtypes: int64(5), object(5)

memory usage: 14.3+ KB

None

RL Instruction
Good Example
- Coder instruction should explicitly acknowledge the correct type of the column(s)
- Coder instruction should explicitly mention that the column is not in the desired format
based on the df.info output and therefore needs conversion
- Coder instruction should explicitly explain what’s wrong with the value based on
the df.head() output and what exactly needs to be done for the clean up
- The clean up instruction should be specific not generic (e.g. removing specific chars
vs removing all non-numeric chars) and only based on the df.head() output

Bad Examples:
1. Coder instruction identifies the wrong column type and says conversion is not needed
2. Coder instruction didn’t mention anything about column type or conversion
3. [Less penalty] Coder instruction didn’t explain the reason of why conversion is
needed or why clean up is needed
4. Coder instruction gives unnecessary clean up
a. E.g. give instructions to remove “$” but df.head() is not showing any values that
has “$”
b. E.g. give instructions to remove any non-numeric character from the column
5. Coder instruction gives incorrect clean up
a. E.g. give instructions to replace “,” with “.” when df.head shows a value like
$1,999. This will lead to incorrect result

Good Example

Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Tidy the data:

1. `Price w Markup` is type 'int64' which is good for the calculation. No conversion
needed.

2. Because the `Total` column is type 'Object' (seen in the df.info output) which means
contains non-numeric value, we need to transform it to numeric for the calculation later.

3. Based on the `df.head()` output, `Total` column has values such as '$1,300' that
contains '$' and ',', I need to remove these chars before converting to numeric.

4. Remove the '$' and ',' from `Total` column.

5. Convert the `Total` column to numeric.

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Bad Example 1) identify wrong column type

Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Tidy the data:

1. Both `Total` and `Price w Markup` columns are type 'int64', it's good for the
calculation and no conversion is needed.

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.
Bad Example 2) didn’t mention column type or conversion at all

Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Bad Example 3) didn’t mention why conversion and clean up is needed

Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Tidy the data:

1. Remove the '$' and ',' from `Total` column.

2. Convert the `Total` column to numeric.

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Bad Example 4) Unnecessary clean up

Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Tidy the data:

1. `Price w Markup` is type 'int64' which is good for the calculation. No conversion
needed.

2. Because the `Total` column is type 'Object' (seen in the df.info output) which means
contains non-numeric value, I need to transform it to numeric for the calculation later.

3. Based on the `df.head()` output, `Total` column has values such as '$1,300' that
contains '$' and ',', I need to remove these chars before converting to numeric.

4. Remove non-numeric chars from `Total`

5. Convert the `Total` column to numeric.

Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Bad Example 5) Incorrect clean up

Unset

EXPLANATION_AND_PLAN:

It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.

CODER_INSTRUCTION:

Tidy the data:

1. `Price w Markup` is type 'int64' which is good for the calculation. No conversion
needed.

2. Because the `Total` column is type 'Object' (seen in the df.info output) which means
contains non-numeric value, I need to transform it to numeric for the calculation later.

3.. Based on the `df.head()` output, `Total` column has values such as '$1,300' that
contains '$' and ',', I need to remove these chars before converting to numeric.

4. Remove '$' from `Total`

5. Convert the `Total` column to numeric.


Transform and aggregate metrics:

1. Filter the data to include the rows where `Group` is equal to 11 or 12.

2. Subtract the `Total` from the `Price w Markup` column.

3. Sum all the values and print the result. Round to two decimals.

Scenario 3: Bad Data, Need Sampling


df.info() shows that a column is not in the right format for the analysis/calculation to happen
and therefore needs conversion. (e.g. numeric or datetime format)
However
- Either the value of df.head() seems valid and doesn’t give good clue on what cleaning is
needed
- Or df.head() shows completely invalid data value that suggests the column has messy
data
<df.head() shows valid value but df.info suggests otherwise>

<df.head() shows invalid value>


RL Instructions
In this case:
- Good Example should
- Explain it’s finding and why it wants to take an additional step of sampling for
proceeding with the analysis
- Have the next step to sample 20 unique, incompatible values from this column
for further analysis

- Bad Example should cover the following cases (separately or jointly)


1. Doesn’t mention conversion or clean up needed
2. Doesn’t explain the reason (why conversion is needed, why clean up is needed)
3. Give instructions for overly generic clean up
4. Sampling does not explicitly mention 20 unique
--------------from scenario 2-------------
5. Coder instruction identifies the wrong column type and says conversion is not
needed
6. Coder instruction didn’t mention anything about column type or conversion
7. [Less penalty] Coder instruction didn’t explain the reason of why conversion
is needed or why clean up is needed
8. Coder instruction gives unnecessary clean up
a. E.g. give instructions to remove “$” but df.head() is not showing any
values that has “$”
b. E.g. give instructions to remove any non-numeric character from the
column
9. Coder instruction gives incorrect clean up
a. E.g. give instructions to replace “,” with “.” when df.head shows a value
like $1,999. This will lead to incorrect result

Good Example

Unset

EXPLANATION_AND_PLAN:

I will calculate net profit and profit margin of the products from each supplier to find
out which suppliers are providing the most profitable products.

I can calculate sales and net profit by summing up columns `Price * Stock` and `(Price -
Cost) * Stock`, respectively.

I can then calculate profit margin using the formula:

```

Profit Margin (%) = (Net Profit / Sales) * 100

```

However, since the `Cost` column is not numeric, I will check the values that are not
numeric to decide how to convert the `Cost` column for further analysis.

CODER_INSTRUCTION:

From the df.head() output, the first five rows of `Cost` are all integer values. However,
df.info() output shows that the `Cost` column is of type object, which means it has
non-numeric values.

1. Filter all rows of `Cost` that has non-numeric value

2. If there are more than 20 unique non-numeric values, sample and print 20 of them

3. Otherwise, print all unique non-numeric values from `Cost` column

Bad Example

Bad examples here are any steps that didn’t do sampling. We can reuse the bad example
patterns from Scenario 1) & 2) above

Branch: Synthetic Loss Bucket Generation


Objective

In this task, your goal is to generate alternative versions of the provided model responses.
Your challenge is to craft a different version of a response for each potential category of loss
buckets. It is important to focus on creating responses with errors that the model might generate.

Before creating the loss bucket responses you should ensure that the selected/rewritten
response is perfect. To achieve this, remove any content that the model would obviously never
generate. This includes eliminating single non-English words, poor grammar, improperly
formatted text, or any content that is clearly not within the model's capabilities.

Regarding the lossy responses you will encounter different scenarios (or cases) that will
require you to generate a varying number of loss buckets depending if the prompt requires or
not the creation of a plot or table.

The following diagram explains the different possible scenarios:

Image 3 - Possible Synthetic Scenarios

Task Instructions
1. Read and Understand the task
○ It is necessary to have a certain level of comprehension about the resolution of the
task before being able to identify and craft loss buckets.
2. Read and Understand the Chosen/Rewritten Final Response
○ This response must be a perfect response.
3. Generate Alternative Responses
○ Follow the category guidelines to create the required number of alternative
responses.
○ Each alternative response should introduce a single type of loss bucket.
○ It’s not mandatory to use the “perfect response” the tasker chose or wrote, as a
basis for the new responses, but it’s recommendable.
○ If the “perfect response” happens to have any previous Loss Bucket, remove
that Loss Bucket to avoid adding extra Loss Buckets to each alternative
response.
○ Your response should be written in the following markdown format:

**Output:**

--ENTER YOUR LOSS BUCKET HERE --

### This is the final selection.

### End the task session after selecting this response. Click the **Submit**
button to end.

Remember

● Your rewrites should be clearly distinct from one another, each showcasing a unique
type of loss bucket.
● The final response in each case is the only aspect you need to modify. Avoid changing
the task itself.
Seeking loss buckets

Loss buckets are specific characteristics that make a model response bad. The table below
shows all potential loss buckets along with their corresponding explanations:

Loss bucket Explanation

Weird/Wrong Final Response Pattern The final response refers to the previous analysis.
It could include but is not limited to “...data
analysis objective…", “...based on the analysis…"
or similar related expressions.

Incorrect/Hallucinated Final Response The final response offers either completely


incorrect information or information that hasn’t
been directly extracted from a preceding step’s
output.
It’s very important to mention that in order for
the final response to be incorrect or hallucinated
the previous code output has to be a valid and
correct output that provides the necessary
information for the successful resolution to the
user’s query.
If the preceding code output or the logical
process of the task is incorrect, naturally the
Final Response will be incorrect, this does not
constitute a Loss Bucket since it’s the “correct
response for the wrong logical process”.

Final Response is Not complete by itself In order to answer the prompt, the final response
refers to an output produced in an earlier step,
yet it does not display it within the final response.
I.e. “The max can be found in the previous step“,
“I’ve calculated the highest average in the last
turn…“, “Refer to the table above for…”

Incorrect / Hallucinated Insights The final response is fundamentally accurate but


includes insights that are either incorrect or not
derived from the analysis conducted in previous
steps. NOTE: insight refers to a valuable or
meaningful observation, interpretation, or
understanding gained from analyzing data or
text. It often involves identifying patterns, trends,
relationships, or significant information within
the given context.
Final Response Does Not Fully Satisfy User Query The final response addresses a portion of the
main prompt but does not fulfill all the requested
information.

Table in response is not formatted well When the main prompt requires a table, the final
(OPTIONAL) response should present the requested table. If
there are too many rows (more than 20), the
table should display only three rows; otherwise, it
should display the complete table. If it’s
truncated, it should be explicitly specify in the
response. E.g. “This table is a truncated view of
the records…”
This loss bucket occurs when the model just
starts showing the table from the start, but stops
abruptly. E.g. printed 14 complete rows and does
not complete the 15th row.
Non-marked down tables should not be included
into this category.

Unnecessary Insights for Plots (OPTIONAL) In cases where the main prompt explicitly
requests a plot, the expected final response
output is a confirmation message stating that the
plot has been generated. There should be no
additional insights regarding the plot or the data.
This loss bucket occurs when the final response
generates a correct but unnecessary (not
requested) insight. Be aware that if the prompt
does not ask for a plot but the resolution involves
displaying one, then providing insight might be
necessary, and this loss bucket does not apply to
these situations.
NOTE: insight refers to a valuable or meaningful
observation, interpretation, or understanding
gained from analyzing data or text. It often
involves identifying patterns, trends,
relationships, or significant information within
the given context.

Image Tags in Final Response (OPTIONAL) When working with plots the final response may
feature an image tag instead of presenting the
plot directly or referring to a plot generated in a
previous step. I.e: [Image of bar chart showing
the frequency of vehicles].
This loss bucket is only possible when the
plots/graphs have taken part in the resolution of
the task.

Not a Final Response Instead of providing a Final Response to


conclude the task, the model provides a new set
of Output/Instructions to continue with the
process, when there’s enough information to
conclude the task.

Unnecessary Explanation in the Final Response The model generates unnecessary explanation in
the final response - e.g. offers some explanations
detailing the steps taken to solve the prompt.

Use of plural Pronouns The model suggests conducting further analysis.


It does not acknowledge that all necessary
information has already been found to address
the main prompt. It may offer new coder
instructions.
Note: “Let me know if you have any other
request” or similar phrases are NOT a loss
buckets.

Follow up Suggestions The final response contains the filename of the


CSV file. I.e.: “…based on the data in ‘AMSA
INDUSTRIAL - INVENTORY.csv’.”

Uses the File's Name When working with plots the final response may
feature an image tag instead of presenting the
plot directly or referring to a plot generated in a
previous step. I.e: [Image of bar chart showing
the frequency of vehicles]
This loss bucket is only possible when the
plots/graphs have taken part in the resolution of
the task.

Table 4 - List of Synthetic Loss Bucket

Synthetic Loss Bucket Resolution Example


Note: These are suggestions only, please use as diverse and meaningful alternatives as
possible.

Main Prompt
This is the main prompt to be resolved in the task.
Original Task Resolution
This is the resolution of the original task through code execution and code output.

Final Responses
After reaching a conclusion, the model presents a set of 2 final responses. The tasker that solved
the task selected one of these two options (the one considered to be the best in the tasker's
judgment) and refined this response if necessary.

As previously explained, in the case of this task, at least 1 response would be a “perfect”
response, and the other/s will present one or many loss buckets.

These are the set of Final Responses for the task at hand:

The contributor chose answer 1 and rewrote it in order to display only the first 3 records so as
not to display the entire table according to the rubrics.

Response 2 contains an unnecessary explanation of the displayed table.

Creation of New Loss Buckets


Now, it’s time to create new Loss Buckets according to each category.

Each Loss Bucket will have a window to complete with a Final Answer that contains only the Loss
Bucket requested in that step.
If you happen to have any doubt regarding what each Loss Bucket refers to, you can hover over
the symbol, and get a quick explanation of each Loss Bucket.
For example, for the Weird / Wrong Final Response Pattern, you can find the following quick
explanation

[Mandatory] Weird / Wrong Final Response Pattern.


Resolution example for Weird / Wrong Final Response Pattern.
In this case, the phrase “The data analysis objective has been achieved…” was added to
generate the desired Loss Bucket.
[Mandatory] Incorrect / Hallucinated Final Response.
Resolution example for Incorrect / Hallucinated Final Response.
To achieve this Loss Bucket, an incorrect/hallucinated result was added to the response alone with
an insight about the result (new requeriment).

[Mandatory] Final Response is Not Complete by Itself.


Resolution example for Final Response is Not Complete by Itself.
To create this Loss Bucker, a reference to a previously created variable that stores the desired result
is made but the actual result is never printed in the response.
[Mandatory] Incorrect / Hallucinated Insights.
Resolution example for Incorrect / Hallucinated Insights.
The addition of a hallucinated insight (not based on any calculation or analysis conducted on
the dataset) generates this Loss Bucket. In this case, the following sentence is the Loss Bucket
“This could result in significant savings on larger purchases, especially for bulk items”.
[Mandatory] Final Response Does Not Fully Satisfy User Query.
Resolution example for Final Response Does Not Fully Satisfy User Query.
This Loss Bucket, answers the main prompt but only partially.
There might be difficulties resolving this task when the prompt is too simple and asks for a simple
value or calculation, a simple way to solve these cases is just to state that the calculation was
performed. An example of this technique with this prompt would be “The average discount for items
purchased from Nursery B was calculated”. The prompt gets responded to, but not completely.
[Optional] Loss buckets.
There are optional Loss Buckets, these Loss Buckets will need to be created depending on the task
at hand.
Optional loss buckets: “Table in response is not formatted well”, “Unnecessary Insights for
Plots” and “Image Tags in Final Response”.
If the resulting DataFrame is to be printed, it would need just to print a truncated view of the 3 first
rows. To create a Loss Bucket, we just need to print any other number of rows different than 3, which
is the case of this example where the first 2 rows of the DataFrame get printed.
[Optional] Unnecessary Insights for Plots.
The following case is an example of the optional Loss Bucket, “Unnecessary Insights for Plots”.
The main prompt requires the creation of a simple plot, just a bar chart.

To create a Loss Bucket, is just necessary to add any insight referring to the requested bar chart
since no insights were requested about it. This insight should be correct (not incorrect or
hallucinated).
[Optional] Image Tags in Final Response.
The following case is an example of the optional Loss Bucket, “Tags in Final Response”.
The main prompt requires the creation of a simple plot, just a bar chart.

To create a Loss Bucket is just necessary to add an image tag to the final response, referencing the
created bar chart.
[Mandatory] Not a Final Response.
Many times we have found that the model, once the response to the main prompt has been reached,
wants to continue with the analysis.
The idea of this loss bucket is to recreate this situation.
[Mandatory] Unnecessary Explanation in the Final Response.
Sometimes we find that the model continues with some unnecessary suggestions as to how we
might deepen the analysis.
The idea of this loss bucket is to recreate this situation.
[Mandatory] Use of plural Pronouns.
A requirement for the project has been to correct when the model is conducted in third person.
How to complete a task

https://frrautneduar-my.sharepoint.com/:v:/g/personal/pablo_caneva_frra_utn_edu_ar/EYv0sq
b3hspOsz75YKQ83yABzVVkJXshzDmBSobY0I72tQ?e=vj180R

A problem we may encounter.


Problem: Image is not displayed.
Complete task as if the image was there. You can run code in the sphere to see the plot if necessary.
Note: the requirements for the project are constantly changing, this guide can be modified and it is
our recommendation to consult it regularly and even have it always open when we work.
Happy tasking!

You might also like