Bulba Code ICE - RLHF Synthetic & Organic Loss
Bulba Code ICE - RLHF Synthetic & Organic Loss
Bulba Code ICE - RLHF Synthetic & Organic Loss
Executive Summary
In this document you will find instructions and examples for the branches and key points of
vanilla V3 (Advanced E 2 E ICE RLHF):
● Organic Branch:
○ Error Recovery.
○ Lossy Final Response.
● Synthetic Branch:
○ Loss Buckets for final response.
● Step 3
○ Loss Buckets for step 3.
Table of Content
Latests changes:
Tasks
Branches
Error Recovery
Recovery Strategy
Frequent causes of reject:
Detailed Example
Successful error recovery:
Unnecessary/Incorrect sampling:
Unnecessary sampling Scenario 1 / 2:
Unnecessary sampling Scenario 3:
Wrong recovery strategy and not sampling after error
Sampling Variations
How to complete a task (video)
Lossy Final Response
Loss buckets
How to complete a task
Scenario type
Variations examples
Branch: Synthetic Loss Bucket Generation
Objective
Task Instructions
Remember
Seeking loss buckets
Synthetic Loss Bucket Resolution Example
How to complete a task
31-Mar-2024 v2 Step3
Error Recovery: Filtering strategy (diagram)
03-Apr-2024 v2.2 Update of error recovery guidelines - New errors / Hints: Types of
Errors:, Quick Hints
-Assuming “df” as the name of the dataframe before defining it, is not an issue.
- The use of headers and subheaders in the coder instructions explaining the thought process
behind the coder instructions is necessary from now on. Refer to Subheaders for the details.
General Understanding
Vanilla Project
The Vanilla Project is part of an RLHF process to improve the LLM performance. It is achieved
by solving prompts that require ICE. The language chosen for this purpose is Python.
For ICE, it is understood as Implicit Code Execution, so the task you need to accomplish will
execute several steps of python code.
Vanilla encompasses two distinct types of tasks, each situated in a separate branch. The two
branches, Organic and Synthetic, coexist; however, they pursue different goals as their
approaches to achieving the same objective differ.
The organic Branch presents a multi-turn chat task where a primary prompt must be answered
by exploring the data, while also identifying Loss Buckets (specific unexpected model’s
behaviour) and assisting the model in recovering from errors when they occur.
In each turn, you are in charge of guiding the model through a logical process. At each step, you
will need to evaluate multiple responses presented to you, aiming to select a better one, or in
some cases, rewrite and refine the responses provided.
Conversely, the Synthetic Branch employs a different task structure, where the prompt has
already been addressed. Here, the aim is to craft (synthetically) 'lossy' final responses in
alignment with the specified 'Loss Bucket' categories.
Tasks Taxonomy
The tasks for the Organic Branch are comprised of several turns with 2 Responses where a
prompt is answered. The LLM model takes on 2 roles during the task, acting either as an
Instructor or as a Coder.
The interaction between the two roles will continue throughout the task. The output from each
model feeds into the other for the next turn. When the Coder model provides enough
information for the dataset to answer the prompt, the Instructor will display the Final
Responses in the last turn.
Your duty is to orchestrate the models interaction by choosing the best response of each turn,
and grading the Responses across different topics.
This new approach to the good old vanilla task comes with a few changes in the methods and
rubrics used to solve the task. Let’s dive into them:
1. If any data type conversion is necessary (i.e. an object column that has numeric data or
an object column that has datetime data), we will use the first five rows of the dataset,
and use what we’ve learned from them to perform the conversion.
2. If you followed all the recommended steps for column conversion, and you run into a
code error that is not a consequence of poor/bad/incomplete coder instructions but a
result of an error that was impossible to prevent with the information available at that
point of the task, do not change the previous instructions using information about the
dataset gathered from the error. The idea is to teach the model how to recover from this
initial error.
3. Now you’ve reached the final response, you’ve helped the model get to a final response,
and maybe also helped the model recover from a code error, now it’s time to detect ‘Loss
Buckets’. You can find everything about them here: Seeking loss buckets.
I know this all seems daunting right now, and you’re probably a little lost, but we’ll take you step
by step to explain everything you need to know in the best way possible.
- If the prompt only requests the creation of new column/s or DataFrames, the Final Response
should have a truncated view of the first 3 rows of the final DataFrame to display the resulting
column/s or DataFrame.
Moreover, when the prompt asks for some merging/cleaning/modifying the table
structure/header etc., in the final response the perfect response should include a glimpse at the
new table.
Error Recovery
There will be cases where the code runs into an error that WAS IMPOSSIBLE TO PREVENT with
the information about the dataset available at that point. Do not redo the step if this happens!
Instead, use the next few turns to investigate and fix the error. This section will go over some
frequently occurring errors and the recommended approaches to address them.
For more information about the scenario, you can check Scenario type, and for more examples
Variations examples.
What should I do when the prompt requires to filter a column by a specific value?
In the previous good old vanilla, the approach was checking the unique values looking for value
variations (“APRIL”, “april”, “April”, “ April ”). From now on, if the value appears in the first five
rows, you can filter right away. In this way, the model has higher chances of getting the solution
faster. Caveat, odds of getting an error also are higher, use the Recovery Strategy necessary for
each case.
Image 2 - Organic Scenarios and error categorization
IMPORTANT: If the errors stem from poor instructions(using incorrect methods, poor data
handling, etc.), or model hallucinations (column names that don’t exist, data that can’t be seen
in the first 5 rows, unnecessary column conversions, etc.), these errors can and must be
avoided. For all the details read the Recovery Strategy section.
Note: a common avoidable error is that the code does not follow the instructions.
Recovery Strategy
In an Error Recovery scenario, the model might fail during the code execution phase, and
then, in the following step of coder instructions, it should formulate a recovery plan considering
these new insights. As a contributor, you would only intervene with this recovery plan if the
model fails to utilize this new information to handle a proper recovery.
On the other hand, if the error is Not Suitable for Error Recovery scenario, your job as a
contributor is to intervene with the previous coder's instructions and guide the model along the
correct path to avoid this error. Do not let the model fail because of hallucinations.
If the model doesn’t follow the previous coder's instructions your job is to modify the given
instructions as many times as necessary until the model produces the correct code.
Code modification should ONLY be performed when the model doesn’t follow the previous
coder's instructions as a last resource.
Types of Errors:
A ValueError is raised when a function receives an argument of the correct type but an
inappropriate value. Common cases:
- When the model tries to convert to numeric or date type after recognizing the pattern
seen in the EDA step, but it fails because there were unseen patterns, a ValueError is
raised as a consequence.
● Sample 20 unique rows from that Column that will cause an error while converting
● Based on these 20 rows:
○ If there are any patterns, update the conversion strategy to take care of these
patterns
● Retry conversion with the new strategy, but also set errors=’coerce’ so that any rows
that can not be converted are converted to NaNs and are then dropped
- When the model tries to convert to numeric or date type without recognizing the pattern
seen in the EDA step.
- Recommended action →: When you encounter a Not Suitable KeyError, your job as a
contributor is to intervene and guide the model along the correct path, do not let the
model fail because of hallucinations.
A KeyError is raised when a mapping key is accessed and isn't found in the mapping. This
could happen because the model references a column name incorrectly due to various causes:
An AttributeError is raised when we try to access or use an attribute of an object that does
not exist or is not accessible in the context where it's being used.
Common cases:
- When the model tries to convert an object type column: The error is raised when the
condition for using .str accessor is not met, i.e., when the value being accessed is not a
string. In pandas, when a column contains elements of different data types (e.g., strings,
integers, floats), pandas may assign the object data type to that column. This allows
pandas to handle columns with mixed data types, but it can also indicate that the
column may not be as efficiently handled as columns with homogeneous data types.
Therefore, the error will occur because the column contains mixed string and numeric
data, and using the str method on a numeric value will cause the error.
- Recommended action →: When you encounter an AttributError (as the one seen above),
your job as a contributor is to, in the next step, repeat the instruction but add an
instruction to make sure the model casts or treats the column as the correct type (in the
example it would be to cast it to string) avoiding the error.
- Recommended action →: When you encounter any of these errors your job as a
contributor is to, in the next step, repeat the instruction but add an instruction to make
sure that the error doesn’t repeat itself in the new step. You’ll have to use your coding
knowledge to figure out what is the cause of the error and what’s the best way to avoid
the error at hand.
General Errors
- When you encounter any other type of error, you should analyze the possibility for this
code error to be avoidable or unavoidable. If you conclude that the error you faced was
avoidable because you could have provided better instructions, then go back to the
previous coder instructions and modify them to prevent the model from failing. If you
encounter an error that you couldn’t have avoided, then let the model recover in the next
coder instructions using this new information to create proper coder instructions.
Quick Hints
Sampling
Whenever a sampling is needed, whether it be scenario 3, or scenario 2 after an error occurs,
this sampling needs to be a 20 unique elements sampling (preferably a random sample), of
elements incompatible with the required conversion (i.e. a 20 unique random samples of
elements where the conversion fails, or that have non-numeric values, it’s all relative to the case
you’re working with). Refer to Planner Step 5 and Coder Step 6 for a very clear example.
The following image, has a good set of instructions. After failing, sampling and viewing that
many rows in the columns of interest had non-numeric characters interfering with the
conversion of the columns to a numeric data type, the contributor instructs the model to
perform a clean-up before performing the conversion using the parameter ‘errors = coerce’.
If you sample a column obtaining valuable information to clean it, before performing the data
type conversion, and then just directly use `errors = ‘coerce’` as a parameter without cleaning
the data, the sample becomes pointless, since you’re just eliminating all the rows of the column
that were causing the original error, which you could’ve done without any sample. The idea is to
use the information gotten while sampling, clean the data, convert the column and try to end
up which as much useful data as possible (limited by how good the sampling actually was).
Frequent causes of reject:
● Unnecessary / Incorrect sampling
In these cases there is an incorrect or unnecessary sampling of column/s, you can refer
to the following links for specific examples.
Avoid submitting incomplete tasks either because none of the responses reached
the final response section or because both of them have a `missing flow state`
message.
Detailed Example:
See Steps 1-4 below to understand the context and reason for the Code Error.
Note `Total` column is used for calculation and therefore needs cleaning before converting to
numeric
Planner Step 1
Python
CODER_INSTRUCTION:
Coder Step 2
Unset
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# Read the CSV file into a DataFrame
df = pd.read_csv('R_Plan_Budgetcsv_-_BT_Budget.csv')
print(df.info())
Code Output
As we see below df.info() shows that the column Total is not in the right format for the
analysis/calculation to happen and therefore needs conversion (it’s of type Object instead of
numeric).
df.head() shows that the column that needs conversion has values that need to be cleaned up for
the conversion to happen successfully. For instance, if there’s a value such as “$1,770.00” in the
Total column which requires us to remove ‘$’ and ‘,’ before converting the column to numeric so it
can be used for the calculations later.
Unset
|:------------|:------------------------------------|:--------|:--------|:-----------|:---
----|:------------|:----------|:-----------|:-----------------|
<class 'pandas.core.frame.DataFrame'>
Planner Step 3
The model will try to clean up this column based on the information in df.head()
Python
OUTPUT:
To examine the costs of the project at each stage, I'll calculate the Total
cost per stage. To do this, I'll calculate the sum of `Total` for every
`Stage`. Then I'll look for the highest and lowest stage and the overall
distribution.
CODER_INSTRUCTION:
1. Because the `Total` column contains non-numeric values (as seen in the
previous `df.info()` output, it is of type `Object`), I need to clean it.
1. Calculate the Sum of `Total` column per `Stage` by using the groupby method.
Display outputs:
1. Identify the stage with the highest `StageTotal` and print the stage
alongside the `StageTotal` for that stage.
2. Identify the stage with the lowest `StageTotal` and print the stage
alongside the `StageTotal` for that stage.
3. The dataframe is short so we can show the full distribution. Sort the
results by `StageTotal`.
Coder Step 4
The coder writes code based on the following instructions:
Python
df['Total'] = pd.to_numeric(df['Total'])
df_results = df.groupby('Stage')['Total'].sum().reset_index()
print(df_results)
Python
CODE_EXECUTION_ERROR:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Given the context above, we now want examples where the following steps are critical:
1. Write a plan for inspecting 20 unique rows that will raise an error during conversion
2. Code to sample and output 20 such unique rows
3. Plan that formulates a strategy for cleaning this column based on the Code Output in the
above step
4. Code to clean the column based on the plan above and then proceed with the rest of the
analysis
Note: When trying to convert the column a second time, we will use errors=’coerce’ so
that all problematic rows are converted to NaNs and do not cause another error. This
should be followed by dropping the rows that now contain NaN values, before
proceeding with the rest of the analysis. See Planner Step 6 and Coder Step 7 for more
details.
Planner Step 5
Unset
OUTPUT:
CODER_INSTRUCTION:
3. Otherwise, print all unique non-numeric values from the `Total` column
Coder Step 6
Python
import numpy as np
else:
print(invalid_values)
IMPORTANT: If the instructions don’t result in this type of code, feel free to divide the
instructions into more steps and specify the methods to use.
Code Output:
Unset
In this case the `Total` column has values that can never be converted to numeric, so these
values need to be dropped from the column. See the following Plan and Code for doing this:
It’s very important that you use the information gathered in the sample and clean the data
before converting to numeric using ‘coerce’.
Planner Step 6
Unset
OUTPUT:
It seems there are string values such as `#REF!` in the `Total` column that
can't be converted to numeric. I'll update the conversion of the `Total` column
to drop the non-numeric values before proceeding with the rest of the analysis.
I will then calculate the Total cost per stage. To do this, I'll calculate the
sum of `Total` for every `Stage`. Then I'll look for the highest and lowest
stage and the overall distribution.
CODER_INSTRUCTION:
First clean and convert columns to correct types for the analysis later:
1. Convert the `Total` column to numeric for the calculations later, using
errors='coerce' within pd.to_numeric() to convert non-numeric values to NaN
2. Remove all rows with NaN values in the `Total` column before proceeding with
the rest of the analysis
1. Calculate the Sum of `Total` column per `Stage` by using the groupby method.
1. Identify the stage with the highest `StageTotal` and print the stage
alongside the `StageTotal` for that stage.
2. Identify the stage with the lowest `StageTotal` and print the stage
alongside the `StageTotal` for that stage.
3. The dataframe is short so we can show the full distribution. Sort the
results by `StageTotal`.
Coder Step 7
Python
# Removing '$' and ',' from the 'Total' column and converting to numeric, using
errors='coerce' to convert non-numeric values to NaN
df = df.dropna(subset=['Total'])
df_results = df.groupby('Stage')['Total'].sum().reset_index()
print(df_results.to_markdown(index=False))
Note: These are suggestions only, please use as diverse and meaningful alternatives as
possible.
Main Prompt
This is the Code Output from the second turn. As the prompt asks, you need to work with the
`Cost` column. So, it is the one you should analyze to establish the scenario.
Based on the previous ER diagram, scenario 2 is the most suitable. Therefore, it is correct to
clean and make the conversion. If some of the signs shown by the df.head() are not instructed to
be cleaned, the tasker must modify the instructions to address them. In this case it is not
necessary because the model takes care of the ‘$’ currency symbol. In the third turn (just in
scenario 2) it is essential to differentiate the cleaning from the conversion by using different
subheaders. In this case ’Tidy the data’ and ‘Convert the data’.
This is how the fourth turn looks like:
This is the Code Output. It appears that the model falls into a ValueError because there’s a
non-numeric value that was not shown in the df.head() that makes the conversion fail.
When you face an error like this where there is an issue with the dataset because it was
impossible for the df.head() to reveal , you should classify it as an unavoidable error.
This is how the fifth turn looks like:
This is how the sixth turn looks like:
This is the Code Output. By looking at the unique values you can identify that a more complex
cleaning is required.
Unnecessary/Incorrect sampling:
Unnecessary sampling Scenario 1 / 2
The following task snippets exemplify a situation where the model prints unique values
unnecessarily for scenario 1 or 2. These scenarios don’t require sampling because the relevant
columns for the analysis are in the correct data type and don’t require conversion, or the
column/s need a data type conversion, and the visible rows in the data exploration have a
consistent format. For more details about the scenarios go to Scenario type.
Note: These are suggestions only, please use as diverse and meaningful alternatives as
possible.
Main Prompt
This is the turn 2 Code Output. As the prompt asks, you need to work with the `DATE` column.
So, it is the one you should analyze to establish the scenario.
It’s visible that the ‘Date’ column is an object column that needs to be converted, and the format
in the first 5 rows is consistent.
Turn 3
The model/contributor incorrectly instructs to print the unique values (even worse because it
prints all the uniques).
Based on the previous ER diagram, scenario 2 is the most suitable. Therefore, printing the
unique values is not the correct approach. The model should have converted the 'DATE' column
to DateTime, assuming that the datetime format in the entire column is consistent with the
format shown in the first five rows of the dataset.
In these scenarios, there is NO NEED FOR SAMPLING, the contributor must re-write the
instructions to make the conversion directly. A good rewrite looks like the following image.
Main Prompt
This is the main prompt of the task.
This is the turn 2 Code Output. As the prompt asks, you need to work with the ‘Total charges
February’, ‘Total charges January’, ‘Total paid February’, ‘Total paid January’, ‘Final balance
February’, and ‘Final balance January’ columns. It’s necessary to analyze these columns to
establish the scenario.
Turn 3
As evidenced by the first five rows, there is a value that presents commas and periods, and the
rest of the present values present periods. Since there is more than one format present in the first
five rows it’s a case scenario 3 and it’s necessary to sample the column values, sampling 20
unique, incompatible values from this column for further analysis (identical to the one that has
to be performed after an error occurs).
In this step the contributor/model decided to print the entirety of the unique values of the
relevant columns, which is not the way to handle this situation, causing an immediate rejection
of the task.
It’s also noticeable that the contributor decided to remove white spaces of the column names,
just in case, that’s also an incorrect instruction (column name spaces can’t be stripped just in
case).
Note: These are suggestions only, please use as diverse and meaningful alternatives as possible.
Main Prompt
Above is a picture of the dataset and highlighted are the relevant columns necessary to solve the
prompt:
This is the third turn where the model provided incorrect coder instructions. The main problem
here is that the problem categorizes into a third scenario. As you’ll see the model makes the
conversion right away. In this case the contributor should have modified the instructions
accordingly, as sampling was required.
This is how the fourth turn looks like:
This is the Code Output. The model falls into a ValueError because there are mixed formats that
were shown in the df.head() and the conversion is being made with a unique format.
In cases like this, where errors occur due to incorrect coder instructions, the error is classified
as avoidable. The appropriate action would have been to modify the previous instructions
accordingly.
This is how the fifth turn looks like. The contributor incorrectly proceeded with the analysis
without modifying the previous coder instructions as required.
This is how the sixth turn looks like.
This is the Code Output. Although it seems correct as it does not show any issue, the way the
error was addressed is incorrect.
Sampling Variations
The plan and code for cleaning up should vary based on the output of the Sampling step (where
20 rows are sampled).
Please look at the following suggestions to synthetically generate variations for the output of the
sampling step and then write the follow-up Code Recovery accordingly.
Note: these are suggestions only, please use as diverse and meaningful alternatives as possible.
Note 1: You can also include columns that have multiple related formatting problems. For
example, a mix of commas and periods. Or a mix of currency symbols.
Note 2: Often you want to make it so that the issue is consistent within the column. For example,
if you add white space, it can be added to all column values.
Modifying the instructions in any turn after the EDA is allowed if the changes are based on the
information seen in the first five rows, or any code output.
https://frrautneduar-my.sharepoint.com/:v:/g/personal/pablo_caneva_frra_utn_edu_ar/ESClOr
liA5tImWjV71aqBuoB9wovIXu4kcGPDGSxevo-AQ?e=ylqUki
The primary objective of this section is to identify tasks where at least one response is perfect
and the other contains a loss bucket.
Initially, you’ll get the traditional two final responses, the model produced. You have to rate the
responses as you did in vanilla, but a brand new feature is that you’ll need to identify and label
each loss bucket present in each final response.
To label the “Loss Buckets”, you can use the following table Loss buckets
Note: if the responses have no loss bucket, then just select “None of the above”.
After these previous steps, you’ll find the traditional side-by-side rating and the chance to refine
the response, to get the best final response possible.
After the response has been refined, you’ll get a questionnaire asking if this was a final response,
once you’ve confirmed that this was the final response the model will regenerate 3 new
responses for the non-chosen response.
Your job consists of choosing the most lossy one out of the 3 new responses, then completing
the ratings and labelling the corresponding Loss Buckets present in each of the 3 new responses
(if the responses have no loss bucket, then just select “None of the above”), and compare them
against the original (non-chosen response from the previous selection) in a small Side-By-Side
bar.
Finally, you’ll be requested to identify if any of the 5 responses (the refined response, the
non-chosen response, and the three responses generated by the model), present any loss bucket.
And now, you’re done, now it’s time to “End Session” and submit your completed task.
Loss buckets
Loss buckets are specific characteristics that make a model response bad. You will need to
review the Final Response and identify if any of the loss buckets listed in Table 1 are present.
The desired situation is when the Final Response chosen by the tasker is perfect (whether it
needs to be rewritten or not), and the non-chosen response contains one loss bucket. If this
situation occurs, the task is considered a good one.
Table in response is not formatted well When the main prompt requires a table, the final
response should present the requested table. If
there are too many rows (more than 20), the table
should display only three rows; otherwise, it
should display the complete table. If it’s
truncated, it should be explicitly specify in the
response. E.g. “This table is a truncated view of
the records…”
This loss bucket occurs when the model just starts
showing the table from the start, but stops
abruptly. E.g. printed 14 complete rows and does
not complete the 15th row.
Non-marked down tables should not be included
into this category.
Final Response is Not complete by itself In order to answer the prompt, the final response
refers to an output produced in an earlier step,
yet it does not display it within the final response.
I.e. “The max can be found in the previous step“,
“I’ve calculated the highest average in the last
turn…“, “Refer to the table above for…”
Unnecessary Insights for Plots In cases where the main prompt explicitly
requests a plot, the expected final response
output is a confirmation message stating that the
plot has been generated. There should be no
additional insights regarding the plot or the data.
Be aware that if the prompt does not ask for a plot
but the resolution involves displaying one, then
providing insight might be necessary, and this loss
bucket does not apply to these situations.
NOTE: insight refers to a valuable or meaningful
observation, interpretation, or understanding
gained from analyzing data or text. It often
involves identifying patterns, trends,
relationships, or significant information within the
given context.
Final Response Does Not Fully Satisfy User Query The final response addresses a portion of the
main prompt but does not fulfill all the requested
information.
Missing Flow State The model fails and displays the following
message instead of a final response
Unnecessary Explanation in the Final Response The model generates unnecessary explanation in
the final response - e.g. offers some explanations
detailing the steps taken to solve the prompt.
The Model Failed in this Step There’s a model failure in this step, the most
common example is
<MALFORMED PLANNER RESPONSE>
Use of Plural Pronouns The model should not use plural pronouns to
refer to itself or its work.
Uses the File's Name The final response contains the filename of the
CSV file. I.e.: “…based on the data in ‘AMSA
INDUSTRIAL - INVENTORY.csv’.”
Final Response contains an image tag When working with plots the final response may
feature an image tag instead of presenting the
plot directly or referring to a plot generated in a
previous step. I.e: [Image of bar chart showing the
frequency of vehicles]
Scenario type
The scenario analysis should be conducted based on the necessary columns to solve the
prompt. There are three scenarios to consider:
Scenario Details
Scenario 3: need sampling before ● df.info() shows that a column is not in the right type
cleaning/conversion. for the analysis/calculation to happen and therefore
needs conversion. (e.g. numeric or datetime that are
in object format)
● df.head() shows that either:
○ there is no hint on how to clean it.
○ does not show consistency across all examples.
Variations examples
In the following table you can find different cases where samples are needed:
1.23E+02
1.23e+02
1994-05-21 12:00:00
1994-05-21T12:00:00Z
1994-05-21 12:00:00+00:00
1994-05-21 12:00:00-07:00
05/21/1994 12:00:00 PDT
21/05/1994 12:00:00 CEST
1994/05/21 12:00:00 UTC
21-May-1994 12:00:00 GMT
May 21, 1994 12:00:00 EST
19940521 12:00:00 MSK
21st May 1994 12:00:00 BST
05-21-94 12:00:00 CDT
Step 3
Context
Step 3 (Turn #3) refers to the step after data exploration, the model is making the plan and
choosing columns to calculate. In this step, the model needs to know when and why a column
needs to be cleaned and converted to the right format for 1) follow on calculations, and 2)
how to correctly do the cleaning.
Note: Please input bad tasks here → If the task is not solvable please input it here
Workflow
1. Read and review Turn #3’s response – Turn #3 is considered the “Good Example”, but it
will still need to be modified to fit the instructions. Please expand this section, scroll
to the last section, and make edits/improve this response based on the instructions.
2. Identify Scenario type for the task. Review the relevant sections. Use the following
chart for that
3. Write the 5 prompts for the “Bad examples”. Please follow the detailed instructions
below and please follow the style guidelines.
[IMPORTANT] Style Guidelines
EXPLANATION_AND_PLAN is more concise and general and CODER_INSTRUCTIONS are
more detailed / specific and follow the numbered subheadings format.
● New guideline: in Explanation and Plan, state the columns that don’t need
conversion & explain why
● In all cases (where no column cleaning up is needed), please add some rationale
in EXPLANATION_AND_PLAN similar to
○ “Since the column Cost is already of type int64, it can be used as is for
the numerical calculations.”
○ "Given that the column 'Cost' is already in the float64 data type, it is
suitable for our calculations."
Consistency in Coder Instruction:
a. Please ensure there are no inconsistent styles used (such as **Tidy Data**
instead of Tidy Data).
b. Use “Convert to numeric” for consistency instead of “Convert to float”.
c. All columns names, methods and variables, including df.info() and df.head()
should be referred to in backticks `` everywhere (including the first Coder
Instruction block).
Bucketized Checklist[IMPORTANT]
Along with the checklist above, please also pay attention to common loss patterns in the
sections below to ensure these problems are not present in Supervised finetuning (SFT)
going forward.
Style
● Columns not properly referenced (missing backticks, using single / double quotes
instead).
● Literal values incorrectly marked (backticks instead of quotes, and vice versa).
● Ambiguous or unclear print / display instructions.
● Unnecessary blank lines (this applies all EXPLANATION_AND_PLAN,
CODER_INSTRUCTION and CODE).
Coder Instructions
● Unclear / Ambiguous Coder Instructions.
● Missing or incomplete data transformation steps.
● Missing column names.
● Coder Instruction not numbered / incorrectly numbered.
● Wrong rationale in Coder Instructions.
○ For cleaning.
○ For conversion.
● Code present in Coder Instructions.
● Missing subheaders / Incorrect formatting of subheaders.
Categories
Note `Total` and `Price w Markup` columns are used for calculation and has numeric value
already
Unset
|:------------|:------------------------------------|:--------|:--------|:-----------|:---
----|:------------|:----------|:-----------|:-----------------|
<class 'pandas.core.frame.DataFrame'>
RL Instruction
Good Example
- Coder instruction should explicitly acknowledge the correct type of the column(s)
- Coder instruction should explicitly explain that no conversion is needed
- In Explanation and Plan, state the columns that don’t need conversion & explain
why
Bad Examples:
1. Coder instruction identifies the wrong column type and says conversion needed
2. Coder instruction identifies the correct column type but still says conversion needed
(when it’s actually not needed)
a. E.g. convert int to str. But the calculation is only doing group by
b. E.g. convert int to float. But nothing requires float calculation
3. Coder instruction didn’t mention the column type and says conversion needed
4. Coder instruction mentions another column that’s not used in the calculation at all.
5. [Less penalty] Coder instruction didn’t mention the column type and didn’t mention
conversion
Good Example
Unset
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
Because `Total` and `Price w Markup` are both type int64 (seen from the df.info output),
they don't need conversion for the calculation.
CODER_INSTRUCTION:
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
Unset
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. `Total` column is type 'Object' (seen in the df.info output) which means it contains
non-numeric values. I need to convert it to numeric before calculation.
2. Convert the `Total` column to numeric and remove '$' and ','.
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. `Total` column is type 'int64' (seen in the df.info output). I need to convert it to
numeric before calculation.
2. Convert the `Total` column to numeric and remove '$' and ','.
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
Unset
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
Unset
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
Note `Total` column is used for calculation and needs cleaning before converting to numeric
Unset
|:------------|:------------------------------------|:--------|:--------|:-----------|:---
----|:------------|:----------|:-----------|:-----------------|
<class 'pandas.core.frame.DataFrame'>
None
RL Instruction
Good Example
- Coder instruction should explicitly acknowledge the correct type of the column(s)
- Coder instruction should explicitly mention that the column is not in the desired format
based on the df.info output and therefore needs conversion
- Coder instruction should explicitly explain what’s wrong with the value based on
the df.head() output and what exactly needs to be done for the clean up
- The clean up instruction should be specific not generic (e.g. removing specific chars
vs removing all non-numeric chars) and only based on the df.head() output
Bad Examples:
1. Coder instruction identifies the wrong column type and says conversion is not needed
2. Coder instruction didn’t mention anything about column type or conversion
3. [Less penalty] Coder instruction didn’t explain the reason of why conversion is
needed or why clean up is needed
4. Coder instruction gives unnecessary clean up
a. E.g. give instructions to remove “$” but df.head() is not showing any values that
has “$”
b. E.g. give instructions to remove any non-numeric character from the column
5. Coder instruction gives incorrect clean up
a. E.g. give instructions to replace “,” with “.” when df.head shows a value like
$1,999. This will lead to incorrect result
Good Example
Unset
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. `Price w Markup` is type 'int64' which is good for the calculation. No conversion
needed.
2. Because the `Total` column is type 'Object' (seen in the df.info output) which means
contains non-numeric value, we need to transform it to numeric for the calculation later.
3. Based on the `df.head()` output, `Total` column has values such as '$1,300' that
contains '$' and ',', I need to remove these chars before converting to numeric.
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
Unset
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. Both `Total` and `Price w Markup` columns are type 'int64', it's good for the
calculation and no conversion is needed.
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
Bad Example 2) didn’t mention column type or conversion at all
Unset
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
Unset
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
Unset
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. `Price w Markup` is type 'int64' which is good for the calculation. No conversion
needed.
2. Because the `Total` column is type 'Object' (seen in the df.info output) which means
contains non-numeric value, I need to transform it to numeric for the calculation later.
3. Based on the `df.head()` output, `Total` column has values such as '$1,300' that
contains '$' and ',', I need to remove these chars before converting to numeric.
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
Unset
EXPLANATION_AND_PLAN:
It seems that the total income can be calculated by using the `Total` and `Price w Markup`
columns. I'll look at the appropriate groups and calculate net income by subtracting the
`Total` column from the `Price w Markup` column and summing up.
CODER_INSTRUCTION:
1. `Price w Markup` is type 'int64' which is good for the calculation. No conversion
needed.
2. Because the `Total` column is type 'Object' (seen in the df.info output) which means
contains non-numeric value, I need to transform it to numeric for the calculation later.
3.. Based on the `df.head()` output, `Total` column has values such as '$1,300' that
contains '$' and ',', I need to remove these chars before converting to numeric.
1. Filter the data to include the rows where `Group` is equal to 11 or 12.
3. Sum all the values and print the result. Round to two decimals.
Good Example
Unset
EXPLANATION_AND_PLAN:
I will calculate net profit and profit margin of the products from each supplier to find
out which suppliers are providing the most profitable products.
I can calculate sales and net profit by summing up columns `Price * Stock` and `(Price -
Cost) * Stock`, respectively.
```
```
However, since the `Cost` column is not numeric, I will check the values that are not
numeric to decide how to convert the `Cost` column for further analysis.
CODER_INSTRUCTION:
From the df.head() output, the first five rows of `Cost` are all integer values. However,
df.info() output shows that the `Cost` column is of type object, which means it has
non-numeric values.
2. If there are more than 20 unique non-numeric values, sample and print 20 of them
Bad Example
Bad examples here are any steps that didn’t do sampling. We can reuse the bad example
patterns from Scenario 1) & 2) above
In this task, your goal is to generate alternative versions of the provided model responses.
Your challenge is to craft a different version of a response for each potential category of loss
buckets. It is important to focus on creating responses with errors that the model might generate.
Before creating the loss bucket responses you should ensure that the selected/rewritten
response is perfect. To achieve this, remove any content that the model would obviously never
generate. This includes eliminating single non-English words, poor grammar, improperly
formatted text, or any content that is clearly not within the model's capabilities.
Regarding the lossy responses you will encounter different scenarios (or cases) that will
require you to generate a varying number of loss buckets depending if the prompt requires or
not the creation of a plot or table.
Task Instructions
1. Read and Understand the task
○ It is necessary to have a certain level of comprehension about the resolution of the
task before being able to identify and craft loss buckets.
2. Read and Understand the Chosen/Rewritten Final Response
○ This response must be a perfect response.
3. Generate Alternative Responses
○ Follow the category guidelines to create the required number of alternative
responses.
○ Each alternative response should introduce a single type of loss bucket.
○ It’s not mandatory to use the “perfect response” the tasker chose or wrote, as a
basis for the new responses, but it’s recommendable.
○ If the “perfect response” happens to have any previous Loss Bucket, remove
that Loss Bucket to avoid adding extra Loss Buckets to each alternative
response.
○ Your response should be written in the following markdown format:
**Output:**
### End the task session after selecting this response. Click the **Submit**
button to end.
Remember
● Your rewrites should be clearly distinct from one another, each showcasing a unique
type of loss bucket.
● The final response in each case is the only aspect you need to modify. Avoid changing
the task itself.
Seeking loss buckets
Loss buckets are specific characteristics that make a model response bad. The table below
shows all potential loss buckets along with their corresponding explanations:
Weird/Wrong Final Response Pattern The final response refers to the previous analysis.
It could include but is not limited to “...data
analysis objective…", “...based on the analysis…"
or similar related expressions.
Final Response is Not complete by itself In order to answer the prompt, the final response
refers to an output produced in an earlier step,
yet it does not display it within the final response.
I.e. “The max can be found in the previous step“,
“I’ve calculated the highest average in the last
turn…“, “Refer to the table above for…”
Table in response is not formatted well When the main prompt requires a table, the final
(OPTIONAL) response should present the requested table. If
there are too many rows (more than 20), the
table should display only three rows; otherwise, it
should display the complete table. If it’s
truncated, it should be explicitly specify in the
response. E.g. “This table is a truncated view of
the records…”
This loss bucket occurs when the model just
starts showing the table from the start, but stops
abruptly. E.g. printed 14 complete rows and does
not complete the 15th row.
Non-marked down tables should not be included
into this category.
Unnecessary Insights for Plots (OPTIONAL) In cases where the main prompt explicitly
requests a plot, the expected final response
output is a confirmation message stating that the
plot has been generated. There should be no
additional insights regarding the plot or the data.
This loss bucket occurs when the final response
generates a correct but unnecessary (not
requested) insight. Be aware that if the prompt
does not ask for a plot but the resolution involves
displaying one, then providing insight might be
necessary, and this loss bucket does not apply to
these situations.
NOTE: insight refers to a valuable or meaningful
observation, interpretation, or understanding
gained from analyzing data or text. It often
involves identifying patterns, trends,
relationships, or significant information within
the given context.
Image Tags in Final Response (OPTIONAL) When working with plots the final response may
feature an image tag instead of presenting the
plot directly or referring to a plot generated in a
previous step. I.e: [Image of bar chart showing
the frequency of vehicles].
This loss bucket is only possible when the
plots/graphs have taken part in the resolution of
the task.
Unnecessary Explanation in the Final Response The model generates unnecessary explanation in
the final response - e.g. offers some explanations
detailing the steps taken to solve the prompt.
Uses the File's Name When working with plots the final response may
feature an image tag instead of presenting the
plot directly or referring to a plot generated in a
previous step. I.e: [Image of bar chart showing
the frequency of vehicles]
This loss bucket is only possible when the
plots/graphs have taken part in the resolution of
the task.
Main Prompt
This is the main prompt to be resolved in the task.
Original Task Resolution
This is the resolution of the original task through code execution and code output.
Final Responses
After reaching a conclusion, the model presents a set of 2 final responses. The tasker that solved
the task selected one of these two options (the one considered to be the best in the tasker's
judgment) and refined this response if necessary.
As previously explained, in the case of this task, at least 1 response would be a “perfect”
response, and the other/s will present one or many loss buckets.
These are the set of Final Responses for the task at hand:
The contributor chose answer 1 and rewrote it in order to display only the first 3 records so as
not to display the entire table according to the rubrics.
Each Loss Bucket will have a window to complete with a Final Answer that contains only the Loss
Bucket requested in that step.
If you happen to have any doubt regarding what each Loss Bucket refers to, you can hover over
the symbol, and get a quick explanation of each Loss Bucket.
For example, for the Weird / Wrong Final Response Pattern, you can find the following quick
explanation
To create a Loss Bucket, is just necessary to add any insight referring to the requested bar chart
since no insights were requested about it. This insight should be correct (not incorrect or
hallucinated).
[Optional] Image Tags in Final Response.
The following case is an example of the optional Loss Bucket, “Tags in Final Response”.
The main prompt requires the creation of a simple plot, just a bar chart.
To create a Loss Bucket is just necessary to add an image tag to the final response, referencing the
created bar chart.
[Mandatory] Not a Final Response.
Many times we have found that the model, once the response to the main prompt has been reached,
wants to continue with the analysis.
The idea of this loss bucket is to recreate this situation.
[Mandatory] Unnecessary Explanation in the Final Response.
Sometimes we find that the model continues with some unnecessary suggestions as to how we
might deepen the analysis.
The idea of this loss bucket is to recreate this situation.
[Mandatory] Use of plural Pronouns.
A requirement for the project has been to correct when the model is conducted in third person.
How to complete a task
https://frrautneduar-my.sharepoint.com/:v:/g/personal/pablo_caneva_frra_utn_edu_ar/EYv0sq
b3hspOsz75YKQ83yABzVVkJXshzDmBSobY0I72tQ?e=vj180R