7 Data Pre-Processing in Clementine

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Intelligent Systems

Practical Study Block 7: Data Pre-processing in Clementine Preamble


In this session, you will experiment with the setting up of streams for data understanding and preparation (CRISP stages 2 and 3) This will involve creation of simple streams using nodes from the Sources, Record Ops, Field Ops, Graphs and Output palettes. As you work, you should save streams for future use.

Sourcing the data


We will be using, initially, the Clementine drug data set with the names header row. Set the working directory (File/Set Directory) to be Demo. Place a Var. file node from the Sources palette on the drawing area. Edit the node and choose the file drug1n. Check the Get field names from file box so that the field names are picked up. Choose the Data tab at the bottom of the editor. You will see each field (which should be named) and its type, e.g. Range (=numeric), Set (= discrete) and Flag (=binary discrete). Should appear. Click on Read Values to pull the data through the node (select all fields when asked). You will now see in the Values column the values (or ranges) for each field). To see more details about the values, right click on the value cell and choose Edit . Add a Table node from the Output palette and connect to your Source to give stream: Source Table Execute the Table node. This will produce a window allowing you to browse the full example set of 200 records. Note that each attribute has been named using the header row. Also, note that executing a stream will pull the data through it so that you do not need the Read Values step above. For more quick inspection of the data, place a Data Audit node (from the Output palette) on your workspace and attach to the source node. Note you cannot have more than one output node in series in a stream you must branch. Now edit the Data Audit node. Choose Use Custom fields and pull down the drop list on the right of the fields box. Select all fields to be audited. Tick Graphs and Basic Statistics to be displayed in your audit. Execute. The audit replicates some of the details seen before but adds a graph giving, for discrete fields, a distribution across the value set and, for ranges, a histogram of values. Both Distribution and Histogram are available separately as nodes in the Graphs palette. The right hand column of the audit tells you how many valid values you have in each field. This is a quick way to spot the presence and extent of missing values (discussed in Practical 2).

Record operations
From the Record Ops palette, add a Select node between your Source and Table nodes to give stream: Source Select Table Edit the Select node and enter the following selection criteria: Age < 30 and Sex == 'M' Execute from the Table node and check if the records you receive all satisfy this condition. Experiment with other selection conditions. See the Help on the Clem language if required. -1-

Now remove your Select node and replace with a Sample node. Edit this to choose the first 150 records. Execute the Table node to confirm that this has happened. Now develop a stream to divide the data set into two: a training set consisting of the first 150 records and a test set of the remaining 50. Rename your nodes (from the node menu) to show what they are doing.

Field operations
Remove the Sample node and add a Type node from the Field Ops palette to give stream: Source Type Table Browse the Type node settings using its editor. This should be familiar! It is basically the same as the editor for the Source node. But it is sometimes useful to have a Type node downstream in a complex stream where you have added or altered fields along the way. Add a Filter node to give the stream: Source Type Filter Table Edit the Filter node to remove the Sex and Cholesterol attributes. Execute the Table node to confirm. Replace the removed attributes. Now create the stream: Source Derive Type Table Edit the Derive node to add a new field giving the value of Na/K. Note that the Clementine language is case sensitive. The calculator icon in the editor brings up the Expression Builder, a tool to help with writing arithmetical expressions. Edit the node to give your new attribute a name. Execute the stream. Check that the Type settings and the Table have included your new field (and its name). Notice that the Type node was placed downstream from the Derive node. It is sensible to do this otherwise Type will not know about your new attribute.

Graphical displays
Replace your Table node with a Distribution node from the Graph Palette. Edit the node and select Drug as the field whose distribution is to be displayed. Also select BP as the overlay field. Execute the node to see the distribution - often called the class frequency profile. DrugY is the most popular - referred to as the majority class. As an experiment, place a Select node into your stream between Source and Derive. Try different selection criteria to see if you can find one which results in a substantial majority for one class, i.e. drug. What can you say about the relevance of BP to determination of drug? Now remove the Select node and replace the Distribution node by Histogram. Edit it and choose Age as the attribute to be graphed without any overlay. Note that the histogram can be used only on continuous attributes). From the Options tab, try different numbers of bins, i.e. intervals, and execute to see the resulting histogram.

Binning a Continuous attribute


To bin a numeric attribute (i.e. convert it to discrete) connect your source to a Binning node from the Fields palette. Edit and use the drop down list at the right to select Age as the field to be binned (you will be shown all the numeric fields). Different ways of selecting boundaries for the binning intervals are available under Binning Method. For the minute, stick with Fixed-Width (the default) and change the bin width to be 15 (meaning Age will be binned into sub-intervals each of 15 years length). Add a Table node downstream from your -2-

Binning node and execute. You will see a new attribute in the final column called Age_BIN. Edit your Binning node once more and select the Generate tab at the bottom. You will see the range definitions for each of your bins. Click the Generate Derive button and a Derive node (from the Fields palette) will be placed on your workspace. Edit this to see the settings. You can change the default bin labels (1,2, 3 etc) to be more meaningful, e.g. young, middleaged etc here if you want just click on the label). Experiment with different binning strategies. Use the online help for guidance on the various options within the binning editor. We can also use the Histogram node to bin an attribute. Click at several points along your Age histogram from above to create binning boundaries (these do not have to coincide with the bins used to build the histogram). Note that Clementine refers to bins as bands. You can name each band and then automatically generate a Derive node to create a new discrete attribute whose values are your band names. See online help for the details.

Saving new data


Having prepared new data, through the activities above, you can save it to a file. Use the File node in the output palette. You should opt to include field names. Field and record delimiters can also be selected in the editor.

Missing values
Missing values, referred to in Clementine as blanks, are values in a data set that are unknown, uncollected or incorrectly entered. These values are often nonsensical in the context of their field. If we have a field Age, a negative value is obviously nonsensical, and could therefore be defined as a blank. Wrongly typed values are generally regarded as blanks because a missing value is often deliberately entered using spaces, a nonsensical value or an empty field. Before mining, the training set must usually be massaged to clean it up. Some algorithms will not accept data having missing values. There are several techniques available for dealing with missing values. Different techniques are appropriate in different situations, the most appropriate depending on the data set size, number of fields containing blanks and the amount of missing information generally. The techniques fall into four categories: Omit the offending records; Omit the offending fields; Fill in missing values with default values; Fill in missing values with a value derived from a model.

The first two are dealt with here.

Missing values in Clementine


We will use a data set from the Printing Press domain as a running example. The file is bandsn. This data was downloaded from the ML repository and edited to include the attribute names. The .names file detailing the attributes is provided at the bottom of this document. Missing values are indicated in bandsn file by '?'. Place a variable source node on the workspace and edit to set the file. Several attributes can be removed immediately: timestamp, customer, job number and cylinder number do not contain useful information about the printing process settings they are mainly just example identifiers. These can be eliminated using the Filter tab in the Source node editor. -3-

Add and execute a Table node and inspect for missing values. Try to gauge where they occur - which fields and records seem to be most affected? You will notice some ? values and also some $Nulls$ (these have been converted by Clementine from ? see below). Now look through the bands.names file (at the end of this document) to identify all the numeric attributes and check the type they have been given by Clementine. Sometimes numeric attributes are assigned type Set, i.e. discrete. (One reason for this is that using ? as a marker for a missing value forces the type-determining mechanism in Clementine to infer that the attribute values should be stored internally as string and that the attribute type should be Set). (The internal storage type of the attribute is indicated by in a yellow square on the attribute column in the Types tab of the Source node editor: an A means stored as string; an empty diamond means stored as integer; a diamond containing # means stored as real. Remember that internal storage type is not the same as the logical or usage type with which we are normally concerned.) To fix any such wrongly-typed attribute, go to the Data tab in the Source node editor, right-click on the attribute name in the left column, use Set Storage to change the type to real. Override will then be checked. Now go to the Types tab and right-click on the attribute you are changing and take Select Values followed by <Read>. Finally, to activate the read, press the Read Values button at the top. The Type should now change to Range and the lower and upper values on the data set should appear in the values cell. To do all this for several attributes at once, you can use multiple select. Discrete attributes which happen to use numeric codes may be wrongly typed as numeric. These should be fixed also.

Defining blanks and cleaning the data


Before missing values can be dealt with, Clementine needs to be told what constitutes a missing value - in this domain, as noted above, it is a '?'. This is referred to as defining blanks. For numeric attributes with missing value markers which are strings (e.g. ?), then, provided you have corrected the type to real (if necessary) as described above, Clementine will assume that all strings are missing values. Such values will be treated as system nulls and automatically re-written as $Null$ (which is why you see such values when you execute the Table node). For missing value markers which conform to type (e.g. ? for a discrete attribute and 999 for a numeric attribute), Clementine must be told explicitly that the value is a missing value marker, i.e. a blank. Blanks can be defined in the Types tab of the Source node editor. Right-click on the first attribute, grain screened (which is discrete) and select Edit . Check the Define blanks box and add ? as a missing value. Also check the Null and White space boxes (if not already checked). A star will appear in the cell of the Missing column of the editor indicating that blanking is set for this attribute. (Note that when you next inspect this, Clementine will have expressed ? as a string, i.e. ?.) This operation must be repeated for all discrete attributes. The quickest way to do this is to copy the definition. Right-click on grain screened and select Copy. Multiple select all set attributes and right-click to take Paste Special from the option list that appears. Check Missing and uncheck the other options. In this data set there are no numeric markers (e.g. -999) for numeric attributes so you do not have to worry about these. All ? markers for numeric attributes will be identified automatically by Clementine and changed to $Null$ as explained above. For other data sets, however, this could be something you might have to deal with. The main reason we need to identify blanks is that most learning algorithms will not accept a blank which disobeys type.

-4-

Quality report on blanks


The Quality node (in the Output palette) can be used to generate a report on the occurrence of blanks (once they have been defined). Attach a Quality node downstream from the Source node and edit it. Here you can set the node to look for Empty String, White Space or Blanks. Check the Blanks box. Executing the Quality node (from the Output palette) will produce a report on the % completeness of values for each field with reference to the definition of blanks. The report can sorted in decreasing order of attribute completeness by clicking the % Complete column header until a down arrow appears.

Cleaning the data


Using the Generate menu items on the Quality report, it is possible to clean the data set by removing records and/or fields with missing values. This achieved by the automatic generation of either a Select node (for records) or a Filter node (for fields). The generated nodes are placed (flashing) at the top left of the workspace and must be incorporated into your stream diagram. Before generating Select and/or Filter nodes you must select fields from the report for which this generation will apply. Experiment with different strategies for removing records and fields. For each strategy, build a stream branching downstream from your Source node incorporating the generated Select and/or Filter nodes. Attach Table and Quality nodes and inspect to make sure there are no missing values remaining. Note the number of records and fields in the cleaned-up data set. Begin with the two simplest strategies: remove all records with missing values remove all fields with missing values.

Now try more complex (trade-off) strategies involving the removal of both fields and records. A good approach is to remove the worst few fields first and then select records from those remaining. When you have cleaned up the data satisfactorily, save using a File node from the Output palette.

Filling in values
Sometimes it is appropriate to provide a value when one is missing rather than eliminate fields and or records. This is best done when you are sure what the correct value should be (a data understanding task). For example, in the grain screened field, suppose that if a ? appears, it can be assumed that the correct value is NO. This value can be substituted using the Filler node. Place a Filler node after your Source node and edit it. Select grain screened from Fill in fields. From the Replace menu choose Blank values. In the Replace with box type NO. Notice, again, the string format used for symbolic values. Add a new Table node after your Filler and execute to check that the filling in has occurred there should be no ?.

-5-

Appendix: bands.names file


1. Title: Cylinder bands 2. Sources: (a) Creator: Bob Evans RR Donnelley & Sons Co. Gallatin Division 801 Steam Plant Rd Gallatin, Tennessee 37066-3396 (615) 452-5170 (b) Donor: same (c) Date: August, 1995 3. Past Usage: Evans, B., and Fisher, D. (1994). Overcoming process delays with decision tree induction. IEEE Expert , Vol. 9, No. 1, 60--66. 4. Relevant Information:n Here's the abstract from the above reference: ABSTRACT: Machine learning tools show significant promise for knowledge acquisition, particularly when human expertise is inadequate. Recently, process delays known as cylinder banding in rotogravure printing were substantially mitigated using control rules discovered by decision tree induction. Our work exemplifies a more general methodology which transforms the knowledge acquisition task from one in which rules are directly elicited from an expert, to one in which a learning system is responsible for rule generation. The primary responsibilities of the human expert are to evaluate the merits of generated rules, and to guide the acquisition and classification of data necessary for machine induction. These responsibilities require the expert to do what an expert does best: to exercise his or her expertise. This seems a more natural fit to an expert's capabilities than the requirements of traditional methodologies that experts explicitly enumerate the rules that they employ. 5. Number of Instances: 512 6. Number of Attributes: 40 including the class attribute -- 20 attributes are numeric, 20 are nominal

-6-

7. Attribute Information: 1. timestamp: numeric;19500101 - 21001231 2. cylinder number: nominal 3. customer: nominal; 4. job number: nominal; 5. grain screened: nominal; yes, no 6. ink color: nominal; key, type 7. proof on ctd ink: nominal; yes, no 8. blade mfg: nominal; benton, daetwyler, uddeholm 9. cylinder division: nominal; gallatin, warsaw, mattoon 10. paper type: nominal; uncoated, coated, super 11. ink type: nominal; uncoated, coated, cover 12. direct steam: nominal; use; yes, no * 13. solvent type: nominal; xylol, lactol, naptha, line, other 14. type on cylinder: nominal; yes, no 15. press type: nominal; use; 70 wood hoe, 70 motter, 70 albert, 94 motter 16. press: nominal; 821, 802, 813, 824, 815, 816, 827, 828 17. unit number: nominal; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 18. cylinder size: nominal; catalog, spiegel, tabloid 19. paper mill location: nominal; north us, south us, canadian, scandanavian, mid european 20. plating tank: nominal; 1910, 1911, other 21. proof cut: numeric; 0-100 22. viscosity: numeric; 0-100 23. caliper: numeric; 0-1.0 24. ink temperature: numeric; 5-30 25. humifity: numeric; 5-120 26. roughness: numeric; 0-2 27. blade pressure: numeric; 10-75 28. varnish pct: numeric; 0-100 29. press speed: numeric; 0-4000 30. ink pct: numeric; 0-100 31. solvent pct: numeric; 0-100 32. ESA Voltage: numeric; 0-16 33. ESA Amperage: numeric; 0-10 34. wax: numeric ; 0-4.0 35. hardener: numeric; 0-3.0 36. roller durometer: numeric; 15-120 37. current density: numeric; 20-50 38. anode space ratio: numeric; 70-130 39. chrome content: numeric; 80-120 40. band type: nominal; class; band, no band * 8. Missing Attribute Values: yes, in 302 examples 9. Class Distribution: (out of 512 total instances) -- 312 No band -- 200 Band

-7-

You might also like