7 Data Pre-Processing in Clementine
7 Data Pre-Processing in Clementine
7 Data Pre-Processing in Clementine
Record operations
From the Record Ops palette, add a Select node between your Source and Table nodes to give stream: Source Select Table Edit the Select node and enter the following selection criteria: Age < 30 and Sex == 'M' Execute from the Table node and check if the records you receive all satisfy this condition. Experiment with other selection conditions. See the Help on the Clem language if required. -1-
Now remove your Select node and replace with a Sample node. Edit this to choose the first 150 records. Execute the Table node to confirm that this has happened. Now develop a stream to divide the data set into two: a training set consisting of the first 150 records and a test set of the remaining 50. Rename your nodes (from the node menu) to show what they are doing.
Field operations
Remove the Sample node and add a Type node from the Field Ops palette to give stream: Source Type Table Browse the Type node settings using its editor. This should be familiar! It is basically the same as the editor for the Source node. But it is sometimes useful to have a Type node downstream in a complex stream where you have added or altered fields along the way. Add a Filter node to give the stream: Source Type Filter Table Edit the Filter node to remove the Sex and Cholesterol attributes. Execute the Table node to confirm. Replace the removed attributes. Now create the stream: Source Derive Type Table Edit the Derive node to add a new field giving the value of Na/K. Note that the Clementine language is case sensitive. The calculator icon in the editor brings up the Expression Builder, a tool to help with writing arithmetical expressions. Edit the node to give your new attribute a name. Execute the stream. Check that the Type settings and the Table have included your new field (and its name). Notice that the Type node was placed downstream from the Derive node. It is sensible to do this otherwise Type will not know about your new attribute.
Graphical displays
Replace your Table node with a Distribution node from the Graph Palette. Edit the node and select Drug as the field whose distribution is to be displayed. Also select BP as the overlay field. Execute the node to see the distribution - often called the class frequency profile. DrugY is the most popular - referred to as the majority class. As an experiment, place a Select node into your stream between Source and Derive. Try different selection criteria to see if you can find one which results in a substantial majority for one class, i.e. drug. What can you say about the relevance of BP to determination of drug? Now remove the Select node and replace the Distribution node by Histogram. Edit it and choose Age as the attribute to be graphed without any overlay. Note that the histogram can be used only on continuous attributes). From the Options tab, try different numbers of bins, i.e. intervals, and execute to see the resulting histogram.
Binning node and execute. You will see a new attribute in the final column called Age_BIN. Edit your Binning node once more and select the Generate tab at the bottom. You will see the range definitions for each of your bins. Click the Generate Derive button and a Derive node (from the Fields palette) will be placed on your workspace. Edit this to see the settings. You can change the default bin labels (1,2, 3 etc) to be more meaningful, e.g. young, middleaged etc here if you want just click on the label). Experiment with different binning strategies. Use the online help for guidance on the various options within the binning editor. We can also use the Histogram node to bin an attribute. Click at several points along your Age histogram from above to create binning boundaries (these do not have to coincide with the bins used to build the histogram). Note that Clementine refers to bins as bands. You can name each band and then automatically generate a Derive node to create a new discrete attribute whose values are your band names. See online help for the details.
Missing values
Missing values, referred to in Clementine as blanks, are values in a data set that are unknown, uncollected or incorrectly entered. These values are often nonsensical in the context of their field. If we have a field Age, a negative value is obviously nonsensical, and could therefore be defined as a blank. Wrongly typed values are generally regarded as blanks because a missing value is often deliberately entered using spaces, a nonsensical value or an empty field. Before mining, the training set must usually be massaged to clean it up. Some algorithms will not accept data having missing values. There are several techniques available for dealing with missing values. Different techniques are appropriate in different situations, the most appropriate depending on the data set size, number of fields containing blanks and the amount of missing information generally. The techniques fall into four categories: Omit the offending records; Omit the offending fields; Fill in missing values with default values; Fill in missing values with a value derived from a model.
Add and execute a Table node and inspect for missing values. Try to gauge where they occur - which fields and records seem to be most affected? You will notice some ? values and also some $Nulls$ (these have been converted by Clementine from ? see below). Now look through the bands.names file (at the end of this document) to identify all the numeric attributes and check the type they have been given by Clementine. Sometimes numeric attributes are assigned type Set, i.e. discrete. (One reason for this is that using ? as a marker for a missing value forces the type-determining mechanism in Clementine to infer that the attribute values should be stored internally as string and that the attribute type should be Set). (The internal storage type of the attribute is indicated by in a yellow square on the attribute column in the Types tab of the Source node editor: an A means stored as string; an empty diamond means stored as integer; a diamond containing # means stored as real. Remember that internal storage type is not the same as the logical or usage type with which we are normally concerned.) To fix any such wrongly-typed attribute, go to the Data tab in the Source node editor, right-click on the attribute name in the left column, use Set Storage to change the type to real. Override will then be checked. Now go to the Types tab and right-click on the attribute you are changing and take Select Values followed by <Read>. Finally, to activate the read, press the Read Values button at the top. The Type should now change to Range and the lower and upper values on the data set should appear in the values cell. To do all this for several attributes at once, you can use multiple select. Discrete attributes which happen to use numeric codes may be wrongly typed as numeric. These should be fixed also.
-4-
Now try more complex (trade-off) strategies involving the removal of both fields and records. A good approach is to remove the worst few fields first and then select records from those remaining. When you have cleaned up the data satisfactorily, save using a File node from the Output palette.
Filling in values
Sometimes it is appropriate to provide a value when one is missing rather than eliminate fields and or records. This is best done when you are sure what the correct value should be (a data understanding task). For example, in the grain screened field, suppose that if a ? appears, it can be assumed that the correct value is NO. This value can be substituted using the Filler node. Place a Filler node after your Source node and edit it. Select grain screened from Fill in fields. From the Replace menu choose Blank values. In the Replace with box type NO. Notice, again, the string format used for symbolic values. Add a new Table node after your Filler and execute to check that the filling in has occurred there should be no ?.
-5-
-6-
7. Attribute Information: 1. timestamp: numeric;19500101 - 21001231 2. cylinder number: nominal 3. customer: nominal; 4. job number: nominal; 5. grain screened: nominal; yes, no 6. ink color: nominal; key, type 7. proof on ctd ink: nominal; yes, no 8. blade mfg: nominal; benton, daetwyler, uddeholm 9. cylinder division: nominal; gallatin, warsaw, mattoon 10. paper type: nominal; uncoated, coated, super 11. ink type: nominal; uncoated, coated, cover 12. direct steam: nominal; use; yes, no * 13. solvent type: nominal; xylol, lactol, naptha, line, other 14. type on cylinder: nominal; yes, no 15. press type: nominal; use; 70 wood hoe, 70 motter, 70 albert, 94 motter 16. press: nominal; 821, 802, 813, 824, 815, 816, 827, 828 17. unit number: nominal; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 18. cylinder size: nominal; catalog, spiegel, tabloid 19. paper mill location: nominal; north us, south us, canadian, scandanavian, mid european 20. plating tank: nominal; 1910, 1911, other 21. proof cut: numeric; 0-100 22. viscosity: numeric; 0-100 23. caliper: numeric; 0-1.0 24. ink temperature: numeric; 5-30 25. humifity: numeric; 5-120 26. roughness: numeric; 0-2 27. blade pressure: numeric; 10-75 28. varnish pct: numeric; 0-100 29. press speed: numeric; 0-4000 30. ink pct: numeric; 0-100 31. solvent pct: numeric; 0-100 32. ESA Voltage: numeric; 0-16 33. ESA Amperage: numeric; 0-10 34. wax: numeric ; 0-4.0 35. hardener: numeric; 0-3.0 36. roller durometer: numeric; 15-120 37. current density: numeric; 20-50 38. anode space ratio: numeric; 70-130 39. chrome content: numeric; 80-120 40. band type: nominal; class; band, no band * 8. Missing Attribute Values: yes, in 302 examples 9. Class Distribution: (out of 512 total instances) -- 312 No band -- 200 Band
-7-