Untitled

http://www.sv-europe.
com/crisp-dm-methodology/
https://courses.bigdatauniversity.com/courses/course-
v1:BigDataUniversity+PA0101EN+2016/courseware/
a70fec8899fd464289ee11117f2600ce/
1096ce7eb267434f86019372585d06fb/
General Information
 This course is free.
 It is self-paced.
 It can be taken at any time.
 It can be taken as many times as you wish.
 Labs can be performed by downloading the Free trial version of IBM SPSS
Modeler.
 There is only ONE chance to pass the course, but multiple attempts per question
(see the Grading Scheme section for details)
Prerequisites
 None
Recommended skills prior to taking this course
 Basic knowledge of business statistics (recommended but not required)
Learning Objectives
In this course you will learn about:
 Introduction to Data Mining
 CRISP-DM Methodology
 Introduction to IBM SPSS Modeler - predictive data mining workbench
 SPSS Modeler interface
Syllabus
Lesson 1 - Introduction to Data Mining
 Introduction to SPSS Modeler - predictive data mining workbench
 SPSS Modeler Interface

Lesson 2 - The Data Mining Process
 Business Understanding
 Data Understanding
 Data Preparation
Lesson 3 - Modeling Techniques
 Introduction to Common Modeling Techniques
 Cluster Analysis (Unsupervised Learning)
 Classification & Prediction (Supervised Learning)
 Classification - Training & Testing
 Sampling Data in Classification
 Predictive Modeling Algorithms in SPSS Modeler
 Automated Selection of Algorithms

Lesson 4 - Model Evaluation
 Metrics for Performance Evaluation

 Accuracy as Performance Evaluation tool
 Overcoming Limitations of Accuracy Measure
 ROC Curves
Lesson 5 - Deployment on IBM Bluemix
 Scoring new data
 Deployment of the Predictive Model
 What is IBM Bluemix?
 Predictive Modeling service: Deployment in the Cloud
 SPSS Collaboration and Deployment Services
About the Software
IBM SPSS Modeler is a comprehensive predictive analytics platform

designed to bring predictive intelligence to decisions made by
individuals, by groups, by systems – by your enterprise as a whole. The
following video will provide an overview of the product.
IBM SPSS MODELER : THE POWER OF PREDICTIVE

INTELLIGENCE
Trial software
Register for the free trial of IBM SPSS Modeler software that can be used in
this course
Community Support
Visit the Predictive Analytics community for up-to-date information about
Predictive Analytics such as blogs, discussions, and more!
Grading scheme
1. The minimum passing mark for the course is 70% with the following weights:
 50% - All Review Questions
 50% - The Final Exam
2. Though Review Questions and the Final Exam have a passing mark of 50%
respectively, the only grade that matters is the overall grade for the course.
3. Review Questions have no time limit. You are encouraged to review the course
material to find the answers. Please remember that the Review Questions are
worth 50% of your final mark.
4. The final exam has a 1 hour time limit.
5. Attempts are per question in both, the Review Questions and the Final Exam:
 One attempt - For True/False questions
 Two attempts - For any question other than True/False
6. There are no penalties for incorrect attempts.
7. Clicking the "Final Check" button when it appears, means your submission is FINAL.

You will NOT be able to resubmit your answer for that question ever again.
8. Check your grades in the course at any time by clicking on the "Progress" tab.
Certificate and Badge Information

When you pass this course you will receive:
 An online downloadable completion certificate.
This will be enabled immediately after passing the course (see section at
the bottom titled "Completion Certificate and Badge")
Change Log
This course was last updated on May 2nd, 2016:
 The course code was changed from DS101EN to PA0101EN.
Copyrights and Trademarks

IBM®, the IBM logo, and ibm.com® are trademarks or registered trademarks of
International Business Machines Corporation in the United States, other countries,
or both. A current list of IBM trademarks is available on the Web at “Copyright and
trademark information” at: ibm.com/legal/copytrade.shtml
References to IBM products or services do not imply that IBM intends to make |
them available in all countries in which IBM operates.
Netezza®, Netezza Performance Server®, NPS® are trademarks or registered

trademarks of IBM International Group B.V., an IBM Company.
Linux is a registered trademark of Linus Torvalds in the United States, other

countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of
Microsoft Corporation in the United States, other countries, or both.
Java and all Java-based trademarks and logos are trademarks or registered
trademarks of Oracle and/or its affiliates.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Other product, company or service names may be trademarks or service marks of

others.
About this course

Module 1: Introduction to Data Mining
Learning Objectives
Introduction to Data Mining (11:20)
IBM SPSS Modeler Interface (4:52)
Lab 1 - SPSS Installation (3:35)
Review Questions
Learning objectives
In this lesson you will learn about:
 Introduction to IBM SPSS Modeler - predictive data mining workbench
 SPSS Modeler Interface
THE DATA MINING PROCESS (12:06)
Skip to a navigable version of this video's transcript.
1. hello and welcome back to the predictive modeling fundamentals 14 this is a
2. lesson to taking today are you can electoral-rich product marketing manager
3. for IBM predictive analytics and Arman worries product manager for IBM
4. predictive analytics so in this lesson we will pick up where we left off we
5. will discuss the first steps of the first-team methodology we discuss data
6. preparation and data preprocessing and then we will perform data preparation
7. was SPSS model so our agenda for today's we're gonna start with the first half of
8. Chris p.m. to business understanding and we will discuss the use case that we
9. will be using throughout the rest of this course we will go over there and
10. standing and go to tools for data extortion available in my earlier and
11. then we'll go on to test preparation and processing and we'll review the tools
12. that are smaller force to be able to do that and then we're going to a lab band
13. so the first half of christiana get it is business understanding without
14. knowing what exactly directives and requirements are for the project
15. chances for success
16. attending my task reviews tell you have to be able to know what you're going to
17. do and you have to know what you gonna look for and you have to do what data
18. you're gonna need and why you doing its owner keys were looking at the sinking
19. sinking of the Titanic use cases these shoes case Titanic ship collided with a
20. sturdy and most people thrive because of the lack of rest because of a lack of
21. light jacket so most likely to survive more like they just have a few others
22. where women children and the upper class and our challenge project will be to
23. analyze a date and price per passenger as whether they're likely to survive but
24. now we're going to be doing it with a damn I saw this dataset 1010 dictator
25. comes from capitol and can go a couple proves to you know get startled to mine
26. project place for the only competition for a lot of data sets a publicly
27. available there and it's a great way to learn and explore they get there will
28. you doing so for our cases to studio saturated source the training testing
29. dataset so we don't have two separate or attached to these two sectors
30. subsections so some of the tools available energy business model for
31. exploring the available in the underside free agency can get the data you have a
32. grasp or that allows you to pick our various brands based on what they do you
33. have two different price and distributions histograms and web brass
34. and rallies across that when your value so this is something we'll be using
35. throughout the course so when it comes to data preparation this is where we
36. create an initial rocky and often times that provision tasks take a while to 32
37. responsible steps can be repeated in four times because oftentimes the data
38. and real four of his dirty it's incomplete and borrow actually values
39. are lacking for example you cannot measure you can have a chamber of
40. nothing which is obviously not correct or does not have found you too can be
41. noisy which means there is a result they can be consistent when it comes to many
42. meanings 04 discrepancies between different table so we have to account
43. for that time we have to take care of before we can run a model that actually
44. makes sense
45. so the key task and data processing data cleansing so filling in missing values
46. did that's no easy identifying and moving out fliers resolving
47. inconsistencies 10 this data integration was just an ablation of data from
48. different data bases science intensive have to transform the team by
49. normalizing marketing sometimes if you have a deal for example you can have a
50. rating date of the hearing today because it's gonna be so large
51. thousand dollars in damage so we normalize the static transform its 021
52. attributes 10 2010 we have to perform data reduction which is done by
53. Principal Component Analysis factor analysis basically you are kidding
54. reduced representation positive you're trying to capture as much capability in
55. this much information you didn't participate in a smaller size which is
56. necessary to pass trusting stuff especially when you're working with
57. extremely low to handle noisy team and one of them is bidding you can sort the
58. data and the partition it into men's deaths
59. number of customers with C span you can also call austerity to identify where
60. you know where the deputy to how they live together as distinct can help us to
61. trust each other so they can also help us identify outlines and then obviously
62. expert August crucial because of my room can detect potential outliers person
63. that has 2008 price subject matter expert actually knows that kato knows
64. the stakes so how fires did objects are very different from you know the general
65. representation of the tapes of this stand out and sometimes fires are needed
66. and we have a has to be corrected but sometimes hours actually bring
67. invaluable information and they have to be included in the model and this is
68. where subject matter expertise especially really comes into play when
69. it comes to did after formation this couple hours of suits moves noisy data
70. moving all the noise we can sometimes aggregated data I can we can normalize
71. it and in the end max ko you know everything from 0 to for example how
72. does he score normalization allows us to do that we can create new features are
73. attributes with her principal components analysis so some other tools that are
74. available to us for dinner provisional IRA certificate to type this allows us
75. to find what type of each attribute
76. continues data set income page is a categorical string values the flag yes
77. and no word no
78. particular order was a tapper the type node structure is you have you can say
79. you did to be an input which is used as an input or predictors of the moderate
80. acne to target which is the outcome that we're trying to cross the picket
81. sometimes Bebo so used and specifically association rules from outside counsel
82. not to sometimes it can take food from marlin or petition that means we're
83. using this variable to separate it into something for training testing for
84. volatility and other important issue to deal with when preparing it is handling
85. missing values and it's important to handle the table before him to
86. understand how to handle this tentative quality details you know obviously
87. profiteering escape the garbage in garbage out a result of modern art of
88. you can do to help business decisions and files and fields of missing data can
89. produce results if they're not identified the analysis so there's
90. different types missing you can have an older you or no value from the military
91. you can have a blank space for a string variable presentation of users specified
92. missing counties are going to be so for example for children that they can be
93. covered in 1999 wishes shows you that there is a really need to be out of 10
94. per episode and this noted by itself
95. hopes you handle the data preparation tasks helps you to analyze data identify
96. missing
97. values normalized fear today if you want to do he hopes you take care of some of
98. the spacing work which can be if you attempt to me
99. number of record operations options exist tomorrow as we continue to extract
100. transform and load ETL process who is he can you know sample taken sordid balance
101. it out you have a certain class that's under represented you can worst arafat
102. aggregated data driving new variables Ferguson variables out reclassified from
103. vehicles being partitioned into training and testing the worst critic
104. attributable to a series of dummy variables so a lot of things that this
105. is a lot of things available
106. powerful attempts of these operations with you didn't know either creating
107. with Myler is claiming which the expression building would you can't beat
108. you can be used for example the revenue actually it is really do love you can
109. you know you can see you can't do everything with point-and-click you can
110. check it to view is an integer variable you can convert available Tuesday 23
111. some different variables and there's a lot of tools available here for you with
112. this expression
113. so glad we're very loud too we're gonna load attending to explore it and where
114. appropriate for modeling and we will see you in the course
LAB 2 - DATA EXPLORATION AND PREPARATION (10:48)
www.kaggle.com/wayoflores
1.
predictive modeling fundamentals on in this tutorial what we're gonna be doing
2. is getting our data for a project we're going to put it in that space is modular
3. and start doing some preparation so the first step of this is getting our
4. dataset and there's a link in the document a tutorial that goes along with
5. the series that directs you to find the training and testing data sets but I
6. have that open here so it's just casual dot com slash see isn't cat's eyes
7. Titanic / data and so you should see something like I have on the screen here
8. what this is is a dataset of the passengers on the Titanic was some of
9. the information about them such as what they paid for a ticket for sex their age
10. where they were staying at on the Titanic and things like that and our
11. class variable we're trying to predict is whether they survived or died when
12. the Titanic sank so you just come out here it's to CSV files that you to
13. download so you wanna get the train CSV and the trustee as well as download
14. those files to where you work out of
15. and then we'll be ready to move forward I'm so once you have those downloaded
16. you're gonna want to go back to SPSS modeler and you can start a new stream
17. and what we're going to do is under the sources Palin we have a final note and
18. this is what you can use for importing site files into modeler so the dialogue
19. here it's very easy I was his final just to the right of that there's a button to
20. open a browser to select the father to load so just navigate to where you have
21. your training trial first that's we want to get in
22. so find this year and really quickly a nice feature we have as we can preview
23. the data so what we do this we can see the attributes that we have for the
24. dataset and their shoes in a show you the top 10 rows in the CSV file so we
25. have no idea where they survived the class which you can see websites in the
26. network right because we have a peek last name but looks like the clash as
27. part of the name and so that happens sometimes when you're a CSV file Wenders
28. quotes and the reader doesn't know exactly how to handle it so that's
29. alright so what we want to do here is the double quotes at the bottom
30. we're just gonna switch that from discarded paper discard so that's
31. changed
32. get a preview again you can see with the preview that price which in double
33. quotes two-parent discard 36 their data so now everything will be expected so
34. passenger deal experts are very busy 01 we have a class for one of the
35. passengers at the name the sex age and a few other details here
36. ok so we can click OK and now we see that this is pretty nice the nose been
37. remade the name of a file so fast now we know the desire training data and so we
38. can do now if we go to that output power had a table node and I just did that by
39. double-clicking with training notes elected so different way that you could
40. do that is to drag this on the road onto the campus and you can right click your
41. input }
42. or sellers and click Connect and click on where you tryna connected
43. to another way that you can connect nose and you can get rid of connections by
44. right clicking into leading so another way that is a little quicker a little
45. bit short cut if you're using a mouse and you have school in the center if you
46. click down on that while
47. connect you can see that draws a line and by releasing it next to the note I
48. want to go out to its gonna make that connection there so it's a nice shortcut
49. should make easy connections between notes ok now put it will just show which
50. is a visual way basically to have a great of the data within modeler so I
51. added that made the connection right click on it and click to run and this
52. shows us instead of just the top 10 rows we have all of our data and a nice table
53. so this is good just to get a sense of what we're gonna do it did you working
54. with but it's not really good for doing any kind of adjustments and you can see
55. here that we have some no values so this is kind of a first step and we're going
56. to continue on with doing some more analysis for their data so we can add we
57. can do now is exploring since we have a data model exploring so if you go to the
58. output Talat you can add a data audit node and so once again we can connect
59. our input data to the data audit noted and you can see right away
60. detective there's 12 fields and I just double click that and if I click Run
61. this information about the data that we have so there's a nice way to do some
62. initial exploratory analysis and data we see the graph we have some histograms
63. here today
64. age a tribute is normally distributed are categorical variables and the psych
65. sex is pretty well balanced and you can see that the details you have with the
66. descriptive statistics you know I was gonna be there for your continuous
67. variables but any categorical that's going to be we do have here for sex we
68. have no tells you the name of categories for cabin we have a hundred and forty
69. eight so that's all
70. may not be a good predictor for us but we do see that the age that the mean or
71. average ages twenty 9.67 we have quality which will ensure before we have 75%
72. complete fields 20% are complete records this is definitely something you want to
73. do every time you start a project we're doing check to make sure that here
74. data's high-quality ok so so that's just an exploratory analysis
75. the next step we want to do is do some or do some preparation of our data so if
76. you go to the field ops palette for 11 ill C type
77. so what this does
78. and dispensing double-clicking I pretty much the campus and what this does is
79. this gives us our field governor tribute or variable you want to say it
80. our measurement type so if it's a continuous categorical or flag also a
81. nominal and the role that it has so some things that we can do at this point
82. there was some work as we can just say that the survived variable is our target
83. so that means that as we build models it's going to automatically know that we
84. want to be predicting the survived that we want to predict that one is here
85. whether the passengers survived your not so this is just steps are gonna go
86. through this kind of thing so at this point we're gonna do is actually remove
87. some of these pay given the role of known so you can see this is probably
88. intuitive but things like me that's not going to be a good predictor for us
89. ticket we can switch that to known that will be a good predictor for us and
90. they're also gonna more cabinet
91. in part is being done so we would go to model
92. gonna be taking our imports and I'm trying to predict our target and then
93. click Apply here and ok

94. so the next thing we want to do is do some more data preparation but this time
95. we're gonna be using a Autodata prep note that can be found at field ops so
96. I'm just gonna make this up on the canvas and now if I connect these two
97. and go in and I have options here and just want to point out that you can see
98. that there's a red triangle here and you'll see that once we complete this
99. process that's going to change so in the toilet as guides steps for how we would
100. have all the setup we want to keep it set as balance for speed and accuracy
101. when we go to settings were gonna go to prepare inputs and targets and here is
102. some checkboxes these all should be pressure for what we're doing here I'm
103. going to uncheck to reorder nominal fields because we want to keep them in
104. the same order to as to make it easier we can leave the other box is checked
105. the other thing we want to do on this screen is unchecked the transform
106. continuous field so if we leave this checked it's what this note will do is
107. normalize our continuous variables to put them on a disk or transformations on
108. a scale based on our standard deviation and mean that we have calculated here so
109. just to keep everything we don't do any transformations British go unchecked
110. this then click Apply here and let's click on analyzed data to run this note
111. and we have some new transformed classes and we also have aged transformed so
112. this this did work for us and so we can click OK and that completed our data
113. preparation steps for for this lab
Learning objectives
 Introduction to Common Modeling Techniques
 Cluster Analysis (Unsupervised Learning)
 Classification & Prediction (Supervised Learning)
 Classification - Training & Testing

 Sampling Data in Classification
 Predictive Modeling Algorithms in SPSS Modeler
 Automated Selection of Algorithms
MODELING TECHNIQUES (9:06)
1.
both come to the predictive modeling Fundamentals class myself mechanic image
2. can and joined by a man whose product manager for IBM predictive analytics so
3. in this session we are going to introduce some common modeling
4. techniques were appointed to discuss the difference between supervised and
5. unsupervised worried and we will understand the algorithms that are
6. viewable IBM Business Partner so the agenda for today again we're going to
7. introduce two techniques were Canada Trust provides learning which is
8. questioned analysis discuss supervised learning a classic Eastern prediction
9. will go further into PATH station the training and testing down south
10. something we're seeing the previous like chair around at 8 a.m. accused case that
11. it would gain from the Titanic use case will talk about sampling data predictive
12. modeling algorithms and SPSS modeler and I don't get too excited about the times
13. so some of the common modeling techniques available to us if we break
14. down into pickpockets we have supervised learning which describes into English
15. classes for future predictions based training to become america's decision
16. trees regression analysis can use neighbors neural networks so you have a
17. tariff fit you have enough practice trip to China
18. so now you're going to build a model based on that particular train them on
19. that day in fact and then you can applies to a new it's a new customer we
20. don't know whether that customer will turn out so we're going to make a
21. prediction for that thirteen example supervised learning that there's a
22. supervised learning we're going to analyze data where these labels we r
23. comes up unknown to creep groups and classes for objects turn to each other
24. with a group of dissimilar to other groups have questioned out to some some
25. of the common message that we have our team is questioning hierarchical
26. questioning to stop fussing and other associations were analyzed therefore
27. events in instances that occur together for example she died percent beer
28. company purchased together probably toothpaste and toothbrushes also
29. purchased together so we look for these instances as some of the comments field
30. below are a priori ok so unsupervised learning to your customers whose
31. behavior collection of data and within to these questions the data points are
32. similar to one another within the same question but at the same time being are
33. dissimilar to appoint any other costs and cost analysis allows us to group a
34. dataset into these objects into these clusters and the classes are not
35. provided we don't know if what Chrysler Group outcome certainly the deal was
36. blocked so we lift tomorrow
37. group based on their difference dissimilarity to travel rather than
38. burning from Justin Bannan the flying this morning to new teen when it comes
39. to supervised learning crustacean prediction we have to do two things of
40. classification were ridiculously Journal turn fraud fraud purchase yes or no it
41. doesn't have to be by any
42. it could be multiple outcomes you can buy we can predict where customers can
43. buy three items that customers can buy more than three items our customers can
44. buy more than five but less than 10 and it took root structure across station
45. model based on the training set and use it to question you did this prediction
46. for Milan continuous fearful some predicting an owner missing out so
47. classification is for predicting vehicles prediction is for prosecution
48. prediction supervised learning in general is training and testing is very

49. important so we want to split heart into training second attempt except for a
50. plan tomorrow we are proud we want to put the 256 I seventy-five percent for
51. training to 24 percent 25% for testing so first we train the model on the
52. training dataset with existing crisis and then we tested the remainder of the
53. data sector was not included in the training is detected and this allows us
54. to evaluate them I'll compare the accuracy compared to model how it
55. performed in the training and testing data set up to a procedure which is the
56. percentage of families currently classified by tomorrow and you know we
57. want to see high accuracy for not just testing get a proper training and
58. sometimes if we have problems here we do really well with training for 10
59. tomorrow does not do with testing data that it's an offer
60. overfishing problem the model doesn't generalizable to new data that means
61. it's probably too complex and it starts to caption worries in the dance of the
62. teachers so at this point we have to go back and revisit tomorrow and may be
63. removed from the site which is transmitted to the future unknown
64. objects so we are too simple to track for for it and
65. dressed gatien why because an interesting reaction because we want to
66. deal with a smaller subset of a really really big deal except as representative
67. of the population so there's different approaches to patent just take a simple
68. template 30% of original sample which sometimes me that be appropriate for
69. their ballast for example we're talking to predict where they're actually going
70. to be benign or malignant and we're gonna have mostly but I cases but
71. somewhat and it's going to be reading for us to accurately predict both cases
72. end up with three percent of the cases were gonna do really well with
73. predicting bank is prepared to be able to pacify the case is cracked and so we
74. have a couple examples of past two temples which uses simple groups or
75. clusters or we can stratified sample sori select samples independently you
76. know not overlap acceptance straps for example men and women to sample in April
77. proportions or you know certain regions socioeconomic group so that our sample
78. proportions of these groups are appropriately represented we're
79. maintaining the original proportions of those variables and her has a lot a lot
80. a lot of different groups available for all needs of grannies from of
81. classification and prediction algorithms different decision tree algorithms
82. regression analysis neural networks generalized linear models country cuts
83. russians for survival data support vector machines vision to retain your
84. neighbor so there's a lot of actors material for clustering also we have
85. k-means Kohonen to step down early detection algorithm that helps to
86. identify potential for ice and associations algorithms farmer field so
87. what happens if you don't know what's coming out going to help you carry candy
88. store which which algorithm should you pick which have room is right for you
89. 12 miler there are automated algorithms available whereas this is part of a
90. select the best algorithms for your project given you know you're trying to
91. protect so we have had a classifier algorithm for America for predicting
92. continued skis or cluster and penchant for forecasting pictures so with that
93. let go ahead with a third lap and here we're going to build the logistic
94. regression model for a tenant Dana and then we're also gonna use are mining
95. feature of
MODEL EVALUATION
Learning objectives
 Metrics for Performance Evaluation
 Accuracy as Performance Evaluation tool
 Overcoming Limitations of Accuracy Measure
 ROC Curves
4.- MODEL EVALUATION
Play
0:00 / 9:03
SpeedSpeed1.0x
Volume
Maximum Volume.
Fill browser
Turn off transcripts
1.
2. and welcome to predictive modeling fundamentals one question for you today
3. I myself we can encourage product marketing manager for IBM predictive
4. analytics and their monies product manager for IBM products hours in this
5. session we're going to understand time to come it took me two metrics for
6. classification model evaluation gonna pick up where we left off in the
7. previous class we're gonna talk about a plan to take tomorrow off at NTNU Tina
8. and then we're going to use this disparate ourselves part of performance
9. and accuracy
10. jennifer is begin by review some of the concepts from my speed record data
11. mining tools training and testing data sampling data federal government to
12. metrics for performance evaluation discussed actors as performance
13. evaluation and other measures for overcoming ruled petitions of accuracy
14. we're going to talk about our secrets for measuring how well it performs and
15. we're gonna go into a lab previous reports have money to squeak and broken
16. up into three main categories supervised learning which is classification of
17. production where we look at historical data an attempt to predict outcomes or
18. crafts items into groups based on historical day so we're going to
19. describe the distinguished classes for future prediction classification
20. classification section of suppressing deals with both categorical data
21. production deals with continuous efforts we employ a decision trees kenyans
22. neighbors at work then we use regression for prediction degrees of separation
23. working where we could do
24. we cannot know belongs to you to create groups and classes for objects with
25. creating these customers there have been a similarity but also have high interest
26. in their class objective view similar with each other for a few different to
27. have to check out of cost accounting methods we use our team's quest 30
28. question to step up and then there's association rules were reality for event
29. occur to campus for a couple with two things commonly by introduction and
30. common method we use acronyms prosecution prediction we evaluate final
31. performance we put fear into training to approximately $0.60 to $0.70 going into
32. training and 24 percent went at some point so we train the model on
33. retrieving section of historical deal with existing classes at school
34. supervisor and then we can smile did not include testing and attract attention
35. and this variation allows us to compare how will tomorrow
36. performed at the training set and the technical you want to make sure to see
37. if that is generalizable to new data so we don't have the overfitting problem
38. where they're really well in the training to the people and the test to
39. see high accuracy for training and testing and then we used a smile for
40. questions thanks Peter known so sometimes it's important for us to
41. sample did it because we're doing such a matter of fact
42. we need to some Italian we need to deal with smaller subset of it and if such a
43. way that the subsidies representative large population of Argentina side so we
44. can sometimes take a sample taken the time to stop appropriate for the
45. balanced where we have a large study determined from one class only here
46. until this coming from another crisis wears too much more rare of classes
47. going important for us to accurately predict in this case will resolve this
48. with complex samples were stratified temples where we maintain the original
49. portions of prices can key so for example if we have percent positive and
50. negative cases in the regional powers for a couple also have 99% positive
51. cases and 10% mixed case once we have built a model it was looking at products
52. for customers and prospects would like to return and we want to see how well it
53. does so what's the one way to do that that's true actress generally do that by
54. looking at the confusion matrix matrix that was classified our forces how r
55. here so we have two positives and negatives to accurately classified as
56. either yes or no and then lost that case and false positives and it doesn't have
57. to be a binary outcome can be a multi-dimensional data we can be
58. multiple classes but worth looking to compare their versatile actress is a
59. performance evaluation is basically a measure that would look how many total
60. cases a clear pic less having a degree actually classified US troops and maybe
61. between accurately classify that's true positives best price for that is divided
62. by the total number of times has its limitations even though it's very
63. popular tool for example you have no negative cases technical cases in the
64. two important for us to accurately predict everything negative
65. a model is going to predict everything to be positive would have really high
66. accuracy and working with model did really well but in reality if it failed
67. cases they did not field especially for predicting
68. tumors and is extremely important for us to identify their career case because
69. it's easy to overlook that there's some other measures to deal with it for
70. example this procession which is true positives divided by the sum of
71. positives and false positives of the how many people can we prove we put it to be
72. true then restive city which is true positives divided by 2 percent plus
73. false negatives how many of the outcomes actually positive to predict to be
74. positive and negative divided by two neighbors 130 miles we can compare we
75. can compare them from the performance for example we ran one final question
76. and other one that decision for NYC how to perform in comparison to jump suit in
77. using ROC curve receiver operating characteristic curve to present a
78. performance for binary classification model and its ability to distinguish
79. false positives and true so there is a draft of the event his relation to the
80. graph that allows us to build this metric and in the sample she we have
81. streamlined the replay and we have two miles of build a blue-eyed represent a
82. decision tree and a green line 32 percent in which is the creation of
83. green line representing a decision tree here has done better
84. logistic regression because it's closer to the upper left corner that's what we
85. want to smile to be want to subscribe to this party and close to the upper left
86. corner as possible so that let's move on to the next level we're gonna perform
87. and we'll see in the next class
MODELING FOR PREDICTIONS (7:33)
1. welcome back to predictive modeling fundamental is one and this video we're
2. going over for scoring test data so in this tutorial what we're going to do is
3. kind of just keep building off stream that we've been working on the past few
4. labs so the last laugh we finally built some models and so we did some
5. evaluation so now what we really want to do is actually use our model that we
6. created and we want to see and limit our predictions with only those that were at
7. least 80% confident confident about and kind of go to next freshly I'm using our
8. model in a web application so they'll be in the next tutorial so this is your
9. work is going after testing data modeler and limit or disciplinary actions ok so
10. you should have this stream something looks like this stream on your in your
11. campus if you are just jumping in
12. want to go back to the first tutorials the first lives and either watch the
13. videos or go through the steps on your own I'm seeing get to this point we're
14. just going to continue moving forward with this so just to get started here
15. what you want to do next
16. we want to have our testing CSV file that you downloaded in the second lab
17. testing dataset for the Titanic data and we're gonna add that to modeler so
18. really the same as we did the first time you want to go to the sources palette at
19. the bottom of your screen just click and drag the wire file onto the canvas and
20. I'm going to double click that once again the same process as before I'm
21. gonna selected testing CSV file click Preview and we see once again this post
22. incorrectly so that's alright we know how to handle this in the double quotes
23. at the bottom of this dialog box we're gonna switched from discs are two-parent
24. discard and now just for that one change you can see now named as being read
25. incorrectly entered it looks pretty good
26. get something quick reply
27. and click OK and now you can see basically we're set up for now is we can
28. do something similar to what we did with our training data just with our testing
29. data although at this point we already built our model and so we've already
30. given the stream we've told it was important to look for so long as they're
31. testing data has in place it's going to be easy for us just to apply our model
32. so what we can do now just something I've been pointed out so far in the
33. environment of SPSS modeler that you see on the far right side at the top we have
34. a different types here and so these are pretty easy come in handy so if you're
35. working on multiple projects you can have multiple streams open and you can
36. quickly jump between them by clicking here it's gonna give you all the tables
37. of the graphs of the previews achieve created so that's a quick really good
38. thing to jump active you just what you've been working on and then models
39. is the third time here and this is what we're going to use at this point so
40. since our other classification model perform better let's go for it that way
41. and so this you can think of this just like something from the palace at the
42. bottom and check and drop it here on campus now we have this connected I'm
43. just gonna make a connection here directly between the testing data and
44. that model and to get an output just to kind of visualize something here I'm
45. gonna had a table to this also just click Iran and see if this works ok so
46. that was really quick
47. just to show you so if you are just joining are you did you see the last
48. video when you're first created the model especially the opacification and
49. it's running through you know a big data set and many different many different
50. models are trying to test out it can take some time so now the advantage of
51. what what we have here is rather than creating a model again we can use it
52. during our testing dataset and that was almost instantaneous to get those
53. results and you can see here we have are predicted class and our confidence so if
54. you recall the original the testing dataset it didn't have an actual class I
55. survived or died so 10 so all we had was passenger I D through and barked and
56. just by with this data monitor model now we have predicted 10 and the probability
57. for that something that you might want to do if you're doing this kind of work
58. is that ok that's good but I don't only one apply this model early wanna make a
59. prediction over a certain percentage confident and my results so that's what
60. we're gonna do now to do that kind of Jack so we can do is under our record
61. hops we have a no thats select and this is really powerful because this is where
62. you can select only certain roads are instances that meet your criteria so
63. what we can do and reconnect the filing no to the select and going in here this
64. dialogue is but here it's kind of a calculator lets you make your finals so
65. there's different ways you can use you can type it in manually what we can do
66. here though is the XFC survived that was the probabilities number that I showed
67. you from the table and so that's our variable and so what we want to do is
68. make sure that we only keep those that are greater 8.8
69. so we want to make sure that we're at least 80% confident in our results and
70. we can click check at the bottom of the just gonna do a throwing error let us
71. know via something totally wrong but that looks good so we can look ok here
72. and now it's at another table here so if I could help lead and tables going to
73. make that connection ok let's just run this right now you can see if you recall
74. from our previous table we have all the all the outputs and its 418 on this
75. table that we just selected where we live at it we only 273 and looking at
76. the far right column I just doing a spot check here you can see that everything
77. is higher than 20 which is what we want because the predictions that we feel
78. confident that was it for this tutorial on the next tutorial gonna be doing is
79. setting up a woman's account and we're going to use our model to put it onto a
80. web application or we can use it
81. dying dynamically on our website so we'll see you then
DEPLOYMENT ON IBM BLUEMIX (6:51)
1. additional employment services so into this agenda will cover my point is
2. whether the second will deploy of a mile and Wyoming terry's how to implement the
3. crowd and finally we will be able to talk about collaboration and Deployment
4. Services so when we're working in a data mining project when we use only the way
5. we split our data between training and we leaving
6. Train Your Mother and usually when you do is just be the data in a way that you
7. use 80% for example to train them and putting both 10 2010 what they used to
8. train them all the more important for you to know more accurate gonna be your
9. you're breaking while so wine is made them all those using the training and
10. the second is going on at this thing said
11. and compared them with the observed values so they can talk on the door
12. dataset is not big enough
13. you may have wrong room balance oh and in case we have her on the phone records
14. so we're splitting a d20 so we have a hundred records for the training and to
15. any
16. and 200 records for the testing going back to the crease methodology we saw
17. before that we have sixteen deaths West business understanding trying to figure
18. out what we're trying to do in this project
19. second they don't understanding what is made available and how can I get into my
20. environment the first episode of preparation which is basically getting
21. that they are ready for modeling then we come to modern fire and so actually
22. creating the model is generally not the end of the project because of the way
23. and what we try to do here is to deploy these smart model into a reason to my
24. man so we can get some benefit from that so for the moment they are different
25. solutions provided by APM they work well with SPSS let me point out here when he
26. says basis
27. solution for we share Lumix service available in the cloud and the thirties
28. I V YM
29. blaming terrorists I'm going to focus now on the cloud offering and I do it I
30. V emblem thanks for those who doesn't know what I V emblems of blood from a
31. self-service application hosting environment so they view here is that
32. you will be able to deploy applications and don't have to spend weeks or so the
33. application developers will focus only on their business logic on on the Nov 10
34. you don't have to be so the application developer for example what happened to
35. worry about how to install or manage the runtimes framework and libraries and
36. we're opening a bunch of different viruses there that are easy to manage
37. and you can tell that killed them as you know so I is based on Monday which is
38. open source and it is a very strong and growing community so you can access it
39. on the net and in there you will find many different and that when we are we
40. are interested who do they look pretty another service offered to do the
41. developers and at the time the weather to integrate capabilities into
42. applications so basically went to the end and that's what we will be going
43. into the blood stream to develop smaller and you uploaded into the area and its
44. gonna provide you the great point APA's so you will be able to call the into
45. your applications deployed so it's very simple to create the ferry you upload
46. the file and the service provider directly the point finally we have
47. deployments have a larger solution of the possible
48. and frame and we are in this training but this is so that I know you love
49. number five
50. according to the dentist uses my mother into the IBM Cloud and you will use this
51. application thank you very much
PRUEBAS
REVIEW QUESTION 1

(1/1 point)
Which of the following applications would require the use of data

mining? Select all that apply.
Predicting the outcome of flipping a fair coin Determining which products in a store are
likely to be purchased together Predicting future stock prices using historical records
Determining the total number of products sold by a store Sorting a student database by gender
Determining which products in a store are likely to be purchased together, Predicting future stock
prices using historical records, - correct
You have used 2 of 2 submissions
REVIEW QUESTION 2

(1/1 point)
Which of the following is NOT a section of the Modeler Interface?
Stream Canvas Stream, Outputs, and Model Manager Nodes Palettes All
of the above are sections of the Modeler Interface All of the above are sections of the Modeler
Interface - correct
REVIEW QUESTION 3

(1/1 point)
Which of the following is NOT a part of the Cross-Industry Process for

Data Mining?
Data Storage Data Storage - correct Modeling Data Preparation Business

Understanding Evaluation
PRUEBA 2
REVIEW QUESTION 1

(1 point possible)
Which phase of the data mining process focuses on understanding the
project requirements and objectives?
Data Preprocessing Data Understanding Data Exploration Data Exploration -

incorrect Business Understanding Data Preparation
REVIEW QUESTION 2

(1/1 point)
Which Data Preprocessing task focuses on removing outliers and filling

in missing values?
Data Reduction Data Transformation Data Integration Data Cleaning Data

Cleaning - correct None of the above
REVIEW QUESTION 3

(1/1 point)
The IBM SPSS Modeler supports which data type?
Ordinal Categorical Continuous Nominal All of the above All of the

above - correct
PRUEBA 3
REVIEW QUESTION 1

(1/1 point)
Which of the following methods are commonly used for supervised

learning tasks? Select all that apply.
Neural Networks Decision Trees K-Means CARMA Regression
Neural Networks, Decision Trees, Regression, - correct
REVIEW QUESTION 2

(1 point possible)
Classification is a subset of supervised learning that focuses on modeling

continuous variables. True or false?
True True - incorrect False

REVIEW QUESTION 3

(1/1 point)
Which of the following algorithms is NOT supported by the SPSS

Modeler?
K-Means Logistic Regression CARMA Apriori All of the above

algorithms are supported All of the above algorithms are supported - correct
PRUEBA 4:
REVIEW QUESTION 1

(1 point possible)
What is the term for a negative data point that is incorrectly classified as
positive?
True Negative False Positive True Positive False Negative None of the
above None of the above - incorrect
REVIEW QUESTION 2

(1/1 point)
Which of the following is NOT a cost-sensitive performance metric?
Specificity Precision Sensitivity Accuracy Accuracy - correct All of the

above metrics are cost-sensitive
REVIEW QUESTION 3

(1/1 point)
What is the formula for the precision metric?
(True Positive) / (True Positive + False Negative) (True Negative) / (True Negative +
False Positive) (False Positive) / (True Positive + False Positive) (True Positive) / (True
Positive + False Positive) (True Positive) / (True Positive + False Positive) - correct (False
Positive) / (True Negative + True Positive)
EXAM 5:
REVIEW QUESTION 1

(1/1 point)
In general, the testing dataset should be significantly larger than the

training dataset. True or false?
True False False - correct

RREVIEW QUESTION 2

(1/1 point)
Which of the following is NOT a model deployment solution?
SPSS Solution Publisher IBM Collaboration and Deployment Services CRISP-

DM CRISP-DM - correct Bluemix All of the above are model deployment solutions
RESETYOUR ANSWER
REVIEW QUESTION 3

(1 point possible)
Which of the following statements are true of IBM Bluemix? Select all
that apply.
Bluemix generally takes about a week to deploy an app Bluemix is supported by a

growing community Bluemix is closed-source Bluemix provides a self-service
application-hosting environment Bluemix provides built-in load-balancing capabilities x

Untitled

Uploaded by

Copyright:

Available Formats

Untitled

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Untitled

Uploaded by

Copyright:

Available Formats

http://www.sv-europe.

 This course is free.

 It can be taken at any time.

 It can be taken as many times as you wish.

Recommended skills prior to taking this course

 Basic knowledge of business statistics (recommended but not required)

 Introduction to Data Mining

 SPSS Modeler interface

Lesson 1 - Introduction to Data Mining

 Introduction to Data Mining

 Introduction to SPSS Modeler - predictive data mining workbench

 SPSS Modeler Interface

Lesson 3 - Modeling Techniques

 Introduction to Common Modeling Techniques

 Cluster Analysis (Unsupervised Learning)

 Classification & Prediction (Supervised Learning)

 Classification - Training & Testing

 Sampling Data in Classification

 Predictive Modeling Algorithms in SPSS Modeler

 Automated Selection of Algorithms

 Metrics for Performance Evaluation

 Overcoming Limitations of Accuracy Measure

 Scoring new data

 Deployment of the Predictive Model

 What is IBM Bluemix?

 Predictive Modeling service: Deployment in the Cloud

 SPSS Collaboration and Deployment Services

About the Software

IBM SPSS Modeler is a comprehensive predictive analytics platform

IBM SPSS MODELER : THE POWER OF PREDICTIVE

 50% - All Review Questions

 50% - The Final Exam

4. The final exam has a 1 hour time limit.

 One attempt - For True/False questions

 Two attempts - For any question other than True/False

6. There are no penalties for incorrect attempts.

7. Clicking the "Final Check" button when it appears, means your submission is FINAL.

Certificate and Badge Information

 An online downloadable completion certificate.

 The course code was changed from DS101EN to PA0101EN.

Copyrights and Trademarks

Netezza®, Netezza Performance Server®, NPS® are trademarks or registered

Linux is a registered trademark of Linus Torvalds in the United States, other

Other product, company or service names may be trademarks or service marks of

About this course

Introduction to Data Mining (11:20)

IBM SPSS Modeler Interface (4:52)

Lab 1 - SPSS Installation (3:35)

 Introduction to Data Mining

 Introduction to IBM SPSS Modeler - predictive data mining workbench

 SPSS Modeler Interface

THE DATA MINING PROCESS (12:06)

Skip to a navigable version of this video's transcript.

1. hello and welcome back to the predictive modeling fundamentals 14 this is a

4. predictive analytics so in this lesson we will pick up where we left off we

13. so the first half of christiana get it is business understanding without

15. chances for success

42. meanings 04 discrepancies between different table so we have to account

44. makes sense

52. attributes 10 2010 we have to perform data reduction which is done by