This is my blog for the Data Mining module.
Analysis in Rapid Miner
I have downloaded the data for the Titanic. I have created a scatter graph that graphs age against survival. This shows that people who survived were from all ages, and people who died were from all ages. A person’s chance of survival did not seem to depend on age. This would indicate that they did not save the younger people first, or the older people first, as may have been suggested. The data is also colour coded by class. Class 1 are blue, class 2 are green, class 3 are red. The bulk of the red data is in the top left of the graph. This means that the vast majority of the class 3 people were young and they died.
I have downloaded data in relation to a golf club. This has details of whether there is play or not, and in what weather conditions. I have created a scatter plot in Rapid Miner of temperature against play (yes or no). This seems to indicate that there is play when the temperature is generally cooler, apart from two outlier times when there was play at very hot temperatures.
I have downloaded data in relation to the payment method for products from a store. There are three methods, cash, cheque and credit card. I want to see if there is any correlation between age and payment method. I have created a scatter plot which graphs age against payment method. Cash payments are pretty evenly spread across all ages. Credit card payments are also across all ages. This may come as a surprise as it might have been expected that older people would not use credit cards as much as younger people. The payments by cheque occur amongst all ages until about age 60, after which there are very few payments by this method.
Finally. I have downloaded a file that outlines details of working hours and conditions and benefits for the employees of a company. I have created a scatter plot that graphs the latest wage increase against the working hours of the employees. This shows the data to be predominantly in the top left hand corner of the graph. This means that the main bulk of the workforce got small pay increases of under 5%, apart from a handful of people who got over 5% of an increase.
Regression Analysis in R
I have used R to run a regression analysis.
I have used the enrolment data from a university in the US which shows the number of enrolments each year. I want to see if there in any correlation between the enrolment nombers (ENROL) and other factors such as the unemployment rate (UNEMP), the numbers graduating from high school (HGRAD) and the average income per person (INCOME).
R gave me the following results:
lm(formula = ENROL ~ UNEMP + HGRAD + INCOME, data = datavar)
Min 1Q Median 3Q Max
-1156.2 -491.5 -3.121 382.6 1226.6
Estimate Std. Error t value Pr(>ltl)
(Intercept) -8.379e+03 2.156e+03 -8.985 0.000594***
UNEMP 4.689e+02 1.236e+02 4.236 0.000812***
HGRAD 4.328e-01 7.523e-02 6.645 1.64e-05***
INCOME 4.018e+00 4.658e-03 8.478 4.89e-09***
signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 661.5 on 1025 degrees of freedom
Multiple R-squared: 0.9652, Adjusted R-squared 0.9548
F-statistic: 232.5 on 3 and 1025 DF, p-value: < 2.4e-14
From this output, we can determine that the intercept is -8379, the coefficient for the unemployment rate is 468.9, the coefficient for number of spring high school graduates is 0.4328, and the coefficient for per person income is 4.018. Therefore, the complete regression equation is:
Autumn Enrolment = -8379 + 468.9 * Unemployment Rate + 0.4328 * Number of Spring High School Graduates + 4.018 * Income per person.
The results from R give us a number of statistics and indicators for us to see how good a fit our model is, and how significant or not is the correlation between the enrolment and the variables we have included.
The stars (for example, ***) indicate the predictive power of each feature in the model. The significance level (as listed by the significance codes shown) provides a measure of how likely the true coefficient is zero given the value of the estimate.
The presence of three stars indicates a significance level of 0, which means that the feature is extremely unlikely to be unrelated to the dependent variable. In this case, the intercept and our three variables all have three stars beside them. So we can be confident that the variables we have included are related to the dependent variable.
A common practice is to use a significance level of 0.05 to denote a statistically significant variable. If the model had few features that were statistically significant, it may be cause for concern, since it would indicate that our features are not very predictive of the outcome. In this case, the intercept and our three variables all have significance levels that are very small, much smaller than the 0.05 level that is commonly used. Therefore we can conclude that these variables are all statistically significant, and we are correct to include them. The overall p-value for the model is a very small figure, in the order of 10 to the power of -14. Therefore we can conclude that these results did not occur by chance and that there is strong dependency between the variables.
The Multiple R-squared value (also called the coefficient of determination) provides a measure of how well our model as a whole explains the values of the dependent variable. It is similar to the correlation coefficient in that the closer the value is to 1.0, the better the model perfectly explains the data. Since the R-squared value is 0.9652, we know that over 96 percent of the variation in the dependent variable is explained by our model. Because models with more features always explain more variation, the Adjusted R-squared value corrects R-squared by penalizing models with a large number of independent variables. The Adjusted R-squared value here is 0.9548. So we can conclude that our model is a very good fit for the data. With an R-squared value this high, we don’t need to search for any more variables.
The Residuals section provides summary statistics for the errors in our predictions, some of which are apparently quite substantial. Since a residual is equal to the true value minus the predicted value, the maximum error of 1,226 suggests that the model under-predicted by over 1,200 for at least one observation. The minimum residual value is -1,156 which means that the model over-predicted by over 1,100 for at least one observation. The maximum and minimum residuals are approximately 2 standard deviations out from the mean, which can be expected in a model like this. There are no outlier values that are way off from the rest of the data.
Fifty percent of errors fall within the 1Q and 3Q values (the first and third quartile), so the majority of predictions were between 491 over the true value and 383 under the true value.
Given the preceding performance indicators, our model is performing very well.