Directed Data Mining: Simple Linear Regression
Directed data mining allows modelling to be created based on specific variables within the data set. To use directed data mining methods, a target variable must be chosen. This target variable allows any analysis to be completed using the variables chosen as predictors. Directed data mining (supervised prediction and classification) is useful in a lot of domain areas and business problems. One of the example areas can be using data mining for direct marketing as it allows companies to predict those who are most likely to reply to a campaign that is being run. This can reduce the costs of the campaign and increase the response rate for those who are targeted.
What is Regression Analysis?
One particularly useful directed data mining method is regression analysis. Regression detects correlation to detail any connections between the variables within the data set. If a connection is found, regression can be used to identify the type of relationship that is present.
Preparing your Data for Regression Analysis
The first step towards creating a regression analysis is to assign a target variable for the data. This can be completed by adding a data set to the diagram and right-clicking the data set to edit variables. In the example below, we have chosen nHome as the target variable. This will allow us to determine any factors that are significant to an increased number of home runs.
Once the target variable has been assigned, it is important to handle any data that may decrease the fit and relevancy of the regression analysis. Two factors that can significantly influence the results of regression analysis include:
- Missing values
- Skewed data
To reduce this risk an Impute node or replacement node must be added to the diagram. In this case, an Impute node has been added to replace any missing values in numeric variables with the mean value.
SAS™ Enterprise Miner contains the Regression node which can be used to perform regression analysis. This node can be found within the Model tab.
The Regression node can be added directly after an Impute or Replacement node within the diagram.
Simple Linear Regression
Simple linear regression is used for numeric (interval) data. In its univariate version, the technique allows a comparison between two variables to establish if a link is present. The link is determined by fitting a linear equation to the data to create a line of best fit.
Several options are available for the Regression node:
The first option that we are going to look at is the "Regression Type". In this example, we are looking at "Linear Regression". In case of using an interval target variable (like using the nHome variable just as in our example), the regression type is set automatically to Linear Regression.
Now the regression type has been specified, we can look at the model selection methods. For this, four selection models are available:
- Backward - Begins with all input variables added to the model and removes variables until the stop criteria is met.
- Forward – Begins with a null model and adds variables until the stop criteria is met.
- Stepwise - Begins with a null model and adds variables, but may also remove these variables again until the stop criteria are met.
- None – All input variables are used to fit the model.
In larger data sets it is important to view the Optimisation options. These allow users to specify the maximum iterations used for model training, the maximum function calls and the maximum CPU time. This can be particularly useful within large organisations where limited time is available to run models which use large amounts of CPU power.
Run the model by right-clicking the Regression node and selecting Run.
Assessing the Results
On running the regression model, the results window displays four internal windows as below.
1) Score Rankings Overlay: Home Runs in 1986
The score rankings overlay window plots charts to assess the model’s performance for different score values. For interval target variables (as in our case), the default chart is plotting the mean of the predicted values against the mean of the target values for each bin. It plots highest predicted score values on the left to lowest predicted score values on the right. The plot is showing that our model has a very good fit on the high score values, and still good fit on the middle and lower score values. Two other charts are plotting the maximums and minimums of the predicted and target values against each other, in each bin.
2) Fit Statistics
The fit statistics window details the different model fit statistics that can be used to evaluate the model’s efficiency. This can be helpful for creating custom settings for further modelling.
3) Effects Plot
The effects plot details all the variables that are used within the model to predict the number of home runs. The larger the bars, the more significant the variable is within the predictions. The variables are displayed within the below graph. The intercept is also shown ("Effect Number 27"). Blue bars show positive and red bars show negative effect of the actual variable on the target variable. To view the variable effects (coefficients), hover over the variables. The strongest effect within the plot below is of the derived input variable flagging whether the player is within Team Atlanta.
4) Output Window
The final window within the results is the output window. This details the log of the regression analysis and any further information about the variables and their weights within the charts. The final table within the output window contains the full Assessment Score Distribution. This shows the range for each interval for the prediction score and the number of observations that are within that range in the data set.