Data Mining: Transforming Variables
Data manipulation is an important part of the data mining process. In order to ensure that all of the data is used effectively, it can sometimes be useful to transform the variables. This can reduce the variances in the data and correct incorrect data within the data set. Within SAS Enterprise Miner™ software the Transform Variables node can be used to manipulate variables. The Transform Variables node allows you to make variable transformations that can maximise the correlation with the target variable and normalising variables. This node can also be used to create interaction variables from class variables.
Transform Variables Node
Within SAS Enterprise Miner, the Transform Variable node can be found within the modify tab:
The Transform Variables node can be dragged on to a SAS Enterprise Miner diagram and joined using an arrow to direct the flow of the data through the system:
To view and modify the properties of the Transform Variable node, click on the node and the properties will appear in the left pane.
The ‘Default Methods’ for interval and class variables is to perform no transformation. The default method for missing values is not to treat any missing values as a level.
In order to choose variables for the transformation you can use the ‘…’ button highlighted below to bring up the variables window.
The Formulas window is a useful tool to view any initial trends within the data set. This window displays a histogram to show a plot of the data within the data set. If a histogram shows a significant skewness on the data, this could indicate that a transformation could improve further modelling.
To view the histogram for each of the variables you can click on the name of the variable:
You can see from the histogram above that the Div variable is not normally distributed. This could indicate that the Div variable should be transformed within the data set. Once you have identified the variable that you would like to transform, the formulas window can be closed.
To transform a variable within the data set you will need to choose the ‘…’ next to the Variables property to open up the Variables – Trans window.
The Variables – Trans window allows you to choose the method to use to transform the variable. A number of options are available for this as listed below:
- Best
- Bucket
- Centring
- Equalise
- Exponential
- Inverse
- Log
- Log 10
- Max Correlation
- Max Normal
- Multiple
- Optimal Binning
- Optimal Max Equalise
- Quantile
- Range Standardisation
- Square
- Square Root
- Standardise
By default, no transformation methods are specified for interval or class variables and missing values are not treated as a level within the data set.
To identify the most suitable transformation method for the Div, the ‘Explore…’ button can be helpful.
In order to analyse the data fully, in this case it may be beneficial to split the leagues and divisions as follows:
Current | Country | Region |
AE | America | East |
AW | America | West |
NE | Nebraska | East |
NW | Nebraska | West |
This can be completed using the SAS Code editor, which can be found under the Train heading as below:
The substring function has been used to extract the country code and two new variables are created within the data set.
Choose 'OK' to apply the code and run the Transform Variable node by right-clicking and selecting 'Run'.
On adding any further nodes to the diagram, you will now be able to use the new variables.
The new variables can be viewed by right-clicking the Regression Node and choosing 'Edit Variables'.