Data Mining: Removing Variables
Data manipulation is an important part of the data mining process. Filtering data and removing inaccurate or skewed variables can be important to ensure that accurate analysis is completed. SAS® Enterprise Miner™ includes two nodes created specifically for the purpose of removing variables.
This tip focuses on two nodes used for filtering and removing variables and how they can be used:
- Drop Node
- Filter Node
The Drop Node
The Drop Node can be used to remove any unnecessary variables from the Enterprise Miner data sets. Any of the following role types can be dropped from scored data sets:
The Drop Node can be used within decision trees to trim the size of the data sets and metadata during the tree analysis.
The Drop Node can be found within the ribbon under the Modify tab.
The Drop Node can be dragged on to a SAS Enterprise Miner diagram and joined using an arrow to direct the flow of the data through the system:
The Drop Node allows you to specify the variables that you wish to remove from the SAS data set. This method has the following options available. To view the options available for the Drop Node, click on the Drop Node in the diagram and the properties will be displayed within the left pane.
By default, the ‘Drop from Tables’ attribute is set to ‘No’. This indicates that any variables that are selected to be dropped will be removed from the exported metadata only. If this value is set to ‘Yes’ then this node will create data sets instead of views for the data specified.
Within the ‘Drop Selection Options’ you can choose the type of variables that you would like to drop from the data analysis. This includes the data types below:
- Hidden *
- Rejected *
* Variables that have a role of Hidden and Rejected are dropped by default within the data set.
Within the Baseball data set the following roles have been set. On running the default settings within the Drop Node, we would expect that the logSalary variable would be dropped from the data set.
To run the Drop Node, right-click on the last node in the sequence and select run. A green-tick demonstrates that the node has run successfully:
On running the flow with the default settings, the following output log shows that one interval variable was discovered that had a role of rejected. This variable was removed from the data set.
The Filter Node
The Filter Node enables you to apply a filter to the training data set in order to exclude outliers or other observations that you do not want to include in your data mining analysis. Outliers can greatly affect modelling results and, subsequently, the accuracy and reliability of trained models.
Within SAS Enterprise Miner, the Filter Node can be found in the ribbon within the Sample tab.
The Filter Node can be dragged on to a SAS Enterprise Miner diagram and joined using an arrow to direct the flow of the data through the system:
The Filter Node can be used to remove any missing values, use normalised values or to customise the filtering method that you would like for both class and interval variables.
The ‘Export Table’ options allows you to specify which table to export after training the data set. This value can be set to one of the following:
- Filtered: The default option, this allows the filtered data to be passed through as a view for further processing.
- Excluded: This passes through any filtered out data as a view for further processing.
- All: This passes all of the data through as a view and creates an indicator variable to identify any filtered records.
The ‘Tables to Filter’ option allows you to specify if you would like just the training data set filtered or if you would like all data sets filtered.
The ‘Distribution Data Sets’ option allows you to specify if the data sets used for interactive filtering should be created a training time. These data sets are used for histograms and bar charts which you may use in further analysis.
Class variables, by default, are filtered by Rare Values (Percentage) with a minimum cutoff for percentage at 0.01%. This removes any class variables that are only discovered in less than 0.01% of the data. The default also keeps any normalised of missing class variable values.
Interval variables are filtered using Standard Deviations from the Mean, with missing values also being kept.
To run the Filter Node, right-click on the last node in the sequence and select run. A green-tick demonstrates that the node has run successfully:
Running the Filter Node using the default settings has allowed for 44 observations to be excluded for the training data set.
The class variables that have been removed are as below:
The limits that were used for the interval variables are also displayed in the results window: