Amadeus cookies policy - you'll see this message only once.

Amadeus use cookies on this website. They help us to know a little bit about you and how you use our website, which improves the browsing experience and marketing - both for you and for others. They are stored locally on your computer or mobile device. To accept cookies, continue browsing as normal. Or, go to the privacy policy for more information.

Data Mining: Replacing Missing Values

Data preparation is a large part of the data mining process. Missing values within SAS data sets can cause SAS data models to become inaccurate and skew any analysis results. In order to ensure that this does not happen within the data preparation phase, SAS Enterprise Miner™ software has included two nodes which are designed specifically for the purpose of removing or replacing any missing values within data sets.

The two nodes available within SAS Enterprise Miner for the purpose of replacing missing values include:

  • The Replacement node
  • The Impute node

The Replacement node is used specifically for the replacement of missing class and interval variables within the data set based on specific values. However, the Impute node is used to replace missing class and interval variables with the use of a number of available options including; the mean of that variable, the median value or a tree-based imputation method for identifying a replacement value. Both of these nodes can be found within the Modify ribbon of SAS Enterprise Miner:

Data Mining Replacing Missing Values image1

Each node can be dragged on to a SAS Enterprise Miner diagram and can be joined using an arrow to direct the flow of the data through the system:

Data Mining Replacing Missing Values image2

To run the nodes, right-click on the last node in the sequence and select run. A green tick demonstrates that the node has run successfully:

Data Mining Replacing Missing Values image3

Each Replacement node has a number of properties that can be set which are unique to that replacement method. These can be important when choosing an appropriate replacement method for your data.

1 The Replacement Node

The Replacement node is found within the Modify ribbon as below:

Data Mining Replacing Missing Values image4

The Replacement node allows users to specify how to replace class and interval variables within the SAS data set. This method has the following options available for the replacement:

Data Mining Replacing Missing Values image5

The default method for interval variables replacement is to set the values based on the limit of standard deviations from the mean. This can be changed to any of the following:

  • Mean Absolute Variation
  • User Defined Limits
  • Metadata Limits
  • Extreme Percentiles
  • Modal Center

The user can also define any cut-off values for these interval variables.

Class variables can either be replaced using the Replacement Editor, or be set to ignore. Ignoring the unknown values means that when later using a model already created for scoring, the unknown (freshly occurred) values of the class variable will be ignored.

The Score specifies which values should be used as a replacement value. This can be set to computed, user specified or missing in relation to the data set. The Hide option allows the user to specify if the original variable value should be included or removed from the metadata exported by the node.

The results of the Replacement node include three output windows.

The Total Replacement Counts window displays each variable that has had data replaced and the number of replacements that have been made for each variable:

Data Mining Replacing Missing Values image6

The Interval Variables window displays the calculations for each of the replaced variables. This shows that standard deviation has been used to replace each of the missing variables within the “Baseball” data set:

Data Mining Replacing Missing Values image7

The output window contains a summary of the replacements that have been made. This includes the limits and replacement values for the interval variables and also details of the Replacement Counts for each variable:

Data Mining Replacing Missing Values image8

Data Mining Replacing Missing Values image9

Data Mining Replacing Missing Values image10

2 The Impute Node

The Impute node is found within the Modify ribbon as below:

Data Mining Replacing Missing Values image11

The Impute node has more customisable features than the Replacement node and can be used for both interval and class variables:

Data Mining Replacing Missing Values image12

The Default Imputation method for class input variables for the Impute node can be modified to any of the following:

  • Count
  • Tree
  • Distribution
  • Tree Surrogate
  • Default Constant Value

The Default Target Imputation method for class variables can be set to:

  • Count
  • Distribution
  • Default Constant Value

It is also possible to choose normalised values for the output the data.

With regards to interval variables the default imputation method for input variables can be chosen as any one of the following:

  • Mean
  • Median
  • Maximum
  • Minimum
  • Mid-Range
  • Distribution
  • Tree
  • Tree Surrogate
  • Huber
  • Andrew’s Wave
  • Tukey’s Bi-weight
  • Mid-Minimum Spacing
  • Default Constant Value

Each of the above imputation methods apart from Tree or Tree-Surrogate can also be used as the Default Target method for interval target variables.

Variables can also be set using a default character or numeric values. This can be set within the Default Constant Value section of the properties. Options can be set for the methods of tuning and tree imputation. It is also possible to reject a role that is specified within the indicator values section.

The results of the Impute node display two output windows. The imputation summary details the variables that have been replaced and the values that they have been replaced with, in this case the MEAN.

Data Mining Replacing Missing Values image13

The output window displays a summary of the variables that have had missing values replaced.

Data Mining Replacing Missing Values image14

Data Mining Replacing Missing Values image15