Amadeus cookies policy - you'll see this message only once.

Amadeus use cookies on this website. They help us to know a little bit about you and how you use our website, which improves the browsing experience and marketing - both for you and for others. They are stored locally on your computer or mobile device. To accept cookies, continue browsing as normal. Or, go to the privacy policy for more information.

Data Mining - Sampling Data

Data preparation is a large part of the data mining process. A large part of Data Preparation includes creating representative samples within the data to speed up the Data Mining process. SAS Enterprise Miner™ software has a large number of nodes available for SAS® software users for the purpose of sampling the data.

1 Why Do We Sample Data?

Sampling data is recommended within SAS Enterprise Miner for larger data sets. SAS Enterprise Miner works largely towards machine learning and allows users to create models which can be “trained” in order to predict outcomes more precisely. During model training the models are tuned using a series of different weights in order to more accurately predict outcomes.

Having smaller sample sizes allows models to be trained more rapidly. If the sample is sufficiently representative of the entire data set, then any trends or relationships discovered within the sample can expect to be found within the entire data set.

2 Sampling Tools within SAS Enterprise Miner

The following nodes are available within SAS Enterprise Miner for the purpose of sampling:

  • Data Partition
  • Sample

Both of these nodes can be found within the Sample ribbon of Enterprise Miner:

Data Mining Sampling Data Image 1

Each node can be dragged on to a SAS Enterprise Miner diagram and can be joined using an arrow to direct the flow of the data through the system:

Data Mining Sampling Data Image 2

To run the nodes, right-click on the last node in the sequence and select run. A green tick demonstrates that the node has run successfully:

Data Mining Sampling Data Image 3

Each sampling node has a number of properties that can be set which are unique to that sampling method. These can be important when choosing an appropriate sampling method for your data.

2.1 The Data Partition Node

The Data Partition node is found within the Sample ribbon as below:

Data Mining Sampling Data Image 4
The Data Partition node splits the data into three separate data sets for the data mining approach:

  • Training – Preliminary data, beyond the actual training of the model, is used to assess if the model fits the data accurately.
  • Validation – Used to tune the models weights during estimation, this data set is also used for model assessment.
  • Test – A further data set used for model assessment.

Within the Data Partition node, four options are available for the partitioning of the data:

  • Simple Random 
  • Cluster 
  • Stratified
  • Default

If a class target variable is found within the data the default partitioning method is set to stratified, however, if no target variable is defined the default is set to Simple Random sampling instead.
The properties of the Data Partition node can be viewed in the left ribbon once the node has been selected:

Data Mining Sampling Data Image 6

The output type can be selected as either a data set or a data set view. A data set is beneficial when the output data will be input into further processing nodes whereas a data set view puts the results to a window.

By choosing the default partitioning method, the data will be sampled using stratified sampling. This is because within the baseball data set, a target variable of position has been assigned.

The Random Seed allows the data to be sampled based on the value specified. Using the same seed will output the same values each time the sample is run.

By default the Data Partition node splits the data into 40% training, 30% validation and 30% test. These percentages can be changed within the Data Set Allocations setting.

A report type of Interval Targets generates summary statistics whereas the class targets generates charts, these are both based on the original and partitioned data for the class target variables.

The results will display a single output window. This details a summary of the number of variables that were input and the number of observations that have been partitioned into each data set:

Data Mining Sampling Data Image 7Data Mining Sampling Data Image 8

The output also displays a list of the summary statistics for each of the data sets that have been created using the Data Partition node:

Data Mining Sampling Data Image 10

2.2 The Sample Node

The Sample node is found within the Sample ribbon as below:

Data Mining Sampling Data Image 12
The Sample node, similarly to the Data Partition node enables users to make samples from the full data set. These samples can also be created using Simple Random, Stratified or Cluster sampling.

Data Mining Sampling Data Image 13

The output type, similarly to the Data Set Partition node, can be selected as either a data set or a data set view. A data set is beneficial when the output data will be input into further processing nodes whereas a data set view puts the results to a window.

By choosing the default partitioning method, the data will be sampled using stratified sampling. This is because within the baseball data set, a target variable of position has been assigned.
The Random Seed allows the data to be sampled based on the value specified. Using the same seed will output the same values each time the sample is run.

The type can be set to Percentage, Number of Observations or Computed. Dependent on the value selected in this field the size of the sample can be specified using each of these measures.

The stratified section allows the stratification methods to be specified.

The Level Based options are available if there is only one stratification variable.

For oversampling, if Adjust Frequency is set to Yes, a biased stratified sample is created. Therefore, it is recommended that this is set to No in most of the cases.

A report type of Interval Targets generates summary statistics whereas the Class Targets generates charts, these are both based on the original and sample data for the class target variables.

The results window for the Sample node displays a single output window. This details the number of observations within the data set and the sample that has been created:

Data Mining Sampling Data Image 14

A list of the summary statistics is also created for each data set:

Data Mining Sampling Data Image 15