Partition Your Data for Predictive Modelling
This tip will cover a technique called partitioning that is often used when creating a predictive model in data science.
What is Data Partitioning?
Partitioning a data set is splitting the data into two, sometimes three smaller data sets. These are called Training, Validation and Test. This technique is best practice when creating a predictive model but is only possible when working with enough data. Test data sets are less common due the volume of data required.
If a predictive model is created to fit a specific data set, it is possible to create a highly predictive model. To ensure that this model will predict new data well, it should be tested on a different sample of data to see how accurate it is.
Data partitioning is used to split the original data set before the model is created so that there is ‘new’ data available to assess the model.
The three data sets that can be used are described as:
Training: The subset of data used to explore the characteristics of the data and used to create a model
Validation: Data that remains unseen when building the model. It is used to tune the model parameter estimates
Test: A data set that can be used to measure overall model performance and compare the performance between different candidate models
There are many ways in SAS® to perform this partitioning step. If using SAS Enterprise MinerTM, there is a node called Data Partition. Below, four methods in Base SAS will be covered.
Proc SURVEYSELECT is a general sampling procedure that can be used to partition a data set.
Using the OUTALL option, all rows will be kept in the output table samples_train_valid with the addition of a flag to indicate if each row is in Training (1) or validation (0). SAMPSIZE is used to create a training data set of 200 observations.
Procedures for Regression Analysis
Some regression procedures, such as Prog GLMSELECT and Proc HPLOGISTIC have additional functionality to split the data into training, validation and test during the procedure.
The example below is using Proc GLMSELECT. If the data set has already been partitioned, the following options can be used on the Proc GLMSELECT statement to indicate which data sets to use as Training, Validation and Test:
If the data set has not yet been partitioned, the partition statement can be used to indicate how to split it:
The remaining 0.4 will be assigned to Training. An additional SEED= option can be added to the Proc GLMSELECT statement to ensure the same observations are split into the same data sets each time the procedure is run.
The following table can be seen in the results output:
Partitioning a data set is best practice when creating predictive models. Creating validation and test data sets will hold some of the sample aside so that there is new data assess the model with. There are plenty of methods that can be used to partition data so select the most appropriate in each situation.