Creating a Stratified Sample of Data Using Proc SURVEYSELECT
Proc SURVEYSELECT is a very useful SAS/STAT procedure for taking samples from datasets using a variety of different methodologies. This has wide ranging applications including selecting a random sample of people to survey; forming a control group of customers to assess the effectiveness of a marketing campaign and creating a set of validation data to confirm a statistical model.
The dataset SASHELP.CARS contains data about a variety of different makes and model of car manufactured in Asia, Europe and USA.
The following code uses Proc SURVEYSELECT to create a sample of approximately one third of the whole dataset whilst preserving the overall proportion of cars from each country in the sample (note that the dataset must be sorted by any variables used in the strata statement and that the seed option is used in the Proc SURVEYSELECT statement to allow you to reproduce the results):
proc sort data = sashelp.cars out = work.cars; by origin; run;
proc surveyselect data = work.cars out = work.control method = srs samprate = (0.333 0.333 0.333) seed = 123456789; strata origin; run;
Running a quick Proc FREQ on both the original dataset (n=428) and the Control dataset (n=143) reveals that the proportion of cars from each country in both datasets is similar:
Origin | SASHELP.CARS | WORK.CONTROL |
Asia | 36.92% | 37.06% |
Europe | 28.74% | 28.67% |
USA | 34.35% | 34.27% |
Having created the control group, the cars in the dataset can be excluded from rest of the data by using a GATING IF in a MERGE using appropriate BY variables.
For further information please consult the SAS Help documentation.