Dealing with Duplicate Keys
Suppose we have a data set sorted by a variable called "age", and want to check whether any of the age values are repeated. We can use PROC SORT to sort the data by key and simultaneously remove any duplicate records that are present, the following example demonstrates this.
proc sort data=sashelp.class out=agelist nodupkey ; by age; run;
The following notes are reported in the log showing how many records have been removed. PROC SORT will retain the first occurrence of each value of age.
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: 13 observations with duplicate key values were deleted.
The only problem with this is that we have no record of which rows were removed. If this information is required we can utilise the DUPOUT option. The DUPOUT options allow us to remove duplicates as demonstrated in the first example but also retain those rows in a separate data step. The following example demonstrates the use of DUPOUT:
proc sort data=sashelp.class out=agelist nodupkey dupout=dup_age_recs; by age; run;
Viewing the log messages reveals that two new data sets have been created, on with the unique values of age and a second containing all of the removed values.
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: 13 observations with duplicate key values were deleted.
NOTE: The data set WORK.AGELIST has 6 observations and 5 variables.
NOTE: The data set WORK.DUP_AGE_RECS has 13 observations and 5 variables.
There are many ways to deal with duplicates within the SAS language including the DATA STEP and PROC SQL.
For further information on these and PROC SORT view the SAS Help pages or visit the SAS online support ate www.support.sas.com.