Sorted
Here are some assorted tips related to Proc SORT. First, let’s look at the detection of duplicates.
Here we use the DUPOUT option to write duplicates to a data set. Suppose that for one particular key there are four matching records; then the first of them will go into the main SORTED data set, and the other three into DUPES. If instead we wanted all four of the duplicates to go into the DUPES dataset, we could do it like this:
Beware, incidentally, of the NODUPRECS (or NODUP) option. This will eliminate duplicate records only if they occur consecutively in the sorted dataset. This is guaranteed to remove all duplicate records only if you sort BY all variables.
Sorting can be greedy in terms of workspace. If you need to use less workspace, one option is to use TAGSORT; however this can significantly increase runtime. An alternative technique is to split your data set into smaller chunks, sort the chunks, and then put them back together with an “interleave” data step:
Each of these sorts will use half as much workspace as sorting the whole file at once. The total runtime should be virtually unchanged.
Finally, in data warehousing applications, you may sometimes see sort-related error messages when using the Type II Loader transform in DI Studio. These can arise when the sort order is a matter of disagreement between SAS and the database being loaded. (Sometimes the issue is simply one of case sensitivity.) You can eliminate such problems by specifying the SAS option SORTPGM=SAS, which makes SAS responsible for all sorting.