Linguistic Sorting
The way Proc SORT orders data is not always ideal, from the programmer's point of view. However, the SORTSEQ option has some useful features which deserve to be more widely known. Let's look first of all at the normal situation.
The sorted version of this data set begins:
The test data set has arranged for the first character of the NUMTXT string to alternate between lower case "n" and upper case "N". Because the sorting is case sensitive, all the "N"s come before all the "n"s. And of course the numeric parts of the string are sorted alphabetically, which here bears little relation to the actual values they represent.
The SORTSEQ option supports a lot of values like "Danish", to take account of the alphabetical order used by particular natural languages. However, it also supports the value "Linguistic", which gives case-insensitive sorting:
Yes, that's case insensitive now. But the numbers are still in a pretty useless order. If only there were a sub-option that would fix that... There is; it's called "numeric collation".
Another sub-option, ALTERNATE_HANDLING=SHIFTED, tells Proc SORT to ignore spaces - so that a value like "N" 4", for example, would still come between "n3" and "n5".