Removing Unwanted Characters from Text Strings
Various techniques are available. We are going to demonstrate several of them by looking at the problem of removing multiple commas from strings containing lists. First, the two easy approaches: compression and translation.
data _null_; list="Alan,Bill,,Dave,,,Greg,Hank,,Jack,,,,,Otto,Pete"; compressed=compress(list,,'p'); translated=tranwrd(list,",,",","); put compressed= / translated=; run;
The first attempt here uses the COMPRESS function to eliminate all punctuation. The second attempt uses the TRANWRD function to replace each occurrence of doubled commas by a single comma. Neither of these yields quite what we want. Wouldn't it be nice if we could call TRANWRD recursively, and get rid of all the multiple commas?
Well we can! Since SAS 9.2 it has been possible to define your own functions and subroutines using Proc FCMP, and Proc FCMP supports recursion.
proc fcmp outlib=work.myfuncs.char; subroutine decomma(string $); outargs string; string=tranwrd(string,",,",","); if index(string,",,") > 0 then call decomma(string); endsub; quit;
options cmplib=work.myfuncs; data _null_; list="Alan,Bill,,Dave,,,Greg,Hank,,Jack,,,,,Otto,Pete"; call decomma(list); put list=; run;
The Proc FCMP call here defines a subroutine called DECOMMA, to be stored in a "package" called CHAR within a "function library" (physically just a data set) called WORK.MYFUNCS. DECOMMA is defined here as a subroutine (or "CALL routine") rather than a function, for the sake of efficiencyÂ - the string parameter will be passed by reference rather than by value. If, after using TRANWRD , the DECOMMA function finds that the string still contains doubled commas, it calls itself to get rid of them.
An OPTIONS statement is required to make the new routine available to the data step that follows. The output shows that it is doing what we probably wanted.
Another powerful tool in this area is the family of PRX functions. This final program does something slightly more elaborate: it uses a single comma followed by a single space to replace each substring that begins with a comma and contains only commas and spaces.
data _null_; list="Alan,Bill,,Dave, , ,Greg,Hank, ,Jack,,,,, Otto,Pete"; length prxed $40; prx=prxparse("s/,[, ]*/, /"); call prxchange(prx,-1,list,prxed); put prxed= ; run;
prxed=Alan, Bill, Dave, Greg, Hank, Jack, Otto
The Perl regular expression input to PRXPARSE specifies "s" for "substitute", and uses "/" as delimiter. The "old" string, between the first pair of delimiters, consists of a comma followed by any number ("*") of a occurrences of characters in the subclass "[, ]" i.e. commas or spaces. NB "Any number" includes zero!
The parsed expression forms the first parameter to PRXCHANGE. The value "-1" for the second parameter means "make this change as many times as possible".
For full details of the techniques demonstrated here, please refer to the SAS Institute documentation.