Looking for Patterns in Data
With the General Data Protection Regulation (GDPR) effective from this year, it is important to be able to identify personal information in data. Here we consider the particular case of UK National Insurance (NI) numbers.
NI numbers are well-behaved, in that their format is fixed – two prefix letters, six digits, and a suffix letter e.g. “QQ123456C”. The only variation is that spaces are sometimes embedded to separate out pairs of digits, improving readability. Some other kinds of personal information are more variable in their format e.g. phone numbers, postcodes, car registration numbers.
In the SAS world the tool specifically designed for this kind of task is DataFlux, nowadays in the form of Data Management Studio. This can be used to analyse character data in terms of “patterns” (of letters, digits, spaces etc), and report on the frequencies of the various patterns found.
This example uses a table called PEOPLE and focuses on a column called IN_NUM.
Performing a profile analysis in DataFlux Data Management Studio is straightforward. Having set up a “data connection” so that the PEOPLE table is visible, we can then right-click on it and choose “Profile”. The profile has to be named and assigned a location in SAS metadata. A dialogue then enables columns to be selected for profiling.
Clicking on the green triangle here will run the profile on all the selected columns. The report can then be viewed:
The IN_NUM column and the “Pattern frequency distribution” tab have been chosen. This shows that the majority of the values - 115 of them - have the pattern AA999999A, which is that of a NI number. Another 7 have a similar pattern but with embedded spaces. Right-clicking in the tab enables a visualisation of this data to be selected, for example a pie chart.
An equivalent analysis can be performed using SAS code. SAS functions are available corresponding to DataFlux activities. The relevant function here is called DQPATTERN. A little additional code is needed to load and unload the DataFlux QKB (Quality Knowledge Base), which is used by the function. The DQSETUP location used here is typical. Once the QKB has been loaded, the DQ functions can be used freely within SAS, for example in Proc SQL queries. This one produces results similar to those from Data Management Studio:
“ENGBR” is the name of the locale. Notice that in this case the report shows a frequency of 3 for a blank pattern, which Data Management Studio did not mention.
The QKB encapsulates the intelligence used by DataFlux, and includes, among other things, a whole battery of regular expression libraries. It is some of these regular expressions that are driving the pattern recognition processing we have been considering.
The SAS language has its own regular expression facilities, in the form of the PRX functions, and in many cases it is not hard to perform pattern matching using Base SAS code without any DataFlux facilities. For example, to check for the NI patterns we are interested in here:
The regular expression is looking for a string consisting of two upper-case letters (“[A-Z]”), followed immediately by six digits (“\d”) and another upper-case letter. The “^” and “$” anchor the string to the beginning and end respectively of the input value, so it must not contain anything else. The PRXMATCH function returns a value of 1 if a match is found, and 0 otherwise. This time the COMPRESS function has been used to eliminate embedded spaces from the input. The results are:
So once again we can conclude that most of the time the column IN_NUM does contain a string that is structured like a National Insurance number.
SAS provides numerous ways of checking data as required for GDPR. While DataFlux facilities may be the most powerful and convenient, equivalent functionality is usually available in Base SAS code, either with or without the use of the DataFlux QKB, at the cost of some additional programming effort.