Man, Woman, Couple or Company?
Suppose we want to work out for each name in our contacts database whether it refers to a company or a man or woman – or perhaps multiple people. At first sight this is quite a tricky programming problem – but the SAS Data Quality (DQ) solution can help a lot. This is an optional part of the SAS system, based around Dataflux software.
The first thing to do is to load the DQ system:
Notice that locale has to be specified, here “ENGBR”. The vocabularies used by the DQ functions that follow will be specifically British. The location shown for the DQSETUP file is typical.
The first step in processing the names will be to identify which are companies and which are individuals. This can be done using an “Identification definition”. The one used here for this purpose is called “Contact Info”, for which the DQIDENTIFY function will return values of “ORG” and “IND” for the two kinds of data we are chiefly interested in. We also try a definition called “Offensive”.
Some records are determined to be neither organisations nor individuals. These include:
The first of these is of type “unknown”, which is fair enough. The second managed a “pass” from the “Offensive” definition, despite including the word “bottom”, and was interpreted as probably an address – which is quite clever, even if not actually correct. No examples will be given here of text that fails the offensiveness check, but it is worth noting that, although all the most notoriously offensive words will be spotted in their “raw” form, adding an “s” on the end apparently makes them acceptable. If you propose to use this definition, some judicious editing of the relevant DQ vocabulary is called for.
Names identified as those of companies include:
Strings such as “Ltd”, “Company”, “& Co”, “plc”, “and Sons” have enabled some companies to be identified as such. The reasons for putting some of the others into this category are less obvious, particularly as Mr Potter’s earlier liaison (above) was classed as “unknown”.
Those identified as individuals include:
This is all non-controversial. Notice that without “Ltd” on the end, “Smythe and Weston” look like a pair of individuals rather than a company. In fact, several of these name strings relate to pairs of individuals. There is a “parse definition” that can be used to separate these out; it is called “Name (Multiple Name)”.
The parsed version of the name string is held in a variable called PARSED. The function DQPARSETOKENGET is then used to retrieve from this parsed string the two tokens “Name 1” and “Name 2”. If the second of these is non-blank, then it will contain the name of the second individual. The data set output from this includes:
Notice that a degree of intelligence has been shown here e.g. “Mr J P and Mrs Hetty Grainger” now each have their own copy of their shared surname. We can go on to determine the gender of each of these people, using a “gender definition” called “Name”.
We unload the DQ system after this, having achieved all we set out to. The output is:
The genders determined all look reasonable. “Dr Fox”, when alone, is of unknown gender, though when arriving in company with Mrs Fox we might perhaps have assumed a male doctor.