The SCD Type 1 Loader in SAS® Data Integration Studio
SAS Data Integration Studio’s ordinary “Table Loader” – which is categorised as an “Access” transformation – can handle load styles of “replace”, “append” and “insert/update”.
SAS DI Studio also provides two Slowly Changing Dimensions (SCD) table loader transformations. Both are categorised as “Data” transformations.
All three of these loader transformations can handle either SAS data sets or database tables. The two SCD loaders offer the ability to automatically generate in the table some information about the history of individual records.
In the literature, an “SCD Type 0” is also defined, as “retain original” i.e. once written, a record never changes. This is the ultimate in slowness of dimension change! It can be done using the ordinary Table Loader, in either “Replace” or “Append” mode. (Serious nerds may wish to investigate SCD types 3, 4, 5, 6 and 7, which have also been defined).
SCD Type 1 means “update existing records”, so there will only ever be one record for a given key value. It is possible to use the Type 1 Loader in such a way that it does no more than the ordinary Table Loader would do in “update/insert” mode, but usually it will be specified so as to handle either a “last update” column, or surrogate keys, or both.
SCD Type 2 means “keep history records”. The Type 2 loader (which will not be considered here in any more detail) enables a table to retain a complete history of all versions of every record – which will mean that the table is liable to use up rather more disk space.
Using the Type 1 Loader in a DI Studio job is quite straightforward, although there are a couple of traps for the unwary.
The general idea is that the programmer can pass all the input data to the Type 1 Loader. The loader will work out for itself which records need updating, and leave the rest alone.
The important “gotcha” here is that the input data must not contain any duplicates i.e. it must not contain more than one record for the same key. If it does, the Type 1 Loader will attempt to write both (or all) of them. If the key column(s) have been specified as such in the registration of the table, then there will be warning messages about the violation of a table constraint. If not, it is possible that duplicate keys will arise in the output table.
These issues can easily be avoided by sorting the input data with the NODUPKEY option, before it is passed to the Type 1 Loader.
The Type 1 Loader can then be added to the job diagram. Notice that it generates four work tables, and it is the last of these that has to be connected to the output table. Hovering the cursor over the work tables will show the significance of each; some of the others may be of interest during debugging.
In the transformation’s properties, as well as the mappings, it is essential to set up the Keys and the Change Columns.
The key(s) are the business Key column(s) that uniquely identify records, and the Change Columns are the columns in which any change would cause the output table to be updated. In this example, every column except the Key column has been identified as a Change Column.
The above properties are sufficient to specify a Type 1 Loader that will work. However, they do nothing that that could not be achieved by using the ordinary Table Loader in “update/insert” mode. It is also possible to use (among other options) a surrogate key, or a “last update” column, or both.
A surrogate key can be specified in the lower part of the “Keys” tab, the upper part of which was just shown.
All that is needed here is to specify a column of the output table which is to act as the surrogate key column. In this case, column ORDKEY has been chosen. This column does not appear in the input data. Leave the “New record expression” as it is shown here, and do not touch the “Define Max Key” button. NewMaxKey is a DATA step variable (not a macro variable) which will contain, at the time the surrogate key is being set for a new record, the highest surrogate key value already used for the table. (This can be confirmed by looking at the “Code” tab of the Properties window.)
In the table, the surrogate key column will contain integer values running consecutively from 1, records being numbered according to the order in which they were added to the table. When a record is updated, its surrogate key will not change. Surrogate keys are particularly useful when the business key (as specified on the Keys tab) is compound i.e. consists of more than one column. In such a case, efficiency should be improved by making the surrogate key the primary key for the table.
As a possible alternative, DI Studio has a “surrogate key” transformation, which can be used to generate surrogate keys for use with, for example, the ordinary Table Loader.
Finally, a “last update” column can be specified. This should be a numeric column with either a date or datetime format, and it indicates when the record was last updated. It is present in the output table, and not in the input data. To specify such a column, go to the “Options” tab:
Here an output column called LAST_UPDATED has been specified, and the expression shown here (which is the default) says that it is to contain the datetime at which the record was last updated. This is widely regarded as the main purpose of the Type 1 Loader.
If information about previous versions of records is going to be required, then the table needs to be structured differently, and the SCD Type 2 Loader used instead.