Data Management Studio – Performing Tasks on Multiple Variables
Data jobs within SAS Data Management Studio allow users to perform a whole raft of data processing tasks on data items, these include applying business rules and monitoring data. Typically, the actions are performed on specified fields within specified datasets. However, it is possible to make your data jobs and process jobs more versatile and perform tasks on multiple variables.
This tip shows an example of using a data job to list all of the variables within a dataset. Subsequent tasks can then use this list to cycle through each variable and perform the same task on each. This is something that a SAS programmer would routinely code, but it is less intuitive using Data Management Studio. A typical use of this would be to apply a business to each variable.
The data job consists of the following nodes:
- Read the first row of data
- List the variables in the dataset
- Output the required rows
- Create a work dataset with a list of variables
Read the first row of data
Use the SQL query node to read all of the fields from the data.
The query can be made more efficient by reading in just the first row of data, as all we are interested in is the variable names. This is set in the ‘Advanced Properties’.
List the variables in the dataset
Here we use an expression node (list the variables in the dataset) which makes use of the fieldcount, fieldname and fieldtype functions in the Expression Engine Language (EEL) to extract the number of variables, variable name and variable type from the dataset. The pre-expression initialises variables to be added, and the expression loops around each variable and assigns values to fields ‘column_name’ and ‘column_type’.
The pushrow function pushes current values to a stack.
When the pushed row status field is set via the ‘Settings’ tab, a defined field (‘prow’ in this case) is set to ‘true’ or ‘false’ for pushed rows. The output dataset contains added columns ‘column_name’, ‘column_type’ and ‘prow’, which can be viewed via the 'Preview' tab:
Output the required rows
Note that the final row from the previous node is not a pushed row. To avoid a duplicate value in the list of variables this is dealt with via a further expression node (Output if pushrow is true). This node uses the return function to determine which rows are included in the output, in this case only pushed rows from the previous expression.
Note, if you are only interested in performing subsequent tasks on a certain type of variable, these can be filtered here by modifying the expression. So, for example to see ‘real’ variables only modify the expression to:
Create a work dataset with a list of variables
The final step is to write the output variables to a work dataset. Achieved simply with a ‘Work Table Writer’ node. The work table includes just the variable ‘column_name’. The node has properties as illustrated.
The data job is now complete. Typically, this can be used within a process job. Process jobs allow looping, so it’s possible to use the work dataset output from this data job, to loop through each variable in the dataset and perform a task. A typical example is to apply a custom metric to check the attributes of each value for each variable. A schematic for an example process job is illustrated below.