Dare to Compare

When moving ETL processes to a new environment, it’s important to test that those processes return the same results as they do in the old environment. We might expect that, given the same input from source data, running the same code, we should see the same output. Of course that’s not always the case, because the data warehouse is not an abstract model (it’s sometimes easy to forget this when all you see is code and data), but a combination of many varied, interconnected pieces of hardware and software. When we change any one portion it’s hard to predict what, if any, difference this will make – so when we change all of them at once, we need to expect the unexpected.

So how can we tell what’s changed? The SAS Compare procedure is a very handy tool in this situation, providing a quick, automated means of determining differences in type and value between one dataset and another. Here’s an example of how I’ve used it recently in testing migrated data:

%macro compare(table, num, key) ;

%let old = old_&table ;
%let new = new_&table ;
%let result = res_&table ;
%let diff = df_&table ;

proc sort data=&newlib..&table out=&new;
    by &key ;
proc sort data=&oldlib..&table out=&old;
    by &key ;
run ;


proc compare compare=&new base=&old out=&result 
  outnoequal outbase outcomp outdiff
  outstat=&diff (where=(_type_='NDIF' and _base_ <> 0)) listobs;
  id &key ;
run ;

data _diff ;
  dsid = open("&new") ;
  set &diff end=last;
  length column $40 table $40 type $40 fmt $80 ;
  table="&table" ;
  column = _var_ ;
  varnum = varnum(dsid, _var_) ;
  type = vartype(dsid, varnum) ;
  length = varlen(dsid, varnum) ;
  fmt = varfmt(dsid, varnum) ;
  if last then rc=close(dsid) ;
run ;

%if &num = 1 %then %do ;
    data diff ;
      set _diff 
      (keep=table column varnum type length fmt _base_ _comp_);
    run ;
%end ;
%else %do ;
    proc append base=diff data=_diff 
      (keep=table column varnum type length fmt _base_ _comp_);
    run ;
%end ;

%mend compare ;

Looking at each dataset in a particular library, the compare macro is run to compare it with its counterpart in the new library (&newlib in the above). There’s a fair bit going on here, so piece by piece, here’s what the macro does.

First, a bit of setup: we are going to be comparing datasets by the supplied &key parameter (rather than by row number), so for the purpose of proc compare we need to sort both sets first. There is of course a new version and an old version of &table, named so that we know which is which. We’re also naming output sets &result as ‘res_&table’ (this will hold rows which differ between versions), and &diff as ‘df_&table’ (this will contain output statistics summarising the comparison).

After sorting the old and new tables, the comparison is run. The old version is supplied as the ‘base=’ argument and the new version is ‘compare=’; the id statement identifies the variable(s) making up the unique key for the sets. The ‘out=’ argument specifies the &result output set, and the modifiers outnoequal, outbase, outcomp and outdiff specify respectively that only observations where there is some difference in values between base and compare, we get the rows from both the base and the compare set, and we also get a row showing the differences for each variable. The ‘outstat=’ argument is set to &diff to hold a set of summary statistics (which we are restricting to results from variables where there is one or more value difference). Finally, the ‘listobs’ option specifies that we get to see a printout of observations where the unique key value is found in only one of the base and compare sets.

In the final part of the macro, we take the &diff set and pull out some metadata about the variables that have tested unequal between sets. All of these &diff sets are appended together to a big output dataset which can be used as a handy summary of differences across the entire library. Note the compare procedure also produces list output which can then be read as a detailed report.

Advertisements

Tricks with _TYPE_

In SAS I frequently make use of the very handy SUMMARY procedure. It will produce a wide range of aggregate results at whatever class level you specify – or multiple levels if you like. So given a dataset costs with measure variable total_cost and class variables month, age, tenure and location, I might wish to see one-way analyses of cost by all of these classes, and possibly two-way analysis of month vs age and month vs tenure. So I could write a whole lot of summary procedures like the below:

/*Month 1-way analysis*/
proc summary data=costs nway missing ;
	class month ;
	var   total_cost ;
	output out=cost_month_summary (drop=_type_ rename=(_freq_=count))
		min= max= p5= q1= median= q3= p95= mean= sum= /autoname ;
run;

/*Month by Age 2-way analysis*/
proc summary data=costs nway missing ;
	class month age;
	var   total_cost ;
	output out=cost_month_age_summary (drop=_type_ rename=(_freq_=count))
		min= max= p5= q1= median= q3= p95= mean= sum= /autoname ;
run;
/*etc, etc.*/

Now while that will work, it’s a lot of code to write and maintain. I prefer to request all the required results at once, in one proc:

proc summary data=costs missing ;
	class month age tenure location;
	var   total_cost;
	types /*One-way analyses*/
		  month age tenure location
		  /*two-way*/
		  month*age
		  month*tenure
	;
	output out=cost_summary (rename=(_freq_=count))
		min= max= p5= q1= median= q3= p95= mean= sum= /autoname ;
run;

The change that makes this work is of course the type statement, and removing the nway option (we no longer want an n-way analysis).

The output dataset will then have a column for every class variable, one for each requested statistic (including the renamed _freq_), and crucially we’ve also kept the autogenerated _type_ variable. _Type_ is the variable which tells you which summary request you’re looking at, assigning numbers to each combination of variables according to a simple binary pattern. Basically given n class variables arranged in logical order A1 … An in the output dataset, working from right to left each is assigned a value from 20 … 2n-1, so Am gets 2n-m. In our example, if month, age, tenure and location have logical positions 1, 2, 3 and 4 respectively, then month=23, age=22, tenure=21 and location=20. That means the requested analysis types get the following _type_ values:

month = 23 = 8
age = 22 = 4
tenure = 21 = 2
location = 20 = 1
month*tenure = 23 + 21 = 10
month*age = 23 + 22 = 12

This works fine for me, but the _type_ variable is a number, which is unhelpful to an audience without the technical knowledge, time or patience to decode it, so it would be great to be able to systematically assign it a descriptive label. Of course, given a combination of class variables in a summary dataset, it’s possible to work out the value of _type_ that has been assigned to them. Turns out, this paper from SUGI 31 by Dan Bruns contains the macro %findtype written by Ron Coleman which will examine a summary dataset table and return the value of _type_ for the variable list vars:

%macro findtype(table,vars);
   %local index fid flag target;
   %let fid = %sysfunc(open(&table,i));
   %if not &fid %then %goto exit;
   %let vars = %upcase(&vars);
   %let flag = 0;
   %let _type_ = 0;
   %do i = %sysfunc(attrn(&fid,nvars)) %to 1 %by -1;
      %let target = %upcase(%sysfunc(varname(&fid,&i)));
      %if &flag %then %do;
         %let index = 1;
         %do %while (%length(%scan(&vars,&index)) > 0);
            %if &target = %scan(&vars,&index) %then
               %let _type_ = %eval(&_type_ + 2**(&flag - &i - 1));
            %let index = %eval(&index + 1);
         %end;
      %end;
      %else %if &target = _TYPE_ %then
         %let flag = &i;
   %end;
   %let fid = %sysfunc(close(&fid));
   &_type_
   %goto term;
%Exit:;
   -1
%term:
%mend findtype;

What I wanted to do was use this macro to create a format typedesc to look up from _type_ to a description, so that in the example above put(12, typedesc.) would return ‘month*age’. I created a dataset types containing all the variable combinations requested in the summary, and then I used this to create a control data set typefmt:

filename out_tmp temp ;

/*Write a data step to generate a format control set*/
data _null_;
  set types end=last;
  file out_tmp ;
  if _n_=1 then put 'data typefmt; retain fmtname type ; fmtname="typedesc"; type="N";' ;
  line='start=%findtype(cost_summary, '|| types ||'); label="'|| types ||'"; output ;' ;
  put line ;
  if last then put 'run ;' ;
run;

%include out_tmp /source2 ;

Here the variable types acts as the label, and %findtype finds the _type_ value to apply the label to.

From there it’s a simple matter of compiling the format and applying it to _type_ in the summary dataset:

proc format library=work cntlin=typefmt ;
run ;

data cost_summary ;
  set cost_summary ;
  type=put(_type_, typefmt.) ;
run ;

Maintaining SAS formats in Excel

For one of the subject areas I deal with day to day, categorical data values are often stored in the database in narrow character fields, so we need to apply SAS formats to get readable values out in reports. To maintain these formats I store them in an Excel workbook, looking roughly like this:

This is a relatively painless way to manage keeping SAS formats up to date, as well as making them easily accessible to people who need to inspect them. Each format is stored in a named range, with the name of the range matching the name of the format. The column headers are chosen to agree with the SAS Format procedure requirement for an input control data set. The above range, for instance, would be named ‘excel_user’, and would be used to input the required data for a character format (Type=’C’) also named ‘excel_user’ (Fmtname).

So how do I get them into SAS from Excel? Once again, the SAS/ACCESS Excel libname statement is helpful here. First, we assign a libref ‘wb’ to the workbook holding the format ranges, and another libref ‘sasprogs’ which will be used to store the format catalog:

libname wb Excel "&path.\formats.xls" ver=2002 ;
libname sasprogs "&path" ;

Next we need to get a list of all the ranges that we’re interested in here. In this workbook the only named ranges are the ones containing the formats. Any other names that the libname engine detects will be the names of worksheets, which will contain a ‘$’ character. So, we simply exclude these from the query against the dictionary.tables table:

proc sql;
  create table tmp as
  select memname
  from dictionary.tables
  where libname='WB' 
    and index(memname, '$') = 0 ;
quit;

So now the table ‘tmp’ holds a list of all the format range names in the workbook. Now we need to use this table to guide SAS in building those formats. What we need to do is to submit a proc format statement for every one of those names. Like so:

%macro load_fmt(name) ;
  proc format library=sasprogs.sasfmts cntlin=wb.&name ;
%mend load_fmt;

Now we simply need to loop through the set of names to load each of the formats into the sasprogs.sasfmts catalog:

data _null_;
  set tmp;
  call execute('%load_fmt('||memname||')') ;
run;

All that’s left to do now is a little tidying up:

/*Close the file*/
libname wb clear ;
/*Inspect format contents*/
proc format library=sasprogs.sasfmts fmtlib cntlout=sasprogs.sasfmt_data; run;
/*Add them into the search path*/
options fmtsearch=(library sasfmts);

And we’re done – all the formats are now loaded and available in the SAS session. Now if there are any changes, or a new format needs to be created, just add it into the file, save it and re-run this script. Sorted!

SAS macro variable scope

I’m going to assert something fairly uncontroversial: it’s good practice to limit the scope of your variables. The reasoning behind this is fairly easy to follow: the more places a variable’s value can be altered, the better the chance that at some point it will be altered unintentionally.

This is particularly true in the SAS macro language, where the rules governing how variables are resolved are fairly simple, but where it’s easy to lose sight of how conflicting scope can pose a risk. For a slightly contrived example, in the macro %test below, the value of the global variable x is checked. In %set, the value of x is changed:

%let x=Y;

%macro test;
  %if &x = Y %then %put "Flag set";
  %else %put "Flag not set";
%mend test;

%macro set(val);
  %let x=&val;
%mend set;

/*Some code*/
%set(Y);
/*Some other code*/
%test;

This all looks fine, and it is – until you call the macro %pointless_loop between %set and %test:

%macro pointless_loop(num_mths);
%let start=%sysfunc(intnx(month, %sysfunc(date()), -&num_mths, e));
%do i=0 %to &num_mths;
    %let x=%sysfunc(intnx(month, &start, &i, b), date9.);
%end;
%mend pointless_loop;

The problem here is that, by the time this macro is executed, x already exists in the global symbol table. If this wasn’t the case, x would be created in the symbol table local to %pointless_loop, and would be deleted with that symbol table when it finished executing. As it is, when %test runs, the x in the global symbol table will be altered, and will hold a formatted date string rather than the Y or N it’s intended to contain. This type of situation can arise very easily – particularly if macros are being re-used across applications. Ideally, you shouldn’t need to care what names are assigned to variables defined inside them.

How could this be avoided? Well, apart from resisting the temptation to use variable names like x (as a rule, I think variable names should be as descriptive as possible), be explicit. There doesn’t need to be a problem with having x in both the global and local symbol table – just describe the scope before using them:

%global x;
%let x=Y;

%macro pointless_loop(num_mths);
%let start=%sysfunc(intnx(month, %sysfunc(date()), -&num_mths, e));
%local x;
%do i=0 %to &num_mths;
    %let x=%sysfunc(intnx(month, &start, &i, b), date9.);
%end;
%mend pointless_loop;

Now the value of x in the global symbol table will be left untouched.