Dare to Compare

When moving ETL processes to a new environment, it’s important to test that those processes return the same results as they do in the old environment. We might expect that, given the same input from source data, running the same code, we should see the same output. Of course that’s not always the case, because the data warehouse is not an abstract model (it’s sometimes easy to forget this when all you see is code and data), but a combination of many varied, interconnected pieces of hardware and software. When we change any one portion it’s hard to predict what, if any, difference this will make – so when we change all of them at once, we need to expect the unexpected.

So how can we tell what’s changed? The SAS Compare procedure is a very handy tool in this situation, providing a quick, automated means of determining differences in type and value between one dataset and another. Here’s an example of how I’ve used it recently in testing migrated data:

%macro compare(table, num, key) ;

%let old = old_&table ;
%let new = new_&table ;
%let result = res_&table ;
%let diff = df_&table ;

proc sort data=&newlib..&table out=&new;
    by &key ;
proc sort data=&oldlib..&table out=&old;
    by &key ;
run ;


proc compare compare=&new base=&old out=&result 
  outnoequal outbase outcomp outdiff
  outstat=&diff (where=(_type_='NDIF' and _base_ <> 0)) listobs;
  id &key ;
run ;

data _diff ;
  dsid = open("&new") ;
  set &diff end=last;
  length column $40 table $40 type $40 fmt $80 ;
  table="&table" ;
  column = _var_ ;
  varnum = varnum(dsid, _var_) ;
  type = vartype(dsid, varnum) ;
  length = varlen(dsid, varnum) ;
  fmt = varfmt(dsid, varnum) ;
  if last then rc=close(dsid) ;
run ;

%if &num = 1 %then %do ;
    data diff ;
      set _diff 
      (keep=table column varnum type length fmt _base_ _comp_);
    run ;
%end ;
%else %do ;
    proc append base=diff data=_diff 
      (keep=table column varnum type length fmt _base_ _comp_);
    run ;
%end ;

%mend compare ;

Looking at each dataset in a particular library, the compare macro is run to compare it with its counterpart in the new library (&newlib in the above). There’s a fair bit going on here, so piece by piece, here’s what the macro does.

First, a bit of setup: we are going to be comparing datasets by the supplied &key parameter (rather than by row number), so for the purpose of proc compare we need to sort both sets first. There is of course a new version and an old version of &table, named so that we know which is which. We’re also naming output sets &result as ‘res_&table’ (this will hold rows which differ between versions), and &diff as ‘df_&table’ (this will contain output statistics summarising the comparison).

After sorting the old and new tables, the comparison is run. The old version is supplied as the ‘base=’ argument and the new version is ‘compare=’; the id statement identifies the variable(s) making up the unique key for the sets. The ‘out=’ argument specifies the &result output set, and the modifiers outnoequal, outbase, outcomp and outdiff specify respectively that only observations where there is some difference in values between base and compare, we get the rows from both the base and the compare set, and we also get a row showing the differences for each variable. The ‘outstat=’ argument is set to &diff to hold a set of summary statistics (which we are restricting to results from variables where there is one or more value difference). Finally, the ‘listobs’ option specifies that we get to see a printout of observations where the unique key value is found in only one of the base and compare sets.

In the final part of the macro, we take the &diff set and pull out some metadata about the variables that have tested unequal between sets. All of these &diff sets are appended together to a big output dataset which can be used as a handy summary of differences across the entire library. Note the compare procedure also produces list output which can then be read as a detailed report.

Advertisements

Replacing SQL Joins with the SAS Data Step Hash Object

So it has been a very long time since I posted on here, a lot of changes in the meantime. Long story short, I got busy, changed jobs, still busy now, but with the change in scene I have a little more motivation to blog and hopefully some more interesting things to say. I figured I’d start out by posting a quick summary of a technical presentation I gave to the SUNZ quarterly meeting late last year.

The presentation was a brief look at how to use the SAS data step hash object to replace expensive SQL joins. The scenario I had in mind was a typical one working with a data warehouse, where we join from a large, central fact table to several (typically smaller) dimension tables:

Join

The thing is, even when the dimension tables are (relatively) small, each lookup from the fact extracts a cost, in the form of CPU time – so the more rows in the fact, the more this costs. Not only that, but as the dimensions grow (which they naturally will over time), the CPU time required for each lookup will increase. This is why the quick and easy report which took 5 minutes to run when you wrote it a year ago now keeps you (and perhaps the rest of your reporting batch) waiting for the best part of an hour.

The advantage that the hash object offers is essentially that while growth in the fact still adds to the CPU time required, growth in the dimensions does not. The hash object guarantees (except in situations involving pathological data) constant time lookup. There can be monster savings here, with some caveats which I’ll get to later. For the moment, here’s a very quick how to.

First, the (generic) SQL we’re replacing:

proc sql ;
  create table report as
  select
  	dim1.attr1
	, dim2.attr2
	, dim3.attr3
	, dim4.attr4
	, fac.measure1
  from
  	fact1 fac
	inner join dimension1 dim1
		on fac.dim1_pk = dim1.dim1_pk
	inner join dimension2 dim2
		on fac.dim2_pk = dim2.dim2_pk
	inner join dimension3 dim3
		on fac.dim3_pk = dim3.dim3_pk
	inner join dimension4 dim4
		on fac.dim4_pk = dim4.dim4_pk
		;
quit ;

The idea with using the data step hash object to replace this is simple: we add a separate hash object for each dimension, containing the keys we are using to join on and the attributes we are adding into the report table. Then for each row in the fact, if we find a match in all dimensions, we add the row into the report.

The code is as follows:

data report ;
  /* 1 - 'Fake' set statement to add variables into the PDV*/
  if 0 then set
  	fact1 (keep = measure1)
  	dimension1 (keep = dim1_pk attr1)
	dimension2 (keep = dim2_pk attr2)
	dimension3 (keep = dim3_pk attr3)
	dimension4 (keep = dim4_pk attr4)
  ;

  /* 2 - Declare hash objects for each dimension*/
  if _n_ = 1 then do ;
  	declare hash dim1 (dataset:&quot;dimension1&quot;) ;
	dim1.definekey(&quot;dim1_pk&quot;) ;
	dim1.definedata(&quot;attr1&quot;) ;
	dim1.definedone() ;
	
  	declare hash dim2 (dataset:&quot;dimension2&quot;) ;
	dim2.definekey(&quot;dim2_pk&quot;) ;
	dim2.definedata(&quot;attr2&quot;) ;
	dim2.definedone() ;
	
  	declare hash dim3 (dataset:&quot;dimension3&quot;) ;
	dim3.definekey(&quot;dim3_pk&quot;) ;
	dim3.definedata(&quot;attr3&quot;) ;
	dim3.definedone() ;
	
  	declare hash dim4 (dataset:&quot;dimension4&quot;) ;
	dim4.definekey(&quot;dim4_pk&quot;) ;
	dim4.definedata(&quot;attr4&quot;) ;
	dim4.definedone() ;
  end ;

  /* 3 - 'Join' rows to the dimensions by matching with the .find() method*/
  do until (eof) ;
  	set fact1 (keep=dim1_pk dim2_pk dim3_pk dim4_pk measure1 end=eof;
	if dim1.find() = 0 and dim2.find() = 0 and 
		dim3.find() = 0 and dim4.find() = 0 then output ;
  end ;
  stop ;

  drop dim1_pk dim2_pk dim3_pk dim4_pk ;

run ;

As per the comments, the code breaks down into 3 steps:
1 – Fake a set statement: the data step compiler does not know about the hash object when it is created, so we need to supply it with column metadata to assist with formation of the PDV.
2 – Declare and create the hash objects. The definekey, definedata and definedone methods do the work of defining the hash object, after which SAS loops over the tables named in the ‘dataset’ parameter supplied with the declare statement. For each row the key and value pairs are added into the hash object.
3 – Perform the join by matching key values from the fact table into the dimension hash objects (using the hash object find() method). This is where one fundamental difference between the two approaches becomes apparent. We’re now not joining tables on disk, as we were with the SQL join; the fact table on disk is being matched with the hash objects, which are data structures entirely resident in memory.

So is it worth it? In a word, yes – but only if you’re willing to trade off a big (sometimes huge) increase in memory consumption against the CPU time you’ll be getting back. To illustrate this, here’s some performance stats from a real-life example.

First, a small join – 2 dimensions joined to a small fact table (about 100,000 rows):

ProcSQL100K

Replacing this with data step code using hash objects:

DataStepHash100K

There’s a small saving in CPU time, set against a slight increase in memory consumption. It hardly seems worthwhile replacing this code, but then again it’s not a very long-running report to begin with. The real savings come when we look at the full version – this time, 9 dimensions joined to a somewhat larger fact table (~ 10 million rows). First the SQL join:

BiggestProcSQLAllRows

Then, the data step version:

BiggestDataStepHashAllRows

Here, replacing the SQL code has reduced the time required by a factor of 10. This is a huge difference and we could be doing backflips and high fives all round right now, but before we kick off the celebrations, take a moment to look at the memory usage.

You’ll see that the memory consumption with the SQL code is less than 300MB, whereas the data step hash code uses a little over 10 times that. In fact, even the data step code against the small fact required over 1GB. The memory usage is linked to the size of the dimensions that you’re shoving into the hash objects, so the decrease in CPU time is being paid for more or less directly with a corresponding increase in memory. So is this a problem? Well, it might be, or it might not be. Obviously it depends on the availability of both these resources – if your server is constantly running out of memory then I’d say yes, it’s a problem. Then again, if your users are always complaining about how long it takes for their reports to run, maybe the hash object is worth a look.

I delivered a slightly longer form of this as a presentation to SUNZ last year. The slideshow is at the link below (pptx) or a pdf version is also available from the SUNZ website.

Data Step Hash Object vs SQL Join

Tricks with _TYPE_

In SAS I frequently make use of the very handy SUMMARY procedure. It will produce a wide range of aggregate results at whatever class level you specify – or multiple levels if you like. So given a dataset costs with measure variable total_cost and class variables month, age, tenure and location, I might wish to see one-way analyses of cost by all of these classes, and possibly two-way analysis of month vs age and month vs tenure. So I could write a whole lot of summary procedures like the below:

/*Month 1-way analysis*/
proc summary data=costs nway missing ;
	class month ;
	var   total_cost ;
	output out=cost_month_summary (drop=_type_ rename=(_freq_=count))
		min= max= p5= q1= median= q3= p95= mean= sum= /autoname ;
run;

/*Month by Age 2-way analysis*/
proc summary data=costs nway missing ;
	class month age;
	var   total_cost ;
	output out=cost_month_age_summary (drop=_type_ rename=(_freq_=count))
		min= max= p5= q1= median= q3= p95= mean= sum= /autoname ;
run;
/*etc, etc.*/

Now while that will work, it’s a lot of code to write and maintain. I prefer to request all the required results at once, in one proc:

proc summary data=costs missing ;
	class month age tenure location;
	var   total_cost;
	types /*One-way analyses*/
		  month age tenure location
		  /*two-way*/
		  month*age
		  month*tenure
	;
	output out=cost_summary (rename=(_freq_=count))
		min= max= p5= q1= median= q3= p95= mean= sum= /autoname ;
run;

The change that makes this work is of course the type statement, and removing the nway option (we no longer want an n-way analysis).

The output dataset will then have a column for every class variable, one for each requested statistic (including the renamed _freq_), and crucially we’ve also kept the autogenerated _type_ variable. _Type_ is the variable which tells you which summary request you’re looking at, assigning numbers to each combination of variables according to a simple binary pattern. Basically given n class variables arranged in logical order A1 … An in the output dataset, working from right to left each is assigned a value from 20 … 2n-1, so Am gets 2n-m. In our example, if month, age, tenure and location have logical positions 1, 2, 3 and 4 respectively, then month=23, age=22, tenure=21 and location=20. That means the requested analysis types get the following _type_ values:

month = 23 = 8
age = 22 = 4
tenure = 21 = 2
location = 20 = 1
month*tenure = 23 + 21 = 10
month*age = 23 + 22 = 12

This works fine for me, but the _type_ variable is a number, which is unhelpful to an audience without the technical knowledge, time or patience to decode it, so it would be great to be able to systematically assign it a descriptive label. Of course, given a combination of class variables in a summary dataset, it’s possible to work out the value of _type_ that has been assigned to them. Turns out, this paper from SUGI 31 by Dan Bruns contains the macro %findtype written by Ron Coleman which will examine a summary dataset table and return the value of _type_ for the variable list vars:

%macro findtype(table,vars);
   %local index fid flag target;
   %let fid = %sysfunc(open(&table,i));
   %if not &fid %then %goto exit;
   %let vars = %upcase(&vars);
   %let flag = 0;
   %let _type_ = 0;
   %do i = %sysfunc(attrn(&fid,nvars)) %to 1 %by -1;
      %let target = %upcase(%sysfunc(varname(&fid,&i)));
      %if &flag %then %do;
         %let index = 1;
         %do %while (%length(%scan(&vars,&index)) > 0);
            %if &target = %scan(&vars,&index) %then
               %let _type_ = %eval(&_type_ + 2**(&flag - &i - 1));
            %let index = %eval(&index + 1);
         %end;
      %end;
      %else %if &target = _TYPE_ %then
         %let flag = &i;
   %end;
   %let fid = %sysfunc(close(&fid));
   &_type_
   %goto term;
%Exit:;
   -1
%term:
%mend findtype;

What I wanted to do was use this macro to create a format typedesc to look up from _type_ to a description, so that in the example above put(12, typedesc.) would return ‘month*age’. I created a dataset types containing all the variable combinations requested in the summary, and then I used this to create a control data set typefmt:

filename out_tmp temp ;

/*Write a data step to generate a format control set*/
data _null_;
  set types end=last;
  file out_tmp ;
  if _n_=1 then put 'data typefmt; retain fmtname type ; fmtname="typedesc"; type="N";' ;
  line='start=%findtype(cost_summary, '|| types ||'); label="'|| types ||'"; output ;' ;
  put line ;
  if last then put 'run ;' ;
run;

%include out_tmp /source2 ;

Here the variable types acts as the label, and %findtype finds the _type_ value to apply the label to.

From there it’s a simple matter of compiling the format and applying it to _type_ in the summary dataset:

proc format library=work cntlin=typefmt ;
run ;

data cost_summary ;
  set cost_summary ;
  type=put(_type_, typefmt.) ;
run ;

Maintaining SAS formats in Excel

For one of the subject areas I deal with day to day, categorical data values are often stored in the database in narrow character fields, so we need to apply SAS formats to get readable values out in reports. To maintain these formats I store them in an Excel workbook, looking roughly like this:

This is a relatively painless way to manage keeping SAS formats up to date, as well as making them easily accessible to people who need to inspect them. Each format is stored in a named range, with the name of the range matching the name of the format. The column headers are chosen to agree with the SAS Format procedure requirement for an input control data set. The above range, for instance, would be named ‘excel_user’, and would be used to input the required data for a character format (Type=’C’) also named ‘excel_user’ (Fmtname).

So how do I get them into SAS from Excel? Once again, the SAS/ACCESS Excel libname statement is helpful here. First, we assign a libref ‘wb’ to the workbook holding the format ranges, and another libref ‘sasprogs’ which will be used to store the format catalog:

libname wb Excel "&path.\formats.xls" ver=2002 ;
libname sasprogs "&path" ;

Next we need to get a list of all the ranges that we’re interested in here. In this workbook the only named ranges are the ones containing the formats. Any other names that the libname engine detects will be the names of worksheets, which will contain a ‘$’ character. So, we simply exclude these from the query against the dictionary.tables table:

proc sql;
  create table tmp as
  select memname
  from dictionary.tables
  where libname='WB' 
    and index(memname, '$') = 0 ;
quit;

So now the table ‘tmp’ holds a list of all the format range names in the workbook. Now we need to use this table to guide SAS in building those formats. What we need to do is to submit a proc format statement for every one of those names. Like so:

%macro load_fmt(name) ;
  proc format library=sasprogs.sasfmts cntlin=wb.&name ;
%mend load_fmt;

Now we simply need to loop through the set of names to load each of the formats into the sasprogs.sasfmts catalog:

data _null_;
  set tmp;
  call execute('%load_fmt('||memname||')') ;
run;

All that’s left to do now is a little tidying up:

/*Close the file*/
libname wb clear ;
/*Inspect format contents*/
proc format library=sasprogs.sasfmts fmtlib cntlout=sasprogs.sasfmt_data; run;
/*Add them into the search path*/
options fmtsearch=(library sasfmts);

And we’re done – all the formats are now loaded and available in the SAS session. Now if there are any changes, or a new format needs to be created, just add it into the file, save it and re-run this script. Sorted!

Populating Excel from SAS

A while back, I posted about using the SAS Excel libname engine to read and write to and from Excel files. That turned out to be a reasonably popular topic, and engendered quite a few questions, so I thought I’d follow up with an example of the typical use I’ve made of this facility, along with a couple of lessons learned along the way.

I usually have a pre-formatted template file set up, which I copy to a new file destination and then assign a libname:

data _null_;
    infile "&tmpltdir.\template.xls" recfm=n;  /*Template file*/
    input one_byte $char1. @@;
    file "&rptdir.\report.xls" recfm=n;        /*Report file*/
    put one_byte $char1. @@;
run;

libname wb Excel "&rptdir.\report.xls" ver=2002;

Copying like this (input/put) is a little clunky I think, but it works. X commands are not enabled by default from Enterprise Guide, so this is the best option remaining.

The data will be stored in named ranges set up for that purpose – but before they can be populated, they need to be cleared:

proc datasets library=wb;
  delete Listing Header;
run;
quit;

Listing in this case would be a tabular dataset, and Header is a one row set which contains information like report date, refresh date, titles etc. – any information which is particular to that edition of the report.

The next step is to populate the data in the ranges:

data Header ;
	format startdate enddate refreshdate reportdate date9.;
	progpath="&regref.\sasprogs\&pgmname";
	startdate=&startdate.d;
	enddate=&enddate.d;
	refreshdate="&dwh_d_fmt"d;
	reportdate="&sysdate9"d;
run;

data wb.Header (dblabel=yes);
	set Header;
run;

The code to populate Listing would run much the same way. One point to note here is that sometimes the order of the columns output is important. In this case, it is possible to control the order of columns by either using a retain statement in the data step, or by using proc sql and ordering the columns in the select clause.

Once the ranges have been populated, close the file by clearing the libname:

libname wb clear;

This is important as otherwise the file will be held open by SAS.

And there you have it – the report has now been populated. Actually, there’s a further wrinkle I should mention. For some reason, the report file created often suffers from a peculiar “bloating” effect, where the file size appears to be roughly double what it should be. However, once the file is opened and saved in Excel, the size returns to normal. In a couple of cases involving large numbers of regular reports, I have written a VBA routine to open and save each one in turn…as with the copy step above, I find it a little clunky, but it can save a considerable amount of space.

SAS macro variable scope

I’m going to assert something fairly uncontroversial: it’s good practice to limit the scope of your variables. The reasoning behind this is fairly easy to follow: the more places a variable’s value can be altered, the better the chance that at some point it will be altered unintentionally.

This is particularly true in the SAS macro language, where the rules governing how variables are resolved are fairly simple, but where it’s easy to lose sight of how conflicting scope can pose a risk. For a slightly contrived example, in the macro %test below, the value of the global variable x is checked. In %set, the value of x is changed:

%let x=Y;

%macro test;
  %if &x = Y %then %put "Flag set";
  %else %put "Flag not set";
%mend test;

%macro set(val);
  %let x=&val;
%mend set;

/*Some code*/
%set(Y);
/*Some other code*/
%test;

This all looks fine, and it is – until you call the macro %pointless_loop between %set and %test:

%macro pointless_loop(num_mths);
%let start=%sysfunc(intnx(month, %sysfunc(date()), -&num_mths, e));
%do i=0 %to &num_mths;
    %let x=%sysfunc(intnx(month, &start, &i, b), date9.);
%end;
%mend pointless_loop;

The problem here is that, by the time this macro is executed, x already exists in the global symbol table. If this wasn’t the case, x would be created in the symbol table local to %pointless_loop, and would be deleted with that symbol table when it finished executing. As it is, when %test runs, the x in the global symbol table will be altered, and will hold a formatted date string rather than the Y or N it’s intended to contain. This type of situation can arise very easily – particularly if macros are being re-used across applications. Ideally, you shouldn’t need to care what names are assigned to variables defined inside them.

How could this be avoided? Well, apart from resisting the temptation to use variable names like x (as a rule, I think variable names should be as descriptive as possible), be explicit. There doesn’t need to be a problem with having x in both the global and local symbol table – just describe the scope before using them:

%global x;
%let x=Y;

%macro pointless_loop(num_mths);
%let start=%sysfunc(intnx(month, %sysfunc(date()), -&num_mths, e));
%local x;
%do i=0 %to &num_mths;
    %let x=%sysfunc(intnx(month, &start, &i, b), date9.);
%end;
%mend pointless_loop;

Now the value of x in the global symbol table will be left untouched.

Using Proc Transpose to widen a time series

I find I’m doing this fairly frequently recently with our reporting: starting with a set of unit record data, I use proc summary to get to a sorted, narrow dataset like this:

Report_Month Location Outcome Num_Cases
01Jan2010 Auckland A 3
01Jan2010 Auckland B 5
01Jan2010 Christchurch A 1
01Jan2010 Christchurch B 2
01Jan2010 Christchurch C 5
01Jan2010 Wellington B 1
01Jan2010 Wellington C 7
01Feb2010 Auckland A 5
01Feb2010 Auckland B 4
01Feb2010 Auckland C 1
01Feb2010 Christchurch A 2
01Feb2010 Christchurch B 2
01Feb2010 Christchurch C 5
01Feb2010 Wellington A 2
01Feb2010 Wellington B 3
01Feb2010 Wellington C 1
01Mar2010 Auckland A 6
01Mar2010 Auckland B 3
01Mar2010 Christchurch B 1
01Mar2010 Christchurch C 6
01Mar2010 Wellington A 1
01Mar2010 Wellington B 3
01Apr2010 Auckland A 2
01Apr2010 Wellington C 5

Whereas the requirement would be to get to a wider set like this:

Report_Month Location A B C
01Jan2010 Auckland 3 5 0
01Jan2010 Christchurch 1 2 5
01Jan2010 Wellington 0 1 7
01Feb2010 Auckland 5 4 1
01Feb2010 Christchurch 2 2 5
01Feb2010 Wellington 2 3 1
01Mar2010 Auckland 6 3 0
01Mar2010 Christchurch 0 1 6
01Mar2010 Wellington 1 3 0
01Apr2010 Auckland 2 0 0
01Apr2010 Wellington 0 0 5

There are two ways I know of to achieve this transformation in SAS code. One would be to create the array of variables needed in a data step and populate them for each by group (in this case, Report_Month and Location would be the by variables) in turn. This approach has got a lot to recommend it – most importantly to my mind the data step is both flexible and straightforward.

On the other hand…if you’re dealing with a lot of variables the code required can get pretty verbose. Which is why a lazy programmer like me finds proc transpose to be a better solution – in this (admittedly simple) case, very little code is required:

proc transpose data=narrow
               out=wide (drop=_name_ _label_);
  by report_month location;
  id outcome;
run;

The code is short, and explicit – what’s intended here will be clearer to another programmer than the typical array processing idiom used in the data step.

Here’s a wrinkle, though: the transpose procedure quite rightly determines that in this case, it needs to create 3 new columns in the output dataset wide, to allow for the 3 outcome values A, B and C occurring in narrow. What if there were actually 4 outcomes that had to be counted, as in the below?

Report_Month Location A B C D
01Jan2010 Auckland 3 5 0 0
01Jan2010 Christchurch 1 2 5 0
01Jan2010 Wellington 0 1 7 0
01Feb2010 Auckland 5 4 1 0
01Feb2010 Christchurch 2 2 5 0
01Feb2010 Wellington 2 3 1 0
01Mar2010 Auckland 6 3 0 0
01Mar2010 Christchurch 0 1 6 0
01Mar2010 Wellington 1 3 0 0
01Apr2010 Auckland 2 0 0 0
01Apr2010 Wellington 0 0 5 0

The data step is going to be able to cope a lot better with this sort of situation, but let’s just assume that I’ve already written way too much code to change horses now. OK, here’s one way to fix this:

proc sql;
  create table wider
  (report_month num format=date9.
   , location char(20)
   , A num
   , B num
   , C num
   , D num
   );
quit;

proc datasets library=work force;
  append base=wider data=wide;
run;
quit;

Using the force option with append means the wider dataset will accept the wide dataset being appended to it, even with column D missing. Problem solved and proc transpose gets to stay.

Documentation for proc transpose is here and for proc datasets is here.