Replacing SQL Joins with the SAS Data Step Hash Object

So it has been a very long time since I posted on here, a lot of changes in the meantime. Long story short, I got busy, changed jobs, still busy now, but with the change in scene I have a little more motivation to blog and hopefully some more interesting things to say. I figured I’d start out by posting a quick summary of a technical presentation I gave to the SUNZ quarterly meeting late last year.

The presentation was a brief look at how to use the SAS data step hash object to replace expensive SQL joins. The scenario I had in mind was a typical one working with a data warehouse, where we join from a large, central fact table to several (typically smaller) dimension tables:

Join

The thing is, even when the dimension tables are (relatively) small, each lookup from the fact extracts a cost, in the form of CPU time – so the more rows in the fact, the more this costs. Not only that, but as the dimensions grow (which they naturally will over time), the CPU time required for each lookup will increase. This is why the quick and easy report which took 5 minutes to run when you wrote it a year ago now keeps you (and perhaps the rest of your reporting batch) waiting for the best part of an hour.

The advantage that the hash object offers is essentially that while growth in the fact still adds to the CPU time required, growth in the dimensions does not. The hash object guarantees (except in situations involving pathological data) constant time lookup. There can be monster savings here, with some caveats which I’ll get to later. For the moment, here’s a very quick how to.

First, the (generic) SQL we’re replacing:

proc sql ;
  create table report as
  select
  	dim1.attr1
	, dim2.attr2
	, dim3.attr3
	, dim4.attr4
	, fac.measure1
  from
  	fact1 fac
	inner join dimension1 dim1
		on fac.dim1_pk = dim1.dim1_pk
	inner join dimension2 dim2
		on fac.dim2_pk = dim2.dim2_pk
	inner join dimension3 dim3
		on fac.dim3_pk = dim3.dim3_pk
	inner join dimension4 dim4
		on fac.dim4_pk = dim4.dim4_pk
		;
quit ;

The idea with using the data step hash object to replace this is simple: we add a separate hash object for each dimension, containing the keys we are using to join on and the attributes we are adding into the report table. Then for each row in the fact, if we find a match in all dimensions, we add the row into the report.

The code is as follows:

data report ;
  /* 1 - 'Fake' set statement to add variables into the PDV*/
  if 0 then set
  	fact1 (keep = measure1)
  	dimension1 (keep = dim1_pk attr1)
	dimension2 (keep = dim2_pk attr2)
	dimension3 (keep = dim3_pk attr3)
	dimension4 (keep = dim4_pk attr4)
  ;

  /* 2 - Declare hash objects for each dimension*/
  if _n_ = 1 then do ;
  	declare hash dim1 (dataset:"dimension1") ;
	dim1.definekey("dim1_pk") ;
	dim1.definedata("attr1") ;
	dim1.definedone() ;
	
  	declare hash dim2 (dataset:"dimension2") ;
	dim2.definekey("dim2_pk") ;
	dim2.definedata("attr2") ;
	dim2.definedone() ;
	
  	declare hash dim3 (dataset:"dimension3") ;
	dim3.definekey("dim3_pk") ;
	dim3.definedata("attr3") ;
	dim3.definedone() ;
	
  	declare hash dim4 (dataset:"dimension4") ;
	dim4.definekey("dim4_pk") ;
	dim4.definedata("attr4") ;
	dim4.definedone() ;
  end ;

  /* 3 - 'Join' rows to the dimensions by matching with the .find() method*/
  do until (eof) ;
  	set fact1 (keep=dim1_pk dim2_pk dim3_pk dim4_pk measure1 end=eof;
	if dim1.find() = 0 and dim2.find() = 0 and 
		dim3.find() = 0 and dim4.find() = 0 then output ;
  end ;
  stop ;

  drop dim1_pk dim2_pk dim3_pk dim4_pk ;

run ;

As per the comments, the code breaks down into 3 steps:
1 – Fake a set statement: the data step compiler does not know about the hash object when it is created, so we need to supply it with column metadata to assist with formation of the PDV.
2 – Declare and create the hash objects. The definekey, definedata and definedone methods do the work of defining the hash object, after which SAS loops over the tables named in the ‘dataset’ parameter supplied with the declare statement. For each row the key and value pairs are added into the hash object.
3 – Perform the join by matching key values from the fact table into the dimension hash objects (using the hash object find() method). This is where one fundamental difference between the two approaches becomes apparent. We’re now not joining tables on disk, as we were with the SQL join; the fact table on disk is being matched with the hash objects, which are data structures entirely resident in memory.

So is it worth it? In a word, yes – but only if you’re willing to trade off a big (sometimes huge) increase in memory consumption against the CPU time you’ll be getting back. To illustrate this, here’s some performance stats from a real-life example.

First, a small join – 2 dimensions joined to a small fact table (about 100,000 rows):

ProcSQL100K

Replacing this with data step code using hash objects:

DataStepHash100K

There’s a small saving in CPU time, set against a slight increase in memory consumption. It hardly seems worthwhile replacing this code, but then again it’s not a very long-running report to begin with. The real savings come when we look at the full version – this time, 9 dimensions joined to a somewhat larger fact table (~ 10 million rows). First the SQL join:

BiggestProcSQLAllRows

Then, the data step version:

BiggestDataStepHashAllRows

Here, replacing the SQL code has reduced the time required by a factor of 10. This is a huge difference and we could be doing backflips and high fives all round right now, but before we kick off the celebrations, take a moment to look at the memory usage.

You’ll see that the memory consumption with the SQL code is less than 300MB, whereas the data step hash code uses a little over 10 times that. In fact, even the data step code against the small fact required over 1GB. The memory usage is linked to the size of the dimensions that you’re shoving into the hash objects, so the decrease in CPU time is being paid for more or less directly with a corresponding increase in memory. So is this a problem? Well, it might be, or it might not be. Obviously it depends on the availability of both these resources – if your server is constantly running out of memory then I’d say yes, it’s a problem. Then again, if your users are always complaining about how long it takes for their reports to run, maybe the hash object is worth a look.

I delivered a slightly longer form of this as a presentation to SUNZ last year. The slideshow is at the link below (pptx) or a pdf version is also available from the SUNZ website.

Data Step Hash Object vs SQL Join

Lookup tables in SAS

Recently there was a change to a reference table in a database we use frequently for reporting. For a lot of our regular reports this won’t be troublesome, as the changed values will come through with the extract. However, with some reports for one reason and another the new reference data will not come through, so we need to replace the changed field in some datasets. There are many ways to do this in SAS, so I thought I’d run through a couple of them here. The reference data that’s changing is the mapping from the field court to the field region, which I extracted in the table court_reg below:

proc sql;
connect to oledb(init_string="Provider=msdaora ;
User ID=&name; Password=&password; 
data source=odwp") ;
  create table court_reg as
  select * from connection to oledb
 (select distinct
    crt.short_name as court,
    reg.name as new_region
  from 
    court crt
    inner join region_type reg
      on crt.region_type_code = reg.code
  );
  disconnect from oledb;
quit;

What I need to do is to use the court field as a key to pull a new value for region into an existing dataset. The pattern for the problem, then, is to pull values from a small dataset (< 500 rows) into a large dataset (> 100,000 rows). In Excel, this could be easily accomplished by using VLOOKUP into the small table from the large one. Here’s some ways I could do much the same thing in SAS:

Formats

It’s a relatively simple thing to define a format from a SAS dataset: just do some renaming of the variables (‘start’ is the variable to be formatted, and ‘label’ is the formatted value we want) and assign the format name to ‘fmtname’ and its type to ‘type’, then use proc format to read the values into the format ‘fmtname’, like so:

data tmp_format;
  retain fmtname 'regtype' type 'C';
  set work.court_reg (rename=(court=start
                              new_region=label));
run;

proc format cntlin=tmp_format fmtlib;
run;

Once that’s done, the new values are just a put away:

data sasdata.newset;
  set sasdata.oldset (drop=region);
  region=put(court, $regtype.);
run;

A format has the advantage of being fast (well, potentially faster than a merge/join, anyway) as it uses a binary tree lookup to assign the value. Balancing that is the memory requirement; the entire format must fit into memory. Whether this is an issue or not depends on the size of the dataset being used to create the format and the type of the values being applied (in this case, region is a 10-byte character value).

Data Step with Merge

Another option is an approach which uses a lot less memory: merge both sets by the key field and replace the old value with the new. I’m actually not going to post the code for this here, as in this case in my opinion it’s not a good idea. While the merge does use less memory than the format (it only loads one observation from each set into the program data vector at a time), and sequential access into the data sets makes the search very fast, the trade-off is the increased overhead of either sorting or indexing both sets. The bigger the large set, the more costly this will become.

Creating a temporary array

Rather than creating a format, another approach is to create a temporary array of the values to be mapped. This has the advantage of being much quicker than the format, as the lookup is indexed to the location of the values in the array. The disadvantage is the requirement for numeric values to be used for the array index. We can get around this by using an informat created from the order of the observations in the small dataset, for instance:

data tmp_format;
  retain fmtname 'regarray' type 'IN';
  set work.court_reg (keep=court 
                      rename=(court=start);
  label=put(_n_, 8.);
run;

proc format cntlin=tmp_format fmtlib;
run;

/*gets the size of the array required*/
data _null_;
  if 0 then set work.court_reg nobs=nobs;
  call symputx("nobs", nobs);
  stop;
run;

data sasdata.newset (drop=old_region);
  /*create and populate the array*/
  array regions {&nobs} $ _temporary_;
  if _n_ = 1 then do i=1 to &nobs;
    set work.court_reg;
    crtid=input(court, crtarray.);
    regions{crtid}=new_region;
  end;
  /*lookup into the array*/
  set sasdata.oldset (rename=(region=old_region));
  crtid=input(court, crtarray.) 
  region=regions{crtid};
run;

This informat can then be used to assign the values of region into positions in the array. However, the fact that this assignment requires a lookup into a format appears to mean that the array is not going to achieve any better performance in this case. Not only that, but the code is convoluted by the need to determine the size of the array to begin with.

The Data Step Hash Object

Another option is to use a hash table – an associative array of key-value mappings (think hash as in Perl, or dictionary as in Python). The lookup is faster than the use of a format, as the hash object uses an efficient algorithm to translate the key values into array positions, providing a similar gain to that you’d get with the use of an array. The hash object is also more flexible than a format, as it allows more than one item to be stored per key, and these can be a mixture of character and numeric variables. The key itself can be either character or numeric, or a combination of variables.

Here’s the hash object in action:

data sasdata.newset (drop=old_region);
  length court $ 20 new_region $ 10;
  /*initialise the hash object*/
  if _n_ = 1 then do;
    declare hash reg(dataset: 'court_reg');
    reg.defineKey('court');
    reg.defineData('new_region');
    reg.defineDone();
    call missing(court, new_region);
  end;
  /*retrieve data*/
  set set sasdata.oldset (rename=(region=old_region));
  if reg.find()=0 then do;
    region=new_region;
    output;
  end;
run;

I like the efficiency of this approach, but more than that, I like the code. It’s simple, concise, and the intention is explicit. There’s a couple of good references to the use of the hash object here and here, and a detailed explanation of how it’s implemented here.