Recently there was a change to a reference table in a database we use frequently for reporting. For a lot of our regular reports this won’t be troublesome, as the changed values will come through with the extract. However, with some reports for one reason and another the new reference data will not come through, so we need to replace the changed field in some datasets. There are many ways to do this in SAS, so I thought I’d run through a couple of them here. The reference data that’s changing is the mapping from the field court to the field region, which I extracted in the table court_reg below:
proc sql;
connect to oledb(init_string="Provider=msdaora ;
User ID=&name; Password=&password;
data source=odwp") ;
create table court_reg as
select * from connection to oledb
(select distinct
crt.short_name as court,
reg.name as new_region
from
court crt
inner join region_type reg
on crt.region_type_code = reg.code
);
disconnect from oledb;
quit;
What I need to do is to use the court field as a key to pull a new value for region into an existing dataset. The pattern for the problem, then, is to pull values from a small dataset (< 500 rows) into a large dataset (> 100,000 rows). In Excel, this could be easily accomplished by using VLOOKUP into the small table from the large one. Here’s some ways I could do much the same thing in SAS:
Formats
It’s a relatively simple thing to define a format from a SAS dataset: just do some renaming of the variables (‘start’ is the variable to be formatted, and ‘label’ is the formatted value we want) and assign the format name to ‘fmtname’ and its type to ‘type’, then use proc format to read the values into the format ‘fmtname’, like so:
data tmp_format;
retain fmtname 'regtype' type 'C';
set work.court_reg (rename=(court=start
new_region=label));
run;
proc format cntlin=tmp_format fmtlib;
run;
Once that’s done, the new values are just a put away:
data sasdata.newset;
set sasdata.oldset (drop=region);
region=put(court, $regtype.);
run;
A format has the advantage of being fast (well, potentially faster than a merge/join, anyway) as it uses a binary tree lookup to assign the value. Balancing that is the memory requirement; the entire format must fit into memory. Whether this is an issue or not depends on the size of the dataset being used to create the format and the type of the values being applied (in this case, region is a 10-byte character value).
Data Step with Merge
Another option is an approach which uses a lot less memory: merge both sets by the key field and replace the old value with the new. I’m actually not going to post the code for this here, as in this case in my opinion it’s not a good idea. While the merge does use less memory than the format (it only loads one observation from each set into the program data vector at a time), and sequential access into the data sets makes the search very fast, the trade-off is the increased overhead of either sorting or indexing both sets. The bigger the large set, the more costly this will become.
Creating a temporary array
Rather than creating a format, another approach is to create a temporary array of the values to be mapped. This has the advantage of being much quicker than the format, as the lookup is indexed to the location of the values in the array. The disadvantage is the requirement for numeric values to be used for the array index. We can get around this by using an informat created from the order of the observations in the small dataset, for instance:
data tmp_format;
retain fmtname 'regarray' type 'IN';
set work.court_reg (keep=court
rename=(court=start);
label=put(_n_, 8.);
run;
proc format cntlin=tmp_format fmtlib;
run;
/*gets the size of the array required*/
data _null_;
if 0 then set work.court_reg nobs=nobs;
call symputx("nobs", nobs);
stop;
run;
data sasdata.newset (drop=old_region);
/*create and populate the array*/
array regions {&nobs} $ _temporary_;
if _n_ = 1 then do i=1 to &nobs;
set work.court_reg;
crtid=input(court, crtarray.);
regions{crtid}=new_region;
end;
/*lookup into the array*/
set sasdata.oldset (rename=(region=old_region));
crtid=input(court, crtarray.)
region=regions{crtid};
run;
This informat can then be used to assign the values of region into positions in the array. However, the fact that this assignment requires a lookup into a format appears to mean that the array is not going to achieve any better performance in this case. Not only that, but the code is convoluted by the need to determine the size of the array to begin with.
The Data Step Hash Object
Another option is to use a hash table – an associative array of key-value mappings (think hash as in Perl, or dictionary as in Python). The lookup is faster than the use of a format, as the hash object uses an efficient algorithm to translate the key values into array positions, providing a similar gain to that you’d get with the use of an array. The hash object is also more flexible than a format, as it allows more than one item to be stored per key, and these can be a mixture of character and numeric variables. The key itself can be either character or numeric, or a combination of variables.
Here’s the hash object in action:
data sasdata.newset (drop=old_region);
length court $ 20 new_region $ 10;
/*initialise the hash object*/
if _n_ = 1 then do;
declare hash reg(dataset: 'court_reg');
reg.defineKey('court');
reg.defineData('new_region');
reg.defineDone();
call missing(court, new_region);
end;
/*retrieve data*/
set set sasdata.oldset (rename=(region=old_region));
if reg.find()=0 then do;
region=new_region;
output;
end;
run;
I like the efficiency of this approach, but more than that, I like the code. It’s simple, concise, and the intention is explicit. There’s a couple of good references to the use of the hash object here and here, and a detailed explanation of how it’s implemented here.