Maintaining SAS formats in Excel

For one of the subject areas I deal with day to day, categorical data values are often stored in the database in narrow character fields, so we need to apply SAS formats to get readable values out in reports. To maintain these formats I store them in an Excel workbook, looking roughly like this:

This is a relatively painless way to manage keeping SAS formats up to date, as well as making them easily accessible to people who need to inspect them. Each format is stored in a named range, with the name of the range matching the name of the format. The column headers are chosen to agree with the SAS Format procedure requirement for an input control data set. The above range, for instance, would be named ‘excel_user’, and would be used to input the required data for a character format (Type=’C’) also named ‘excel_user’ (Fmtname).

So how do I get them into SAS from Excel? Once again, the SAS/ACCESS Excel libname statement is helpful here. First, we assign a libref ‘wb’ to the workbook holding the format ranges, and another libref ‘sasprogs’ which will be used to store the format catalog:

libname wb Excel "&path.\formats.xls" ver=2002 ;
libname sasprogs "&path" ;

Next we need to get a list of all the ranges that we’re interested in here. In this workbook the only named ranges are the ones containing the formats. Any other names that the libname engine detects will be the names of worksheets, which will contain a ‘$’ character. So, we simply exclude these from the query against the dictionary.tables table:

proc sql;
  create table tmp as
  select memname
  from dictionary.tables
  where libname='WB' 
    and index(memname, '$') = 0 ;
quit;

So now the table ‘tmp’ holds a list of all the format range names in the workbook. Now we need to use this table to guide SAS in building those formats. What we need to do is to submit a proc format statement for every one of those names. Like so:

%macro load_fmt(name) ;
  proc format library=sasprogs.sasfmts cntlin=wb.&name ;
%mend load_fmt;

Now we simply need to loop through the set of names to load each of the formats into the sasprogs.sasfmts catalog:

data _null_;
  set tmp;
  call execute('%load_fmt('||memname||')') ;
run;

All that’s left to do now is a little tidying up:

/*Close the file*/
libname wb clear ;
/*Inspect format contents*/
proc format library=sasprogs.sasfmts fmtlib cntlout=sasprogs.sasfmt_data; run;
/*Add them into the search path*/
options fmtsearch=(library sasfmts);

And we’re done – all the formats are now loaded and available in the SAS session. Now if there are any changes, or a new format needs to be created, just add it into the file, save it and re-run this script. Sorted!

Lookup tables in SAS

Recently there was a change to a reference table in a database we use frequently for reporting. For a lot of our regular reports this won’t be troublesome, as the changed values will come through with the extract. However, with some reports for one reason and another the new reference data will not come through, so we need to replace the changed field in some datasets. There are many ways to do this in SAS, so I thought I’d run through a couple of them here. The reference data that’s changing is the mapping from the field court to the field region, which I extracted in the table court_reg below:

proc sql;
connect to oledb(init_string="Provider=msdaora ;
User ID=&name; Password=&password; 
data source=odwp") ;
  create table court_reg as
  select * from connection to oledb
 (select distinct
    crt.short_name as court,
    reg.name as new_region
  from 
    court crt
    inner join region_type reg
      on crt.region_type_code = reg.code
  );
  disconnect from oledb;
quit;

What I need to do is to use the court field as a key to pull a new value for region into an existing dataset. The pattern for the problem, then, is to pull values from a small dataset (< 500 rows) into a large dataset (> 100,000 rows). In Excel, this could be easily accomplished by using VLOOKUP into the small table from the large one. Here’s some ways I could do much the same thing in SAS:

Formats

It’s a relatively simple thing to define a format from a SAS dataset: just do some renaming of the variables (‘start’ is the variable to be formatted, and ‘label’ is the formatted value we want) and assign the format name to ‘fmtname’ and its type to ‘type’, then use proc format to read the values into the format ‘fmtname’, like so:

data tmp_format;
  retain fmtname 'regtype' type 'C';
  set work.court_reg (rename=(court=start
                              new_region=label));
run;

proc format cntlin=tmp_format fmtlib;
run;

Once that’s done, the new values are just a put away:

data sasdata.newset;
  set sasdata.oldset (drop=region);
  region=put(court, $regtype.);
run;

A format has the advantage of being fast (well, potentially faster than a merge/join, anyway) as it uses a binary tree lookup to assign the value. Balancing that is the memory requirement; the entire format must fit into memory. Whether this is an issue or not depends on the size of the dataset being used to create the format and the type of the values being applied (in this case, region is a 10-byte character value).

Data Step with Merge

Another option is an approach which uses a lot less memory: merge both sets by the key field and replace the old value with the new. I’m actually not going to post the code for this here, as in this case in my opinion it’s not a good idea. While the merge does use less memory than the format (it only loads one observation from each set into the program data vector at a time), and sequential access into the data sets makes the search very fast, the trade-off is the increased overhead of either sorting or indexing both sets. The bigger the large set, the more costly this will become.

Creating a temporary array

Rather than creating a format, another approach is to create a temporary array of the values to be mapped. This has the advantage of being much quicker than the format, as the lookup is indexed to the location of the values in the array. The disadvantage is the requirement for numeric values to be used for the array index. We can get around this by using an informat created from the order of the observations in the small dataset, for instance:

data tmp_format;
  retain fmtname 'regarray' type 'IN';
  set work.court_reg (keep=court 
                      rename=(court=start);
  label=put(_n_, 8.);
run;

proc format cntlin=tmp_format fmtlib;
run;

/*gets the size of the array required*/
data _null_;
  if 0 then set work.court_reg nobs=nobs;
  call symputx("nobs", nobs);
  stop;
run;

data sasdata.newset (drop=old_region);
  /*create and populate the array*/
  array regions {&nobs} $ _temporary_;
  if _n_ = 1 then do i=1 to &nobs;
    set work.court_reg;
    crtid=input(court, crtarray.);
    regions{crtid}=new_region;
  end;
  /*lookup into the array*/
  set sasdata.oldset (rename=(region=old_region));
  crtid=input(court, crtarray.) 
  region=regions{crtid};
run;

This informat can then be used to assign the values of region into positions in the array. However, the fact that this assignment requires a lookup into a format appears to mean that the array is not going to achieve any better performance in this case. Not only that, but the code is convoluted by the need to determine the size of the array to begin with.

The Data Step Hash Object

Another option is to use a hash table – an associative array of key-value mappings (think hash as in Perl, or dictionary as in Python). The lookup is faster than the use of a format, as the hash object uses an efficient algorithm to translate the key values into array positions, providing a similar gain to that you’d get with the use of an array. The hash object is also more flexible than a format, as it allows more than one item to be stored per key, and these can be a mixture of character and numeric variables. The key itself can be either character or numeric, or a combination of variables.

Here’s the hash object in action:

data sasdata.newset (drop=old_region);
  length court $ 20 new_region $ 10;
  /*initialise the hash object*/
  if _n_ = 1 then do;
    declare hash reg(dataset: 'court_reg');
    reg.defineKey('court');
    reg.defineData('new_region');
    reg.defineDone();
    call missing(court, new_region);
  end;
  /*retrieve data*/
  set set sasdata.oldset (rename=(region=old_region));
  if reg.find()=0 then do;
    region=new_region;
    output;
  end;
run;

I like the efficiency of this approach, but more than that, I like the code. It’s simple, concise, and the intention is explicit. There’s a couple of good references to the use of the hash object here and here, and a detailed explanation of how it’s implemented here.