Using Proc Transpose to widen a time series


I find I’m doing this fairly frequently recently with our reporting: starting with a set of unit record data, I use proc summary to get to a sorted, narrow dataset like this:

Report_Month Location Outcome Num_Cases
01Jan2010 Auckland A 3
01Jan2010 Auckland B 5
01Jan2010 Christchurch A 1
01Jan2010 Christchurch B 2
01Jan2010 Christchurch C 5
01Jan2010 Wellington B 1
01Jan2010 Wellington C 7
01Feb2010 Auckland A 5
01Feb2010 Auckland B 4
01Feb2010 Auckland C 1
01Feb2010 Christchurch A 2
01Feb2010 Christchurch B 2
01Feb2010 Christchurch C 5
01Feb2010 Wellington A 2
01Feb2010 Wellington B 3
01Feb2010 Wellington C 1
01Mar2010 Auckland A 6
01Mar2010 Auckland B 3
01Mar2010 Christchurch B 1
01Mar2010 Christchurch C 6
01Mar2010 Wellington A 1
01Mar2010 Wellington B 3
01Apr2010 Auckland A 2
01Apr2010 Wellington C 5

Whereas the requirement would be to get to a wider set like this:

Report_Month Location A B C
01Jan2010 Auckland 3 5 0
01Jan2010 Christchurch 1 2 5
01Jan2010 Wellington 0 1 7
01Feb2010 Auckland 5 4 1
01Feb2010 Christchurch 2 2 5
01Feb2010 Wellington 2 3 1
01Mar2010 Auckland 6 3 0
01Mar2010 Christchurch 0 1 6
01Mar2010 Wellington 1 3 0
01Apr2010 Auckland 2 0 0
01Apr2010 Wellington 0 0 5

There are two ways I know of to achieve this transformation in SAS code. One would be to create the array of variables needed in a data step and populate them for each by group (in this case, Report_Month and Location would be the by variables) in turn. This approach has got a lot to recommend it – most importantly to my mind the data step is both flexible and straightforward.

On the other hand…if you’re dealing with a lot of variables the code required can get pretty verbose. Which is why a lazy programmer like me finds proc transpose to be a better solution – in this (admittedly simple) case, very little code is required:

proc transpose data=narrow
               out=wide (drop=_name_ _label_);
  by report_month location;
  id outcome;
run;

The code is short, and explicit – what’s intended here will be clearer to another programmer than the typical array processing idiom used in the data step.

Here’s a wrinkle, though: the transpose procedure quite rightly determines that in this case, it needs to create 3 new columns in the output dataset wide, to allow for the 3 outcome values A, B and C occurring in narrow. What if there were actually 4 outcomes that had to be counted, as in the below?

Report_Month Location A B C D
01Jan2010 Auckland 3 5 0 0
01Jan2010 Christchurch 1 2 5 0
01Jan2010 Wellington 0 1 7 0
01Feb2010 Auckland 5 4 1 0
01Feb2010 Christchurch 2 2 5 0
01Feb2010 Wellington 2 3 1 0
01Mar2010 Auckland 6 3 0 0
01Mar2010 Christchurch 0 1 6 0
01Mar2010 Wellington 1 3 0 0
01Apr2010 Auckland 2 0 0 0
01Apr2010 Wellington 0 0 5 0

The data step is going to be able to cope a lot better with this sort of situation, but let’s just assume that I’ve already written way too much code to change horses now. OK, here’s one way to fix this:

proc sql;
  create table wider
  (report_month num format=date9.
   , location char(20)
   , A num
   , B num
   , C num
   , D num
   );
quit;

proc datasets library=work force;
  append base=wider data=wide;
run;
quit;

Using the force option with append means the wider dataset will accept the wide dataset being appended to it, even with column D missing. Problem solved and proc transpose gets to stay.

Documentation for proc transpose is here and for proc datasets is here.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s