I find I’m doing this fairly frequently recently with our reporting: starting with a set of unit record data, I use proc summary to get to a sorted, narrow dataset like this:
Report_Month | Location | Outcome | Num_Cases |
01Jan2010 | Auckland | A | 3 |
01Jan2010 | Auckland | B | 5 |
01Jan2010 | Christchurch | A | 1 |
01Jan2010 | Christchurch | B | 2 |
01Jan2010 | Christchurch | C | 5 |
01Jan2010 | Wellington | B | 1 |
01Jan2010 | Wellington | C | 7 |
01Feb2010 | Auckland | A | 5 |
01Feb2010 | Auckland | B | 4 |
01Feb2010 | Auckland | C | 1 |
01Feb2010 | Christchurch | A | 2 |
01Feb2010 | Christchurch | B | 2 |
01Feb2010 | Christchurch | C | 5 |
01Feb2010 | Wellington | A | 2 |
01Feb2010 | Wellington | B | 3 |
01Feb2010 | Wellington | C | 1 |
01Mar2010 | Auckland | A | 6 |
01Mar2010 | Auckland | B | 3 |
01Mar2010 | Christchurch | B | 1 |
01Mar2010 | Christchurch | C | 6 |
01Mar2010 | Wellington | A | 1 |
01Mar2010 | Wellington | B | 3 |
01Apr2010 | Auckland | A | 2 |
01Apr2010 | Wellington | C | 5 |
Whereas the requirement would be to get to a wider set like this:
Report_Month | Location | A | B | C |
01Jan2010 | Auckland | 3 | 5 | 0 |
01Jan2010 | Christchurch | 1 | 2 | 5 |
01Jan2010 | Wellington | 0 | 1 | 7 |
01Feb2010 | Auckland | 5 | 4 | 1 |
01Feb2010 | Christchurch | 2 | 2 | 5 |
01Feb2010 | Wellington | 2 | 3 | 1 |
01Mar2010 | Auckland | 6 | 3 | 0 |
01Mar2010 | Christchurch | 0 | 1 | 6 |
01Mar2010 | Wellington | 1 | 3 | 0 |
01Apr2010 | Auckland | 2 | 0 | 0 |
01Apr2010 | Wellington | 0 | 0 | 5 |
There are two ways I know of to achieve this transformation in SAS code. One would be to create the array of variables needed in a data step and populate them for each by group (in this case, Report_Month and Location would be the by variables) in turn. This approach has got a lot to recommend it – most importantly to my mind the data step is both flexible and straightforward.
On the other hand…if you’re dealing with a lot of variables the code required can get pretty verbose. Which is why a lazy programmer like me finds proc transpose to be a better solution – in this (admittedly simple) case, very little code is required:
proc transpose data=narrow out=wide (drop=_name_ _label_); by report_month location; id outcome; run;
The code is short, and explicit – what’s intended here will be clearer to another programmer than the typical array processing idiom used in the data step.
Here’s a wrinkle, though: the transpose procedure quite rightly determines that in this case, it needs to create 3 new columns in the output dataset wide, to allow for the 3 outcome values A, B and C occurring in narrow. What if there were actually 4 outcomes that had to be counted, as in the below?
Report_Month | Location | A | B | C | D |
01Jan2010 | Auckland | 3 | 5 | 0 | 0 |
01Jan2010 | Christchurch | 1 | 2 | 5 | 0 |
01Jan2010 | Wellington | 0 | 1 | 7 | 0 |
01Feb2010 | Auckland | 5 | 4 | 1 | 0 |
01Feb2010 | Christchurch | 2 | 2 | 5 | 0 |
01Feb2010 | Wellington | 2 | 3 | 1 | 0 |
01Mar2010 | Auckland | 6 | 3 | 0 | 0 |
01Mar2010 | Christchurch | 0 | 1 | 6 | 0 |
01Mar2010 | Wellington | 1 | 3 | 0 | 0 |
01Apr2010 | Auckland | 2 | 0 | 0 | 0 |
01Apr2010 | Wellington | 0 | 0 | 5 | 0 |
The data step is going to be able to cope a lot better with this sort of situation, but let’s just assume that I’ve already written way too much code to change horses now. OK, here’s one way to fix this:
proc sql; create table wider (report_month num format=date9. , location char(20) , A num , B num , C num , D num ); quit; proc datasets library=work force; append base=wider data=wide; run; quit;
Using the force option with append means the wider dataset will accept the wide dataset being appended to it, even with column D missing. Problem solved and proc transpose gets to stay.
Documentation for proc transpose is here and for proc datasets is here.