proc datasets – Number Cruncher

I find I’m doing this fairly frequently recently with our reporting: starting with a set of unit record data, I use proc summary to get to a sorted, narrow dataset like this:

Report_Month	Location	Outcome	Num_Cases
01Jan2010	Auckland	A	3
01Jan2010	Auckland	B	5
01Jan2010	Christchurch	A	1
01Jan2010	Christchurch	B	2
01Jan2010	Christchurch	C	5
01Jan2010	Wellington	B	1
01Jan2010	Wellington	C	7
01Feb2010	Auckland	A	5
01Feb2010	Auckland	B	4
01Feb2010	Auckland	C	1
01Feb2010	Christchurch	A	2
01Feb2010	Christchurch	B	2
01Feb2010	Christchurch	C	5
01Feb2010	Wellington	A	2
01Feb2010	Wellington	B	3
01Feb2010	Wellington	C	1
01Mar2010	Auckland	A	6
01Mar2010	Auckland	B	3
01Mar2010	Christchurch	B	1
01Mar2010	Christchurch	C	6
01Mar2010	Wellington	A	1
01Mar2010	Wellington	B	3
01Apr2010	Auckland	A	2
01Apr2010	Wellington	C	5

Whereas the requirement would be to get to a wider set like this:

Report_Month	Location	A	B	C
01Jan2010	Auckland	3	5	0
01Jan2010	Christchurch	1	2	5
01Jan2010	Wellington	0	1	7
01Feb2010	Auckland	5	4	1
01Feb2010	Christchurch	2	2	5
01Feb2010	Wellington	2	3	1
01Mar2010	Auckland	6	3	0
01Mar2010	Christchurch	0	1	6
01Mar2010	Wellington	1	3	0
01Apr2010	Auckland	2	0	0
01Apr2010	Wellington	0	0	5

There are two ways I know of to achieve this transformation in SAS code. One would be to create the array of variables needed in a data step and populate them for each by group (in this case, Report_Month and Location would be the by variables) in turn. This approach has got a lot to recommend it – most importantly to my mind the data step is both flexible and straightforward.

On the other hand…if you’re dealing with a lot of variables the code required can get pretty verbose. Which is why a lazy programmer like me finds proc transpose to be a better solution – in this (admittedly simple) case, very little code is required:

proc transpose data=narrow
               out=wide (drop=_name_ _label_);
  by report_month location;
  id outcome;
run;

The code is short, and explicit – what’s intended here will be clearer to another programmer than the typical array processing idiom used in the data step.

Here’s a wrinkle, though: the transpose procedure quite rightly determines that in this case, it needs to create 3 new columns in the output dataset wide, to allow for the 3 outcome values A, B and C occurring in narrow. What if there were actually 4 outcomes that had to be counted, as in the below?

Report_Month	Location	A	B	C	D
01Jan2010	Auckland	3	5	0	0
01Jan2010	Christchurch	1	2	5	0
01Jan2010	Wellington	0	1	7	0
01Feb2010	Auckland	5	4	1	0
01Feb2010	Christchurch	2	2	5	0
01Feb2010	Wellington	2	3	1	0
01Mar2010	Auckland	6	3	0	0
01Mar2010	Christchurch	0	1	6	0
01Mar2010	Wellington	1	3	0	0
01Apr2010	Auckland	2	0	0	0
01Apr2010	Wellington	0	0	5	0

The data step is going to be able to cope a lot better with this sort of situation, but let’s just assume that I’ve already written way too much code to change horses now. OK, here’s one way to fix this:

proc sql;
  create table wider
  (report_month num format=date9.
   , location char(20)
   , A num
   , B num
   , C num
   , D num
   );
quit;

proc datasets library=work force;
  append base=wider data=wide;
run;
quit;

Using the force option with append means the wider dataset will accept the wide dataset being appended to it, even with column D missing. Problem solved and proc transpose gets to stay.

Documentation for proc transpose is here and for proc datasets is here.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

	Alex on More Regex Functions, part…
	geoffness on Regular Expressions – ne…
	mikeg_de on Regular Expressions – ne…
	geoffness on Type or Class?
	vbanewbie on Type or Class?

Number Cruncher

Tag: proc datasets

Using Proc Transpose to widen a time series