July 2010 – Number Cruncher

I find I’m doing this fairly frequently recently with our reporting: starting with a set of unit record data, I use proc summary to get to a sorted, narrow dataset like this:

Report_Month	Location	Outcome	Num_Cases
01Jan2010	Auckland	A	3
01Jan2010	Auckland	B	5
01Jan2010	Christchurch	A	1
01Jan2010	Christchurch	B	2
01Jan2010	Christchurch	C	5
01Jan2010	Wellington	B	1
01Jan2010	Wellington	C	7
01Feb2010	Auckland	A	5
01Feb2010	Auckland	B	4
01Feb2010	Auckland	C	1
01Feb2010	Christchurch	A	2
01Feb2010	Christchurch	B	2
01Feb2010	Christchurch	C	5
01Feb2010	Wellington	A	2
01Feb2010	Wellington	B	3
01Feb2010	Wellington	C	1
01Mar2010	Auckland	A	6
01Mar2010	Auckland	B	3
01Mar2010	Christchurch	B	1
01Mar2010	Christchurch	C	6
01Mar2010	Wellington	A	1
01Mar2010	Wellington	B	3
01Apr2010	Auckland	A	2
01Apr2010	Wellington	C	5

Whereas the requirement would be to get to a wider set like this:

Report_Month	Location	A	B	C
01Jan2010	Auckland	3	5	0
01Jan2010	Christchurch	1	2	5
01Jan2010	Wellington	0	1	7
01Feb2010	Auckland	5	4	1
01Feb2010	Christchurch	2	2	5
01Feb2010	Wellington	2	3	1
01Mar2010	Auckland	6	3	0
01Mar2010	Christchurch	0	1	6
01Mar2010	Wellington	1	3	0
01Apr2010	Auckland	2	0	0
01Apr2010	Wellington	0	0	5

There are two ways I know of to achieve this transformation in SAS code. One would be to create the array of variables needed in a data step and populate them for each by group (in this case, Report_Month and Location would be the by variables) in turn. This approach has got a lot to recommend it – most importantly to my mind the data step is both flexible and straightforward.

On the other hand…if you’re dealing with a lot of variables the code required can get pretty verbose. Which is why a lazy programmer like me finds proc transpose to be a better solution – in this (admittedly simple) case, very little code is required:

proc transpose data=narrow
               out=wide (drop=_name_ _label_);
  by report_month location;
  id outcome;
run;

The code is short, and explicit – what’s intended here will be clearer to another programmer than the typical array processing idiom used in the data step.

Here’s a wrinkle, though: the transpose procedure quite rightly determines that in this case, it needs to create 3 new columns in the output dataset wide, to allow for the 3 outcome values A, B and C occurring in narrow. What if there were actually 4 outcomes that had to be counted, as in the below?

Report_Month	Location	A	B	C	D
01Jan2010	Auckland	3	5	0	0
01Jan2010	Christchurch	1	2	5	0
01Jan2010	Wellington	0	1	7	0
01Feb2010	Auckland	5	4	1	0
01Feb2010	Christchurch	2	2	5	0
01Feb2010	Wellington	2	3	1	0
01Mar2010	Auckland	6	3	0	0
01Mar2010	Christchurch	0	1	6	0
01Mar2010	Wellington	1	3	0	0
01Apr2010	Auckland	2	0	0	0
01Apr2010	Wellington	0	0	5	0

The data step is going to be able to cope a lot better with this sort of situation, but let’s just assume that I’ve already written way too much code to change horses now. OK, here’s one way to fix this:

proc sql;
  create table wider
  (report_month num format=date9.
   , location char(20)
   , A num
   , B num
   , C num
   , D num
   );
quit;

proc datasets library=work force;
  append base=wider data=wide;
run;
quit;

Using the force option with append means the wider dataset will accept the wide dataset being appended to it, even with column D missing. Problem solved and proc transpose gets to stay.

Documentation for proc transpose is here and for proc datasets is here.

Public Function CountDistinct(ByRef vArr As Variant) As Variant Dim dict As Scripting.Dictionary Dim lVal As Long Set dict = New Scripting.Dictionary For lVal = LBound(vArr) To UBound(vArr) dict(vArr(lVal)) = vArr(lVal) Next lVal CountDistinct = dict.Count End Function ?CountDistinct(array(1,2,3,3,4)) 4

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

	Alex on More Regex Functions, part…
	geoffness on Regular Expressions – ne…
	mikeg_de on Regular Expressions – ne…
	geoffness on Type or Class?
	vbanewbie on Type or Class?

Number Cruncher

Month: July 2010

Using Proc Transpose to widen a time series

Count Distinct Values