Using Proc Transpose to widen a time series

I find I’m doing this fairly frequently recently with our reporting: starting with a set of unit record data, I use proc summary to get to a sorted, narrow dataset like this:

Report_Month Location Outcome Num_Cases
01Jan2010 Auckland A 3
01Jan2010 Auckland B 5
01Jan2010 Christchurch A 1
01Jan2010 Christchurch B 2
01Jan2010 Christchurch C 5
01Jan2010 Wellington B 1
01Jan2010 Wellington C 7
01Feb2010 Auckland A 5
01Feb2010 Auckland B 4
01Feb2010 Auckland C 1
01Feb2010 Christchurch A 2
01Feb2010 Christchurch B 2
01Feb2010 Christchurch C 5
01Feb2010 Wellington A 2
01Feb2010 Wellington B 3
01Feb2010 Wellington C 1
01Mar2010 Auckland A 6
01Mar2010 Auckland B 3
01Mar2010 Christchurch B 1
01Mar2010 Christchurch C 6
01Mar2010 Wellington A 1
01Mar2010 Wellington B 3
01Apr2010 Auckland A 2
01Apr2010 Wellington C 5

Whereas the requirement would be to get to a wider set like this:

Report_Month Location A B C
01Jan2010 Auckland 3 5 0
01Jan2010 Christchurch 1 2 5
01Jan2010 Wellington 0 1 7
01Feb2010 Auckland 5 4 1
01Feb2010 Christchurch 2 2 5
01Feb2010 Wellington 2 3 1
01Mar2010 Auckland 6 3 0
01Mar2010 Christchurch 0 1 6
01Mar2010 Wellington 1 3 0
01Apr2010 Auckland 2 0 0
01Apr2010 Wellington 0 0 5

There are two ways I know of to achieve this transformation in SAS code. One would be to create the array of variables needed in a data step and populate them for each by group (in this case, Report_Month and Location would be the by variables) in turn. This approach has got a lot to recommend it – most importantly to my mind the data step is both flexible and straightforward.

On the other hand…if you’re dealing with a lot of variables the code required can get pretty verbose. Which is why a lazy programmer like me finds proc transpose to be a better solution – in this (admittedly simple) case, very little code is required:

proc transpose data=narrow
               out=wide (drop=_name_ _label_);
  by report_month location;
  id outcome;

The code is short, and explicit – what’s intended here will be clearer to another programmer than the typical array processing idiom used in the data step.

Here’s a wrinkle, though: the transpose procedure quite rightly determines that in this case, it needs to create 3 new columns in the output dataset wide, to allow for the 3 outcome values A, B and C occurring in narrow. What if there were actually 4 outcomes that had to be counted, as in the below?

Report_Month Location A B C D
01Jan2010 Auckland 3 5 0 0
01Jan2010 Christchurch 1 2 5 0
01Jan2010 Wellington 0 1 7 0
01Feb2010 Auckland 5 4 1 0
01Feb2010 Christchurch 2 2 5 0
01Feb2010 Wellington 2 3 1 0
01Mar2010 Auckland 6 3 0 0
01Mar2010 Christchurch 0 1 6 0
01Mar2010 Wellington 1 3 0 0
01Apr2010 Auckland 2 0 0 0
01Apr2010 Wellington 0 0 5 0

The data step is going to be able to cope a lot better with this sort of situation, but let’s just assume that I’ve already written way too much code to change horses now. OK, here’s one way to fix this:

proc sql;
  create table wider
  (report_month num format=date9.
   , location char(20)
   , A num
   , B num
   , C num
   , D num

proc datasets library=work force;
  append base=wider data=wide;

Using the force option with append means the wider dataset will accept the wide dataset being appended to it, even with column D missing. Problem solved and proc transpose gets to stay.

Documentation for proc transpose is here and for proc datasets is here.


Count Distinct Values

I’m sorry for the lack of posts over the last few months – what can I say, I’ve been busy. I had meant to continue the series on data access patterns from here and I put a revision of the code here, but I haven’t had the time to do much more. I may get back to it eventually, but in the meantime I thought I’d post something else which I’ve found dictionaries to be quite useful for.

The below is a function which gives a count of the distinct items in an array. As items are added, the dictionary will prevent duplication of key values – and the Count method gives the number of keys in the dictionary.

Public Function CountDistinct(ByRef vArr As Variant) As Variant

    Dim dict As Scripting.Dictionary
    Dim lVal As Long
    Set dict = New Scripting.Dictionary
    For lVal = LBound(vArr) To UBound(vArr)
        dict(vArr(lVal)) = vArr(lVal)
    Next lVal
    CountDistinct = dict.Count

End Function


A much simpler implementation is available in Python, using the set datatype:

>>> def CountDistinct(vals):
...   return len(set(vals))
>>> CountDistinct([1, 2, 3, 3, 4])