Pivot Table Compatibility Issues

Like a lot of users, I’ve done a lot with pivot tables over the years. A typical application to produce reports would be something like what I did today:

  1. SAS query against a database, summarised and output to csv
  2. open the csv file in Excel and create a pivot table off the summarised data
  3. format the pivot table, selecting the fields to aggregate by, and generate a pivot chart for a nice visual

This week I joined the pilot for our rollout of Office 2007, so I did all the above in Excel 2007. Having followed the improvements to pivot table functionality in this version on the Excel team blog (some posts archived here), and having tried out some fairly trivial pivot tables in the trial version I downloaded at home, I was keen to take a look at what I could do with them in this case. I wasn’t disappointed, there are a lot of really nice enhancements to both form and function, which produced a good-looking report.

The wrinkle occurred when I sent the document to a co-worker – when saving to 97/2003 format I was advised that some functionality would not be available. Turns out it’s more a case of all functionality is not available – the pivot table cannot be refreshed, filters cannot be changed…basically it turns into a static report, which of course is not at all what I wanted. I’d been bitten by the fact that pivot tables are versioned, just like Excel. The pivot table I’d created was version 12; the highest version my co-worker can use in Excel 2003 is 10.

So ok, I should have expected that in a new version of Excel, correct? After all, if there’s new functionality available in the new version, it seems pretty reasonable to expect that functionality not to be available in an old version, no? Well, yes…but I had naively expected that saving the file in 97/2003 format would have ‘downgraded’ the pivot table version, possibly disabling functionality that isn’t supported in 2003. Turns out that cannot be done after the pivot table has been created – the Excel team blog posted an explanation of why this is the case a couple of years ago, which obviously I had either missed or forgotten.

In any case, the way around the issue appears to be to ensure the file is saved in 97/2003 format before creating the pivot table. This is a fairly straightforward fix, but unfortunately it’s going to create some issues for us. We use a lot of code for our regular reporting (served to Excel 97/2003 users) that creates pivot tables, and I’m pretty sure most of it creates the pivot table before saving the file. Revising all of that is going to be a whole lot of nastiness…

Related post from a while back on funnybroad’s Dear Microsoft Office 2007 team

Ubuntu 9.04 released

Canonical has announced the availability of Ubuntu 9.04 (Jaunty Jackalope) for download (free as always). This time around, in addition to the desktop and server editions, the Netbook Remix is available for your cute wee mini-puters – awww. There are a number of options for download – as well as the live CD iso through http and torrent, there is the alternate (text-based) installation CD (handy if you have any less than the recommended 256 MB RAM) and the DVD image is a lot bigger but contains all the language packs. If you want to try Ubuntu but want a less complicated process with no disk partitioning required then Wubi may be worth a crack.

If you’re already running Ubuntu you also have the option of upgrading your system through Update Manager, but I wouldn’t recommend it. I’ve been running Ubuntu for 3 years and have tried this twice on different machines, and both times I’ve had serious problems as a result. While there may be some people who have no trouble upgrading this way, it’s much safer in general to re-install entirely. Doing this means you will need to have some way to recover your files and system configuration. Backing up /home onto an external drive and copying them back after re-installing will work, but if you have /home on a separate partition, there’s no need to do this – as long as you don’t format this partition during installation all your files and any system/application/desktop configuration files will be preserved for the new version.

I’ll post a how-to when I’ve installed Jaunty, which won’t be for a few weeks yet. I’m waiting for the Ubuntu CD through shipit as downloading a CD image as well as all the updates required afterwards takes up a fairly large proportion of my monthly data allowance. I’ll just have to cope with my recently outdated system until then…

SQL passthrough in SAS

Our main data source at work is an Oracle data warehouse, which we (mainly) query from SAS. There are two main ways we can do this – one is to use a libname statement with the OLEDB engine. You can then extract and process data directly from the Oracle tables exactly as you would from a SAS dataset, in a data step, like so:

libname oralib oledb
    Provider=MSDAORA
    Properties=("User ID"=&name;
    Password=&password; "Data Source"=oradw);

data newset;
  set oralib.oldset;
run;

This, however, is far less efficient than the alternative, which is to use the Proc SQL pass-through facility to extract the data before processing it:

proc sql;
  connect to oledb
    init_string="Provider=msdaora; User ID=&name;
    Password=&password; Data Source=oradw";
  create table newset as
  select * from connection to oledb
  (select * from oldset);
quit;

Here’s a nice visual explanation of why this is a more efficient data access method:


To paraphrase, with the pass-through any selection or processing of data is done in the database before pulling it into SAS. This makes it significantly faster, and the more tables being referenced the better the gain in speed over the use of the libname engine.

However, one thing you do miss out on with the pass-through is the ability to use results from an earlier query stored in a temporary dataset to restrict data extracted from a later one. For example, the following stores a dataset id_list in the SAS workspace:

proc sql;
  create table id_list as
  select
    a.id
  from
    oralib.table1 a
    inner join oralib.table2 b
        on a.emp_id = b.sp_emp_id
  where
    b.rate >= 10000
  group by a.id
  ;
quit;

id_list can then be used in subsequent queries, like so:

proc sql;
  create table using_ids as
  select *
  from oralib.table3 c
  inner join id_list d
    on c.id = d.id
  ;
quit;

Of course, the same end result can be achieved in a pass-through by using a subquery:

proc sql;
  connect to oledb
    (init_string="Provider=MSDAORA; User ID=&name;
    Password=&password; Data Source=oradw");
  create table using_ids as
  select * from connection to oledb
  (select *
  from table3 c
  where c.id in
    (select
        a.id
     from
        table1 a
        inner join table2 b
            on a.emp_id = b.sp_emp_id
     where
        b.rate >= 10000
     group by a.id
    )
  );
quit;

What I don’t like about this approach is the overhead of evaluating the inner query before going to work on the outer one. With a simple query like the above, the slow-down is not likely to be that noticeable, but as more tables and joined and more complex selection criteria are added, this becomes more of a concern. In terms of time taken, SQL code like this can often be greater than the sum of its parts.
An alternative approach is to create a temporary table on the database, which can then be referenced in the pass-through:

proc sql;
  create table oralib.id_list as
  select
    a.id
  from
    oralib.table1 a
    inner join oralib.table2 b
        on a.emp_id = b.sp_emp_id
  where
    b.rate >= 10000
  group by a.id
  ;
quit;

libname oralib clear; /*the libname can interfere with the pass-through*/

proc sql;
  connect to oledb
    (init_string="Provider=MSDAORA; User ID=&name;
    Password=&password; Data Source=oradw");
  create table using_ids as
  select * from connection to oledb
  (select *
  from table3 c
  inner join id_list d
    on c.id = d.id
  );
quit;

This of course assumes that the user has permissions to create tables on the database.

FBReader in Ubuntu 8.10

A lot of mornings on my way in to work on the train I like to read a book on my laptop. FBReader is a simple, attractive open source e-book reader that handles a large range of formats and provides a structured view of the e-book collection.
FBReader

FBReader is available from the Ubuntu repositories – just a click away through Synaptic or Add/Remove Applications:

Add/Remove Applications
Alternatively, of course, you could just type at the command line:

sudo apt-get install fbreader

And you’re good to go – start adding your e-books. The library can be viewed as a tree, and books tagged however you like:
FBReader Library

Supported formats include epub, fb2, oeb, rtf, html and plain text, among others. Notable exceptions are pdf (apparently there are plans to implement support for text-only pdf files) and doc. There are also a few really nice features, including direct reading from tar and zip archives, automatic library building, automatic table of contents generation, text search and preserving the last book and last read position between sessions. Happy reading!

Names in Excel

First off, I should apologise for the complete lack of posts over the last few weeks. It’s been a rather busy time for a month or so now, and while there hasn’t been a lack of things to write about, there has certainly been a lack of energy to actually write about them.

Anyway, I posted a few weeks back about the modelling life cycle, where I talked about the desired outcomes of implementing a standard ‘best practice’ modelling process. One of these was effective documentation of the model – effective in that everyone who needs to build, use, test and maintain a model has access to the information they need to understand its intent. The documentation should primarily be external to the model itself, but in my opinion, a best practice model design should also apply an approach adhering to the principles of self-documenting code. That is, the model should be built in a way that is easy to understand, use and extend without the need for extensive reference to external documentation.

Names in Excel are a key part of that approach. While it is entirely possible to build a model which lives up to its specification without using names at all, I can almost guarantee that appropriate and judicious use of names would shorten development time, produce formulas that are easier for the tester and/or user to understand, and produce something that would be easier to extend or modify. There are many references out there to how names can be used in Excel, so without trying to repeat or summarise all of those, I’ll just highlight what I see as the main benefits of each technique in this context.

The first and most obvious way to use names is in defining named ranges. A cell named inpDiscountRate is meaningful to both a developer and a user – The prefix inp signals that the range is an input or assumption, and DiscountRate tells them what it denotes in the context of the model. This means that a formula referencing the discount rate in the model is far more comprehensible at first glance than something which has a reference to, say, $B$6. It’s also helpful to be able to denote lookup tables and arrays used in the model. for instance in a table where an ID field is located in column A, there are a couple of ways to retrieve the ID value from elsewhere in the table. One is to name the whole column (excluding the header), maybe using a dynamic range name, defining ptrID as

=OFFSET($A$1,1,0,COUNTA($A:$A)-1,0)

Referring to ptrID (The ptr prefix denotes a pointer) in a formula in another column will then pick up the value in the ID column by implicit intersection. Another way to accomplish the same feat is to define a relative range name, again ptrID, as $A6, where the active cell is, say, B6. The reference is row-relative, so that a reference to ptrID picks up the cell in the same row as that the formula is entered in, but always looks in column A. I prefer using relative references as they’re less work to code, but either way we get the same benefit, of a formula which has more meaning than one which relies on cell addresses.

Names can also contain constants, thereby eliminating the need for using ‘magic numbers’ in formulae. Using

=ptrCellToLeft*inpDiscountRate^(1/cNumPeriods)

where cNumPeriods is defined to have the value 4, means that not only is the logic in the formula more transparent than it would be otherwise, but it is also more flexible. A change in granularity from quarterly to monthly requires no more than changing the value of cNumPeriods to 12.

Finally, a name can also encapsulate a formula. The benefit I see in a named formula is principally that it can serve to simplify otherwise unwieldy calculations. However, I prefer not to make too much use of them in this way, as in my opinion they can also serve to obscure the logic involved. In most cases, if a formula is too long to understand easily at first reading, a better approach to simplifying it is to split it into two or more cells. Assuming those cells are themselves assigned names, the end result will look pretty similar, but it is easier for the reader to access the logic. Also, this approach could conceivably improve performance – a really long formula can have a lot of dependencies, so by splitting it up we can reduce the amount of recalculation required when any one of those changes.

I also should make mention of the excellent Name Manager add-in, available from JKP Application Development Services. Working with the Insert Names dialog can be very time-consuming, but this add-in greatly simplifies the creation, modifying and documenting of names. The addition of the built-in Name Manager in Excel 2007 is a great improvement, but as was pointed out in a comment on an earlier post, it still doesn’t equal the functionality of this very handy add-in.