64-bit Excel

The MS Excel team just posted on their blog about a 64-bit version of Excel in the upcoming 2010 version. Nominally, this means, as they put it, “…a version of Excel that runs natively in that [64-bit] environment and takes advantage of 64-bit strengths. One of the more obvious benefits is the lifting of the addressable memory constraint we were dealing with in 32-bit Windows.”

OK, OK, so that’s true. Out-of-memory conditions are one of the problems that have traditionally plagued Excel. I’m really stoked to see this being addressed. But one thing I’m really not stoked about – and a few people have already commented about this – is the unfortunately wording of the explanation for need for 64-bit Excel: “In short: enormous capacity; humungous workbooks.” In short, no, no, NOOOOOO!!!

Let me put my objection into context. I see Excel as filling a number of roles in the BI space. It’s got a calculation engine and pivot tables which make it a great tool for analysis. It’s got a charting engine and formatting which makes it an excellent presentation layer. It’s got VBA (for the time being anyway), which allows it to be used as a serviceable small-scale rapid application development platform. However, one thing it’s not and should never be used as is a database. The occasions when it is used in that way almost inevitably end in tears, and quite often in mission-critical systems which just quit working as they were intended – models which no longer calculate, or “humungous workbooks” which no longer open.

I was approached recently by someone in the team implementing the rollout of Office 2007 at work, asking about a number of very large workbooks (built in Excel 2003) which appeared not to open in Excel 2007. “How large is large?” I asked, thinking it’d be in the 50-100 MB range. Turns out these monsters approached 500 MB – several sheets containing 30K rows with 100+ columns filled with VLOOKUPs, OFFSETs, SUMIFs and array functions. They do open (and re-calculate) in Excel 2003, although you do have to wait a few minutes in either case. In fact, they do eventually open in Excel 2007, but only after a 5-hour wait.

Luckily, these behemoths are not in current use, which is the scenario which concerned the project team. If they had been part of a business-critical system, we would have been in serious trouble. The bottom line is, this is a great example of really poor spreadsheet design, which should have as a fundamental guiding principle the idea of restricting data to the minimal possible to achieve the desired result. Really large datasets do NOT belong in spreadsheets, and I’ll take a lot of convincing to believe that the enhancements offered in a 64-bit version of Excel can change that.

Names in Excel

First off, I should apologise for the complete lack of posts over the last few weeks. It’s been a rather busy time for a month or so now, and while there hasn’t been a lack of things to write about, there has certainly been a lack of energy to actually write about them.

Anyway, I posted a few weeks back about the modelling life cycle, where I talked about the desired outcomes of implementing a standard ‘best practice’ modelling process. One of these was effective documentation of the model – effective in that everyone who needs to build, use, test and maintain a model has access to the information they need to understand its intent. The documentation should primarily be external to the model itself, but in my opinion, a best practice model design should also apply an approach adhering to the principles of self-documenting code. That is, the model should be built in a way that is easy to understand, use and extend without the need for extensive reference to external documentation.

Names in Excel are a key part of that approach. While it is entirely possible to build a model which lives up to its specification without using names at all, I can almost guarantee that appropriate and judicious use of names would shorten development time, produce formulas that are easier for the tester and/or user to understand, and produce something that would be easier to extend or modify. There are many references out there to how names can be used in Excel, so without trying to repeat or summarise all of those, I’ll just highlight what I see as the main benefits of each technique in this context.

The first and most obvious way to use names is in defining named ranges. A cell named inpDiscountRate is meaningful to both a developer and a user – The prefix inp signals that the range is an input or assumption, and DiscountRate tells them what it denotes in the context of the model. This means that a formula referencing the discount rate in the model is far more comprehensible at first glance than something which has a reference to, say, $B$6. It’s also helpful to be able to denote lookup tables and arrays used in the model. for instance in a table where an ID field is located in column A, there are a couple of ways to retrieve the ID value from elsewhere in the table. One is to name the whole column (excluding the header), maybe using a dynamic range name, defining ptrID as

=OFFSET($A$1,1,0,COUNTA($A:$A)-1,0)

Referring to ptrID (The ptr prefix denotes a pointer) in a formula in another column will then pick up the value in the ID column by implicit intersection. Another way to accomplish the same feat is to define a relative range name, again ptrID, as $A6, where the active cell is, say, B6. The reference is row-relative, so that a reference to ptrID picks up the cell in the same row as that the formula is entered in, but always looks in column A. I prefer using relative references as they’re less work to code, but either way we get the same benefit, of a formula which has more meaning than one which relies on cell addresses.

Names can also contain constants, thereby eliminating the need for using ‘magic numbers’ in formulae. Using

=ptrCellToLeft*inpDiscountRate^(1/cNumPeriods)

where cNumPeriods is defined to have the value 4, means that not only is the logic in the formula more transparent than it would be otherwise, but it is also more flexible. A change in granularity from quarterly to monthly requires no more than changing the value of cNumPeriods to 12.

Finally, a name can also encapsulate a formula. The benefit I see in a named formula is principally that it can serve to simplify otherwise unwieldy calculations. However, I prefer not to make too much use of them in this way, as in my opinion they can also serve to obscure the logic involved. In most cases, if a formula is too long to understand easily at first reading, a better approach to simplifying it is to split it into two or more cells. Assuming those cells are themselves assigned names, the end result will look pretty similar, but it is easier for the reader to access the logic. Also, this approach could conceivably improve performance – a really long formula can have a lot of dependencies, so by splitting it up we can reduce the amount of recalculation required when any one of those changes.

I also should make mention of the excellent Name Manager add-in, available from JKP Application Development Services. Working with the Insert Names dialog can be very time-consuming, but this add-in greatly simplifies the creation, modifying and documenting of names. The addition of the built-in Name Manager in Excel 2007 is a great improvement, but as was pointed out in a comment on an earlier post, it still doesn’t equal the functionality of this very handy add-in.

The modelling life cycle

Recently I decided to propose the use of a framework in the spreadsheet modelling process in my business unit. The idea for the framework is based on the slightly dated but still excellent document Spreadsheet Modelling Best Practice, available for download through the European Spreadsheet Risk Interest Group website. Basically, my motivation was provided by one particular modelling process I saw in action recently, where someone with no real skills in spreadsheet design (let’s call them Noddy) was tasked with building a model. This model is going to inform several key internal and external stakeholders in our business about downstream effects of a significant change to a business process. High visibility, high risk. The thing is, I know I’m going to be called upon at some stage to help out with this, and I also know it’s going to be a hard ask to get details from Noddy – I’ve seen some of what he calls documentation, and it’s nowhere near as specific as it needs to be.

With all this in mind, the question for me is: how can we avoid this situation in the future? The Best Practice paper seemed to offer a nice framework to ensure that two things are guaranteed from any modelling process:

  • Constructive engagement from everyone that needs to be engaged; and
  • Documentation which allows for ease of testing, maintenance and re-use.

The core idea is simple – any model built according to best practice goes through a six-stage life cycle, as follows:

  1. Scope
  2. Specify
  3. Design
  4. Build
  5. Test
  6. Use

This is iterated as required. The outcomes from each stage are concrete, and serve as inputs into later stages – for instance, the outcome of the specification stage is an unambiguous statement of how the model will calculate the results required – this guides how the model is designed and then built, and serves as a framework to test against.

One of the key weaknesses with our current setup is that people like Noddy often charge into the build stage without any regard for the first three stages, which in my opinion are the most important to getting the model right. Building a model in Excel, once you know the questions it has to answer, have the data you need, know exactly how you’re going to answer those questions, and what it’s going to look like, is a trivial process (given, of course, a fair knowledge of worksheet functions and the Excel UI).

Building Models with Pivot Tables

Pivot Table Love @ Smurf on Spreadsheets
This is a really interesting post, mainly because of the discussion it’s generated – there’s definitely a diversity of opinion out there about the use of pivot tables. Personally, I have used them more often than not when building models, for the simple reason that once you’ve gotten over the initial learning curve on the way to using them they’re a very easy way to make sense out of large amounts of data.

My use of pivot tables has evolved over time, in tandem with my growth in understanding of best practice in spreadsheet modelling. When I first started using them, all I would do was put a truckload of data on one worksheet and use the wizard to build a table and present some summary stats on another. Which can be fine, but I think what I failed to appreciate at that point is the fact that there are lots of ways to extract summary stats, many of which are far less expensive in terms of file size. For instance, array functions, SUMIF/COUNTIF, or one of the many variations on the use of SUMPRODUCT (if you’re not using too many of them), can be very convenient in that they will save on the expense of creating the pivot cache. The database functions (DCOUNT, DSUM, DAVERAGE) are also very useful, and often enable the extraction of stats using far more complex criteria than would otherwise be available, directly from the dataset.

What I find myself doing more now is using pivot tables, in combination with GETPIVOTDATA, as an intermediate step on the way to building the inputs and calculations that do the work in the model. The question of file size often comes up, but unless the dataset is under, say, 10K rows, I’ll keep it outside the model. The SAS add-in for Office is a great option in that regard, with the option of opening a SAS dataset directly into a pivot table, or running stored processes or even adhoc queries on the server from inside the worksheet.

The really attractive feature of pivot tables, though, is that once they’re built they’re very flexible, and reasonably intuitive for the end user to change as needed. Of course, whether I use them in that way or not is going to depend on the objective of the model and the needs of the end user.

I, too, am in spreadsheet hell

Ok, let me start by saying that the use of the term “hell” is probably a little too strong for the situation I’m about to describe. But I just stumbled across an old post in Smurf on Spreadsheets linking to a paper written by the author, which struck an immediate chord with me. Without reviewing the contents of the paper, I’ll just mention a couple of the points which stood out:

  • The “key man” dependency problem – critical spreadsheets and systems of spreadsheets in many corporate environments have been built, or at least designed, principally by one person. This one person hands the system over to the prospective business owner or end user with either little or no documentation, technical or otherwise. No-one else has the same perspective on the overall structure of the system, or sufficient knowledge of its inner workings to be able to support it. So when that person leaves the building, where does that leave the user when something goes wrong? Not in a good place…
  •  

  • External linking – It’s an extremely common (and risky) practice to link related spreadsheets within a system or application in a rigid, predetermined fashion in order to collate, summarise or manipulate data, slicing and dicing on the way to producing a presentation layer (perhaps a management dashboard or other regular report) at the end. This often means that the accuracy of the results and/or the completion of the process is extremely sensitive to both the structure of the data and to the containing file structure. As a result, an error in the process often means a protracted, tortuous investigation to trace the point at which it originated.
  •  

So my situation relates to both of these points. When I started in my current job, about a year ago, I was designated as the support person for a monthly business reporting process in another team. My predecessor had been the person who “designed” this process, which involves a total of about 60-70 spreadsheets in total, 30-odd of which are raw data entered into templates, and the rest summarising this data in various ways, resulting in 3 or 4 key reports for various consumers, mostly upper management.

The issue I have is that the process is, overall, a complete mess. The importing of data and regeneration of consolidated new reports is handled by a bunch of VBA routines, which is why I was nominated as the support person (Geoff knows VB, he can do that). Imagine my horror to discover that the ongoing maintenance of the code required, each and every month, copying and pasting every single routine used the previous month and then replacing within it every single reference to that month – month names, dates, file paths, everything. There were no variables used in the code, anywhere – it had basically all been recorded, then updated, every month for the last three years. Anecdotal evidence suggested the bloke who was doing this spent around 3-4 days every month on this and a couple of other similar processes.

Well, I was having none of that, so I spent around 3 weeks painstakingly untangling the intent of the code, and then refactoring, until finally I was left with something that requires no regular maintenance. This now leaves the real problem to deal with, which does require regular attention.

Most of the cells in the monthly summaries represent the last link in a chain of dependency stretching back over several workbooks. It works, mostly – but where it doesn’t, leaving incorrect or #REF values, it is a real pain in the arse finding where it’s gone wrong. The person who runs the reports calls me up most months to report just that happening, and investigation often reveals nothing more than links not being updated. Or occasionally I’ll discover a link pointing to the wrong place, or a template incorrectly filled out. What the process needs is essentially complete redesign, but given the current constraints on budget in our business, it’s not going to happen anytime soon. In fact, I can see this stretching out over the next 5-10 years, and when I’ve gone, the next person to support it will be in much the same situation I was. I’ve made some attempts at documentation, but I really don’t have the time now for anything else.

The perils of large spreadsheets

Ok, we’ve all been there. What started out as a relatively compact workbook has blossomed into a gargantuan mess that requires calculation set to Manual to change anything, and takes a full minute just to open. It’s not nice, particularly when the workbook in question is one that is relied upon heavily. I “owned”, or should I say “inherited”, one of these a couple of years back, which was used by over a hundred people in a call centre – and loathed by all of them. I didn’t like it much either, as I had to field all the calls and emails asking for it to be “fixed”.

So what to do? In a sense, of course, there’s nothing to fix, as long as everything does recalculate as it should. If not, or worse, if Excel crashes when it’s being used, then there is a problem, which usually warrants a complete redesign. Unfortunately that happens more often than the average user realises. While Excel’s memory management is continually improving with new releases (and the average memory available to it is increasing), you can only push it so far before it gives up the ghost. However, let’s assume for the sake of argument you’re not in that awful place right now. So while there’s nothing broken as such, the performance you’re getting out of it is not what it should be. What can you do to improve it? Here’s a couple of things to try.

  • Cull unnecessary data. There are cases where workbooks (particularly some financial models) do require several sheets with thousands of rows. But it’s worth questioning whether it’s all absolutely necessary. And does everything need to be in the same workbook? It may be that you can get significant gains by splitting the workbook into two or more pieces, which can then be hyperlinked – and by grouping related data this can make for a more user-friendly tool.
  • Re-use the same cache for your pivot tables if you can. Excel will very helpfully prompt you to do just that when it realises you’re about to create a pivot table using the same range or database that forms the cache for another pivot table. It also explains why – you will save memory and your workbook will be smaller. This is almost certainly a good thing. And don’t make the mistake of creating a separate cache for a subset of a previously used range (another common error) – you can always use the whole thing and exclude what you don’t want from within the pivot table after it’s been created.
  • Avoid formula repetition, wherever possible. Too many people take the “Jurassic Park” approach to building formulas, and assume that just because it is possible to do something, they should do it. Not true – here’s an example: =IF(VLOOKUP($M6,$B$2:$H$100,3,FALSE)>AVERAGE($B$2:$B6),
    VLOOKUP($M6,$B$2:$H$100,3,FALSE)-AVERAGE($B$2:$B6),
    VLOOKUP($M6,$B$2:$H$100,3,FALSE)+AVERAGE($B$2:$B6))

    Now you may ask, what’s wrong with that? Well, look at it from Excel’s point of view:
    Excel: So what do you want me to do?
    You: Well, first look up this thing in the first column of that table, and go along 3 columns from it and get the value there.
    Excel: Got it – easy!
    You: Good. Now take the mean of these numbers here and compare it with the number you just got. If it’s smaller, then look up this thing in the first column of that table, go along 3 columns and get the value there.
    Excel: Wait – you asked me to do that already.
    You: Well now I want you to do it again.
    Excel: OK then – it’s your party.
    You: Right. Now take the mean of these numbers here,…
    Excel: Seriously? I just did that before.
    You: Do it again, and subtract it from the number you just got.
    Excel: Alright – done. What if the number wasn’t smaller?
    You: Well, in that case, look up this thing in this table here…
    Excel: I can feel a migraine coming on…

    You get the picture. Of course, the problem is that Excel won’t complain when you ask it to do something silly like this – but any reasonable person would suggest that you calculate the results of the VLOOKUP and AVERAGE functions once, in separate cells, and then use the values however you like, as many times as you like, without creating a job that takes twice as much recalculation as would otherwise be necessary.
  • A related issue is the (over)use of VLOOKUP. Where you need to look up the same value more than once, even in separate cells, returning data from different columns, Excel is still having to repeat the same task unnecessarily. MATCH is your friend here, in combination with either OFFSET or INDEX. For example:
    A1=(VLOOKUP(M6,$B$2:$H$100,3,FALSE)
    A2=(VLOOKUP(M6,$B$2:$H$100,4,FALSE)
    A3=(VLOOKUP(M6,$B$2:$H$100,7,FALSE)

    could be replaced with:
    A1=MATCH(M6,$B$2:$H$100,0)
    A2=INDEX($D$2:$D$100,A1)
    A3=INDEX($E$2:$E$100,A1)
    A4=INDEX($H$2:$H$100,A1)

    Over big tables this can represent a huge saving in calculation time.

And that’s all I’ve got time for right now. Good luck!