Names in Excel

First off, I should apologise for the complete lack of posts over the last few weeks. It’s been a rather busy time for a month or so now, and while there hasn’t been a lack of things to write about, there has certainly been a lack of energy to actually write about them.

Anyway, I posted a few weeks back about the modelling life cycle, where I talked about the desired outcomes of implementing a standard ‘best practice’ modelling process. One of these was effective documentation of the model – effective in that everyone who needs to build, use, test and maintain a model has access to the information they need to understand its intent. The documentation should primarily be external to the model itself, but in my opinion, a best practice model design should also apply an approach adhering to the principles of self-documenting code. That is, the model should be built in a way that is easy to understand, use and extend without the need for extensive reference to external documentation.

Names in Excel are a key part of that approach. While it is entirely possible to build a model which lives up to its specification without using names at all, I can almost guarantee that appropriate and judicious use of names would shorten development time, produce formulas that are easier for the tester and/or user to understand, and produce something that would be easier to extend or modify. There are many references out there to how names can be used in Excel, so without trying to repeat or summarise all of those, I’ll just highlight what I see as the main benefits of each technique in this context.

The first and most obvious way to use names is in defining named ranges. A cell named inpDiscountRate is meaningful to both a developer and a user – The prefix inp signals that the range is an input or assumption, and DiscountRate tells them what it denotes in the context of the model. This means that a formula referencing the discount rate in the model is far more comprehensible at first glance than something which has a reference to, say, $B$6. It’s also helpful to be able to denote lookup tables and arrays used in the model. for instance in a table where an ID field is located in column A, there are a couple of ways to retrieve the ID value from elsewhere in the table. One is to name the whole column (excluding the header), maybe using a dynamic range name, defining ptrID as

=OFFSET($A$1,1,0,COUNTA($A:$A)-1,0)

Referring to ptrID (The ptr prefix denotes a pointer) in a formula in another column will then pick up the value in the ID column by implicit intersection. Another way to accomplish the same feat is to define a relative range name, again ptrID, as $A6, where the active cell is, say, B6. The reference is row-relative, so that a reference to ptrID picks up the cell in the same row as that the formula is entered in, but always looks in column A. I prefer using relative references as they’re less work to code, but either way we get the same benefit, of a formula which has more meaning than one which relies on cell addresses.

Names can also contain constants, thereby eliminating the need for using ‘magic numbers’ in formulae. Using

=ptrCellToLeft*inpDiscountRate^(1/cNumPeriods)

where cNumPeriods is defined to have the value 4, means that not only is the logic in the formula more transparent than it would be otherwise, but it is also more flexible. A change in granularity from quarterly to monthly requires no more than changing the value of cNumPeriods to 12.

Finally, a name can also encapsulate a formula. The benefit I see in a named formula is principally that it can serve to simplify otherwise unwieldy calculations. However, I prefer not to make too much use of them in this way, as in my opinion they can also serve to obscure the logic involved. In most cases, if a formula is too long to understand easily at first reading, a better approach to simplifying it is to split it into two or more cells. Assuming those cells are themselves assigned names, the end result will look pretty similar, but it is easier for the reader to access the logic. Also, this approach could conceivably improve performance – a really long formula can have a lot of dependencies, so by splitting it up we can reduce the amount of recalculation required when any one of those changes.

I also should make mention of the excellent Name Manager add-in, available from JKP Application Development Services. Working with the Insert Names dialog can be very time-consuming, but this add-in greatly simplifies the creation, modifying and documenting of names. The addition of the built-in Name Manager in Excel 2007 is a great improvement, but as was pointed out in a comment on an earlier post, it still doesn’t equal the functionality of this very handy add-in.

The modelling life cycle

Recently I decided to propose the use of a framework in the spreadsheet modelling process in my business unit. The idea for the framework is based on the slightly dated but still excellent document Spreadsheet Modelling Best Practice, available for download through the European Spreadsheet Risk Interest Group website. Basically, my motivation was provided by one particular modelling process I saw in action recently, where someone with no real skills in spreadsheet design (let’s call them Noddy) was tasked with building a model. This model is going to inform several key internal and external stakeholders in our business about downstream effects of a significant change to a business process. High visibility, high risk. The thing is, I know I’m going to be called upon at some stage to help out with this, and I also know it’s going to be a hard ask to get details from Noddy – I’ve seen some of what he calls documentation, and it’s nowhere near as specific as it needs to be.

With all this in mind, the question for me is: how can we avoid this situation in the future? The Best Practice paper seemed to offer a nice framework to ensure that two things are guaranteed from any modelling process:

  • Constructive engagement from everyone that needs to be engaged; and
  • Documentation which allows for ease of testing, maintenance and re-use.

The core idea is simple – any model built according to best practice goes through a six-stage life cycle, as follows:

  1. Scope
  2. Specify
  3. Design
  4. Build
  5. Test
  6. Use

This is iterated as required. The outcomes from each stage are concrete, and serve as inputs into later stages – for instance, the outcome of the specification stage is an unambiguous statement of how the model will calculate the results required – this guides how the model is designed and then built, and serves as a framework to test against.

One of the key weaknesses with our current setup is that people like Noddy often charge into the build stage without any regard for the first three stages, which in my opinion are the most important to getting the model right. Building a model in Excel, once you know the questions it has to answer, have the data you need, know exactly how you’re going to answer those questions, and what it’s going to look like, is a trivial process (given, of course, a fair knowledge of worksheet functions and the Excel UI).

Building Models with Pivot Tables

Pivot Table Love @ Smurf on Spreadsheets
This is a really interesting post, mainly because of the discussion it’s generated – there’s definitely a diversity of opinion out there about the use of pivot tables. Personally, I have used them more often than not when building models, for the simple reason that once you’ve gotten over the initial learning curve on the way to using them they’re a very easy way to make sense out of large amounts of data.

My use of pivot tables has evolved over time, in tandem with my growth in understanding of best practice in spreadsheet modelling. When I first started using them, all I would do was put a truckload of data on one worksheet and use the wizard to build a table and present some summary stats on another. Which can be fine, but I think what I failed to appreciate at that point is the fact that there are lots of ways to extract summary stats, many of which are far less expensive in terms of file size. For instance, array functions, SUMIF/COUNTIF, or one of the many variations on the use of SUMPRODUCT (if you’re not using too many of them), can be very convenient in that they will save on the expense of creating the pivot cache. The database functions (DCOUNT, DSUM, DAVERAGE) are also very useful, and often enable the extraction of stats using far more complex criteria than would otherwise be available, directly from the dataset.

What I find myself doing more now is using pivot tables, in combination with GETPIVOTDATA, as an intermediate step on the way to building the inputs and calculations that do the work in the model. The question of file size often comes up, but unless the dataset is under, say, 10K rows, I’ll keep it outside the model. The SAS add-in for Office is a great option in that regard, with the option of opening a SAS dataset directly into a pivot table, or running stored processes or even adhoc queries on the server from inside the worksheet.

The really attractive feature of pivot tables, though, is that once they’re built they’re very flexible, and reasonably intuitive for the end user to change as needed. Of course, whether I use them in that way or not is going to depend on the objective of the model and the needs of the end user.