Type or Class?

I’ve been kept quite busy recently building an application for work. It’s a fairly standard task – take a number of regularly (monthly) populated workbooks from offices around the country and consolidate the data in them at a regional and national level. Aside from a little fluffing around with validation at the beginning and end of the process, the flow of the consolidation procedure runs like this:

Locate workbooks to be processed, add them to a collection
For each workbook in the collection:
	Open it
	Read the data in the workbook into memory
	Close the workbook

Open consolidation workbook
For each block of data collected:
	write into the appropriate place in the workbook

Save and close

Now that’s all fairly straightforward. The issue I found myself faced with a couple of days ago relates to keeping the data from each of the workbooks in memory. There’s a fair variety of data in them, so I elected to use a user-defined type (UDT) to hold it:

Public Type uCheckSheetData
    sLocation As String
    dtMonth As Date
    vErrors As Variant
    lNumCases As Long
    sComment As String
End Type

I then formed a collection which would hold the UDTs – or so I thought. For those of you who’ve tried this before, you’ve probably also seen the rather obscure error message:

Compile error:

Only user-defined types defined in public object modules can be coerced to or from a variant or passed to late-bound functions

After a fair amount of head-scratching and wtf?ing I finally decided that Google is my friend and found a solution: UDTs can’t be added to collections – use a class. Classes in VBA may not be all that they are in other languages, but they do represent a more object-oriented approach, and can accomplish all and more that a UDT can. They just represent a bit more work to code. For example, here’s how I replaced the UDT uCheckSheetData with the class CCheckSheet:

Option Explicit
' Replaces public type as class can be added to 
' collections/dictionaries

Private sLocation As String
Private dtMonth As Date
Private vErrors As Variant
Private lNumCases As Long
Private sComment As String

Public Property Get Location() As String
    Location = sLocation
End Property

Public Property Let Location(ByVal strIn As String)
    sLocation = strIn
End Property

Public Property Get Month() As Date
    Month = dtMonth
End Property

Public Property Let Month(ByVal dtIn As Date)
    dtMonth = dtIn
End Property

Public Property Get Errors() As Variant
    Errors = vErrors
End Property

Public Property Let Errors(ByVal vArr As Variant)
    vErrors = vArr
End Property

Public Property Get NumCases() As Long
    NumCases = lNumCases
End Property

Public Property Let NumCases(ByVal lIn As Long)
    lNumCases = lIn
End Property

Public Property Get Comment() As String
    Comment = sComment
End Property

Public Property Let Comment(ByVal strIn As String)
    sComment = strIn
End Property

Now it might be argued that a lot of that code is not really necessary. I could have defined a set of public variables in the class and just exposed them that way, rather than exposing public properties with Get and Let. However, one benefit of writing these procedures is that a lot of other code can be included in them, for instance to validate the data passed in to the class members – something that has to be done outside of a UDT. There’s also the possibility to include class methods and trap events and a whole lot of other stuff, which in this case is not really required. Chip Pearson has an excellent introductory discussion of classes in VBA at Classes in VBA. Which, by the way, includes a prominent warning not to make classes auto-instancing (using Dim as New syntax when declaring a class variable, otherwise known as a really bad idea). There’s also a great chapter in the book Professional Excel Development covering classes, which also warns against auto-instancing (did I mention that was a really bad idea?).

Finally, there’s a rather interesting discussion here about why VB won’t let you add a UDT to a collection.

The modelling life cycle

Recently I decided to propose the use of a framework in the spreadsheet modelling process in my business unit. The idea for the framework is based on the slightly dated but still excellent document Spreadsheet Modelling Best Practice, available for download through the European Spreadsheet Risk Interest Group website. Basically, my motivation was provided by one particular modelling process I saw in action recently, where someone with no real skills in spreadsheet design (let’s call them Noddy) was tasked with building a model. This model is going to inform several key internal and external stakeholders in our business about downstream effects of a significant change to a business process. High visibility, high risk. The thing is, I know I’m going to be called upon at some stage to help out with this, and I also know it’s going to be a hard ask to get details from Noddy – I’ve seen some of what he calls documentation, and it’s nowhere near as specific as it needs to be.

With all this in mind, the question for me is: how can we avoid this situation in the future? The Best Practice paper seemed to offer a nice framework to ensure that two things are guaranteed from any modelling process:

  • Constructive engagement from everyone that needs to be engaged; and
  • Documentation which allows for ease of testing, maintenance and re-use.

The core idea is simple – any model built according to best practice goes through a six-stage life cycle, as follows:

  1. Scope
  2. Specify
  3. Design
  4. Build
  5. Test
  6. Use

This is iterated as required. The outcomes from each stage are concrete, and serve as inputs into later stages – for instance, the outcome of the specification stage is an unambiguous statement of how the model will calculate the results required – this guides how the model is designed and then built, and serves as a framework to test against.

One of the key weaknesses with our current setup is that people like Noddy often charge into the build stage without any regard for the first three stages, which in my opinion are the most important to getting the model right. Building a model in Excel, once you know the questions it has to answer, have the data you need, know exactly how you’re going to answer those questions, and what it’s going to look like, is a trivial process (given, of course, a fair knowledge of worksheet functions and the Excel UI).

Finding primes with Python

One of the Euler problems requires finding all the primes below a certain number x.  Without thinking about this very much, it’s possible to take a very easy approach. Let’s say, look at all the numbers below x and check whether each of them is prime. In pseudocode, this would go as follows:
for each y less than x
if y is prime, add it to the list
return the list

The only requirement for this to work is a test to determine whether y is prime. Well, that’s easy enough. This would do the trick:

import math

def is_prime(n):
  """
  pre: n +ve integer
  returns: true if n prime, else false
  """
  for i in range(2, int(math.sqrt(n)) + 1):
    if n % i == 0:
      return False
  return True

So, putting that into code, we’d get:

def primes_under(n):
  """
  pre: n +ve integer
  returns: list of all primes under n
  """
  result = [2]
  # only need to test odd numbers
  for x in xrange(3, n, 2):
    if is_prime(x):
      result.append(x)
  return result

Now the only problem with this approach is that, as the numbers get larger, the prime test takes longer. For n = 10,007 (which is prime) the test requires 100 division operations. Even removing even numbers from the list we’re checking, this mounts up to hundreds of thousands of operations getting through the full list. If we wanted to test up to, say, 1,000,000, the number of operations required at that point is up in the hundreds of millions. Clearly, this will get prohibitively expensive after this point – so it’s time to start thinking about a smarter way to do this.

One way which does not require nearly as many operations is to use a prime sieve. This was an idea that occurred to Eratosthenes a couple of thousand years ago – it still impresses me to consider its effectiveness and its simplicity. Here’s a quick rundown of the algorithm from its Wikipedia entry:

  1. Create a contiguous list of numbers from two to some highest number n.
  2. Strike out from the list all multiples of two (4, 6, 8 etc.).
  3. The list’s next number that has not been struck out is a prime number.
  4. Strike out from the list all multiples of the number you identified in the previous step.
  5. Repeat steps 3 and 4 until you reach a number that is greater than the square root of n (the highest number in the list).
  6. All the remaining numbers in the list are prime.

There is a trade-off here: this approach is far faster than testing all numbers (and the bigger the number, the more noticeable this gain is), but it requires the space in memory to hold the array. Quite often this is the expense of improvement: reduce one cost and another increases.
This is how the algorithm looks in Python:

import math

def primes_under(n):
  """
  pre: n +ve integer
  returns: list of all primes under n
  """
  sieve = [True]*n
  for x in xrange(4, n, 2):
    sieve[x] = False
  # any multiple of num above sqrt(n) is also a multiple of a
  # lower prime, hence has already been eliminated
  for num in xrange(3, int(math.sqrt(n)) + 1, 2):
    if sieve[num]:
      for y in xrange(num**2, n, 2*num):
        sieve[y] = False
  result = [x for x in range(2, n) if sieve[x]]
  return result