Summing a random sample

A colleague came to me recently with a problem:
I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant)
    ' SampleArray populates a random sample avSmallset from an array avBigset
    ' without replacement (each element in avBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lOb As Long
    Dim lPickit As Long
        
    ' Make sure we're dealing with arrays...
    If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub
    ' Initialise
    lRemainder = UBound(avBigset)
    lSize = UBound(avSmallset)
    Randomize 0
    
    Do While lSize > 0  ' Still some left to pick up
        lOb = lOb + 1
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSmallset(lPickit) = avBigset(lOb)
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Loop    ' Sample complete
    
End Sub

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

Private Function arr(ByRef rng As Range) As Variant
    ' Convert range to 1-D array
    Dim i As Long
    Dim j As Long
    Dim k As Long
    Dim rowCount As Long
    Dim colcount As Long
    Dim lim As Long
    Dim sArr() As String
    Dim vArr As Variant
    
    vArr = rng.Value2
    
    ' dimension array
    rowCount = UBound(vArr)
    colcount = UBound(vArr, 2)
    lim = rowCount * colcount
    ReDim sArr(1 To lim)
    k = 1
    
    ' populate
    For i = 1 To rowCount
        For j = 1 To colcount
            sArr(k) = vArr(i, j)
            k = k + 1
        Next j
    Next i
    
    arr = sArr
    
End Function

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant)
    ' SampleRange populates a random sample avSample from a range rngBigset
    ' without replacement (each element in rngBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lPickit As Long
    Dim rngCell As Excel.Range
    
    
    ' Initialise
    lRemainder = rngBigset.Cells.Count
    lSize = UBound(avSample)
    Randomize 0
    
    For Each rngCell In rngBigset.Cells
        If lSize <= 0 Then Exit For ' Sample complete
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSample(lPickit) = rngCell
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Next rngCell

End Sub

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim in_array As Variant
    Dim index As Long
    
    ' set up large and small arrays
    in_array = arr(in_range)
    ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size))
    
    ' get sample
    SampleArray in_array, sample_array
    ' sum sample array
    SAMPLESUM = WorksheetFunction.Sum(sample_array)
    
End Function

and SAMPLESUM2, working directly with the range:

Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim index As Long
    
    ' set up large and small arrays
    ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size))
    
    ' get sample
    SampleRange in_range, sample_array
    ' sum sample array
    SAMPLESUM2 = WorksheetFunction.Sum(sample_array)

End Function

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

Public Sub timeit()
Dim starttime As Double
Dim endtime As Double
Dim totaltime As Double
Dim i As Integer
Dim numrange As Excel.Range

Set numrange = Range("A1:J50000")


starttime = Timer

For i = 1 To 100
    SAMPLESUM numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds"

starttime = Timer

For i = 1 To 100
    SAMPLESUM2 numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds"

End Sub

Which output:

Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds
Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds

I think I’ll go with door number 2.

Regular Expressions – new version

Just a quick post tonight to let you all know, I’ve added a new page for downloads, which contains a new version of the regular expressions add-in, compatible with Excel 2007 and later. I’ve added in a couple of utility functions for convenience (mine more than yours but you might find them useful), and a form to evaluate regular expressions against test strings. And there’s even documentation!

The documentation gives a (very) brief intro to the uses and abuses of regular expressions, a run-down of the worksheet functions in the add-in and some examples of their use. Here are a couple of those, I hope you find them useful.

Matching cells which contain variations on a word

There are some words in the English language which Americans, god bless them, spell in their own special way. However, given input on the spreadsheet from users who spell both ways (correctly and incorrectly), you may wish to match both variations of words like ‘realise’ (‘realize’) and ‘colour’ (‘color’).
The pattern to match realise/realize is simple: \breali(s|z)e\b
The word boundary markers ensure we are looking at a complete word, and the alternation of (s|z) means that we match both versions.
Applying the ISRXMATCH formula demonstrates this is successful:

Validating Email Addresses

Given a list of email addresses in a column on a spreadsheet, we wish to ensure that these stick to a form which at least obeys some of the rules governing the format of email addresses. As these are going to be used by a script to send emails, we wish to minimise the number of undeliverable responses due to invalid addresses. The basic rules we specify for these addresses are as follows:
The username part of the address contains one or more alphanumeric characters, and possibly some additional special characters. This is followed by a single @ sign, followed by the domain name, which consists of one or more alphanumeric and special characters, ending with a dot followed by the top-level domain. This must contain only alphanumeric characters, and there must be between 2 and 6 of these. The address should be the entire content of the cell, so the beginning and ending anchors are used at the start and end of the pattern. Case is unimportant, so the case_sensitive flag is set to false.
The pattern is as follows: ^[a-z0-9_%.+-]+@[a-z0-9-.]+\.[a-z]{2,6}$
This is then used in ISRXMATCH – a valid email address according to our rules above will return true:

The second address in the list fails due to the whitespace in the username, whereas the fourth fails because the domain name does not include a top-level domain part of a dot followed by 2-6 letters.
I borrowed this regex from http://www.regular-expressions.info/email.html. As well as a couple of alternative regexes to cover some edge cases which the above doesn’t catch, this page also discusses why email addresses can be tricky, and why you shouldn’t go overboard trying to cover every exception.

An Undo Stack in VBA

I’m working on an Excel add-in which will allow a user to perform operations on cells. I’d also like to allow the user to ‘undo’ these operations. Excel itself, of course, does have an Undo button (shortcut Ctrl + Z), but this works only to undo changes made in the user interface, and is not going to help for changes made by VBA. This means the add-in will need to remember each operation performed and have them ready to supply for un-doing should the user want to do so.

Many people have created their own general solutions for undoing changes made in VBA – for instance, here’s one from Jan Karel Pieterse (who also created the hugely useful Name Manager add-in): Undo With Excel VBA. However I decided not to use a solution like this – it’s quite a lot of code to add to a project, and it is very general.

In this case I’m happy to write a more specific solution, which copes with undoing a specific action, by simply delegating the work of undoing the action to the object which did the action in the first place. Here’s an example:

' CellTest Class - allows calling code to change the value of a
' cell and then change it back

'*****************************************************************
' Private Variables
'*****************************************************************
Private mCell As Range
Private vOrigFormula As Variant

'*****************************************************************
' Public Properties
'*****************************************************************
Public Property Get Address() As String
  Address = mCell.Address
End Property

Public Property Set Cell(ByRef rCell As Range)
  Set mCell = rCell
  vOrigFormula = rCell.Formula
End Property

'*****************************************************************
' Public Methods
'*****************************************************************
Public Sub Change(ByVal sText As String)
  mCell.Value = sText
End Sub

Public Sub ChangeBack()
  mCell.Formula = vOrigFormula
End Sub

So what about storing up a list of operations for undoing? This is ideally suited to a stack. Stacks are neat data structures – they don’t do much, but they do it well and they’re really easy to code. Here’s one I picked up from the VBA Developer’s Handbook and tweaked a little to suit. There are two classes, StackItem and Stack. StackItem simply stores a value and a reference to another StackItem object – the next item on the stack.

' StackItem class holds a reference to the object it refers to and
' a reference to the next item in the stack

Public Value As Variant
Public NextItem As StackItem

Stack stores a reference to the item on the top of the stack, and exposes the functions Push (to push a new item on to the top) and Pop (to remove the top item and return a reference to it). Properties IsEmpty and Peek tell you if there are any items in the stack, and what the value of the top item is (without removing it), respectively. There’s a very nice visual explanation of how this looks here.

' Stack holds a reference to the top item in the stack data structure
' and manages the adding and removing of stack items

Private siTop As StackItem

'*****************************************************************
' Public Properties
'*****************************************************************
Public Property Get IsEmpty() As Boolean
    IsEmpty = siTop Is Nothing
End Property

Public Property Get Peek() As Variant
    If IsObject(siTop.Value) Then
        Set Peek = siTop.Value
    Else
        Peek = siTop.Value
    End If
End Property

'*****************************************************************
' Public Methods
'*****************************************************************
Public Function Push(ByRef varIn As Variant) As Boolean

    Dim siNew As StackItem

On Error GoTo PushError

    Set siNew = New StackItem
    If IsObject(varIn) Then
        Set siNew.Value = varIn
    Else
        siNew.Value = varIn
    End If
    Set siNew.NextItem = siTop
    Set siTop = siNew
    
    Push = True
    
Exit Function

PushError:
    Push = False
End Function

Public Function Pop() As Variant

    If IsObject(siTop.Value) Then
        Set Pop = siTop.Value
    Else
        Pop = siTop.Value
    End If
    Set siTop = siTop.NextItem

End Function

So now to store an actions for undoing, I just need to push the object which performed it onto the undo stack. Every time I wish to actually undo an action, I just pop the object reference off, and instruct it to undo.

Next time, I’ll demonstrate how this looks in practice.

Count Distinct Values

I’m sorry for the lack of posts over the last few months – what can I say, I’ve been busy. I had meant to continue the series on data access patterns from here and I put a revision of the code here, but I haven’t had the time to do much more. I may get back to it eventually, but in the meantime I thought I’d post something else which I’ve found dictionaries to be quite useful for.

The below is a function which gives a count of the distinct items in an array. As items are added, the dictionary will prevent duplication of key values – and the Count method gives the number of keys in the dictionary.

Public Function CountDistinct(ByRef vArr As Variant) As Variant

    Dim dict As Scripting.Dictionary
    Dim lVal As Long
    
    Set dict = New Scripting.Dictionary
    
    For lVal = LBound(vArr) To UBound(vArr)
        dict(vArr(lVal)) = vArr(lVal)
    Next lVal
    
    CountDistinct = dict.Count

End Function

?CountDistinct(array(1,2,3,3,4))
 4

A much simpler implementation is available in Python, using the set datatype:

>>> def CountDistinct(vals):
...   return len(set(vals))
... 
>>> CountDistinct([1, 2, 3, 3, 4])
4

Dictionary vs Collection

A pretty common requirement in programming is to be able to create a group of related objects, and put them into some sort of container. This means they can effectively be treated as one object, while still being accessible individually. For example, if I’m tracking the stats of a sports team over the course of a game, I could create a bunch of Player objects, all of which have individual attributes, but which I can view collectively as a Team object.

One way to group the Players would be to put them into a Team array. A weakness of this approach is that the only way to access one of the Players is to use their position in the array. This is ok for arrays where the positions of the members aren’t going to change, but if this isn’t the case it gets harder to keep track of things the longer the array persists in memory and the more changes need to be made to it. What is useful in this case is to be able to retrieve a Player object by using a key that uniquely identifies them, such as their name for instance, and not have to worry about which position in the Team array they are in.

The VB Collection object allows you to do this, by giving you the option of defining a ‘key’ value when the item is added to it – for instance, the below adds the range object identified by the name RangeName to a collection, and uses RangeName as the key:

Dim col As Collection
Dim rng As Excel.Range
Dim sRangeName As String
Set col = New Collection
sRangeName = "RangeName"
Set rng = Range(sRangeName)
col.Add rng, sRangeName

Now, when you wish to retrieve a particular item from the collection, you have the option of either specifying its position (index), or using the key value. If you know this was the first item added and its position hasn’t changed, you can use

col.Item(1)

(VB collections are 1-based). Otherwise, you can use

col.Item(sRangeName)

You can also iterate over the collection using For Each…Next syntax, which is both more concise and faster than accessing an array one item at a time.

Collections are fine – and pretty much all the container structures in the Excel object model use them, which is why you can say

For Each wb in Workbooks

or

For Each cell in rng.Cells

However, I hardly ever use them for implementing my own containers – where key-based lookup is required I prefer the VBScript Dictionary. The syntax for adding objects to a dictionary is very similar to adding items to a collection, apart from the reversal of the key and item arguments:

Dim dict As Scripting.Dictionary
Dim rng As Excel.Range
Dim sRangeName As String
Set dict = New Scripting.Dictionary
sRangeName = "RangeName"
Set rng = Range(sRangeName)
dict.Add sRangeName, rng

as is the retrieval of items:

dict.Item(sRangeName)

except for the fact that you have to use the key (which is safer in any case).

What really sells the dictionary to me over the collection are the following 3 methods:

dict.Exists(k) – boolean to let you know whether or not the key k has been added to the dictionary. This is very useful, as it avoids the need for the clunky idiom you’re forced to use to check the same thing with collections:

Public Function Exists(ByRef col As Collection, _
                    ByVal sKey As String) As Boolean

    Dim lCheck as Long
    On Error Resume Next
    lCheck = VarType(col.Item(sKey))
    If Err.Number = 0 Then
        Exists = True
    Else
        Exists = False
    End If

End Function

dict.Keys – Return an array containing all keys in the dictionary, which can then be iterated over. This for me is where the collection really falls down, as while it’s certainly possible to iterate over the items in a collection, the key cannot be returned from the item – which means it’s not available. A good example of why you might want this is the following from the CRXRowDict.Test function in the last post:

For iKey = LBound(mdicValues.Keys) To UBound(mdicValues.Keys)
        sKey = mdicValues.Keys(iKey)
        Set objRegex = mdicValues(sKey)
        sTest = clsRow.Item(sKey)
        Debug.Print objRegex.Pattern, sTest
        ' if any test fails, return false
        If Not objRegex.Test(sTest) Then
            bResult = False
            Exit For
        End If
    Next iKey

The key is required in two places – once to retrieve the matching regexp object from the dictionary and once to retrieve the test string from the CRowDict instance. The only way I could see to achieve the same result with a collection would be to populate and iterate over a separate array – not pretty.

dict.RemoveAll – As the name suggests, clean out all key-item pairs in the dictionary. No need to loop through either keys or items here, quick and easy.