Summing a random sample

A colleague came to me recently with a problem:
I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant)
    ' SampleArray populates a random sample avSmallset from an array avBigset
    ' without replacement (each element in avBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lOb As Long
    Dim lPickit As Long
        
    ' Make sure we're dealing with arrays...
    If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub
    ' Initialise
    lRemainder = UBound(avBigset)
    lSize = UBound(avSmallset)
    Randomize 0
    
    Do While lSize > 0  ' Still some left to pick up
        lOb = lOb + 1
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSmallset(lPickit) = avBigset(lOb)
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Loop    ' Sample complete
    
End Sub

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

Private Function arr(ByRef rng As Range) As Variant
    ' Convert range to 1-D array
    Dim i As Long
    Dim j As Long
    Dim k As Long
    Dim rowCount As Long
    Dim colcount As Long
    Dim lim As Long
    Dim sArr() As String
    Dim vArr As Variant
    
    vArr = rng.Value2
    
    ' dimension array
    rowCount = UBound(vArr)
    colcount = UBound(vArr, 2)
    lim = rowCount * colcount
    ReDim sArr(1 To lim)
    k = 1
    
    ' populate
    For i = 1 To rowCount
        For j = 1 To colcount
            sArr(k) = vArr(i, j)
            k = k + 1
        Next j
    Next i
    
    arr = sArr
    
End Function

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant)
    ' SampleRange populates a random sample avSample from a range rngBigset
    ' without replacement (each element in rngBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lPickit As Long
    Dim rngCell As Excel.Range
    
    
    ' Initialise
    lRemainder = rngBigset.Cells.Count
    lSize = UBound(avSample)
    Randomize 0
    
    For Each rngCell In rngBigset.Cells
        If lSize <= 0 Then Exit For ' Sample complete
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSample(lPickit) = rngCell
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Next rngCell

End Sub

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim in_array As Variant
    Dim index As Long
    
    ' set up large and small arrays
    in_array = arr(in_range)
    ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size))
    
    ' get sample
    SampleArray in_array, sample_array
    ' sum sample array
    SAMPLESUM = WorksheetFunction.Sum(sample_array)
    
End Function

and SAMPLESUM2, working directly with the range:

Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim index As Long
    
    ' set up large and small arrays
    ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size))
    
    ' get sample
    SampleRange in_range, sample_array
    ' sum sample array
    SAMPLESUM2 = WorksheetFunction.Sum(sample_array)

End Function

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

Public Sub timeit()
Dim starttime As Double
Dim endtime As Double
Dim totaltime As Double
Dim i As Integer
Dim numrange As Excel.Range

Set numrange = Range("A1:J50000")


starttime = Timer

For i = 1 To 100
    SAMPLESUM numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds"

starttime = Timer

For i = 1 To 100
    SAMPLESUM2 numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds"

End Sub

Which output:

Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds
Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds

I think I’ll go with door number 2.

Advertisements

Some new regular expression functions

There are a couple of SAS functions I have found to be quite useful, and that I miss when I’m working in Excel. Probably the more useful of the two is the SCAN function. You give SCAN a text string, an integer n and optionally some delimiters, and SCAN will give you back the nth ‘word’ in a string, where the ‘words’ are delimited either by the characters you supplied or by default delimiter characters. For instance,

data _null_;
word5 = SCAN("The/number(of+potential.interactions|among^a)large*group<of-people,is$proportional.to!the square/of!their&number.", 5);
put word5=;
run;

will print

word5=interactions

to the log. If n is negative, it will count backwards (from right to left). So

data _null_;
word7 = SCAN("The/number(of+potential.interactions|among^a)large*group<of-people,is$proportional.to!the square/of!their&number.",-7);
put word7=;
run;

will print

word7=proportional

There’s also the function COUNTW, which will also take a text string and some delimiters as arguments, and returns the number of words in the string.

data _null_;
count = COUNTW("Consider, for example, the question of tax policy.");
put count=;
run;

produces

count=8

Once again regular expressions turn out to be ideally suited to translating this functionality to Excel worksheet functions. I’ve included them in Regex.xla as RXSCAN and RXCOUNTW:

Public Function RXSCAN(ByVal stringval As String, ByVal n As Integer, _
                        Optional ByVal dlm As String = "") As Variant
' RXSCAN - Returns the nth word from stringval, where
' words are delimited by any non-word character by default,
' or a specific character set by optional string dlm.

Dim objRegex As VBScript_RegExp_55.RegExp
Dim colMatch As VBScript_RegExp_55.MatchCollection
Dim charlist As String
Dim char As String
Dim i As Integer

' Build pattern
If dlm = "" Then ' Just looking for A-Za-z0-9_
  charlist = "\w+"
Else ' Negate character class specified
  charlist = "[^"
  For i = 1 To Len(dlm)
    char = Mid$(dlm, i, 1)
    If InStr("[\^-", char) Then char = "\" & char ' escape
    charlist = charlist & char
  Next i
  charlist = charlist & "]+"
End If

' Initialise Regexp object, get match collection
Set objRegex = New VBScript_RegExp_55.RegExp
With objRegex
  .Pattern = charlist
  .Global = True
  Set colMatch = .Execute(stringval)
End With

If colMatch.Count = 0 Then ' No words detected
    RXSCAN = CVErr(xlErrNA)
Else
  If Abs(n) > colMatch.Count Or n = 0 Then
    RXSCAN = CVErr(xlErrNum)
  Else
    If n > 0 Then ' Read from left to right
      RXSCAN = colMatch(n - 1)
    Else ' Right to left
      RXSCAN = colMatch(colMatch.Count + n)
    End If
  End If
End If

End Function

Public Function RXCOUNTW(ByVal stringval As String, _
                        Optional ByVal dlm As String = "") As Variant
' RXCOUNTW - Returns the count of words from stringval, where
' words are delimited by any non-word character by default,
' or a specific character set by optional string dlm.
Dim objRegex As VBScript_RegExp_55.RegExp
Dim colMatch As VBScript_RegExp_55.MatchCollection
Dim charlist As String
Dim char As String
Dim i As Integer

' Build pattern
If dlm = "" Then ' Just looking for A-Za-z0-9_
  charlist = "\w+"
Else ' Negate character class specified
  charlist = "[^"
  For i = 1 To Len(dlm)
    char = Mid$(dlm, i, 1)
    If InStr("[\^-", char) Then char = "\" & char ' escape
    charlist = charlist & char
  Next i
  charlist = charlist & "]+"
End If

' Initialise Regexp object, get match collection
Set objRegex = New VBScript_RegExp_55.RegExp
With objRegex
  .Pattern = charlist
  .Global = True
  Set colMatch = .Execute(stringval)
End With

If colMatch.Count = 0 Then ' No words detected
  RXCOUNTW = CVErr(xlErrNA)
Else
  RXCOUNTW = colMatch.Count
End If

End Function

I’ve also moved the add-in file to a dropbox folder. You can now find Regex.xla at http://dl.dropbox.com/u/29927305/Regex.xla

Regex Functions in Excel, part 3

Part 3 of 4

I’ve left this project alone for quite a while, until finally this weekend I had a little time to put into working on it. In part 2 I was still a little undecided as to how I was going to implement some of the worksheet functions, but now I am more or less satisfied with them. I’ve bundled the functions into Regex.xla, which is now available for download. The project is open for viewing, so code may be inspected or modified as you see fit – please let me know if you discover any errors or anything that could have been done better/smarter.

This is the first general purpose add-in I’ve created (I have written several application-specific add-ins), so I’d be interested to know if anyone has any feedback on this one. Some things I have not included which I’d like to:

  • Help files – I have started writing some html to provide users with an explanation of the functions and examples of their use, but I’m finding this very time-intensive. Next release I’ll include them.
  • A Regex Find/Replace dialog – still a work in progress, I’ve decided to leave this until I make a decision about whether to release the add-in as a compiled .dll file or just leave it as a .xla. At this point I have yet to test whether or not there would be any performance benefit in compiling the add-in in VB6 (or VB.NET?), which would be the key benefit I see in doing so.
  • A basic regular expression syntax included – while there are plenty of good sites to look at for regex syntax, it’d be great to include a quick overview of how they work and maybe a short list of FAQ-type answers to the usual ‘how do I do x?’ questions. Next release.

So, what’s included so far? Rather than post the entire source code for all five functions, I thought I’d put in all the header comments here, as they provide a reasonable explanation of their purpose (Roy MacLean wrote an excellent post on comments in code which motivated me putting these in – ordinarily until now my commenting in the header has been pretty sparse). So here they are:

Public Function RXFIND(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal start_num As Long, _
                        Optional ByVal case_sensitive As Boolean) As Long
' RXFIND - Returns the starting position of text matching the regex pattern
' find_pattern in the string within_text, if a match is found.
' Optional long start_num specifies the number of the character in within_text
' to start matching. Default=0.
' Optional boolean case_sensitive makes the pattern case sensitive if true,
' insensitive otherwise. Default=true.

Public Function ISRXMATCH(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal case_sensitive As Boolean) As Boolean
' ISRXMATCH - Returns true if the regex pattern find_pattern is matched
' in the string within_text, false otherwise.
' Optional boolean case_sensitive makes the pattern case sensitive if true,
' insensitive otherwise. Default=true.

Public Function RXGET(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal submatch As Long, _
                        Optional ByVal start_num As Long, _
                        Optional ByVal case_sensitive As Boolean) As String
' RXGET - Looks for a match for regular expression pattern find_pattern
' in the string within_text and returns it if found, error otherwise.
' Optional long submatch may be used to return the corresponding submatch
' if specified - otherwise the entire match is returned.
' Optional long start_num specifies the number of the character to start
' searching for in within_text. Default=0.
' Optional boolean case_sensitive makes the regex pattern case sensitive
' if true, insensitive otherwise. Default=true.

Public Function RXMATCH(ByVal find_pattern As Variant, _
                    ByVal within_range As Variant, _
                    Optional ByVal case_sensitive As Boolean) As Long
' RXMATCH - Searches for the first cell matching regular expression pattern
' find_pattern in the range argument within_range.
' Optional boolean case_sensitive makes the regular expression case
' sensitive if true, insensitive otherwise. Default=true.
' within_range should be a single row or column range. If the range has
' multiple rows and columns, only the first row is searched.

Public Function RXSUB(ByRef within_text As Variant, _
                    ByRef old_text As Variant, _
                    ByRef new_text As Variant, _
                    Optional ByVal start_num As Long, _
                    Optional ByVal case_sensitive As Boolean, _
                    Optional ByVal is_global As Boolean) As String
' RXSUB - Function to substitute text matching regex pattern old_text with
' string argument new_text in the string within_text.
' Optional long start_num specifies which character to start replacement from,
' default=0.
' Optional boolean case_sensitive will make the pattern case sensitive if true
' and insensitive otherwise. Default=true.
' Optional boolean is_global will substitute new_text for *all* occurrences
' matching old_text if true, only the first otherwise. Default=false.

In my final post on this project, I’ll include some examples of these functions in use, and hopefully have an updated add-in with help files and a find/replace dialog. Maybe also a discussion of why I did or didn’t go with a compiled .dll. Maybe also a regex syntax. Maybe I’ll need to make this series into 5 posts…

[EDIT: I’ve changed the download link to a fileden url, this can be accessed directly now.]

The subtle art of indirection

…any problem in computing can be solved by adding another level of indirection…

Butler Lampson in Authentication in distributed systems: Theory and practice

I have fond memories of the first time I cracked open a book on programming in C, and encountered pointers. Pointers are variables which hold the memory address of another variable, which the programmer can then manipulate. Without going into detail about why they are or aren’t a good thing (there are some quite vigorous arguments over that), I mention them in connection with the indirection operator *, which acts on a pointer p by dereferencing it, returning the contents of the object that it points to.

Indirection in C has a (sort of) counterpart in the Excel INDIRECT() function. INDIRECT takes a string argument and evaluates it, returning the value of whatever that string refers to. So placing =INDIRECT(B1) in cell A1 creates a direct reference to cell B1. If cell B1 contains the text “C1”, then “C1” is the argument INDIRECT evaluates. A1 will contain an indirect reference to cell C1, and will return the value contained in C1. Although this might not seem that useful at first (you could after all obtain the same result by just entering “=C1” in cell A1), the use of indirect references is a powerful tool for building dynamic formulae. Chip Pearson’s site has some excellent examples of these.

One particular example I have used in the past relates to Data Validation. A common scenario with validation is to select “List”, and then to enter a named range reference in the source box. For instance, supposing you wish to restrict entries to a list of month names, contained in a range named “Months” – so, you enter “Months” in the source box. But if you wish to change the source, you need to go through the Data -> Validation dialog, and hard-code the new range reference in there. An alternative is to enter the name of the source range in a cell, and then return the source range to the validation dialog by entering “=INDIRECT(<cell address>)” in the source box. This now means that updating the cell with a new range reference will automatically update the validation applied.

Regex Functions in Excel

Part 2 of 4

Continuing with my exploration of regular expressions in Excel, from Part 1 – so far I’ve got to the stage of writing a couple of functions, and planning a few more. As promised, I’ll detail here what the add-in I’m constructing should deliver to the user.

First, a suite of worksheet functions:

  • RXFIND(find_pattern, within_text, [start_num], [case_sensitive])
    – This is probably the easiest one to get right – a simple emulation of the functionality of the Excel built-in function FIND. As with FIND, the inputs are the pattern to test for, followed by the text to look within. Optionally, the start position can be specified (Default=0) as can case-sensitivity (Default=TRUE). Although Excel implements a separate SEARCH function which is case-insensitive, I’d rather save myself repeating the same code and added a parameter in to allow the user to specify whether case matters or not. Parenthetically, my belief, well-founded or not, is that case always matters.
    Here’s the code I’ve come up with so far: 

    Public Function RXFIND(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal start_num As Variant, _
                        Optional ByVal case_sensitive As Boolean) As Long     
    
    Dim objRegex As VBScript_RegExp_55.RegExp
    Dim colMatch As VBScript_RegExp_55.MatchCollection
    Dim vbsMatch As VBScript_RegExp_55.Match
    Dim sMatchString As String
    
    Set objRegex = New VBScript_RegExp_55.RegExp
    
    ' Initialise Regex object
    With objRegex
        .Global = False
        ' Default is case sensitive
        If IsMissing(case_sensitive) Then
            .IgnoreCase = False
        Else: .IgnoreCase = Not case_sensitive
        End If
        .Pattern = find_pattern
    End With
    
    If IsMissing(start_num) Then start_num = 0
    sMatchString = Right$(within_text, Len(within_text) - start_num)
    
    ' Create Match collection
    Set colMatch = objRegex.Execute(sMatchString)
    If colMatch.Count = 0 Then ' No match
        RXFIND = 0
    Else
        Set vbsMatch = colMatch(0)
        RXFIND = vbsMatch.FirstIndex + start_num + 1
    End If
    End Function
  • RXSUB(text,  old_text,  new_text, [instance_num], [case_sensitive], [is_global])
    – This one is similar to the Excel built-in SUBSTITUTE, although as with RXFIND, it allows a couple of extra parameters. It’s also a little trickier to code, as the optional is_global parameter (Default=FALSE) set to TRUE renders instance_num (Default=0) meaningless. I still haven’t got this in a fit state to post the code yet, but it differs only a little from RXFIND, in that instead of using the FirstIndex property and Execute method of the RegExp object, it only needs to access the Replace method, and of course it returns a string rather than a long. Work in progress.
  • RXGET(find_pattern, within_text, [start_num], [case_sensitive])
    I still haven’t completely figured out what to have this one do yet, but it’s going to rely on the Value property of the Match object, and will possibly allow for returning Submatched groups. Another work in progress.
  • RXMATCH(find_pattern, within_array, [case_sensitive])
    Returns the position of the first exact match with the find_pattern from within_array. Yet to begin.

As well as worksheet functions, I’ve been tossing around the idea of a Find/Replace dialog with regex bells and whistles. I have got no further than mimicking the Excel built-in dialog so far (i.e. no functionality as yet), but I am hopeful I’ll have something that actually does something soon. For reasons I’ll explain a little further in the next post, I am undecided as to whether I should stick with the standard user form. Anyway, that’ll do for now…next time I’ll talk a little about some of the key decisions that need to be made – late vs early binding, .NET vs the VBScript library, VB6 add-in vs .xla.

The perils of large spreadsheets

Ok, we’ve all been there. What started out as a relatively compact workbook has blossomed into a gargantuan mess that requires calculation set to Manual to change anything, and takes a full minute just to open. It’s not nice, particularly when the workbook in question is one that is relied upon heavily. I “owned”, or should I say “inherited”, one of these a couple of years back, which was used by over a hundred people in a call centre – and loathed by all of them. I didn’t like it much either, as I had to field all the calls and emails asking for it to be “fixed”.

So what to do? In a sense, of course, there’s nothing to fix, as long as everything does recalculate as it should. If not, or worse, if Excel crashes when it’s being used, then there is a problem, which usually warrants a complete redesign. Unfortunately that happens more often than the average user realises. While Excel’s memory management is continually improving with new releases (and the average memory available to it is increasing), you can only push it so far before it gives up the ghost. However, let’s assume for the sake of argument you’re not in that awful place right now. So while there’s nothing broken as such, the performance you’re getting out of it is not what it should be. What can you do to improve it? Here’s a couple of things to try.

  • Cull unnecessary data. There are cases where workbooks (particularly some financial models) do require several sheets with thousands of rows. But it’s worth questioning whether it’s all absolutely necessary. And does everything need to be in the same workbook? It may be that you can get significant gains by splitting the workbook into two or more pieces, which can then be hyperlinked – and by grouping related data this can make for a more user-friendly tool.
  • Re-use the same cache for your pivot tables if you can. Excel will very helpfully prompt you to do just that when it realises you’re about to create a pivot table using the same range or database that forms the cache for another pivot table. It also explains why – you will save memory and your workbook will be smaller. This is almost certainly a good thing. And don’t make the mistake of creating a separate cache for a subset of a previously used range (another common error) – you can always use the whole thing and exclude what you don’t want from within the pivot table after it’s been created.
  • Avoid formula repetition, wherever possible. Too many people take the “Jurassic Park” approach to building formulas, and assume that just because it is possible to do something, they should do it. Not true – here’s an example: =IF(VLOOKUP($M6,$B$2:$H$100,3,FALSE)>AVERAGE($B$2:$B6),
    VLOOKUP($M6,$B$2:$H$100,3,FALSE)-AVERAGE($B$2:$B6),
    VLOOKUP($M6,$B$2:$H$100,3,FALSE)+AVERAGE($B$2:$B6))

    Now you may ask, what’s wrong with that? Well, look at it from Excel’s point of view:
    Excel: So what do you want me to do?
    You: Well, first look up this thing in the first column of that table, and go along 3 columns from it and get the value there.
    Excel: Got it – easy!
    You: Good. Now take the mean of these numbers here and compare it with the number you just got. If it’s smaller, then look up this thing in the first column of that table, go along 3 columns and get the value there.
    Excel: Wait – you asked me to do that already.
    You: Well now I want you to do it again.
    Excel: OK then – it’s your party.
    You: Right. Now take the mean of these numbers here,…
    Excel: Seriously? I just did that before.
    You: Do it again, and subtract it from the number you just got.
    Excel: Alright – done. What if the number wasn’t smaller?
    You: Well, in that case, look up this thing in this table here…
    Excel: I can feel a migraine coming on…

    You get the picture. Of course, the problem is that Excel won’t complain when you ask it to do something silly like this – but any reasonable person would suggest that you calculate the results of the VLOOKUP and AVERAGE functions once, in separate cells, and then use the values however you like, as many times as you like, without creating a job that takes twice as much recalculation as would otherwise be necessary.
  • A related issue is the (over)use of VLOOKUP. Where you need to look up the same value more than once, even in separate cells, returning data from different columns, Excel is still having to repeat the same task unnecessarily. MATCH is your friend here, in combination with either OFFSET or INDEX. For example:
    A1=(VLOOKUP(M6,$B$2:$H$100,3,FALSE)
    A2=(VLOOKUP(M6,$B$2:$H$100,4,FALSE)
    A3=(VLOOKUP(M6,$B$2:$H$100,7,FALSE)

    could be replaced with:
    A1=MATCH(M6,$B$2:$H$100,0)
    A2=INDEX($D$2:$D$100,A1)
    A3=INDEX($E$2:$E$100,A1)
    A4=INDEX($H$2:$H$100,A1)

    Over big tables this can represent a huge saving in calculation time.

And that’s all I’ve got time for right now. Good luck!