Summing a random sample

A colleague came to me recently with a problem:
I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant)
    ' SampleArray populates a random sample avSmallset from an array avBigset
    ' without replacement (each element in avBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lOb As Long
    Dim lPickit As Long
        
    ' Make sure we're dealing with arrays...
    If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub
    ' Initialise
    lRemainder = UBound(avBigset)
    lSize = UBound(avSmallset)
    Randomize 0
    
    Do While lSize > 0  ' Still some left to pick up
        lOb = lOb + 1
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSmallset(lPickit) = avBigset(lOb)
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Loop    ' Sample complete
    
End Sub

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

Private Function arr(ByRef rng As Range) As Variant
    ' Convert range to 1-D array
    Dim i As Long
    Dim j As Long
    Dim k As Long
    Dim rowCount As Long
    Dim colcount As Long
    Dim lim As Long
    Dim sArr() As String
    Dim vArr As Variant
    
    vArr = rng.Value2
    
    ' dimension array
    rowCount = UBound(vArr)
    colcount = UBound(vArr, 2)
    lim = rowCount * colcount
    ReDim sArr(1 To lim)
    k = 1
    
    ' populate
    For i = 1 To rowCount
        For j = 1 To colcount
            sArr(k) = vArr(i, j)
            k = k + 1
        Next j
    Next i
    
    arr = sArr
    
End Function

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant)
    ' SampleRange populates a random sample avSample from a range rngBigset
    ' without replacement (each element in rngBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lPickit As Long
    Dim rngCell As Excel.Range
    
    
    ' Initialise
    lRemainder = rngBigset.Cells.Count
    lSize = UBound(avSample)
    Randomize 0
    
    For Each rngCell In rngBigset.Cells
        If lSize <= 0 Then Exit For ' Sample complete
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSample(lPickit) = rngCell
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Next rngCell

End Sub

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim in_array As Variant
    Dim index As Long
    
    ' set up large and small arrays
    in_array = arr(in_range)
    ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size))
    
    ' get sample
    SampleArray in_array, sample_array
    ' sum sample array
    SAMPLESUM = WorksheetFunction.Sum(sample_array)
    
End Function

and SAMPLESUM2, working directly with the range:

Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim index As Long
    
    ' set up large and small arrays
    ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size))
    
    ' get sample
    SampleRange in_range, sample_array
    ' sum sample array
    SAMPLESUM2 = WorksheetFunction.Sum(sample_array)

End Function

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

Public Sub timeit()
Dim starttime As Double
Dim endtime As Double
Dim totaltime As Double
Dim i As Integer
Dim numrange As Excel.Range

Set numrange = Range("A1:J50000")


starttime = Timer

For i = 1 To 100
    SAMPLESUM numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds"

starttime = Timer

For i = 1 To 100
    SAMPLESUM2 numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds"

End Sub

Which output:

Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds
Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds

I think I’ll go with door number 2.

Advertisements

Excel 2007 speed test

Up until now, I have been using Excel 2003 at both home and work, and have had no strong urges to switch to 2007. Since its release I have seen rather a lot of comment about its performance, with assessments ranging from fair to rubbish, for instance on Newton Excel Bach and at keyongtech. The Excel team blog also has a recent post asking for examples from people who’ve been experiencing slower VBA performance with Excel 2007 than with 2003. So up until now I’ve held off. However, my employer is about 3 months away from rolling out Office 2007 nationwide, and so I thought it’d be worthwhile taking a look.

So this evening I downloaded a copy of the trial version of Office 2007 at home and tried out a (very basic) speed test. The code below repeats a primality test 10,000 times and outputs to cells in the active sheet:

Public Function ISPRIME(ByVal lTest As Long) As Boolean
‘ Returns TRUE for a prime, FALSE for composite
Dim lIn As Long

lIn = 2
ISPRIME = True
Do While lIn <= Sqr(lTest) If lTest Mod lIn = 0 Then ISPRIME = False Exit Do End If lIn = lIn + 1 Loop End Function Public Sub SpeedTest() Dim rngTest As Range Dim sglStart As Single Dim sglFinish As Single sglStart = Timer For Each rngTest In ActiveSheet.Range("A1:J1000") If ISPRIME(rngTest.Row + (rngTest.Column - 1) * 1000) Then rngTest.Value = "Prime!" Else: rngTest.Value = "Not prime..." End If Next rngTest sglFinish = Timer Debug.Print "Total seconds elapsed = " & sglFinish - sglStart End Sub[/sourcecode] Running this 10 times in 2003 gave the following results: Total seconds elapsed = 6 Total seconds elapsed = 6.15625 Total seconds elapsed = 6.171875 Total seconds elapsed = 5.953125 Total seconds elapsed = 6.203125 Total seconds elapsed = 6.234375 Total seconds elapsed = 6.6875 Total seconds elapsed = 6.203125 Total seconds elapsed = 5.953125 Total seconds elapsed = 5.9375 and in 2007: Total seconds elapsed = 9.84375 Total seconds elapsed = 9.296875 Total seconds elapsed = 9.203125 Total seconds elapsed = 9.171875 Total seconds elapsed = 9.21875 Total seconds elapsed = 9.140625 Total seconds elapsed = 9.140625 Total seconds elapsed = 10.09375 Total seconds elapsed = 9.1875 Total seconds elapsed = 9.171875 The main point of the exercise for me was to evaluate whether there was a significant difference in speed of data transfer from VBA to Excel. Not only has this been an issue for my code in the past (i.e. based on experience this is one of the first things I generally look at when optimising performance), but a number of people have commented that VBA in 2007 does this a lot slower than it does in 2003. While the results above are hardly concrete evidence, I think at first glance this does seem to be borne out. Incidentally, I tried running a similar test without transferring the data to a worksheet, using the code below: [sourcecode language='vb'] Public Sub SpeedTest2() Dim lTest As Long Dim bResult As Boolean Dim sglStart As Single Dim sglFinish As Single sglStart = Timer For lTest = 2 To 10000 bResult = ISPRIME(lTest) Next lTest sglFinish = Timer Debug.Print "Total seconds elapsed = " & sglFinish - sglStart End Sub [/sourcecode] and the results showed very little difference between the two (as I would have expected, really). Here's 2003: Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.0625 Total seconds elapsed = 0.0625 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.09375 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 and in 2007: Total seconds elapsed = 0.09375 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.0625 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125