Summing a random sample

A colleague came to me recently with a problem:
I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant)
    ' SampleArray populates a random sample avSmallset from an array avBigset
    ' without replacement (each element in avBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lOb As Long
    Dim lPickit As Long
        
    ' Make sure we're dealing with arrays...
    If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub
    ' Initialise
    lRemainder = UBound(avBigset)
    lSize = UBound(avSmallset)
    Randomize 0
    
    Do While lSize > 0  ' Still some left to pick up
        lOb = lOb + 1
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSmallset(lPickit) = avBigset(lOb)
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Loop    ' Sample complete
    
End Sub

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

Private Function arr(ByRef rng As Range) As Variant
    ' Convert range to 1-D array
    Dim i As Long
    Dim j As Long
    Dim k As Long
    Dim rowCount As Long
    Dim colcount As Long
    Dim lim As Long
    Dim sArr() As String
    Dim vArr As Variant
    
    vArr = rng.Value2
    
    ' dimension array
    rowCount = UBound(vArr)
    colcount = UBound(vArr, 2)
    lim = rowCount * colcount
    ReDim sArr(1 To lim)
    k = 1
    
    ' populate
    For i = 1 To rowCount
        For j = 1 To colcount
            sArr(k) = vArr(i, j)
            k = k + 1
        Next j
    Next i
    
    arr = sArr
    
End Function

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant)
    ' SampleRange populates a random sample avSample from a range rngBigset
    ' without replacement (each element in rngBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lPickit As Long
    Dim rngCell As Excel.Range
    
    
    ' Initialise
    lRemainder = rngBigset.Cells.Count
    lSize = UBound(avSample)
    Randomize 0
    
    For Each rngCell In rngBigset.Cells
        If lSize <= 0 Then Exit For ' Sample complete
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSample(lPickit) = rngCell
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Next rngCell

End Sub

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim in_array As Variant
    Dim index As Long
    
    ' set up large and small arrays
    in_array = arr(in_range)
    ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size))
    
    ' get sample
    SampleArray in_array, sample_array
    ' sum sample array
    SAMPLESUM = WorksheetFunction.Sum(sample_array)
    
End Function

and SAMPLESUM2, working directly with the range:

Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim index As Long
    
    ' set up large and small arrays
    ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size))
    
    ' get sample
    SampleRange in_range, sample_array
    ' sum sample array
    SAMPLESUM2 = WorksheetFunction.Sum(sample_array)

End Function

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

Public Sub timeit()
Dim starttime As Double
Dim endtime As Double
Dim totaltime As Double
Dim i As Integer
Dim numrange As Excel.Range

Set numrange = Range("A1:J50000")


starttime = Timer

For i = 1 To 100
    SAMPLESUM numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds"

starttime = Timer

For i = 1 To 100
    SAMPLESUM2 numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds"

End Sub

Which output:

Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds
Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds

I think I’ll go with door number 2.

Random Sampling in VBA

Occasionally we get asked to provide random samples from our database, and it’s the sort of thing that can be done very easily in SAS.

For example, here’s a SAS routine I have often used to extract a sample sampleset of size 100 without replacement from a large dataset bigset (I can’t remember where this code originally came from but I definitely can’t take credit for it):

data sampleset (drop=remainder size);
size=100;
remainder=totobs;

do while (size > 0);
pickit + 1;
if ranuni(0) < size/remainder then do; set bigset point=pickit nobs=totobs; output; size=size - 1; end; remainder=remainder - 1; end; stop; run;[/sourcecode] While this is usually sufficient, I have sometimes wondered whether it would be possible to use the same logic to extract a sample in VBA. So tonight I decided to test it out. The following code appears to do the trick quite nicely, but be warned, I haven't done a lot of testing yet. [sourcecode language='vb']Public Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant) ' SampleArray populates a random sample avSmallset from an array avBigset ' without replacement (each element in avBigset is considered once only) Dim lRemainder As Long Dim lSize As Long Dim lOb As Long Dim lPickit As Long ' Make sure we're dealing with arrays... If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub ' Initialise lRemainder = UBound(avBigset) lSize = UBound(avSmallset) Randomize 0 Do While lSize > 0 ‘ Still some left to pick up
lOb = lOb + 1
If Rnd < lSize / lRemainder Then lPickit = lPickit + 1 avSmallset(lPickit) = avBigset(lOb) lSize = lSize - 1 End If lRemainder = lRemainder - 1 Loop ' Sample complete End Sub[/sourcecode] It coped fairly well with populating a sample array of size 1,000 from an array of size 10,000,000, but as I say I haven't tested too extensively so I don't yet know its limits, or whether there is a much more efficient method out there. Has anyone else used VBA in this way? What sort of method would you use?