Summing a random sample

A colleague came to me recently with a problem:
I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant)
    ' SampleArray populates a random sample avSmallset from an array avBigset
    ' without replacement (each element in avBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lOb As Long
    Dim lPickit As Long
        
    ' Make sure we're dealing with arrays...
    If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub
    ' Initialise
    lRemainder = UBound(avBigset)
    lSize = UBound(avSmallset)
    Randomize 0
    
    Do While lSize > 0  ' Still some left to pick up
        lOb = lOb + 1
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSmallset(lPickit) = avBigset(lOb)
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Loop    ' Sample complete
    
End Sub

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

Private Function arr(ByRef rng As Range) As Variant
    ' Convert range to 1-D array
    Dim i As Long
    Dim j As Long
    Dim k As Long
    Dim rowCount As Long
    Dim colcount As Long
    Dim lim As Long
    Dim sArr() As String
    Dim vArr As Variant
    
    vArr = rng.Value2
    
    ' dimension array
    rowCount = UBound(vArr)
    colcount = UBound(vArr, 2)
    lim = rowCount * colcount
    ReDim sArr(1 To lim)
    k = 1
    
    ' populate
    For i = 1 To rowCount
        For j = 1 To colcount
            sArr(k) = vArr(i, j)
            k = k + 1
        Next j
    Next i
    
    arr = sArr
    
End Function

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant)
    ' SampleRange populates a random sample avSample from a range rngBigset
    ' without replacement (each element in rngBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lPickit As Long
    Dim rngCell As Excel.Range
    
    
    ' Initialise
    lRemainder = rngBigset.Cells.Count
    lSize = UBound(avSample)
    Randomize 0
    
    For Each rngCell In rngBigset.Cells
        If lSize <= 0 Then Exit For ' Sample complete
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSample(lPickit) = rngCell
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Next rngCell

End Sub

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim in_array As Variant
    Dim index As Long
    
    ' set up large and small arrays
    in_array = arr(in_range)
    ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size))
    
    ' get sample
    SampleArray in_array, sample_array
    ' sum sample array
    SAMPLESUM = WorksheetFunction.Sum(sample_array)
    
End Function

and SAMPLESUM2, working directly with the range:

Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim index As Long
    
    ' set up large and small arrays
    ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size))
    
    ' get sample
    SampleRange in_range, sample_array
    ' sum sample array
    SAMPLESUM2 = WorksheetFunction.Sum(sample_array)

End Function

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

Public Sub timeit()
Dim starttime As Double
Dim endtime As Double
Dim totaltime As Double
Dim i As Integer
Dim numrange As Excel.Range

Set numrange = Range("A1:J50000")


starttime = Timer

For i = 1 To 100
    SAMPLESUM numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds"

starttime = Timer

For i = 1 To 100
    SAMPLESUM2 numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds"

End Sub

Which output:

Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds
Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds

I think I’ll go with door number 2.