A colleague came to me recently with a problem:

I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant)
' SampleArray populates a random sample avSmallset from an array avBigset
' without replacement (each element in avBigset is considered once only)
Dim lRemainder As Long
Dim lSize As Long
Dim lOb As Long
Dim lPickit As Long
' Make sure we're dealing with arrays...
If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub
' Initialise
lRemainder = UBound(avBigset)
lSize = UBound(avSmallset)
Randomize 0
Do While lSize > 0 ' Still some left to pick up
lOb = lOb + 1
If Rnd < lSize / lRemainder Then
lPickit = lPickit + 1
avSmallset(lPickit) = avBigset(lOb)
lSize = lSize - 1
End If
lRemainder = lRemainder - 1
Loop ' Sample complete
End Sub

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

Private Function arr(ByRef rng As Range) As Variant
' Convert range to 1-D array
Dim i As Long
Dim j As Long
Dim k As Long
Dim rowCount As Long
Dim colcount As Long
Dim lim As Long
Dim sArr() As String
Dim vArr As Variant
vArr = rng.Value2
' dimension array
rowCount = UBound(vArr)
colcount = UBound(vArr, 2)
lim = rowCount * colcount
ReDim sArr(1 To lim)
k = 1
' populate
For i = 1 To rowCount
For j = 1 To colcount
sArr(k) = vArr(i, j)
k = k + 1
Next j
Next i
arr = sArr
End Function

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant)
' SampleRange populates a random sample avSample from a range rngBigset
' without replacement (each element in rngBigset is considered once only)
Dim lRemainder As Long
Dim lSize As Long
Dim lPickit As Long
Dim rngCell As Excel.Range
' Initialise
lRemainder = rngBigset.Cells.Count
lSize = UBound(avSample)
Randomize 0
For Each rngCell In rngBigset.Cells
If lSize <= 0 Then Exit For ' Sample complete
If Rnd < lSize / lRemainder Then
lPickit = lPickit + 1
avSample(lPickit) = rngCell
lSize = lSize - 1
End If
lRemainder = lRemainder - 1
Next rngCell
End Sub

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant
' Sum a random sample of size sample_size from in_range
Dim sample_array() As Double
Dim in_array As Variant
Dim index As Long
' set up large and small arrays
in_array = arr(in_range)
ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size))
' get sample
SampleArray in_array, sample_array
' sum sample array
SAMPLESUM = WorksheetFunction.Sum(sample_array)
End Function

and SAMPLESUM2, working directly with the range:

Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant
' Sum a random sample of size sample_size from in_range
Dim sample_array() As Double
Dim index As Long
' set up large and small arrays
ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size))
' get sample
SampleRange in_range, sample_array
' sum sample array
SAMPLESUM2 = WorksheetFunction.Sum(sample_array)
End Function

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

Public Sub timeit()
Dim starttime As Double
Dim endtime As Double
Dim totaltime As Double
Dim i As Integer
Dim numrange As Excel.Range
Set numrange = Range("A1:J50000")
starttime = Timer
For i = 1 To 100
SAMPLESUM numrange, 10
Next i
endtime = Timer
totaltime = endtime - starttime
Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds"
starttime = Timer
For i = 1 To 100
SAMPLESUM2 numrange, 10
Next i
endtime = Timer
totaltime = endtime - starttime
Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds"
End Sub

Which output:

Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds

Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds

I think I’ll go with door number 2.