A colleague came to me recently with a problem:
I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.
Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.
Here’s the code for creating a random sample:
Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant) ' SampleArray populates a random sample avSmallset from an array avBigset ' without replacement (each element in avBigset is considered once only) Dim lRemainder As Long Dim lSize As Long Dim lOb As Long Dim lPickit As Long ' Make sure we're dealing with arrays... If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub ' Initialise lRemainder = UBound(avBigset) lSize = UBound(avSmallset) Randomize 0 Do While lSize > 0 ' Still some left to pick up lOb = lOb + 1 If Rnd < lSize / lRemainder Then lPickit = lPickit + 1 avSmallset(lPickit) = avBigset(lOb) lSize = lSize - 1 End If lRemainder = lRemainder - 1 Loop ' Sample complete End Sub
The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:
Private Function arr(ByRef rng As Range) As Variant ' Convert range to 1-D array Dim i As Long Dim j As Long Dim k As Long Dim rowCount As Long Dim colcount As Long Dim lim As Long Dim sArr() As String Dim vArr As Variant vArr = rng.Value2 ' dimension array rowCount = UBound(vArr) colcount = UBound(vArr, 2) lim = rowCount * colcount ReDim sArr(1 To lim) k = 1 ' populate For i = 1 To rowCount For j = 1 To colcount sArr(k) = vArr(i, j) k = k + 1 Next j Next i arr = sArr End Function
This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.
Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:
Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant) ' SampleRange populates a random sample avSample from a range rngBigset ' without replacement (each element in rngBigset is considered once only) Dim lRemainder As Long Dim lSize As Long Dim lPickit As Long Dim rngCell As Excel.Range ' Initialise lRemainder = rngBigset.Cells.Count lSize = UBound(avSample) Randomize 0 For Each rngCell In rngBigset.Cells If lSize <= 0 Then Exit For ' Sample complete If Rnd < lSize / lRemainder Then lPickit = lPickit + 1 avSample(lPickit) = rngCell lSize = lSize - 1 End If lRemainder = lRemainder - 1 Next rngCell End Sub
So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:
Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant ' Sum a random sample of size sample_size from in_range Dim sample_array() As Double Dim in_array As Variant Dim index As Long ' set up large and small arrays in_array = arr(in_range) ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size)) ' get sample SampleArray in_array, sample_array ' sum sample array SAMPLESUM = WorksheetFunction.Sum(sample_array) End Function
and SAMPLESUM2, working directly with the range:
Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant ' Sum a random sample of size sample_size from in_range Dim sample_array() As Double Dim index As Long ' set up large and small arrays ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size)) ' get sample SampleRange in_range, sample_array ' sum sample array SAMPLESUM2 = WorksheetFunction.Sum(sample_array) End Function
I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:
Public Sub timeit() Dim starttime As Double Dim endtime As Double Dim totaltime As Double Dim i As Integer Dim numrange As Excel.Range Set numrange = Range("A1:J50000") starttime = Timer For i = 1 To 100 SAMPLESUM numrange, 10 Next i endtime = Timer totaltime = endtime - starttime Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds" starttime = Timer For i = 1 To 100 SAMPLESUM2 numrange, 10 Next i endtime = Timer totaltime = endtime - starttime Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds" End Sub
Which output:
Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds
Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds
I think I’ll go with door number 2.