A colleague came to me recently with a problem:

I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant) ' SampleArray populates a random sample avSmallset from an array avBigset ' without replacement (each element in avBigset is considered once only) Dim lRemainder As Long Dim lSize As Long Dim lOb As Long Dim lPickit As Long ' Make sure we're dealing with arrays... If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub ' Initialise lRemainder = UBound(avBigset) lSize = UBound(avSmallset) Randomize 0 Do While lSize > 0 ' Still some left to pick up lOb = lOb + 1 If Rnd < lSize / lRemainder Then lPickit = lPickit + 1 avSmallset(lPickit) = avBigset(lOb) lSize = lSize - 1 End If lRemainder = lRemainder - 1 Loop ' Sample complete End Sub

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

Private Function arr(ByRef rng As Range) As Variant ' Convert range to 1-D array Dim i As Long Dim j As Long Dim k As Long Dim rowCount As Long Dim colcount As Long Dim lim As Long Dim sArr() As String Dim vArr As Variant vArr = rng.Value2 ' dimension array rowCount = UBound(vArr) colcount = UBound(vArr, 2) lim = rowCount * colcount ReDim sArr(1 To lim) k = 1 ' populate For i = 1 To rowCount For j = 1 To colcount sArr(k) = vArr(i, j) k = k + 1 Next j Next i arr = sArr End Function

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant) ' SampleRange populates a random sample avSample from a range rngBigset ' without replacement (each element in rngBigset is considered once only) Dim lRemainder As Long Dim lSize As Long Dim lPickit As Long Dim rngCell As Excel.Range ' Initialise lRemainder = rngBigset.Cells.Count lSize = UBound(avSample) Randomize 0 For Each rngCell In rngBigset.Cells If lSize <= 0 Then Exit For ' Sample complete If Rnd < lSize / lRemainder Then lPickit = lPickit + 1 avSample(lPickit) = rngCell lSize = lSize - 1 End If lRemainder = lRemainder - 1 Next rngCell End Sub

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant ' Sum a random sample of size sample_size from in_range Dim sample_array() As Double Dim in_array As Variant Dim index As Long ' set up large and small arrays in_array = arr(in_range) ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size)) ' get sample SampleArray in_array, sample_array ' sum sample array SAMPLESUM = WorksheetFunction.Sum(sample_array) End Function

and SAMPLESUM2, working directly with the range:

Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant ' Sum a random sample of size sample_size from in_range Dim sample_array() As Double Dim index As Long ' set up large and small arrays ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size)) ' get sample SampleRange in_range, sample_array ' sum sample array SAMPLESUM2 = WorksheetFunction.Sum(sample_array) End Function

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

Public Sub timeit() Dim starttime As Double Dim endtime As Double Dim totaltime As Double Dim i As Integer Dim numrange As Excel.Range Set numrange = Range("A1:J50000") starttime = Timer For i = 1 To 100 SAMPLESUM numrange, 10 Next i endtime = Timer totaltime = endtime - starttime Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds" starttime = Timer For i = 1 To 100 SAMPLESUM2 numrange, 10 Next i endtime = Timer totaltime = endtime - starttime Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds" End Sub

Which output:

Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds

Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds

I think I’ll go with door number 2.

I’d love to have some more context behind the question. I can’t think of a good reason to sum anything that’s random. Wouldn’t the result just be a random number? Anyway, here’s my solution:

{=RANDBETWEEN(SUM(SMALL(A1:A1000,ROW(1:50))),SUM(LARGE(A1:A1000,ROW(1:50))))}

I concede that there may not actually be a subset of 50 items that adds up to this.

Thanks Dick, that’s an elegant solution, and a good point. However as every integer between the top and bottom sum is equally likely, the distribution of answers you get will be uniform over the interval, whereas the distribution of the population being sampled may not be (in fact in this case I’m pretty sure it was skewed). The numbers being summed were from a real-life set, rather than being randomly generated, so sums of random samples from it would reflect that distribution.

As to why my colleague wanted the function, I believe he was doing some Monte Carlo analysis. I’ll follow up with him to see how he got on, it would make an interesting post to see the method he used.