# Summing a random sample

A colleague came to me recently with a problem:
I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

```Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant)
' SampleArray populates a random sample avSmallset from an array avBigset
' without replacement (each element in avBigset is considered once only)
Dim lRemainder As Long
Dim lSize As Long
Dim lOb As Long
Dim lPickit As Long

' Make sure we're dealing with arrays...
If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub
' Initialise
lRemainder = UBound(avBigset)
lSize = UBound(avSmallset)
Randomize 0

Do While lSize > 0  ' Still some left to pick up
lOb = lOb + 1
If Rnd < lSize / lRemainder Then
lPickit = lPickit + 1
avSmallset(lPickit) = avBigset(lOb)
lSize = lSize - 1
End If
lRemainder = lRemainder - 1
Loop    ' Sample complete

End Sub```

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

```Private Function arr(ByRef rng As Range) As Variant
' Convert range to 1-D array
Dim i As Long
Dim j As Long
Dim k As Long
Dim rowCount As Long
Dim colcount As Long
Dim lim As Long
Dim sArr() As String
Dim vArr As Variant

vArr = rng.Value2

' dimension array
rowCount = UBound(vArr)
colcount = UBound(vArr, 2)
lim = rowCount * colcount
ReDim sArr(1 To lim)
k = 1

' populate
For i = 1 To rowCount
For j = 1 To colcount
sArr(k) = vArr(i, j)
k = k + 1
Next j
Next i

arr = sArr

End Function```

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

```Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant)
' SampleRange populates a random sample avSample from a range rngBigset
' without replacement (each element in rngBigset is considered once only)
Dim lRemainder As Long
Dim lSize As Long
Dim lPickit As Long
Dim rngCell As Excel.Range

' Initialise
lRemainder = rngBigset.Cells.Count
lSize = UBound(avSample)
Randomize 0

For Each rngCell In rngBigset.Cells
If lSize <= 0 Then Exit For ' Sample complete
If Rnd < lSize / lRemainder Then
lPickit = lPickit + 1
avSample(lPickit) = rngCell
lSize = lSize - 1
End If
lRemainder = lRemainder - 1
Next rngCell

End Sub```

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

```Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant
' Sum a random sample of size sample_size from in_range
Dim sample_array() As Double
Dim in_array As Variant
Dim index As Long

' set up large and small arrays
in_array = arr(in_range)
ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size))

' get sample
SampleArray in_array, sample_array
' sum sample array
SAMPLESUM = WorksheetFunction.Sum(sample_array)

End Function```

and SAMPLESUM2, working directly with the range:

```Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant
' Sum a random sample of size sample_size from in_range
Dim sample_array() As Double
Dim index As Long

' set up large and small arrays
ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size))

' get sample
SampleRange in_range, sample_array
' sum sample array
SAMPLESUM2 = WorksheetFunction.Sum(sample_array)

End Function```

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

```Public Sub timeit()
Dim starttime As Double
Dim endtime As Double
Dim totaltime As Double
Dim i As Integer
Dim numrange As Excel.Range

Set numrange = Range("A1:J50000")

starttime = Timer

For i = 1 To 100
SAMPLESUM numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds"

starttime = Timer

For i = 1 To 100
SAMPLESUM2 numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds"

End Sub
```

Which output:
``` Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds ```

I think I’ll go with door number 2.

## 2 thoughts on “Summing a random sample”

1. dickkusleika says:

I’d love to have some more context behind the question. I can’t think of a good reason to sum anything that’s random. Wouldn’t the result just be a random number? Anyway, here’s my solution:

{=RANDBETWEEN(SUM(SMALL(A1:A1000,ROW(1:50))),SUM(LARGE(A1:A1000,ROW(1:50))))}

I concede that there may not actually be a subset of 50 items that adds up to this.

• geoffness says:

Thanks Dick, that’s an elegant solution, and a good point. However as every integer between the top and bottom sum is equally likely, the distribution of answers you get will be uniform over the interval, whereas the distribution of the population being sampled may not be (in fact in this case I’m pretty sure it was skewed). The numbers being summed were from a real-life set, rather than being randomly generated, so sums of random samples from it would reflect that distribution.

As to why my colleague wanted the function, I believe he was doing some Monte Carlo analysis. I’ll follow up with him to see how he got on, it would make an interesting post to see the method he used.