# Summing a random sample

A colleague came to me recently with a problem:
I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

```Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant)
' SampleArray populates a random sample avSmallset from an array avBigset
' without replacement (each element in avBigset is considered once only)
Dim lRemainder As Long
Dim lSize As Long
Dim lOb As Long
Dim lPickit As Long

' Make sure we're dealing with arrays...
If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub
' Initialise
lRemainder = UBound(avBigset)
lSize = UBound(avSmallset)
Randomize 0

Do While lSize > 0  ' Still some left to pick up
lOb = lOb + 1
If Rnd < lSize / lRemainder Then
lPickit = lPickit + 1
avSmallset(lPickit) = avBigset(lOb)
lSize = lSize - 1
End If
lRemainder = lRemainder - 1
Loop    ' Sample complete

End Sub```

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

```Private Function arr(ByRef rng As Range) As Variant
' Convert range to 1-D array
Dim i As Long
Dim j As Long
Dim k As Long
Dim rowCount As Long
Dim colcount As Long
Dim lim As Long
Dim sArr() As String
Dim vArr As Variant

vArr = rng.Value2

' dimension array
rowCount = UBound(vArr)
colcount = UBound(vArr, 2)
lim = rowCount * colcount
ReDim sArr(1 To lim)
k = 1

' populate
For i = 1 To rowCount
For j = 1 To colcount
sArr(k) = vArr(i, j)
k = k + 1
Next j
Next i

arr = sArr

End Function```

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

```Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant)
' SampleRange populates a random sample avSample from a range rngBigset
' without replacement (each element in rngBigset is considered once only)
Dim lRemainder As Long
Dim lSize As Long
Dim lPickit As Long
Dim rngCell As Excel.Range

' Initialise
lRemainder = rngBigset.Cells.Count
lSize = UBound(avSample)
Randomize 0

For Each rngCell In rngBigset.Cells
If lSize <= 0 Then Exit For ' Sample complete
If Rnd < lSize / lRemainder Then
lPickit = lPickit + 1
avSample(lPickit) = rngCell
lSize = lSize - 1
End If
lRemainder = lRemainder - 1
Next rngCell

End Sub```

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

```Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant
' Sum a random sample of size sample_size from in_range
Dim sample_array() As Double
Dim in_array As Variant
Dim index As Long

' set up large and small arrays
in_array = arr(in_range)
ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size))

' get sample
SampleArray in_array, sample_array
' sum sample array
SAMPLESUM = WorksheetFunction.Sum(sample_array)

End Function```

and SAMPLESUM2, working directly with the range:

```Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant
' Sum a random sample of size sample_size from in_range
Dim sample_array() As Double
Dim index As Long

' set up large and small arrays
ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size))

' get sample
SampleRange in_range, sample_array
' sum sample array
SAMPLESUM2 = WorksheetFunction.Sum(sample_array)

End Function```

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

```Public Sub timeit()
Dim starttime As Double
Dim endtime As Double
Dim totaltime As Double
Dim i As Integer
Dim numrange As Excel.Range

Set numrange = Range("A1:J50000")

starttime = Timer

For i = 1 To 100
SAMPLESUM numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds"

starttime = Timer

For i = 1 To 100
SAMPLESUM2 numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds"

End Sub
```

Which output:
``` Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds ```

I think I’ll go with door number 2.

# Random Sampling in VBA

Occasionally we get asked to provide random samples from our database, and it’s the sort of thing that can be done very easily in SAS.

For example, here’s a SAS routine I have often used to extract a sample sampleset of size 100 without replacement from a large dataset bigset (I can’t remember where this code originally came from but I definitely can’t take credit for it):

data sampleset (drop=remainder size);
size=100;
remainder=totobs;

do while (size > 0);
pickit + 1;
if ranuni(0) < size/remainder then do; set bigset point=pickit nobs=totobs; output; size=size - 1; end; remainder=remainder - 1; end; stop; run;[/sourcecode] While this is usually sufficient, I have sometimes wondered whether it would be possible to use the same logic to extract a sample in VBA. So tonight I decided to test it out. The following code appears to do the trick quite nicely, but be warned, I haven't done a lot of testing yet. [sourcecode language='vb']Public Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant) ' SampleArray populates a random sample avSmallset from an array avBigset ' without replacement (each element in avBigset is considered once only) Dim lRemainder As Long Dim lSize As Long Dim lOb As Long Dim lPickit As Long ' Make sure we're dealing with arrays... If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub ' Initialise lRemainder = UBound(avBigset) lSize = UBound(avSmallset) Randomize 0 Do While lSize > 0 ‘ Still some left to pick up
lOb = lOb + 1
If Rnd < lSize / lRemainder Then lPickit = lPickit + 1 avSmallset(lPickit) = avBigset(lOb) lSize = lSize - 1 End If lRemainder = lRemainder - 1 Loop ' Sample complete End Sub[/sourcecode] It coped fairly well with populating a sample array of size 1,000 from an array of size 10,000,000, but as I say I haven't tested too extensively so I don't yet know its limits, or whether there is a much more efficient method out there. Has anyone else used VBA in this way? What sort of method would you use?