Dare to Compare

When moving ETL processes to a new environment, it’s important to test that those processes return the same results as they do in the old environment. We might expect that, given the same input from source data, running the same code, we should see the same output. Of course that’s not always the case, because the data warehouse is not an abstract model (it’s sometimes easy to forget this when all you see is code and data), but a combination of many varied, interconnected pieces of hardware and software. When we change any one portion it’s hard to predict what, if any, difference this will make – so when we change all of them at once, we need to expect the unexpected.

So how can we tell what’s changed? The SAS Compare procedure is a very handy tool in this situation, providing a quick, automated means of determining differences in type and value between one dataset and another. Here’s an example of how I’ve used it recently in testing migrated data:

%macro compare(table, num, key) ;

%let old = old_&table ;
%let new = new_&table ;
%let result = res_&table ;
%let diff = df_&table ;

proc sort data=&newlib..&table out=&new;
    by &key ;
proc sort data=&oldlib..&table out=&old;
    by &key ;
run ;


proc compare compare=&new base=&old out=&result 
  outnoequal outbase outcomp outdiff
  outstat=&diff (where=(_type_='NDIF' and _base_ <> 0)) listobs;
  id &key ;
run ;

data _diff ;
  dsid = open("&new") ;
  set &diff end=last;
  length column $40 table $40 type $40 fmt $80 ;
  table="&table" ;
  column = _var_ ;
  varnum = varnum(dsid, _var_) ;
  type = vartype(dsid, varnum) ;
  length = varlen(dsid, varnum) ;
  fmt = varfmt(dsid, varnum) ;
  if last then rc=close(dsid) ;
run ;

%if &num = 1 %then %do ;
    data diff ;
      set _diff 
      (keep=table column varnum type length fmt _base_ _comp_);
    run ;
%end ;
%else %do ;
    proc append base=diff data=_diff 
      (keep=table column varnum type length fmt _base_ _comp_);
    run ;
%end ;

%mend compare ;

Looking at each dataset in a particular library, the compare macro is run to compare it with its counterpart in the new library (&newlib in the above). There’s a fair bit going on here, so piece by piece, here’s what the macro does.

First, a bit of setup: we are going to be comparing datasets by the supplied &key parameter (rather than by row number), so for the purpose of proc compare we need to sort both sets first. There is of course a new version and an old version of &table, named so that we know which is which. We’re also naming output sets &result as ‘res_&table’ (this will hold rows which differ between versions), and &diff as ‘df_&table’ (this will contain output statistics summarising the comparison).

After sorting the old and new tables, the comparison is run. The old version is supplied as the ‘base=’ argument and the new version is ‘compare=’; the id statement identifies the variable(s) making up the unique key for the sets. The ‘out=’ argument specifies the &result output set, and the modifiers outnoequal, outbase, outcomp and outdiff specify respectively that only observations where there is some difference in values between base and compare, we get the rows from both the base and the compare set, and we also get a row showing the differences for each variable. The ‘outstat=’ argument is set to &diff to hold a set of summary statistics (which we are restricting to results from variables where there is one or more value difference). Finally, the ‘listobs’ option specifies that we get to see a printout of observations where the unique key value is found in only one of the base and compare sets.

In the final part of the macro, we take the &diff set and pull out some metadata about the variables that have tested unequal between sets. All of these &diff sets are appended together to a big output dataset which can be used as a handy summary of differences across the entire library. Note the compare procedure also produces list output which can then be read as a detailed report.

A couple of PowerShell scripts

I’ve been making a little more use of PowerShell recently. It’s a very handy addition to the Windows environment, certainly it’s a lot easier to use than the traditional command line, and it’s a lot easier to find your way around. I figured I’d share a couple of the scripts I’ve used for some specific tasks to do with file management, in case someone found this handy.

Recently we’ve been upgrading our SAS environment, so we’ve had to move a great deal of our data onto new servers. One of the concerns we had with migrating data was that some of our users had been storing disconnected SAS datasets in one of the directories to be migrated. Not really a great practice, and we didn’t want to waste bandwidth transporting over stale data unless there was a real need to. So, we needed a list of datasets – just the really big ones – that had been saved, so that the users responsible could justify the need for them in the new environment, or (preferably), bid them a fond farewell:

# the directory we are migrating
$badpath = "\\no\big\datasets\here\please"
# output to this file
$outfile = "\\list\of\big\datasets.csv"

<# 
  Extract a list of SAS datasets over 100MB in size, and
  write out the name, last modified date, size and location
  to the output file.
#>
Get-ChildItem $badpath -Recurse -Filter *.sas7bdat |
 ? {$_.Length -gt 100MB} |
 Select-Object Name, LastWriteTime, 
 @{Name="SizeInMB";Expression={$_.Length/1MB}},
 @{Name="Path";Expression={$_.directory}} |
 Export-Csv -Path $outfile

The Get-ChildItem cmdlet looks at all items in the folder $badpath, and the Recurse parameter tells it to look in all subfolders also. The Filter parameter allows us to limit the results of this search to SAS datasets. That gets piped to the Where-Object cmdlet (here shortened to ?), which allows only those passing the size limit in the script block through to the next pipe. Finally Select-Object takes a selection of properties of the objects coming through the where-object cmdlet and the Export-Csv cmdlet sends them to an output file. Note that as we know we are dealing with dataset files, the objects coming through will have type FileInfo. If we had not specified SAS datasets in the filter we would have had to include another cmdlet to exclude non-file items like directories and archives.

Another task involved searching through our SAS source code for something we knew was going to cause us problems in the new environment. Without going into too much detail as to the why of this, we needed to find any location where the developer had tested for whether a variable was equal to -1. As it turned out, due to the way SAS Data Integration Studio works, -1 gets added into generated code in a lot of places, so we needed a regular expression pattern to limit the cases returned. We didn’t care about the instances where special ETL macro variables were set to -1, or were compared with -1. So rather than looking for just the literal “-1” string in code, we needed a negative lookbehind regex to exclude these cases:

# Source code directory
$jobdir = "\\SAS\ETL\Code\Goes\Here"
# Output file
$outfile = "\\List\Of\Problem\Lines.csv"
# Occurrence of -1 that we care about
$pattern = "(?<!--((etls_\w+\s+=\s)|(etls_recnt[,=])))-1"

<#
  Extract lines from SAS ETL source code matching the pattern
  specified and write out the line to the output file
#>
Select-String -Pattern "$pattern" -Path $jobdir\*.sas |
 Select-Object Filename, LineNumber, Line |
  Export-Csv $outfile

Select-String is pretty much the PowerShell equivalent of grep. It returns a list of MatchInfo objects which, as with the FileInfo objects in the other example, all get passed to Select-Object to pull out the properties we want to know about to be exported to the output file.

PowerShell documentation is available at https://technet.microsoft.com/en-us/library/bb978526.aspx but you’ll find the built-in F1 help in PowerShell is just as easy to find your way around. Happy scripting!

Replacing SQL Joins with the SAS Data Step Hash Object

So it has been a very long time since I posted on here, a lot of changes in the meantime. Long story short, I got busy, changed jobs, still busy now, but with the change in scene I have a little more motivation to blog and hopefully some more interesting things to say. I figured I’d start out by posting a quick summary of a technical presentation I gave to the SUNZ quarterly meeting late last year.

The presentation was a brief look at how to use the SAS data step hash object to replace expensive SQL joins. The scenario I had in mind was a typical one working with a data warehouse, where we join from a large, central fact table to several (typically smaller) dimension tables:

Join

The thing is, even when the dimension tables are (relatively) small, each lookup from the fact extracts a cost, in the form of CPU time – so the more rows in the fact, the more this costs. Not only that, but as the dimensions grow (which they naturally will over time), the CPU time required for each lookup will increase. This is why the quick and easy report which took 5 minutes to run when you wrote it a year ago now keeps you (and perhaps the rest of your reporting batch) waiting for the best part of an hour.

The advantage that the hash object offers is essentially that while growth in the fact still adds to the CPU time required, growth in the dimensions does not. The hash object guarantees (except in situations involving pathological data) constant time lookup. There can be monster savings here, with some caveats which I’ll get to later. For the moment, here’s a very quick how to.

First, the (generic) SQL we’re replacing:

proc sql ;
  create table report as
  select
  	dim1.attr1
	, dim2.attr2
	, dim3.attr3
	, dim4.attr4
	, fac.measure1
  from
  	fact1 fac
	inner join dimension1 dim1
		on fac.dim1_pk = dim1.dim1_pk
	inner join dimension2 dim2
		on fac.dim2_pk = dim2.dim2_pk
	inner join dimension3 dim3
		on fac.dim3_pk = dim3.dim3_pk
	inner join dimension4 dim4
		on fac.dim4_pk = dim4.dim4_pk
		;
quit ;

The idea with using the data step hash object to replace this is simple: we add a separate hash object for each dimension, containing the keys we are using to join on and the attributes we are adding into the report table. Then for each row in the fact, if we find a match in all dimensions, we add the row into the report.

The code is as follows:

data report ;
  /* 1 - 'Fake' set statement to add variables into the PDV*/
  if 0 then set
  	fact1 (keep = measure1)
  	dimension1 (keep = dim1_pk attr1)
	dimension2 (keep = dim2_pk attr2)
	dimension3 (keep = dim3_pk attr3)
	dimension4 (keep = dim4_pk attr4)
  ;

  /* 2 - Declare hash objects for each dimension*/
  if _n_ = 1 then do ;
  	declare hash dim1 (dataset:&quot;dimension1&quot;) ;
	dim1.definekey(&quot;dim1_pk&quot;) ;
	dim1.definedata(&quot;attr1&quot;) ;
	dim1.definedone() ;
	
  	declare hash dim2 (dataset:&quot;dimension2&quot;) ;
	dim2.definekey(&quot;dim2_pk&quot;) ;
	dim2.definedata(&quot;attr2&quot;) ;
	dim2.definedone() ;
	
  	declare hash dim3 (dataset:&quot;dimension3&quot;) ;
	dim3.definekey(&quot;dim3_pk&quot;) ;
	dim3.definedata(&quot;attr3&quot;) ;
	dim3.definedone() ;
	
  	declare hash dim4 (dataset:&quot;dimension4&quot;) ;
	dim4.definekey(&quot;dim4_pk&quot;) ;
	dim4.definedata(&quot;attr4&quot;) ;
	dim4.definedone() ;
  end ;

  /* 3 - 'Join' rows to the dimensions by matching with the .find() method*/
  do until (eof) ;
  	set fact1 (keep=dim1_pk dim2_pk dim3_pk dim4_pk measure1 end=eof;
	if dim1.find() = 0 and dim2.find() = 0 and 
		dim3.find() = 0 and dim4.find() = 0 then output ;
  end ;
  stop ;

  drop dim1_pk dim2_pk dim3_pk dim4_pk ;

run ;

As per the comments, the code breaks down into 3 steps:
1 – Fake a set statement: the data step compiler does not know about the hash object when it is created, so we need to supply it with column metadata to assist with formation of the PDV.
2 – Declare and create the hash objects. The definekey, definedata and definedone methods do the work of defining the hash object, after which SAS loops over the tables named in the ‘dataset’ parameter supplied with the declare statement. For each row the key and value pairs are added into the hash object.
3 – Perform the join by matching key values from the fact table into the dimension hash objects (using the hash object find() method). This is where one fundamental difference between the two approaches becomes apparent. We’re now not joining tables on disk, as we were with the SQL join; the fact table on disk is being matched with the hash objects, which are data structures entirely resident in memory.

So is it worth it? In a word, yes – but only if you’re willing to trade off a big (sometimes huge) increase in memory consumption against the CPU time you’ll be getting back. To illustrate this, here’s some performance stats from a real-life example.

First, a small join – 2 dimensions joined to a small fact table (about 100,000 rows):

ProcSQL100K

Replacing this with data step code using hash objects:

DataStepHash100K

There’s a small saving in CPU time, set against a slight increase in memory consumption. It hardly seems worthwhile replacing this code, but then again it’s not a very long-running report to begin with. The real savings come when we look at the full version – this time, 9 dimensions joined to a somewhat larger fact table (~ 10 million rows). First the SQL join:

BiggestProcSQLAllRows

Then, the data step version:

BiggestDataStepHashAllRows

Here, replacing the SQL code has reduced the time required by a factor of 10. This is a huge difference and we could be doing backflips and high fives all round right now, but before we kick off the celebrations, take a moment to look at the memory usage.

You’ll see that the memory consumption with the SQL code is less than 300MB, whereas the data step hash code uses a little over 10 times that. In fact, even the data step code against the small fact required over 1GB. The memory usage is linked to the size of the dimensions that you’re shoving into the hash objects, so the decrease in CPU time is being paid for more or less directly with a corresponding increase in memory. So is this a problem? Well, it might be, or it might not be. Obviously it depends on the availability of both these resources – if your server is constantly running out of memory then I’d say yes, it’s a problem. Then again, if your users are always complaining about how long it takes for their reports to run, maybe the hash object is worth a look.

I delivered a slightly longer form of this as a presentation to SUNZ last year. The slideshow is at the link below (pptx) or a pdf version is also available from the SUNZ website.

Data Step Hash Object vs SQL Join

Tricks with _TYPE_

In SAS I frequently make use of the very handy SUMMARY procedure. It will produce a wide range of aggregate results at whatever class level you specify – or multiple levels if you like. So given a dataset costs with measure variable total_cost and class variables month, age, tenure and location, I might wish to see one-way analyses of cost by all of these classes, and possibly two-way analysis of month vs age and month vs tenure. So I could write a whole lot of summary procedures like the below:

/*Month 1-way analysis*/
proc summary data=costs nway missing ;
	class month ;
	var   total_cost ;
	output out=cost_month_summary (drop=_type_ rename=(_freq_=count))
		min= max= p5= q1= median= q3= p95= mean= sum= /autoname ;
run;

/*Month by Age 2-way analysis*/
proc summary data=costs nway missing ;
	class month age;
	var   total_cost ;
	output out=cost_month_age_summary (drop=_type_ rename=(_freq_=count))
		min= max= p5= q1= median= q3= p95= mean= sum= /autoname ;
run;
/*etc, etc.*/

Now while that will work, it’s a lot of code to write and maintain. I prefer to request all the required results at once, in one proc:

proc summary data=costs missing ;
	class month age tenure location;
	var   total_cost;
	types /*One-way analyses*/
		  month age tenure location
		  /*two-way*/
		  month*age
		  month*tenure
	;
	output out=cost_summary (rename=(_freq_=count))
		min= max= p5= q1= median= q3= p95= mean= sum= /autoname ;
run;

The change that makes this work is of course the type statement, and removing the nway option (we no longer want an n-way analysis).

The output dataset will then have a column for every class variable, one for each requested statistic (including the renamed _freq_), and crucially we’ve also kept the autogenerated _type_ variable. _Type_ is the variable which tells you which summary request you’re looking at, assigning numbers to each combination of variables according to a simple binary pattern. Basically given n class variables arranged in logical order A1 … An in the output dataset, working from right to left each is assigned a value from 20 … 2n-1, so Am gets 2n-m. In our example, if month, age, tenure and location have logical positions 1, 2, 3 and 4 respectively, then month=23, age=22, tenure=21 and location=20. That means the requested analysis types get the following _type_ values:

month = 23 = 8
age = 22 = 4
tenure = 21 = 2
location = 20 = 1
month*tenure = 23 + 21 = 10
month*age = 23 + 22 = 12

This works fine for me, but the _type_ variable is a number, which is unhelpful to an audience without the technical knowledge, time or patience to decode it, so it would be great to be able to systematically assign it a descriptive label. Of course, given a combination of class variables in a summary dataset, it’s possible to work out the value of _type_ that has been assigned to them. Turns out, this paper from SUGI 31 by Dan Bruns contains the macro %findtype written by Ron Coleman which will examine a summary dataset table and return the value of _type_ for the variable list vars:

%macro findtype(table,vars);
   %local index fid flag target;
   %let fid = %sysfunc(open(&table,i));
   %if not &fid %then %goto exit;
   %let vars = %upcase(&vars);
   %let flag = 0;
   %let _type_ = 0;
   %do i = %sysfunc(attrn(&fid,nvars)) %to 1 %by -1;
      %let target = %upcase(%sysfunc(varname(&fid,&i)));
      %if &flag %then %do;
         %let index = 1;
         %do %while (%length(%scan(&vars,&index)) > 0);
            %if &target = %scan(&vars,&index) %then
               %let _type_ = %eval(&_type_ + 2**(&flag - &i - 1));
            %let index = %eval(&index + 1);
         %end;
      %end;
      %else %if &target = _TYPE_ %then
         %let flag = &i;
   %end;
   %let fid = %sysfunc(close(&fid));
   &_type_
   %goto term;
%Exit:;
   -1
%term:
%mend findtype;

What I wanted to do was use this macro to create a format typedesc to look up from _type_ to a description, so that in the example above put(12, typedesc.) would return ‘month*age’. I created a dataset types containing all the variable combinations requested in the summary, and then I used this to create a control data set typefmt:

filename out_tmp temp ;

/*Write a data step to generate a format control set*/
data _null_;
  set types end=last;
  file out_tmp ;
  if _n_=1 then put 'data typefmt; retain fmtname type ; fmtname="typedesc"; type="N";' ;
  line='start=%findtype(cost_summary, '|| types ||'); label="'|| types ||'"; output ;' ;
  put line ;
  if last then put 'run ;' ;
run;

%include out_tmp /source2 ;

Here the variable types acts as the label, and %findtype finds the _type_ value to apply the label to.

From there it’s a simple matter of compiling the format and applying it to _type_ in the summary dataset:

proc format library=work cntlin=typefmt ;
run ;

data cost_summary ;
  set cost_summary ;
  type=put(_type_, typefmt.) ;
run ;

How do you SQL in Excel?

QueryCell is the Excel add-in from Oak Focus Software that brings SQL into Excel. I reviewed version 1.4 some time ago, and a few days ago lead developer Sam Howley told me about the release of version 2.0.

The new version is a complete rewrite and offers a noticeable improvement in speed, stability and responsiveness, in addition to support for 64-bit Excel. As always I’m impressed with the clean, simple look and feel of the interface, which slides out the editor when you need to use it and tucks it away when you’re done. You can manipulate data from inside the current workbook, or externally through an ODBC connection.

Sam has very kindly agreed once again to give away some free licences to Number Cruncher readers. All you have to do to get yourself one is to send me an email describing how you’ve used SQL in Excel. Could be bringing external data in through a query, or querying tables in the current workbook, could be using the Excel Data menu and MS Query to add in a connection, or scripting an ADO connection using VBA, or using QueryCell. Surprise me! Send me a description of what the information need was and how you went about resolving it.

Email me at the address at about, with ‘SQL in Excel’ in the subject line, by 10:00 pm NZDT on Saturday 14 July. I’ll judge the best/most interesting of these and publish them in a follow-up post after the weekend, so it’s important that you’re ok with other people reading what you send me.

Lags and Unintended Consequences

Analytic functions like LAG, LEAD, FIRST_VALUE and LAST_VALUE are a very useful addition to Oracle SQL, enabling retrieval of aggregate results without the need for self-joins. LAG, for instance, will allow you to get the value of a column from the previous row in a group of rows.

Here’s an example of where I used this recently. I was attempting to monitor transfers of cases between offices by reading from a table CASE_MGMT_SEQ containing a record of which offices managed which cases, which held a row for each office in the ‘management sequence’ of the case history. Here’s roughly what that looked like:

The column MGMT_SEQ tells us the position of each row in the case management sequence. Each row in the case sequence ends on the same date as the next row starts, and it is possible (although not shown in this example) for the next row to be in the same office as in the previous row. From this view we can tell that case A was managed in Wellington from Feb 1 to Feb 20 this year, then was moved to Auckland from Feb 20 to Feb 25, and finally moved back to Wellington, where it remained until Mar 10.

The SQL to track transfers needed to tell me (among other things; I’ve simplified this considerably) which case we’re transferring, when the transfer happened, where it was transferred to, and where it was transferred from:

WITH TRANSFERS AS
  (SELECT CASE_ID ,
    LAG(OFFICE) OVER (PARTITION BY CASE_ID ORDER BY CASE_ID, MGMT_SEQ) AS PREV_OFFICE,
    OFFICE ,
    START_DATE,
    END_DATE
  FROM CASE_MGMT_SEQ
  WHERE START_DATE BETWEEN TO_DATE(:STDATE, 'yyyy/mm/dd') AND TO_DATE(:ENDDATE, 'yyyy/mm/dd')
  )
SELECT 
  CASE_ID,
  PREV_OFFICE AS TRANSFER_FROM,
  OFFICE AS TRANSFER_TO,
  START_DATE AS TRANSFER_DATE
FROM TRANSFERS
WHERE (OFFICE  = :OFFICE
OR PREV_OFFICE = :OFFICE)
AND OFFICE    <> PREV_OFFICE;

The report returns for a given office any transfers in or out of a specified office over a period bounded by STDATE and ENDDATE.

But something weird is happening. I try running this report with the OFFICE parameter set to Wellington, STDATE set to 2012/02/21 and ENDDATE set to 2012/02/27, and here’s what I get:

There’s a transfer of case C from Wellington to Hamilton on Feb 24, certainly. The row where case C is managed in Wellington from Feb 23 to Feb 24 shouldn’t be counted as a transfer, as there is no previous office – the lag will return a null in this case, so the line OFFICE PREV_OFFICE in the WHERE clause will return null and hence the row will be filtered out. But there’s also case A’s transfer into Wellington from Auckland on Feb 25. There is definitely a previous office in that case. What’s going on?

The problem here is that I wasn’t paying attention to the order in which the clauses execute. It’s natural to assume that because the SELECT clause comes first, it gets executed first. In fact, the query SELECT-FROM-WHERE is executed FROM-WHERE-SELECT. This means that inside the temp table TRANSFERS, the where clause filters out rows with a start date outside the date bounds before the lag function gets to calculate the previous office. Here’s what I get when I just run the TRANSFERS sub-query with the same date parameters:

The row where case A was managed in Auckland from Feb 20 to Feb 25 is filtered out by WHERE as the start date is not within the specified bounds. Hence the window that LAG uses in SELECT to calculate the previous office has no row to look back to for the next row in the case A sequence, and this causes a null to be generated. When the WHERE clause in the outer query compares OFFICE and PREV_OFFICE, a null is returned and so this row doesn’t make it through.

So what can we do about this? Well, some people will say wrap the null values in NVL. That’s fine if you want a default value to come through, but in this case I would actually like to see the previous office – the logical fault lies in the sub-query rather than the outer query. The problem arises because the previous row is not being passed through to SELECT, so we need to get it included somehow.

In the end this was quite simple – we’re only filtering based on START_DATE, but if we include the same filter on END_DATE then the previous row will come through, by virtue of the fact that each row in the sequence ends on the same day the subsequent row starts. Here’s how the report is modified with a change to the WHERE clause in the sub-query:

WHERE START_DATE BETWEEN TO_DATE(:STDATE, 'yyyy/mm/dd') AND TO_DATE(:ENDDATE, 'yyyy/mm/dd')
  OR END_DATE BETWEEN TO_DATE(:STDATE, 'yyyy/mm/dd') AND TO_DATE(:ENDDATE, 'yyyy/mm/dd')

And the transfer from Auckland to Wellington shows up:

Reference for LAG and other analytic functions: http://psoug.org/reference/analytic_functions.html

Summing a random sample

A colleague came to me recently with a problem:
I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant)
    ' SampleArray populates a random sample avSmallset from an array avBigset
    ' without replacement (each element in avBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lOb As Long
    Dim lPickit As Long
        
    ' Make sure we're dealing with arrays...
    If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub
    ' Initialise
    lRemainder = UBound(avBigset)
    lSize = UBound(avSmallset)
    Randomize 0
    
    Do While lSize > 0  ' Still some left to pick up
        lOb = lOb + 1
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSmallset(lPickit) = avBigset(lOb)
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Loop    ' Sample complete
    
End Sub

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

Private Function arr(ByRef rng As Range) As Variant
    ' Convert range to 1-D array
    Dim i As Long
    Dim j As Long
    Dim k As Long
    Dim rowCount As Long
    Dim colcount As Long
    Dim lim As Long
    Dim sArr() As String
    Dim vArr As Variant
    
    vArr = rng.Value2
    
    ' dimension array
    rowCount = UBound(vArr)
    colcount = UBound(vArr, 2)
    lim = rowCount * colcount
    ReDim sArr(1 To lim)
    k = 1
    
    ' populate
    For i = 1 To rowCount
        For j = 1 To colcount
            sArr(k) = vArr(i, j)
            k = k + 1
        Next j
    Next i
    
    arr = sArr
    
End Function

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant)
    ' SampleRange populates a random sample avSample from a range rngBigset
    ' without replacement (each element in rngBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lPickit As Long
    Dim rngCell As Excel.Range
    
    
    ' Initialise
    lRemainder = rngBigset.Cells.Count
    lSize = UBound(avSample)
    Randomize 0
    
    For Each rngCell In rngBigset.Cells
        If lSize <= 0 Then Exit For ' Sample complete
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSample(lPickit) = rngCell
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Next rngCell

End Sub

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim in_array As Variant
    Dim index As Long
    
    ' set up large and small arrays
    in_array = arr(in_range)
    ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size))
    
    ' get sample
    SampleArray in_array, sample_array
    ' sum sample array
    SAMPLESUM = WorksheetFunction.Sum(sample_array)
    
End Function

and SAMPLESUM2, working directly with the range:

Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim index As Long
    
    ' set up large and small arrays
    ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size))
    
    ' get sample
    SampleRange in_range, sample_array
    ' sum sample array
    SAMPLESUM2 = WorksheetFunction.Sum(sample_array)

End Function

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

Public Sub timeit()
Dim starttime As Double
Dim endtime As Double
Dim totaltime As Double
Dim i As Integer
Dim numrange As Excel.Range

Set numrange = Range("A1:J50000")


starttime = Timer

For i = 1 To 100
    SAMPLESUM numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds"

starttime = Timer

For i = 1 To 100
    SAMPLESUM2 numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds"

End Sub

Which output:

Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds
Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds

I think I’ll go with door number 2.