How do you SQL in Excel?

QueryCell is the Excel add-in from Oak Focus Software that brings SQL into Excel. I reviewed version 1.4 some time ago, and a few days ago lead developer Sam Howley told me about the release of version 2.0.

The new version is a complete rewrite and offers a noticeable improvement in speed, stability and responsiveness, in addition to support for 64-bit Excel. As always I’m impressed with the clean, simple look and feel of the interface, which slides out the editor when you need to use it and tucks it away when you’re done. You can manipulate data from inside the current workbook, or externally through an ODBC connection.

Sam has very kindly agreed once again to give away some free licences to Number Cruncher readers. All you have to do to get yourself one is to send me an email describing how you’ve used SQL in Excel. Could be bringing external data in through a query, or querying tables in the current workbook, could be using the Excel Data menu and MS Query to add in a connection, or scripting an ADO connection using VBA, or using QueryCell. Surprise me! Send me a description of what the information need was and how you went about resolving it.

Email me at the address at about, with ‘SQL in Excel’ in the subject line, by 10:00 pm NZDT on Saturday 14 July. I’ll judge the best/most interesting of these and publish them in a follow-up post after the weekend, so it’s important that you’re ok with other people reading what you send me.

Summing a random sample

A colleague came to me recently with a problem:
I have a bunch of numbers in a spreadsheet, and I need to take a random sample of 50 of them, and sum it. I can do the sample by random sorting, but I need to repeat this several times, so I’d like a function that I can repeat without having to muck around.

Well, sampling – I have some code which will do just that. After that, I think I should be able to manage summing the array.

Here’s the code for creating a random sample:

Private Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant)
    ' SampleArray populates a random sample avSmallset from an array avBigset
    ' without replacement (each element in avBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lOb As Long
    Dim lPickit As Long
        
    ' Make sure we're dealing with arrays...
    If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub
    ' Initialise
    lRemainder = UBound(avBigset)
    lSize = UBound(avSmallset)
    Randomize 0
    
    Do While lSize > 0  ' Still some left to pick up
        lOb = lOb + 1
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSmallset(lPickit) = avBigset(lOb)
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Loop    ' Sample complete
    
End Sub

The issue is that it’s built to work with 1d arrays, so given a range as input in the function, I would need to push the values of the range into an array. I could do that with something simple like the below:

Private Function arr(ByRef rng As Range) As Variant
    ' Convert range to 1-D array
    Dim i As Long
    Dim j As Long
    Dim k As Long
    Dim rowCount As Long
    Dim colcount As Long
    Dim lim As Long
    Dim sArr() As String
    Dim vArr As Variant
    
    vArr = rng.Value2
    
    ' dimension array
    rowCount = UBound(vArr)
    colcount = UBound(vArr, 2)
    lim = rowCount * colcount
    ReDim sArr(1 To lim)
    k = 1
    
    ' populate
    For i = 1 To rowCount
        For j = 1 To colcount
            sArr(k) = vArr(i, j)
            k = k + 1
        Next j
    Next i
    
    arr = sArr
    
End Function

This will do the trick, but performance-wise it works out pretty horribly. The bigger the array, the longer this step will take, and it’s all useless, unnecessary overhead. The better approach is I think to revise the SampleArray function so it will work with a range rather than an array.

Actually this turns out to be very easy. The SampleArray code loops over an array, but with a range, we can just loop over the Cells collection. The rest of the code is pretty much identical, except that we’ll use a For Each…Next loop rather than Do While:

Private Sub SampleRange(ByRef rngBigset As Variant, ByRef avSample As Variant)
    ' SampleRange populates a random sample avSample from a range rngBigset
    ' without replacement (each element in rngBigset is considered once only)
    Dim lRemainder As Long
    Dim lSize As Long
    Dim lPickit As Long
    Dim rngCell As Excel.Range
    
    
    ' Initialise
    lRemainder = rngBigset.Cells.Count
    lSize = UBound(avSample)
    Randomize 0
    
    For Each rngCell In rngBigset.Cells
        If lSize <= 0 Then Exit For ' Sample complete
        If Rnd < lSize / lRemainder Then
            lPickit = lPickit + 1
            avSample(lPickit) = rngCell
            lSize = lSize - 1
        End If
        lRemainder = lRemainder - 1
    Next rngCell

End Sub

So is there a significant performance gain in doing it this way, taking out the array function? Unsurprisingly, yes there is. I created two functions, SAMPLESUM, which requires the conversion to an array beforehand:

Public Function SAMPLESUM(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim in_array As Variant
    Dim index As Long
    
    ' set up large and small arrays
    in_array = arr(in_range)
    ReDim sample_array(1 To WorksheetFunction.Min(UBound(in_array), sample_size))
    
    ' get sample
    SampleArray in_array, sample_array
    ' sum sample array
    SAMPLESUM = WorksheetFunction.Sum(sample_array)
    
End Function

and SAMPLESUM2, working directly with the range:

Public Function SAMPLESUM2(ByRef in_range As Range, ByVal sample_size As Long) As Variant
    ' Sum a random sample of size sample_size from in_range
    Dim sample_array() As Double
    Dim index As Long
    
    ' set up large and small arrays
    ReDim sample_array(1 To WorksheetFunction.Min(in_range.Cells.Count, sample_size))
    
    ' get sample
    SampleRange in_range, sample_array
    ' sum sample array
    SAMPLESUM2 = WorksheetFunction.Sum(sample_array)

End Function

I then set up a function which timed 100 calculations of each version, taking a sample of size 10 from a range of numbers with 50,000 rows and 10 columns:

Public Sub timeit()
Dim starttime As Double
Dim endtime As Double
Dim totaltime As Double
Dim i As Integer
Dim numrange As Excel.Range

Set numrange = Range("A1:J50000")


starttime = Timer

For i = 1 To 100
    SAMPLESUM numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM): " & totaltime & " seconds"

starttime = Timer

For i = 1 To 100
    SAMPLESUM2 numrange, 10
Next i

endtime = Timer
totaltime = endtime - starttime

Debug.Print "Time taken for 100 calculations (SAMPLESUM2): " & totaltime & " seconds"

End Sub

Which output:

Time taken for 100 calculations (SAMPLESUM): 44.140625 seconds
Time taken for 100 calculations (SAMPLESUM2): 12.546875 seconds

I think I’ll go with door number 2.

Regular Expressions – new version

Just a quick post tonight to let you all know, I’ve added a new page for downloads, which contains a new version of the regular expressions add-in, compatible with Excel 2007 and later. I’ve added in a couple of utility functions for convenience (mine more than yours but you might find them useful), and a form to evaluate regular expressions against test strings. And there’s even documentation!

The documentation gives a (very) brief intro to the uses and abuses of regular expressions, a run-down of the worksheet functions in the add-in and some examples of their use. Here are a couple of those, I hope you find them useful.

Matching cells which contain variations on a word

There are some words in the English language which Americans, god bless them, spell in their own special way. However, given input on the spreadsheet from users who spell both ways (correctly and incorrectly), you may wish to match both variations of words like ‘realise’ (‘realize’) and ‘colour’ (‘color’).
The pattern to match realise/realize is simple: \breali(s|z)e\b
The word boundary markers ensure we are looking at a complete word, and the alternation of (s|z) means that we match both versions.
Applying the ISRXMATCH formula demonstrates this is successful:

Validating Email Addresses

Given a list of email addresses in a column on a spreadsheet, we wish to ensure that these stick to a form which at least obeys some of the rules governing the format of email addresses. As these are going to be used by a script to send emails, we wish to minimise the number of undeliverable responses due to invalid addresses. The basic rules we specify for these addresses are as follows:
The username part of the address contains one or more alphanumeric characters, and possibly some additional special characters. This is followed by a single @ sign, followed by the domain name, which consists of one or more alphanumeric and special characters, ending with a dot followed by the top-level domain. This must contain only alphanumeric characters, and there must be between 2 and 6 of these. The address should be the entire content of the cell, so the beginning and ending anchors are used at the start and end of the pattern. Case is unimportant, so the case_sensitive flag is set to false.
The pattern is as follows: ^[a-z0-9_%.+-]+@[a-z0-9-.]+\.[a-z]{2,6}$
This is then used in ISRXMATCH – a valid email address according to our rules above will return true:

The second address in the list fails due to the whitespace in the username, whereas the fourth fails because the domain name does not include a top-level domain part of a dot followed by 2-6 letters.
I borrowed this regex from http://www.regular-expressions.info/email.html. As well as a couple of alternative regexes to cover some edge cases which the above doesn’t catch, this page also discusses why email addresses can be tricky, and why you shouldn’t go overboard trying to cover every exception.

The Five Reasons Google Docs Is Not Ready To Replace Excel

Today’s post is from guest contributor Olivia Lennox.

Not too long after the initial excitement of having an application that looks exactly as a spreadsheet is meant to open up in seconds within your browser, Google Docs starts to let you down. Everything from performance on large sheets to the presentation of data is substandard when compared to even the Excel that ships with the 2003 version of Microsoft Office.

Here are the five main areas Google need to tackle before power spreadsheet users can even think about moving from Excel.

1. Graphical Presentation Of Data

One of the first tasks we attempted to replicate was to store and present some stock market performance data from an online share dealing account. To be fair to Google Docs Spreadsheet getting used to entering data was simple and the chart was quickly on it’s own sheet after clicking the add chart button and choosing a few options.

Here was where the vast power of Excel that we take for granted was obviously lacking. How about adding trend lines to scatter charts? This seems currently to be an either/or only feature in Google Docs which is disappointing. In fact only a limited amount of customization is possible at all in the Graphing function. The end result of this test was a graph with the correct headings, colours and lines on Excel and a poor compromise of one line, and unwanted colour and style in Google Docs.

With Excel hardly at the zenith of the data presentation world (some apps like Jaspersoft are near this level) it’s not really possible to consider Google Docs a suitable contender while it rests a long, long way behind Excel.

2. Load And Usage Performance For Large Sheets

Many organisations use Excel to manage very large amounts of data and do modelling on that data using some pretty complicated formulas. I’d say being able to handle these sheets with ease on a mid-range PC typical of many office settings should be a very high priority.

Unfortunately Google Docs significantly disappoints. In some simple tests with opening one large sheet, amending some fields and waiting for the recalculation Excel actually finished before we were able to enter the field on Google. This was repeated on five separate mid range machines with above average Internet connections and two gigabytes of RAM.

3. The Range Of Data Analysis And Presentation Tools In Excel Blows Google Away

Google does have the basics right here. Sorting data and pivot tables worked well and relatively quickly (although on the machines we tested Excel was always faster as noted above) and was easy to figure out for an experienced Spreadsheet user.

The Data Analysis TookPak in Excel however is just so powerful it’ll be a long time before any Spreadsheet can match it – let alone an online-only solution. Many challenging University level and beyond projects can be completed using these tools as most statistical functions are supported. A good guide can be found here which also uncovers some of the pitfalls of using a spreadsheet for some of these tasks.

4. Change In Scripting Language

Whilst this is not necessarily a fair criticism of Google Docs they have chosen to use Javascript as their scripting language whereas Microsoft uses VBA (Visual Basic for Applications). Not only are more casual users familiar with BASIC from their education – it takes school children who’ve learned BASIC just a few hours to get scripting in Excel at a simple level but existing spreadsheets are now significantly more difficult to port to Google Docs. Any corporation with a range of large sheets containing scripts needs to think very carefully about a move.

5. Security Concerns

“Security consultant Ade Barkah checked in with us to alert us to a couple of serious security issues associated to Google Docs, the web-based office software from the world’s most famous search engine company, giving a whole new meaning to its mission to make the world’s information universally accessible.”
TechCrunch 2009

Throughout the years since it’s launched there have been a number of so far minor security issues with Google Docs. However minor these have been and however quickly Google has fixed them corporations used to controlling their own security will be concerned with these issues. Microsoft hasn’t always had a perfect reputation for security but to be fair to them spreadsheets on your own private server haven’t been busy sharing a few graphs of private information with the World Wide Web.

Combining chart types

Wow, it’s been a long time. This is a post I had planned for some time ago but didn’t get around to until now – as many All Black supporters might say, about bloody time

In a made-up scenario I have two rates to report on, Metric 1 and Metric 2. An indicator of good performance is that Metric 2 sits 5-10% above Metric 1. So when I plot the monthly data, as well as the two Metrics time series I’ll add in another series called Difference, that will be calculated as Metric 2 – Metric 1. I’ll also add in the constant series Target 1 (5%) and Target 2 (10%), and what I want to see is the Difference series sitting between these two targets. Here’s what that looks like:

Now, I don’t know what you think of that, but to me it’s confusing. The difference between Metric 1 and 2 and where it sits in relation to the 5% and 10% marks is the key indicator I need to follow, but both Metric 1 and 2 are also important to provide context. Putting them all together on the same line chart makes it harder than it needs to be to see what the indicator is doing.

Thankfully, there is a much simpler way to present this, which avoids the need for the eye to track all five time series together on the chart, and also manages to signal much more clearly where the key indicator is in relation to its target. To do this I can use the relatively simple expedient of combining different chart types.

First, I right-click the Metric 1 time series, select ‘Change Series Chart Type…’ and pick out the ‘Stacked Area’ chart type:

I also format the fill of the area series to a red colour:

Next, I make the same change of chart type to the Target 1 time series. I change the colour to orange (or amber if you prefer), and the name of the series to ‘0-5% above’:

I remove both the Difference and the Target 2 time series from the chart, and add the Target 1 series in again, by copying the data range from the worksheet, selecting the chart and pasting. Once that’s done, I again change the chart type of the new series to stacked area, set the fill colour to green and rename it to ‘5-10% above’. I also change the line colour of the Metric 2 series to a dark blue to provide a little more contrast against the area charts:

And that’s done. We now don’t need to show the ‘Difference’ series at all, as the magnitude of the difference between Metric 1 and 2 is clear without it. I also find the chart a lot simpler and more visually appealing. I should note in passing that I very rarely use the stacked area chart type, but it makes sense to me in this context.

Combining chart types like this is a simple and effective but under-used method – I guess that’s one side effect of the chart wizard, people assume that once the chart type is selected, that’s it. I first came across the idea at Jon Peltier’s site (this page has a lot of cool examples), it’s been a handy tool ever since.

An Undo Stack in VBA

I’m working on an Excel add-in which will allow a user to perform operations on cells. I’d also like to allow the user to ‘undo’ these operations. Excel itself, of course, does have an Undo button (shortcut Ctrl + Z), but this works only to undo changes made in the user interface, and is not going to help for changes made by VBA. This means the add-in will need to remember each operation performed and have them ready to supply for un-doing should the user want to do so.

Many people have created their own general solutions for undoing changes made in VBA – for instance, here’s one from Jan Karel Pieterse (who also created the hugely useful Name Manager add-in): Undo With Excel VBA. However I decided not to use a solution like this – it’s quite a lot of code to add to a project, and it is very general.

In this case I’m happy to write a more specific solution, which copes with undoing a specific action, by simply delegating the work of undoing the action to the object which did the action in the first place. Here’s an example:

' CellTest Class - allows calling code to change the value of a
' cell and then change it back

'*****************************************************************
' Private Variables
'*****************************************************************
Private mCell As Range
Private vOrigFormula As Variant

'*****************************************************************
' Public Properties
'*****************************************************************
Public Property Get Address() As String
  Address = mCell.Address
End Property

Public Property Set Cell(ByRef rCell As Range)
  Set mCell = rCell
  vOrigFormula = rCell.Formula
End Property

'*****************************************************************
' Public Methods
'*****************************************************************
Public Sub Change(ByVal sText As String)
  mCell.Value = sText
End Sub

Public Sub ChangeBack()
  mCell.Formula = vOrigFormula
End Sub

So what about storing up a list of operations for undoing? This is ideally suited to a stack. Stacks are neat data structures – they don’t do much, but they do it well and they’re really easy to code. Here’s one I picked up from the VBA Developer’s Handbook and tweaked a little to suit. There are two classes, StackItem and Stack. StackItem simply stores a value and a reference to another StackItem object – the next item on the stack.

' StackItem class holds a reference to the object it refers to and
' a reference to the next item in the stack

Public Value As Variant
Public NextItem As StackItem

Stack stores a reference to the item on the top of the stack, and exposes the functions Push (to push a new item on to the top) and Pop (to remove the top item and return a reference to it). Properties IsEmpty and Peek tell you if there are any items in the stack, and what the value of the top item is (without removing it), respectively. There’s a very nice visual explanation of how this looks here.

' Stack holds a reference to the top item in the stack data structure
' and manages the adding and removing of stack items

Private siTop As StackItem

'*****************************************************************
' Public Properties
'*****************************************************************
Public Property Get IsEmpty() As Boolean
    IsEmpty = siTop Is Nothing
End Property

Public Property Get Peek() As Variant
    If IsObject(siTop.Value) Then
        Set Peek = siTop.Value
    Else
        Peek = siTop.Value
    End If
End Property

'*****************************************************************
' Public Methods
'*****************************************************************
Public Function Push(ByRef varIn As Variant) As Boolean

    Dim siNew As StackItem

On Error GoTo PushError

    Set siNew = New StackItem
    If IsObject(varIn) Then
        Set siNew.Value = varIn
    Else
        siNew.Value = varIn
    End If
    Set siNew.NextItem = siTop
    Set siTop = siNew
    
    Push = True
    
Exit Function

PushError:
    Push = False
End Function

Public Function Pop() As Variant

    If IsObject(siTop.Value) Then
        Set Pop = siTop.Value
    Else
        Pop = siTop.Value
    End If
    Set siTop = siTop.NextItem

End Function

So now to store an actions for undoing, I just need to push the object which performed it onto the undo stack. Every time I wish to actually undo an action, I just pop the object reference off, and instruct it to undo.

Next time, I’ll demonstrate how this looks in practice.

Maintaining SAS formats in Excel

For one of the subject areas I deal with day to day, categorical data values are often stored in the database in narrow character fields, so we need to apply SAS formats to get readable values out in reports. To maintain these formats I store them in an Excel workbook, looking roughly like this:

This is a relatively painless way to manage keeping SAS formats up to date, as well as making them easily accessible to people who need to inspect them. Each format is stored in a named range, with the name of the range matching the name of the format. The column headers are chosen to agree with the SAS Format procedure requirement for an input control data set. The above range, for instance, would be named ‘excel_user’, and would be used to input the required data for a character format (Type=’C’) also named ‘excel_user’ (Fmtname).

So how do I get them into SAS from Excel? Once again, the SAS/ACCESS Excel libname statement is helpful here. First, we assign a libref ‘wb’ to the workbook holding the format ranges, and another libref ‘sasprogs’ which will be used to store the format catalog:

libname wb Excel "&path.\formats.xls" ver=2002 ;
libname sasprogs "&path" ;

Next we need to get a list of all the ranges that we’re interested in here. In this workbook the only named ranges are the ones containing the formats. Any other names that the libname engine detects will be the names of worksheets, which will contain a ‘$’ character. So, we simply exclude these from the query against the dictionary.tables table:

proc sql;
  create table tmp as
  select memname
  from dictionary.tables
  where libname='WB' 
    and index(memname, '$') = 0 ;
quit;

So now the table ‘tmp’ holds a list of all the format range names in the workbook. Now we need to use this table to guide SAS in building those formats. What we need to do is to submit a proc format statement for every one of those names. Like so:

%macro load_fmt(name) ;
  proc format library=sasprogs.sasfmts cntlin=wb.&name ;
%mend load_fmt;

Now we simply need to loop through the set of names to load each of the formats into the sasprogs.sasfmts catalog:

data _null_;
  set tmp;
  call execute('%load_fmt('||memname||')') ;
run;

All that’s left to do now is a little tidying up:

/*Close the file*/
libname wb clear ;
/*Inspect format contents*/
proc format library=sasprogs.sasfmts fmtlib cntlout=sasprogs.sasfmt_data; run;
/*Add them into the search path*/
options fmtsearch=(library sasfmts);

And we’re done – all the formats are now loaded and available in the SAS session. Now if there are any changes, or a new format needs to be created, just add it into the file, save it and re-run this script. Sorted!