How do you SQL in Excel?

QueryCell is the Excel add-in from Oak Focus Software that brings SQL into Excel. I reviewed version 1.4 some time ago, and a few days ago lead developer Sam Howley told me about the release of version 2.0.

The new version is a complete rewrite and offers a noticeable improvement in speed, stability and responsiveness, in addition to support for 64-bit Excel. As always I’m impressed with the clean, simple look and feel of the interface, which slides out the editor when you need to use it and tucks it away when you’re done. You can manipulate data from inside the current workbook, or externally through an ODBC connection.

Sam has very kindly agreed once again to give away some free licences to Number Cruncher readers. All you have to do to get yourself one is to send me an email describing how you’ve used SQL in Excel. Could be bringing external data in through a query, or querying tables in the current workbook, could be using the Excel Data menu and MS Query to add in a connection, or scripting an ADO connection using VBA, or using QueryCell. Surprise me! Send me a description of what the information need was and how you went about resolving it.

Email me at the address at about, with ‘SQL in Excel’ in the subject line, by 10:00 pm NZDT on Saturday 14 July. I’ll judge the best/most interesting of these and publish them in a follow-up post after the weekend, so it’s important that you’re ok with other people reading what you send me.


Lags and Unintended Consequences

Analytic functions like LAG, LEAD, FIRST_VALUE and LAST_VALUE are a very useful addition to Oracle SQL, enabling retrieval of aggregate results without the need for self-joins. LAG, for instance, will allow you to get the value of a column from the previous row in a group of rows.

Here’s an example of where I used this recently. I was attempting to monitor transfers of cases between offices by reading from a table CASE_MGMT_SEQ containing a record of which offices managed which cases, which held a row for each office in the ‘management sequence’ of the case history. Here’s roughly what that looked like:

The column MGMT_SEQ tells us the position of each row in the case management sequence. Each row in the case sequence ends on the same date as the next row starts, and it is possible (although not shown in this example) for the next row to be in the same office as in the previous row. From this view we can tell that case A was managed in Wellington from Feb 1 to Feb 20 this year, then was moved to Auckland from Feb 20 to Feb 25, and finally moved back to Wellington, where it remained until Mar 10.

The SQL to track transfers needed to tell me (among other things; I’ve simplified this considerably) which case we’re transferring, when the transfer happened, where it was transferred to, and where it was transferred from:

    OFFICE ,

The report returns for a given office any transfers in or out of a specified office over a period bounded by STDATE and ENDDATE.

But something weird is happening. I try running this report with the OFFICE parameter set to Wellington, STDATE set to 2012/02/21 and ENDDATE set to 2012/02/27, and here’s what I get:

There’s a transfer of case C from Wellington to Hamilton on Feb 24, certainly. The row where case C is managed in Wellington from Feb 23 to Feb 24 shouldn’t be counted as a transfer, as there is no previous office – the lag will return a null in this case, so the line OFFICE PREV_OFFICE in the WHERE clause will return null and hence the row will be filtered out. But there’s also case A’s transfer into Wellington from Auckland on Feb 25. There is definitely a previous office in that case. What’s going on?

The problem here is that I wasn’t paying attention to the order in which the clauses execute. It’s natural to assume that because the SELECT clause comes first, it gets executed first. In fact, the query SELECT-FROM-WHERE is executed FROM-WHERE-SELECT. This means that inside the temp table TRANSFERS, the where clause filters out rows with a start date outside the date bounds before the lag function gets to calculate the previous office. Here’s what I get when I just run the TRANSFERS sub-query with the same date parameters:

The row where case A was managed in Auckland from Feb 20 to Feb 25 is filtered out by WHERE as the start date is not within the specified bounds. Hence the window that LAG uses in SELECT to calculate the previous office has no row to look back to for the next row in the case A sequence, and this causes a null to be generated. When the WHERE clause in the outer query compares OFFICE and PREV_OFFICE, a null is returned and so this row doesn’t make it through.

So what can we do about this? Well, some people will say wrap the null values in NVL. That’s fine if you want a default value to come through, but in this case I would actually like to see the previous office – the logical fault lies in the sub-query rather than the outer query. The problem arises because the previous row is not being passed through to SELECT, so we need to get it included somehow.

In the end this was quite simple – we’re only filtering based on START_DATE, but if we include the same filter on END_DATE then the previous row will come through, by virtue of the fact that each row in the sequence ends on the same day the subsequent row starts. Here’s how the report is modified with a change to the WHERE clause in the sub-query:

  OR END_DATE BETWEEN TO_DATE(:STDATE, 'yyyy/mm/dd') AND TO_DATE(:ENDDATE, 'yyyy/mm/dd')

And the transfer from Auckland to Wellington shows up:

Reference for LAG and other analytic functions: