Regular Expressions – new version

Just a quick post tonight to let you all know, I’ve added a new page for downloads, which contains a new version of the regular expressions add-in, compatible with Excel 2007 and later. I’ve added in a couple of utility functions for convenience (mine more than yours but you might find them useful), and a form to evaluate regular expressions against test strings. And there’s even documentation!

The documentation gives a (very) brief intro to the uses and abuses of regular expressions, a run-down of the worksheet functions in the add-in and some examples of their use. Here are a couple of those, I hope you find them useful.

Matching cells which contain variations on a word

There are some words in the English language which Americans, god bless them, spell in their own special way. However, given input on the spreadsheet from users who spell both ways (correctly and incorrectly), you may wish to match both variations of words like ‘realise’ (‘realize’) and ‘colour’ (‘color’).
The pattern to match realise/realize is simple: \breali(s|z)e\b
The word boundary markers ensure we are looking at a complete word, and the alternation of (s|z) means that we match both versions.
Applying the ISRXMATCH formula demonstrates this is successful:

Validating Email Addresses

Given a list of email addresses in a column on a spreadsheet, we wish to ensure that these stick to a form which at least obeys some of the rules governing the format of email addresses. As these are going to be used by a script to send emails, we wish to minimise the number of undeliverable responses due to invalid addresses. The basic rules we specify for these addresses are as follows:
The username part of the address contains one or more alphanumeric characters, and possibly some additional special characters. This is followed by a single @ sign, followed by the domain name, which consists of one or more alphanumeric and special characters, ending with a dot followed by the top-level domain. This must contain only alphanumeric characters, and there must be between 2 and 6 of these. The address should be the entire content of the cell, so the beginning and ending anchors are used at the start and end of the pattern. Case is unimportant, so the case_sensitive flag is set to false.
The pattern is as follows: ^[a-z0-9_%.+-]+@[a-z0-9-.]+\.[a-z]{2,6}$
This is then used in ISRXMATCH – a valid email address according to our rules above will return true:

The second address in the list fails due to the whitespace in the username, whereas the fourth fails because the domain name does not include a top-level domain part of a dot followed by 2-6 letters.
I borrowed this regex from http://www.regular-expressions.info/email.html. As well as a couple of alternative regexes to cover some edge cases which the above doesn’t catch, this page also discusses why email addresses can be tricky, and why you shouldn’t go overboard trying to cover every exception.

Advertisements

Regex Functions in Excel, part 3

Part 3 of 4

I’ve left this project alone for quite a while, until finally this weekend I had a little time to put into working on it. In part 2 I was still a little undecided as to how I was going to implement some of the worksheet functions, but now I am more or less satisfied with them. I’ve bundled the functions into Regex.xla, which is now available for download. The project is open for viewing, so code may be inspected or modified as you see fit – please let me know if you discover any errors or anything that could have been done better/smarter.

This is the first general purpose add-in I’ve created (I have written several application-specific add-ins), so I’d be interested to know if anyone has any feedback on this one. Some things I have not included which I’d like to:

  • Help files – I have started writing some html to provide users with an explanation of the functions and examples of their use, but I’m finding this very time-intensive. Next release I’ll include them.
  • A Regex Find/Replace dialog – still a work in progress, I’ve decided to leave this until I make a decision about whether to release the add-in as a compiled .dll file or just leave it as a .xla. At this point I have yet to test whether or not there would be any performance benefit in compiling the add-in in VB6 (or VB.NET?), which would be the key benefit I see in doing so.
  • A basic regular expression syntax included – while there are plenty of good sites to look at for regex syntax, it’d be great to include a quick overview of how they work and maybe a short list of FAQ-type answers to the usual ‘how do I do x?’ questions. Next release.

So, what’s included so far? Rather than post the entire source code for all five functions, I thought I’d put in all the header comments here, as they provide a reasonable explanation of their purpose (Roy MacLean wrote an excellent post on comments in code which motivated me putting these in – ordinarily until now my commenting in the header has been pretty sparse). So here they are:

Public Function RXFIND(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal start_num As Long, _
                        Optional ByVal case_sensitive As Boolean) As Long
' RXFIND - Returns the starting position of text matching the regex pattern
' find_pattern in the string within_text, if a match is found.
' Optional long start_num specifies the number of the character in within_text
' to start matching. Default=0.
' Optional boolean case_sensitive makes the pattern case sensitive if true,
' insensitive otherwise. Default=true.

Public Function ISRXMATCH(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal case_sensitive As Boolean) As Boolean
' ISRXMATCH - Returns true if the regex pattern find_pattern is matched
' in the string within_text, false otherwise.
' Optional boolean case_sensitive makes the pattern case sensitive if true,
' insensitive otherwise. Default=true.

Public Function RXGET(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal submatch As Long, _
                        Optional ByVal start_num As Long, _
                        Optional ByVal case_sensitive As Boolean) As String
' RXGET - Looks for a match for regular expression pattern find_pattern
' in the string within_text and returns it if found, error otherwise.
' Optional long submatch may be used to return the corresponding submatch
' if specified - otherwise the entire match is returned.
' Optional long start_num specifies the number of the character to start
' searching for in within_text. Default=0.
' Optional boolean case_sensitive makes the regex pattern case sensitive
' if true, insensitive otherwise. Default=true.

Public Function RXMATCH(ByVal find_pattern As Variant, _
                    ByVal within_range As Variant, _
                    Optional ByVal case_sensitive As Boolean) As Long
' RXMATCH - Searches for the first cell matching regular expression pattern
' find_pattern in the range argument within_range.
' Optional boolean case_sensitive makes the regular expression case
' sensitive if true, insensitive otherwise. Default=true.
' within_range should be a single row or column range. If the range has
' multiple rows and columns, only the first row is searched.

Public Function RXSUB(ByRef within_text As Variant, _
                    ByRef old_text As Variant, _
                    ByRef new_text As Variant, _
                    Optional ByVal start_num As Long, _
                    Optional ByVal case_sensitive As Boolean, _
                    Optional ByVal is_global As Boolean) As String
' RXSUB - Function to substitute text matching regex pattern old_text with
' string argument new_text in the string within_text.
' Optional long start_num specifies which character to start replacement from,
' default=0.
' Optional boolean case_sensitive will make the pattern case sensitive if true
' and insensitive otherwise. Default=true.
' Optional boolean is_global will substitute new_text for *all* occurrences
' matching old_text if true, only the first otherwise. Default=false.

In my final post on this project, I’ll include some examples of these functions in use, and hopefully have an updated add-in with help files and a find/replace dialog. Maybe also a discussion of why I did or didn’t go with a compiled .dll. Maybe also a regex syntax. Maybe I’ll need to make this series into 5 posts…

[EDIT: I’ve changed the download link to a fileden url, this can be accessed directly now.]

The subtle art of indirection

…any problem in computing can be solved by adding another level of indirection…

Butler Lampson in Authentication in distributed systems: Theory and practice

I have fond memories of the first time I cracked open a book on programming in C, and encountered pointers. Pointers are variables which hold the memory address of another variable, which the programmer can then manipulate. Without going into detail about why they are or aren’t a good thing (there are some quite vigorous arguments over that), I mention them in connection with the indirection operator *, which acts on a pointer p by dereferencing it, returning the contents of the object that it points to.

Indirection in C has a (sort of) counterpart in the Excel INDIRECT() function. INDIRECT takes a string argument and evaluates it, returning the value of whatever that string refers to. So placing =INDIRECT(B1) in cell A1 creates a direct reference to cell B1. If cell B1 contains the text “C1”, then “C1” is the argument INDIRECT evaluates. A1 will contain an indirect reference to cell C1, and will return the value contained in C1. Although this might not seem that useful at first (you could after all obtain the same result by just entering “=C1” in cell A1), the use of indirect references is a powerful tool for building dynamic formulae. Chip Pearson’s site has some excellent examples of these.

One particular example I have used in the past relates to Data Validation. A common scenario with validation is to select “List”, and then to enter a named range reference in the source box. For instance, supposing you wish to restrict entries to a list of month names, contained in a range named “Months” – so, you enter “Months” in the source box. But if you wish to change the source, you need to go through the Data -> Validation dialog, and hard-code the new range reference in there. An alternative is to enter the name of the source range in a cell, and then return the source range to the validation dialog by entering “=INDIRECT(<cell address>)” in the source box. This now means that updating the cell with a new range reference will automatically update the validation applied.

Regex Functions in Excel

Part 2 of 4

Continuing with my exploration of regular expressions in Excel, from Part 1 – so far I’ve got to the stage of writing a couple of functions, and planning a few more. As promised, I’ll detail here what the add-in I’m constructing should deliver to the user.

First, a suite of worksheet functions:

  • RXFIND(find_pattern, within_text, [start_num], [case_sensitive])
    – This is probably the easiest one to get right – a simple emulation of the functionality of the Excel built-in function FIND. As with FIND, the inputs are the pattern to test for, followed by the text to look within. Optionally, the start position can be specified (Default=0) as can case-sensitivity (Default=TRUE). Although Excel implements a separate SEARCH function which is case-insensitive, I’d rather save myself repeating the same code and added a parameter in to allow the user to specify whether case matters or not. Parenthetically, my belief, well-founded or not, is that case always matters.
    Here’s the code I’ve come up with so far: 

    Public Function RXFIND(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal start_num As Variant, _
                        Optional ByVal case_sensitive As Boolean) As Long     
    
    Dim objRegex As VBScript_RegExp_55.RegExp
    Dim colMatch As VBScript_RegExp_55.MatchCollection
    Dim vbsMatch As VBScript_RegExp_55.Match
    Dim sMatchString As String
    
    Set objRegex = New VBScript_RegExp_55.RegExp
    
    ' Initialise Regex object
    With objRegex
        .Global = False
        ' Default is case sensitive
        If IsMissing(case_sensitive) Then
            .IgnoreCase = False
        Else: .IgnoreCase = Not case_sensitive
        End If
        .Pattern = find_pattern
    End With
    
    If IsMissing(start_num) Then start_num = 0
    sMatchString = Right$(within_text, Len(within_text) - start_num)
    
    ' Create Match collection
    Set colMatch = objRegex.Execute(sMatchString)
    If colMatch.Count = 0 Then ' No match
        RXFIND = 0
    Else
        Set vbsMatch = colMatch(0)
        RXFIND = vbsMatch.FirstIndex + start_num + 1
    End If
    End Function
  • RXSUB(text,  old_text,  new_text, [instance_num], [case_sensitive], [is_global])
    – This one is similar to the Excel built-in SUBSTITUTE, although as with RXFIND, it allows a couple of extra parameters. It’s also a little trickier to code, as the optional is_global parameter (Default=FALSE) set to TRUE renders instance_num (Default=0) meaningless. I still haven’t got this in a fit state to post the code yet, but it differs only a little from RXFIND, in that instead of using the FirstIndex property and Execute method of the RegExp object, it only needs to access the Replace method, and of course it returns a string rather than a long. Work in progress.
  • RXGET(find_pattern, within_text, [start_num], [case_sensitive])
    I still haven’t completely figured out what to have this one do yet, but it’s going to rely on the Value property of the Match object, and will possibly allow for returning Submatched groups. Another work in progress.
  • RXMATCH(find_pattern, within_array, [case_sensitive])
    Returns the position of the first exact match with the find_pattern from within_array. Yet to begin.

As well as worksheet functions, I’ve been tossing around the idea of a Find/Replace dialog with regex bells and whistles. I have got no further than mimicking the Excel built-in dialog so far (i.e. no functionality as yet), but I am hopeful I’ll have something that actually does something soon. For reasons I’ll explain a little further in the next post, I am undecided as to whether I should stick with the standard user form. Anyway, that’ll do for now…next time I’ll talk a little about some of the key decisions that need to be made – late vs early binding, .NET vs the VBScript library, VB6 add-in vs .xla.

The perils of large spreadsheets

Ok, we’ve all been there. What started out as a relatively compact workbook has blossomed into a gargantuan mess that requires calculation set to Manual to change anything, and takes a full minute just to open. It’s not nice, particularly when the workbook in question is one that is relied upon heavily. I “owned”, or should I say “inherited”, one of these a couple of years back, which was used by over a hundred people in a call centre – and loathed by all of them. I didn’t like it much either, as I had to field all the calls and emails asking for it to be “fixed”.

So what to do? In a sense, of course, there’s nothing to fix, as long as everything does recalculate as it should. If not, or worse, if Excel crashes when it’s being used, then there is a problem, which usually warrants a complete redesign. Unfortunately that happens more often than the average user realises. While Excel’s memory management is continually improving with new releases (and the average memory available to it is increasing), you can only push it so far before it gives up the ghost. However, let’s assume for the sake of argument you’re not in that awful place right now. So while there’s nothing broken as such, the performance you’re getting out of it is not what it should be. What can you do to improve it? Here’s a couple of things to try.

  • Cull unnecessary data. There are cases where workbooks (particularly some financial models) do require several sheets with thousands of rows. But it’s worth questioning whether it’s all absolutely necessary. And does everything need to be in the same workbook? It may be that you can get significant gains by splitting the workbook into two or more pieces, which can then be hyperlinked – and by grouping related data this can make for a more user-friendly tool.
  • Re-use the same cache for your pivot tables if you can. Excel will very helpfully prompt you to do just that when it realises you’re about to create a pivot table using the same range or database that forms the cache for another pivot table. It also explains why – you will save memory and your workbook will be smaller. This is almost certainly a good thing. And don’t make the mistake of creating a separate cache for a subset of a previously used range (another common error) – you can always use the whole thing and exclude what you don’t want from within the pivot table after it’s been created.
  • Avoid formula repetition, wherever possible. Too many people take the “Jurassic Park” approach to building formulas, and assume that just because it is possible to do something, they should do it. Not true – here’s an example: =IF(VLOOKUP($M6,$B$2:$H$100,3,FALSE)>AVERAGE($B$2:$B6),
    VLOOKUP($M6,$B$2:$H$100,3,FALSE)-AVERAGE($B$2:$B6),
    VLOOKUP($M6,$B$2:$H$100,3,FALSE)+AVERAGE($B$2:$B6))

    Now you may ask, what’s wrong with that? Well, look at it from Excel’s point of view:
    Excel: So what do you want me to do?
    You: Well, first look up this thing in the first column of that table, and go along 3 columns from it and get the value there.
    Excel: Got it – easy!
    You: Good. Now take the mean of these numbers here and compare it with the number you just got. If it’s smaller, then look up this thing in the first column of that table, go along 3 columns and get the value there.
    Excel: Wait – you asked me to do that already.
    You: Well now I want you to do it again.
    Excel: OK then – it’s your party.
    You: Right. Now take the mean of these numbers here,…
    Excel: Seriously? I just did that before.
    You: Do it again, and subtract it from the number you just got.
    Excel: Alright – done. What if the number wasn’t smaller?
    You: Well, in that case, look up this thing in this table here…
    Excel: I can feel a migraine coming on…

    You get the picture. Of course, the problem is that Excel won’t complain when you ask it to do something silly like this – but any reasonable person would suggest that you calculate the results of the VLOOKUP and AVERAGE functions once, in separate cells, and then use the values however you like, as many times as you like, without creating a job that takes twice as much recalculation as would otherwise be necessary.
  • A related issue is the (over)use of VLOOKUP. Where you need to look up the same value more than once, even in separate cells, returning data from different columns, Excel is still having to repeat the same task unnecessarily. MATCH is your friend here, in combination with either OFFSET or INDEX. For example:
    A1=(VLOOKUP(M6,$B$2:$H$100,3,FALSE)
    A2=(VLOOKUP(M6,$B$2:$H$100,4,FALSE)
    A3=(VLOOKUP(M6,$B$2:$H$100,7,FALSE)

    could be replaced with:
    A1=MATCH(M6,$B$2:$H$100,0)
    A2=INDEX($D$2:$D$100,A1)
    A3=INDEX($E$2:$E$100,A1)
    A4=INDEX($H$2:$H$100,A1)

    Over big tables this can represent a huge saving in calculation time.

And that’s all I’ve got time for right now. Good luck!