Regex Functions in Excel, part 3

Part 3 of 4

I’ve left this project alone for quite a while, until finally this weekend I had a little time to put into working on it. In part 2 I was still a little undecided as to how I was going to implement some of the worksheet functions, but now I am more or less satisfied with them. I’ve bundled the functions into Regex.xla, which is now available for download. The project is open for viewing, so code may be inspected or modified as you see fit – please let me know if you discover any errors or anything that could have been done better/smarter.

This is the first general purpose add-in I’ve created (I have written several application-specific add-ins), so I’d be interested to know if anyone has any feedback on this one. Some things I have not included which I’d like to:

  • Help files – I have started writing some html to provide users with an explanation of the functions and examples of their use, but I’m finding this very time-intensive. Next release I’ll include them.
  • A Regex Find/Replace dialog – still a work in progress, I’ve decided to leave this until I make a decision about whether to release the add-in as a compiled .dll file or just leave it as a .xla. At this point I have yet to test whether or not there would be any performance benefit in compiling the add-in in VB6 (or VB.NET?), which would be the key benefit I see in doing so.
  • A basic regular expression syntax included – while there are plenty of good sites to look at for regex syntax, it’d be great to include a quick overview of how they work and maybe a short list of FAQ-type answers to the usual ‘how do I do x?’ questions. Next release.

So, what’s included so far? Rather than post the entire source code for all five functions, I thought I’d put in all the header comments here, as they provide a reasonable explanation of their purpose (Roy MacLean wrote an excellent post on comments in code which motivated me putting these in – ordinarily until now my commenting in the header has been pretty sparse). So here they are:

Public Function RXFIND(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal start_num As Long, _
                        Optional ByVal case_sensitive As Boolean) As Long
' RXFIND - Returns the starting position of text matching the regex pattern
' find_pattern in the string within_text, if a match is found.
' Optional long start_num specifies the number of the character in within_text
' to start matching. Default=0.
' Optional boolean case_sensitive makes the pattern case sensitive if true,
' insensitive otherwise. Default=true.

Public Function ISRXMATCH(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal case_sensitive As Boolean) As Boolean
' ISRXMATCH - Returns true if the regex pattern find_pattern is matched
' in the string within_text, false otherwise.
' Optional boolean case_sensitive makes the pattern case sensitive if true,
' insensitive otherwise. Default=true.

Public Function RXGET(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal submatch As Long, _
                        Optional ByVal start_num As Long, _
                        Optional ByVal case_sensitive As Boolean) As String
' RXGET - Looks for a match for regular expression pattern find_pattern
' in the string within_text and returns it if found, error otherwise.
' Optional long submatch may be used to return the corresponding submatch
' if specified - otherwise the entire match is returned.
' Optional long start_num specifies the number of the character to start
' searching for in within_text. Default=0.
' Optional boolean case_sensitive makes the regex pattern case sensitive
' if true, insensitive otherwise. Default=true.

Public Function RXMATCH(ByVal find_pattern As Variant, _
                    ByVal within_range As Variant, _
                    Optional ByVal case_sensitive As Boolean) As Long
' RXMATCH - Searches for the first cell matching regular expression pattern
' find_pattern in the range argument within_range.
' Optional boolean case_sensitive makes the regular expression case
' sensitive if true, insensitive otherwise. Default=true.
' within_range should be a single row or column range. If the range has
' multiple rows and columns, only the first row is searched.

Public Function RXSUB(ByRef within_text As Variant, _
                    ByRef old_text As Variant, _
                    ByRef new_text As Variant, _
                    Optional ByVal start_num As Long, _
                    Optional ByVal case_sensitive As Boolean, _
                    Optional ByVal is_global As Boolean) As String
' RXSUB - Function to substitute text matching regex pattern old_text with
' string argument new_text in the string within_text.
' Optional long start_num specifies which character to start replacement from,
' default=0.
' Optional boolean case_sensitive will make the pattern case sensitive if true
' and insensitive otherwise. Default=true.
' Optional boolean is_global will substitute new_text for *all* occurrences
' matching old_text if true, only the first otherwise. Default=false.

In my final post on this project, I’ll include some examples of these functions in use, and hopefully have an updated add-in with help files and a find/replace dialog. Maybe also a discussion of why I did or didn’t go with a compiled .dll. Maybe also a regex syntax. Maybe I’ll need to make this series into 5 posts…

[EDIT: I’ve changed the download link to a fileden url, this can be accessed directly now.]

Advertisements

More fun with Excel 2007…

Work’s keeping me rather busy at the moment, so I haven’t had as much time as I would like on here. I did think, though, that I should spare a moment to write a little more about my continuing exploration of Excel 07. I have to admit to being perhaps a little harsh in my first evaluation, so it’s only fair that I post a little positive feedback where I think it’s due.

Over the last couple of days I’ve been re-configuring a monthly balance sheet used by an NPO my dad does some work for. I have to say, once I got started exploring the new UI, and once I got past the where-the-hell-has-feature-x-gone stage, I was really impressed. Here are some of the things that really stood out for me:

  1. Formula auto-completion. When you write a lot of formulae, the Intellisense drop-down is a godsend. ‘Nuff said.
  2. Name manager. This is a very cool feature, much easier to use than the standard define names dialog. I particularly like the ability to easily change the scope of a name using a drop-down list. No more copy/paste-ing sheet names! Also, comments on names – fantastic idea, which is really going to save time spent on documentation.
  3. Tables. Lists have morphed slightly, into tables, and they are much easier to work with than previously. Structured references, in combination with the Intellisense feature and the ability to edit the table name, are going to save me a lot of work I think.
  4. It’s pretty. (I admit it, I’m a sucker for eye candy on the desktop.) While there are some aspects of the interface I am still coming to terms with (the ribbon being one), a lot of the changes to styles, formats and the way they’re applied have functional benefits, as well as just being prettier than they were in 03. Basically, the less time I have to spend formatting a spreadsheet the better, and the new setup allows more to be done with fewer clicks. Good.

So there you have it – I’m warming to 2007. Finally.

Project Euler #2

Ok, I don’t intend to post all of them, but here’s the second problem:

# Problem 2
# Each new term in the Fibonacci sequence is generated by adding the previous
# two terms. By starting with 1 and 2, the first 10 terms will be:

# 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …

# Find the sum of all the even-valued terms in the sequence which do not
# exceed four million.
def gen_Fib_even(a, b, below):
“””pre: a, b, below +ve integers
returns: generator with an iterator to obtain the Fibonacci sequence
“””
while (b < below): if b % 2 == 0: yield b c = b b = a + b a = c print sum(gen_Fib_even(1, 1, 4000000))[/sourcecode]

Project Euler #1

Tonight I decided I’d give the problems at Project Euler a go using Python…here is my first shot at problem 1:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

Code:

def sum_multiples(num1, num2, upper):
“””
pre: num1, num2, upper +ve integers
num1, num2 < upper returns: sum of all multiples of num1 and num2 below upper """ result = 0 for x1 in xrange(0, upper, num1): result += x1 for x2 in xrange(0, upper, num2): if x2 % num1 != 0: result += x2 return result[/sourcecode] Looking through some of the solutions on the forum, there were some really insightful and elegant ways proposed to retrieve the result, often reducing the problem to a single calculation. I'm really looking forward to seeing what comes out of the later solutions.

The subtle art of indirection

…any problem in computing can be solved by adding another level of indirection…

Butler Lampson in Authentication in distributed systems: Theory and practice

I have fond memories of the first time I cracked open a book on programming in C, and encountered pointers. Pointers are variables which hold the memory address of another variable, which the programmer can then manipulate. Without going into detail about why they are or aren’t a good thing (there are some quite vigorous arguments over that), I mention them in connection with the indirection operator *, which acts on a pointer p by dereferencing it, returning the contents of the object that it points to.

Indirection in C has a (sort of) counterpart in the Excel INDIRECT() function. INDIRECT takes a string argument and evaluates it, returning the value of whatever that string refers to. So placing =INDIRECT(B1) in cell A1 creates a direct reference to cell B1. If cell B1 contains the text “C1”, then “C1” is the argument INDIRECT evaluates. A1 will contain an indirect reference to cell C1, and will return the value contained in C1. Although this might not seem that useful at first (you could after all obtain the same result by just entering “=C1” in cell A1), the use of indirect references is a powerful tool for building dynamic formulae. Chip Pearson’s site has some excellent examples of these.

One particular example I have used in the past relates to Data Validation. A common scenario with validation is to select “List”, and then to enter a named range reference in the source box. For instance, supposing you wish to restrict entries to a list of month names, contained in a range named “Months” – so, you enter “Months” in the source box. But if you wish to change the source, you need to go through the Data -> Validation dialog, and hard-code the new range reference in there. An alternative is to enter the name of the source range in a cell, and then return the source range to the validation dialog by entering “=INDIRECT(<cell address>)” in the source box. This now means that updating the cell with a new range reference will automatically update the validation applied.

Excel 2007 speed test

Up until now, I have been using Excel 2003 at both home and work, and have had no strong urges to switch to 2007. Since its release I have seen rather a lot of comment about its performance, with assessments ranging from fair to rubbish, for instance on Newton Excel Bach and at keyongtech. The Excel team blog also has a recent post asking for examples from people who’ve been experiencing slower VBA performance with Excel 2007 than with 2003. So up until now I’ve held off. However, my employer is about 3 months away from rolling out Office 2007 nationwide, and so I thought it’d be worthwhile taking a look.

So this evening I downloaded a copy of the trial version of Office 2007 at home and tried out a (very basic) speed test. The code below repeats a primality test 10,000 times and outputs to cells in the active sheet:

Public Function ISPRIME(ByVal lTest As Long) As Boolean
‘ Returns TRUE for a prime, FALSE for composite
Dim lIn As Long

lIn = 2
ISPRIME = True
Do While lIn <= Sqr(lTest) If lTest Mod lIn = 0 Then ISPRIME = False Exit Do End If lIn = lIn + 1 Loop End Function Public Sub SpeedTest() Dim rngTest As Range Dim sglStart As Single Dim sglFinish As Single sglStart = Timer For Each rngTest In ActiveSheet.Range("A1:J1000") If ISPRIME(rngTest.Row + (rngTest.Column - 1) * 1000) Then rngTest.Value = "Prime!" Else: rngTest.Value = "Not prime..." End If Next rngTest sglFinish = Timer Debug.Print "Total seconds elapsed = " & sglFinish - sglStart End Sub[/sourcecode] Running this 10 times in 2003 gave the following results: Total seconds elapsed = 6 Total seconds elapsed = 6.15625 Total seconds elapsed = 6.171875 Total seconds elapsed = 5.953125 Total seconds elapsed = 6.203125 Total seconds elapsed = 6.234375 Total seconds elapsed = 6.6875 Total seconds elapsed = 6.203125 Total seconds elapsed = 5.953125 Total seconds elapsed = 5.9375 and in 2007: Total seconds elapsed = 9.84375 Total seconds elapsed = 9.296875 Total seconds elapsed = 9.203125 Total seconds elapsed = 9.171875 Total seconds elapsed = 9.21875 Total seconds elapsed = 9.140625 Total seconds elapsed = 9.140625 Total seconds elapsed = 10.09375 Total seconds elapsed = 9.1875 Total seconds elapsed = 9.171875 The main point of the exercise for me was to evaluate whether there was a significant difference in speed of data transfer from VBA to Excel. Not only has this been an issue for my code in the past (i.e. based on experience this is one of the first things I generally look at when optimising performance), but a number of people have commented that VBA in 2007 does this a lot slower than it does in 2003. While the results above are hardly concrete evidence, I think at first glance this does seem to be borne out. Incidentally, I tried running a similar test without transferring the data to a worksheet, using the code below: [sourcecode language='vb'] Public Sub SpeedTest2() Dim lTest As Long Dim bResult As Boolean Dim sglStart As Single Dim sglFinish As Single sglStart = Timer For lTest = 2 To 10000 bResult = ISPRIME(lTest) Next lTest sglFinish = Timer Debug.Print "Total seconds elapsed = " & sglFinish - sglStart End Sub [/sourcecode] and the results showed very little difference between the two (as I would have expected, really). Here's 2003: Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.0625 Total seconds elapsed = 0.0625 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.09375 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 and in 2007: Total seconds elapsed = 0.09375 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.0625 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125 Total seconds elapsed = 0.078125

Random Sampling in VBA

Occasionally we get asked to provide random samples from our database, and it’s the sort of thing that can be done very easily in SAS.

For example, here’s a SAS routine I have often used to extract a sample sampleset of size 100 without replacement from a large dataset bigset (I can’t remember where this code originally came from but I definitely can’t take credit for it):

data sampleset (drop=remainder size);
size=100;
remainder=totobs;

do while (size > 0);
pickit + 1;
if ranuni(0) < size/remainder then do; set bigset point=pickit nobs=totobs; output; size=size - 1; end; remainder=remainder - 1; end; stop; run;[/sourcecode] While this is usually sufficient, I have sometimes wondered whether it would be possible to use the same logic to extract a sample in VBA. So tonight I decided to test it out. The following code appears to do the trick quite nicely, but be warned, I haven't done a lot of testing yet. [sourcecode language='vb']Public Sub SampleArray(ByRef avBigset As Variant, ByRef avSmallset As Variant) ' SampleArray populates a random sample avSmallset from an array avBigset ' without replacement (each element in avBigset is considered once only) Dim lRemainder As Long Dim lSize As Long Dim lOb As Long Dim lPickit As Long ' Make sure we're dealing with arrays... If Not IsArray(avBigset) Or Not IsArray(avSmallset) Then Exit Sub ' Initialise lRemainder = UBound(avBigset) lSize = UBound(avSmallset) Randomize 0 Do While lSize > 0 ‘ Still some left to pick up
lOb = lOb + 1
If Rnd < lSize / lRemainder Then lPickit = lPickit + 1 avSmallset(lPickit) = avBigset(lOb) lSize = lSize - 1 End If lRemainder = lRemainder - 1 Loop ' Sample complete End Sub[/sourcecode] It coped fairly well with populating a sample array of size 1,000 from an array of size 10,000,000, but as I say I haven't tested too extensively so I don't yet know its limits, or whether there is a much more efficient method out there. Has anyone else used VBA in this way? What sort of method would you use?