Dictionary vs Collection

A pretty common requirement in programming is to be able to create a group of related objects, and put them into some sort of container. This means they can effectively be treated as one object, while still being accessible individually. For example, if I’m tracking the stats of a sports team over the course of a game, I could create a bunch of Player objects, all of which have individual attributes, but which I can view collectively as a Team object.

One way to group the Players would be to put them into a Team array. A weakness of this approach is that the only way to access one of the Players is to use their position in the array. This is ok for arrays where the positions of the members aren’t going to change, but if this isn’t the case it gets harder to keep track of things the longer the array persists in memory and the more changes need to be made to it. What is useful in this case is to be able to retrieve a Player object by using a key that uniquely identifies them, such as their name for instance, and not have to worry about which position in the Team array they are in.

The VB Collection object allows you to do this, by giving you the option of defining a ‘key’ value when the item is added to it – for instance, the below adds the range object identified by the name RangeName to a collection, and uses RangeName as the key:

Dim col As Collection
Dim rng As Excel.Range
Dim sRangeName As String
Set col = New Collection
sRangeName = "RangeName"
Set rng = Range(sRangeName)
col.Add rng, sRangeName

Now, when you wish to retrieve a particular item from the collection, you have the option of either specifying its position (index), or using the key value. If you know this was the first item added and its position hasn’t changed, you can use

col.Item(1)

(VB collections are 1-based). Otherwise, you can use

col.Item(sRangeName)

You can also iterate over the collection using For Each…Next syntax, which is both more concise and faster than accessing an array one item at a time.

Collections are fine – and pretty much all the container structures in the Excel object model use them, which is why you can say

For Each wb in Workbooks

or

For Each cell in rng.Cells

However, I hardly ever use them for implementing my own containers – where key-based lookup is required I prefer the VBScript Dictionary. The syntax for adding objects to a dictionary is very similar to adding items to a collection, apart from the reversal of the key and item arguments:

Dim dict As Scripting.Dictionary
Dim rng As Excel.Range
Dim sRangeName As String
Set dict = New Scripting.Dictionary
sRangeName = "RangeName"
Set rng = Range(sRangeName)
dict.Add sRangeName, rng

as is the retrieval of items:

dict.Item(sRangeName)

except for the fact that you have to use the key (which is safer in any case).

What really sells the dictionary to me over the collection are the following 3 methods:

dict.Exists(k) – boolean to let you know whether or not the key k has been added to the dictionary. This is very useful, as it avoids the need for the clunky idiom you’re forced to use to check the same thing with collections:

Public Function Exists(ByRef col As Collection, _
                    ByVal sKey As String) As Boolean

    Dim lCheck as Long
    On Error Resume Next
    lCheck = VarType(col.Item(sKey))
    If Err.Number = 0 Then
        Exists = True
    Else
        Exists = False
    End If

End Function

dict.Keys – Return an array containing all keys in the dictionary, which can then be iterated over. This for me is where the collection really falls down, as while it’s certainly possible to iterate over the items in a collection, the key cannot be returned from the item – which means it’s not available. A good example of why you might want this is the following from the CRXRowDict.Test function in the last post:

For iKey = LBound(mdicValues.Keys) To UBound(mdicValues.Keys)
        sKey = mdicValues.Keys(iKey)
        Set objRegex = mdicValues(sKey)
        sTest = clsRow.Item(sKey)
        Debug.Print objRegex.Pattern, sTest
        ' if any test fails, return false
        If Not objRegex.Test(sTest) Then
            bResult = False
            Exit For
        End If
    Next iKey

The key is required in two places – once to retrieve the matching regexp object from the dictionary and once to retrieve the test string from the CRowDict instance. The only way I could see to achieve the same result with a collection would be to populate and iterate over a separate array – not pretty.

dict.RemoveAll – As the name suggests, clean out all key-item pairs in the dictionary. No need to loop through either keys or items here, quick and easy.

More Regex Functions, part 2

Part 2 of 3

After building the form in Part 1, the next stage of coding the regex filter involves implementing the logic for deciding which rows of the list to be filtered will pass through the filter. Each row in the criteria range (rngCriteria) can be thought of as a test made up of a set of criteria. In order for a row in the list range (rngSource) to pass the test, all of these criteria must be met. If the list row passes the test for any row in rngCriteria, then it will be included in the eventual output.

Given a criteria range with m rows (excluding the header row) and n columns, this can be modelled in pseudo-SQL:

SELECT * FROM rngSource AS a
WHERE 
	(Crit[1,1](a) AND ... AND Crit[1,n](a))
	OR ...
	OR (Crit[m,1](a) AND ... AND Crit[m,n](a))

where Crit[i,j](a) is a boolean function representing the criterion in row i, column j.

For the built-in advanced filter, Excel constructs these criteria by parsing the text in the criteria range – a large range of tests are available, including the use of relational operators, wildcard and exact text matches. For an introduction to use of the advanced filter, see the tutorial at Contextures.

For our purposes, though, we’ll be allowing only regular expression pattern matches on the text in the list rows. While in theory it would be possible to mimic all the built-in advanced filter functionality and add regular expression tests into the mix, I’m not attempting to extend the advanced filter at this point – instead I’m creating an alternative to it.

In order to test the rows, I decided to implement a couple of classes to contain them – one for the rows in the list and extract ranges, and one which works slightly differently for the criteria range. In both classes, the core idea is to use the Scripting.Dictionary object to store key-value pairs, where the keys are provided by the header rows and the values by the text in each cell.

The first of these, CRowDict, is what I intend to use for the rows in the list and extract ranges:

Option Explicit

' CRowDict - wraps dictionary, holds key-value pairings
' for each row in the list

'********************************************************************
' Class private attributes
'********************************************************************

' holds dictionary object to be populated
Private mdicValues As Scripting.Dictionary
Private Const msCLASS As String = "CRowDict"

'********************************************************************
' Class public properties
'********************************************************************
Public Property Get Item(ByVal sKey As String) As Variant
    If mdicValues.Exists(sKey) Then
        Item = mdicValues.Item(sKey)
    Else
        Err.Raise glNOTFOUND, msCLASS, gsNOTFOUND & sKey
    End If
End Property

'********************************************************************
' Class public methods
'********************************************************************

Public Sub Populate(ByRef rngHeaders As Excel.Range, _
                ByRef rngValues As Excel.Range)
    ' Populates dictionary from the row rngValues
    
    Dim colNum As Long
    
    For colNum = 1 To rngHeaders.Cells.Count
        mdicValues.Add rngHeaders(1, colNum).Text, _
                        rngValues(1, colNum).Text
    Next colNum

End Sub

Public Sub Display()
    ' Convenience method for testing
    Dim vKey As Variant
    For Each vKey In mdicValues.Keys
        Debug.Print vKey, mdicValues.Item(vKey)
    Next vKey
End Sub

'********************************************************************
' Class private methods
'********************************************************************

Private Sub Class_Initialize()
    Set mdicValues = New Scripting.Dictionary
End Sub

The idea here is to allow us to access the text values in each row by name rather than position – this allows us the flexibility to change the order and/or number of columns returned to the extract range.

The class CRXRowDict is very similar, but stores RegExp objects rather than text. It also includes a Test method, which operates on CRowDict instances and returns True if all of the required text values match the corresponding regex patterns:

Option Explicit

' CRXRowDict - wraps dictionary, holds key-value pairings
' for each row in the list, where values are RegExp objects

'********************************************************************
' Class private attributes
'********************************************************************

' holds dictionary object to be populated
Private mdicValues As Scripting.Dictionary
Private Const msCLASS As String = "CRXRowDict"

'********************************************************************
' Class public properties
'********************************************************************
Public Property Get Item(ByVal sKey As String) _
                            As VBScript_RegExp_55.RegExp
    If mdicValues.Exists(sKey) Then
        Set Item = mdicValues.Item(sKey)
    Else
        Err.Raise glNOTFOUND, msCLASS, gsNOTFOUND & sKey
    End If
End Property

'********************************************************************
' Class public methods
'********************************************************************

Public Sub Populate(ByRef rngHeaders As Excel.Range, _
                ByRef rngValues As Excel.Range)
    ' Populates dictionary from the row rngValues
    
    Dim colNum As Long
    Dim objRegex As VBScript_RegExp_55.RegExp
    
    ' create new regexp object for each cell in the row
    For colNum = 1 To rngHeaders.Cells.Count
        Set objRegex = New VBScript_RegExp_55.RegExp
        objRegex.Pattern = rngValues(1, colNum).Text
        mdicValues.Add rngHeaders(1, colNum).Text, objRegex
    Next colNum

End Sub

Public Function Test(ByRef clsRow As CRowDict) As Boolean
    ' Executes the test method for each regexp object
    ' against the corresponding item in clsRow
    
    Dim objRegex As VBScript_RegExp_55.RegExp
    Dim iKey As Integer
    Dim sKey As String
    Dim sTest As String
    Dim bResult As Boolean
    
    ' Assume true to begin with
    bResult = True
    
    For iKey = LBound(mdicValues.Keys) To UBound(mdicValues.Keys)
        sKey = mdicValues.Keys(iKey)
        Set objRegex = mdicValues(sKey)
        sTest = clsRow.Item(sKey)
        Debug.Print objRegex.Pattern, sTest
        ' if any test fails, return false
        If Not objRegex.Test(sTest) Then
            bResult = False
            Exit For
        End If
    Next iKey
    
    Test = bResult
    
End Function

'********************************************************************
' Class private methods
'********************************************************************

Private Sub Class_Initialize()
    Set mdicValues = New Scripting.Dictionary
End Sub

In the final part of this series (which I hope to put out by next week) I’ll be putting this all together, showing how these classes are used to implement the filter function.

Why I’d like Python at work

One of the main reasons I like Python as a scripting language is its brevity – in particular I enjoy using things like list comprehensions, which enable you to escape the use of for loops when dealing with lists. Lists in Python are powerful and when used in combination with map and filter, allow you to do a lot more with a lot less code. This is one of the things I really miss when I write code in VBA/VBScript, where you’re often stuck dealing with either arrays or collections from the object model. Dive Into Python offers an interesting perspective on this.

Dick Kusleika posted a VBA sub to open the newest file in a folder over at Daily Dose of Excel today, which made use of the Scripting library to iterate over .csv files in a folder, find the newest and open it. Dick mentioned wanting an easier way to find the newest than looping over all of them, which got me thinking. If the Scripting.Folder object had a .NewestFile property, then that would obviously do it, but failing that, I could not see any other way than to loop through all files to examine the create dates.

In Python, however, without looping, the exercise becomes a two-line function:

import os, os.path, sys

def newest(ext, path=os.getcwd()):
  """
  pre: path a valid absolute filepath or None, ext a file 
  extension (e.g. '.py')
  returns: newest (last modified on *nix system, last created 
  on Windows) file from path
  """
  file_dict = dict([(fname, \
    os.path.getctime(os.path.join(path, fname))) \
    for fname in os.listdir(path) \
    if os.path.splitext(fname)[1] == ext])
  return max(file_dict, key=lambda x: file_dict[x])

I’m not sure at all whether there would be any performance benefit from doing it this way as opposed to looping over the files, but the real benefit of this is that it frees you from the worry of how to construct a for loop over the list of filenames returned from os.listdir(). All you need to worry about is which one of them you want.

Regex Functions in Excel, part 3

Part 3 of 4

I’ve left this project alone for quite a while, until finally this weekend I had a little time to put into working on it. In part 2 I was still a little undecided as to how I was going to implement some of the worksheet functions, but now I am more or less satisfied with them. I’ve bundled the functions into Regex.xla, which is now available for download. The project is open for viewing, so code may be inspected or modified as you see fit – please let me know if you discover any errors or anything that could have been done better/smarter.

This is the first general purpose add-in I’ve created (I have written several application-specific add-ins), so I’d be interested to know if anyone has any feedback on this one. Some things I have not included which I’d like to:

  • Help files – I have started writing some html to provide users with an explanation of the functions and examples of their use, but I’m finding this very time-intensive. Next release I’ll include them.
  • A Regex Find/Replace dialog – still a work in progress, I’ve decided to leave this until I make a decision about whether to release the add-in as a compiled .dll file or just leave it as a .xla. At this point I have yet to test whether or not there would be any performance benefit in compiling the add-in in VB6 (or VB.NET?), which would be the key benefit I see in doing so.
  • A basic regular expression syntax included – while there are plenty of good sites to look at for regex syntax, it’d be great to include a quick overview of how they work and maybe a short list of FAQ-type answers to the usual ‘how do I do x?’ questions. Next release.

So, what’s included so far? Rather than post the entire source code for all five functions, I thought I’d put in all the header comments here, as they provide a reasonable explanation of their purpose (Roy MacLean wrote an excellent post on comments in code which motivated me putting these in – ordinarily until now my commenting in the header has been pretty sparse). So here they are:

Public Function RXFIND(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal start_num As Long, _
                        Optional ByVal case_sensitive As Boolean) As Long
' RXFIND - Returns the starting position of text matching the regex pattern
' find_pattern in the string within_text, if a match is found.
' Optional long start_num specifies the number of the character in within_text
' to start matching. Default=0.
' Optional boolean case_sensitive makes the pattern case sensitive if true,
' insensitive otherwise. Default=true.

Public Function ISRXMATCH(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal case_sensitive As Boolean) As Boolean
' ISRXMATCH - Returns true if the regex pattern find_pattern is matched
' in the string within_text, false otherwise.
' Optional boolean case_sensitive makes the pattern case sensitive if true,
' insensitive otherwise. Default=true.

Public Function RXGET(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal submatch As Long, _
                        Optional ByVal start_num As Long, _
                        Optional ByVal case_sensitive As Boolean) As String
' RXGET - Looks for a match for regular expression pattern find_pattern
' in the string within_text and returns it if found, error otherwise.
' Optional long submatch may be used to return the corresponding submatch
' if specified - otherwise the entire match is returned.
' Optional long start_num specifies the number of the character to start
' searching for in within_text. Default=0.
' Optional boolean case_sensitive makes the regex pattern case sensitive
' if true, insensitive otherwise. Default=true.

Public Function RXMATCH(ByVal find_pattern As Variant, _
                    ByVal within_range As Variant, _
                    Optional ByVal case_sensitive As Boolean) As Long
' RXMATCH - Searches for the first cell matching regular expression pattern
' find_pattern in the range argument within_range.
' Optional boolean case_sensitive makes the regular expression case
' sensitive if true, insensitive otherwise. Default=true.
' within_range should be a single row or column range. If the range has
' multiple rows and columns, only the first row is searched.

Public Function RXSUB(ByRef within_text As Variant, _
                    ByRef old_text As Variant, _
                    ByRef new_text As Variant, _
                    Optional ByVal start_num As Long, _
                    Optional ByVal case_sensitive As Boolean, _
                    Optional ByVal is_global As Boolean) As String
' RXSUB - Function to substitute text matching regex pattern old_text with
' string argument new_text in the string within_text.
' Optional long start_num specifies which character to start replacement from,
' default=0.
' Optional boolean case_sensitive will make the pattern case sensitive if true
' and insensitive otherwise. Default=true.
' Optional boolean is_global will substitute new_text for *all* occurrences
' matching old_text if true, only the first otherwise. Default=false.

In my final post on this project, I’ll include some examples of these functions in use, and hopefully have an updated add-in with help files and a find/replace dialog. Maybe also a discussion of why I did or didn’t go with a compiled .dll. Maybe also a regex syntax. Maybe I’ll need to make this series into 5 posts…

[EDIT: I’ve changed the download link to a fileden url, this can be accessed directly now.]

Regex Functions in Excel

Part 2 of 4

Continuing with my exploration of regular expressions in Excel, from Part 1 – so far I’ve got to the stage of writing a couple of functions, and planning a few more. As promised, I’ll detail here what the add-in I’m constructing should deliver to the user.

First, a suite of worksheet functions:

  • RXFIND(find_pattern, within_text, [start_num], [case_sensitive])
    – This is probably the easiest one to get right – a simple emulation of the functionality of the Excel built-in function FIND. As with FIND, the inputs are the pattern to test for, followed by the text to look within. Optionally, the start position can be specified (Default=0) as can case-sensitivity (Default=TRUE). Although Excel implements a separate SEARCH function which is case-insensitive, I’d rather save myself repeating the same code and added a parameter in to allow the user to specify whether case matters or not. Parenthetically, my belief, well-founded or not, is that case always matters.
    Here’s the code I’ve come up with so far: 

    Public Function RXFIND(ByRef find_pattern As Variant, _
                        ByRef within_text As Variant, _
                        Optional ByVal start_num As Variant, _
                        Optional ByVal case_sensitive As Boolean) As Long     
    
    Dim objRegex As VBScript_RegExp_55.RegExp
    Dim colMatch As VBScript_RegExp_55.MatchCollection
    Dim vbsMatch As VBScript_RegExp_55.Match
    Dim sMatchString As String
    
    Set objRegex = New VBScript_RegExp_55.RegExp
    
    ' Initialise Regex object
    With objRegex
        .Global = False
        ' Default is case sensitive
        If IsMissing(case_sensitive) Then
            .IgnoreCase = False
        Else: .IgnoreCase = Not case_sensitive
        End If
        .Pattern = find_pattern
    End With
    
    If IsMissing(start_num) Then start_num = 0
    sMatchString = Right$(within_text, Len(within_text) - start_num)
    
    ' Create Match collection
    Set colMatch = objRegex.Execute(sMatchString)
    If colMatch.Count = 0 Then ' No match
        RXFIND = 0
    Else
        Set vbsMatch = colMatch(0)
        RXFIND = vbsMatch.FirstIndex + start_num + 1
    End If
    End Function
  • RXSUB(text,  old_text,  new_text, [instance_num], [case_sensitive], [is_global])
    – This one is similar to the Excel built-in SUBSTITUTE, although as with RXFIND, it allows a couple of extra parameters. It’s also a little trickier to code, as the optional is_global parameter (Default=FALSE) set to TRUE renders instance_num (Default=0) meaningless. I still haven’t got this in a fit state to post the code yet, but it differs only a little from RXFIND, in that instead of using the FirstIndex property and Execute method of the RegExp object, it only needs to access the Replace method, and of course it returns a string rather than a long. Work in progress.
  • RXGET(find_pattern, within_text, [start_num], [case_sensitive])
    I still haven’t completely figured out what to have this one do yet, but it’s going to rely on the Value property of the Match object, and will possibly allow for returning Submatched groups. Another work in progress.
  • RXMATCH(find_pattern, within_array, [case_sensitive])
    Returns the position of the first exact match with the find_pattern from within_array. Yet to begin.

As well as worksheet functions, I’ve been tossing around the idea of a Find/Replace dialog with regex bells and whistles. I have got no further than mimicking the Excel built-in dialog so far (i.e. no functionality as yet), but I am hopeful I’ll have something that actually does something soon. For reasons I’ll explain a little further in the next post, I am undecided as to whether I should stick with the standard user form. Anyway, that’ll do for now…next time I’ll talk a little about some of the key decisions that need to be made – late vs early binding, .NET vs the VBScript library, VB6 add-in vs .xla.

Stand Back – I know Regular Expressions

Part 1 of 4

Yes, it’s true, I do – and that post has to be one of my favourite from the xkcd strip. I’ve used them many times over the last couple of years, in projects ranging from Python scripts written on weekends to critical SAS scripts I’ve written for work. So what are they?

Well, I’m not going to get into all of the gory details here, I’ll leave that to Wikipedia. But for the purpose of a quick and simple explanation, regular expressions are a very powerful tool used to match text. You know how, when you’re using the autofilter in Excel, you can match an arbitrary string of characters by using the wildcard *, or a single arbitrary character using ? So for instance, you could match the words “retail” and “detail” with the expressions ?etail or *ail.

Well, regular expressions offer the ability to do that sort of matching – but they offer far more besides. The core idea is that a regular expression pattern offers a compact and unambiguous way to denote a set of character strings. For example, the pattern \b\w+(e|d)\b denotes the set of all words which end with either e or d. And the pattern \d{3,5}\s+(Jim|Bob) matches all strings of digits with length at least 3 and at most 5, followed by whitespace, followed by either “Jim” or “Bob”.

As I said, I’m not going to get into all the details right here, so I’ll stop short of a proper exploration of regular expression syntax, which would be available as part of most introductory tutorials for Perl, Ruby or Python, to name but a few. However, I do want to outline how all this relates to my current project. Actually, in a sense, it is my current project.

You see, as soon as I discovered regular expressions, I started wondering why such a powerful tool was not implemented in as versatile an application as Excel, with all of its bells and whistles. While we do have things like the Find dialog, the MATCH() function and the Auto and Advanced filters, the ability to harness the power and simplicity of regular expression pattern-matching would really come in handy on occasion. Well, it turns out that the ability is there – it’s just slightly hidden. Microsoft included a regular expression engine in VBScript back in 1999, the latest version of which can be used in a VBA project by adding a reference to it, like so:

 

Add a reference to the Microsoft VBScript Regular Expression Library

Add a reference to the Microsoft VBScript Regular Expression Library

Now, while all the power of this engine is available to the VBA developer, my thinking is that it would be helpful to make this available where the rubber meets the road for the majority of Excel users – on the worksheet. My idea is to create a suite of worksheet functions and forms which can be used as an alternative to built-in functions, allowing users to harness the power of regular expressions. Next time, I’ll detail the functions I intend to include, with a brief summary of their implementation.