A couple of PowerShell scripts

I’ve been making a little more use of PowerShell recently. It’s a very handy addition to the Windows environment, certainly it’s a lot easier to use than the traditional command line, and it’s a lot easier to find your way around. I figured I’d share a couple of the scripts I’ve used for some specific tasks to do with file management, in case someone found this handy.

Recently we’ve been upgrading our SAS environment, so we’ve had to move a great deal of our data onto new servers. One of the concerns we had with migrating data was that some of our users had been storing disconnected SAS datasets in one of the directories to be migrated. Not really a great practice, and we didn’t want to waste bandwidth transporting over stale data unless there was a real need to. So, we needed a list of datasets – just the really big ones – that had been saved, so that the users responsible could justify the need for them in the new environment, or (preferably), bid them a fond farewell:

# the directory we are migrating
$badpath = "\\no\big\datasets\here\please"
# output to this file
$outfile = "\\list\of\big\datasets.csv"

<# 
  Extract a list of SAS datasets over 100MB in size, and
  write out the name, last modified date, size and location
  to the output file.
#>
Get-ChildItem $badpath -Recurse -Filter *.sas7bdat |
 ? {$_.Length -gt 100MB} |
 Select-Object Name, LastWriteTime, 
 @{Name="SizeInMB";Expression={$_.Length/1MB}},
 @{Name="Path";Expression={$_.directory}} |
 Export-Csv -Path $outfile

The Get-ChildItem cmdlet looks at all items in the folder $badpath, and the Recurse parameter tells it to look in all subfolders also. The Filter parameter allows us to limit the results of this search to SAS datasets. That gets piped to the Where-Object cmdlet (here shortened to ?), which allows only those passing the size limit in the script block through to the next pipe. Finally Select-Object takes a selection of properties of the objects coming through the where-object cmdlet and the Export-Csv cmdlet sends them to an output file. Note that as we know we are dealing with dataset files, the objects coming through will have type FileInfo. If we had not specified SAS datasets in the filter we would have had to include another cmdlet to exclude non-file items like directories and archives.

Another task involved searching through our SAS source code for something we knew was going to cause us problems in the new environment. Without going into too much detail as to the why of this, we needed to find any location where the developer had tested for whether a variable was equal to -1. As it turned out, due to the way SAS Data Integration Studio works, -1 gets added into generated code in a lot of places, so we needed a regular expression pattern to limit the cases returned. We didn’t care about the instances where special ETL macro variables were set to -1, or were compared with -1. So rather than looking for just the literal “-1” string in code, we needed a negative lookbehind regex to exclude these cases:

# Source code directory
$jobdir = "\\SAS\ETL\Code\Goes\Here"
# Output file
$outfile = "\\List\Of\Problem\Lines.csv"
# Occurrence of -1 that we care about
$pattern = "(?<!--((etls_\w+\s+=\s)|(etls_recnt[,=])))-1"

<#
  Extract lines from SAS ETL source code matching the pattern
  specified and write out the line to the output file
#>
Select-String -Pattern "$pattern" -Path $jobdir\*.sas |
 Select-Object Filename, LineNumber, Line |
  Export-Csv $outfile

Select-String is pretty much the PowerShell equivalent of grep. It returns a list of MatchInfo objects which, as with the FileInfo objects in the other example, all get passed to Select-Object to pull out the properties we want to know about to be exported to the output file.

PowerShell documentation is available at https://technet.microsoft.com/en-us/library/bb978526.aspx but you’ll find the built-in F1 help in PowerShell is just as easy to find your way around. Happy scripting!

Advertisements

Regular Expressions – new version

Just a quick post tonight to let you all know, I’ve added a new page for downloads, which contains a new version of the regular expressions add-in, compatible with Excel 2007 and later. I’ve added in a couple of utility functions for convenience (mine more than yours but you might find them useful), and a form to evaluate regular expressions against test strings. And there’s even documentation!

The documentation gives a (very) brief intro to the uses and abuses of regular expressions, a run-down of the worksheet functions in the add-in and some examples of their use. Here are a couple of those, I hope you find them useful.

Matching cells which contain variations on a word

There are some words in the English language which Americans, god bless them, spell in their own special way. However, given input on the spreadsheet from users who spell both ways (correctly and incorrectly), you may wish to match both variations of words like ‘realise’ (‘realize’) and ‘colour’ (‘color’).
The pattern to match realise/realize is simple: \breali(s|z)e\b
The word boundary markers ensure we are looking at a complete word, and the alternation of (s|z) means that we match both versions.
Applying the ISRXMATCH formula demonstrates this is successful:

Validating Email Addresses

Given a list of email addresses in a column on a spreadsheet, we wish to ensure that these stick to a form which at least obeys some of the rules governing the format of email addresses. As these are going to be used by a script to send emails, we wish to minimise the number of undeliverable responses due to invalid addresses. The basic rules we specify for these addresses are as follows:
The username part of the address contains one or more alphanumeric characters, and possibly some additional special characters. This is followed by a single @ sign, followed by the domain name, which consists of one or more alphanumeric and special characters, ending with a dot followed by the top-level domain. This must contain only alphanumeric characters, and there must be between 2 and 6 of these. The address should be the entire content of the cell, so the beginning and ending anchors are used at the start and end of the pattern. Case is unimportant, so the case_sensitive flag is set to false.
The pattern is as follows: ^[a-z0-9_%.+-]+@[a-z0-9-.]+\.[a-z]{2,6}$
This is then used in ISRXMATCH – a valid email address according to our rules above will return true:

The second address in the list fails due to the whitespace in the username, whereas the fourth fails because the domain name does not include a top-level domain part of a dot followed by 2-6 letters.
I borrowed this regex from http://www.regular-expressions.info/email.html. As well as a couple of alternative regexes to cover some edge cases which the above doesn’t catch, this page also discusses why email addresses can be tricky, and why you shouldn’t go overboard trying to cover every exception.

Some new regular expression functions

There are a couple of SAS functions I have found to be quite useful, and that I miss when I’m working in Excel. Probably the more useful of the two is the SCAN function. You give SCAN a text string, an integer n and optionally some delimiters, and SCAN will give you back the nth ‘word’ in a string, where the ‘words’ are delimited either by the characters you supplied or by default delimiter characters. For instance,

data _null_;
word5 = SCAN("The/number(of+potential.interactions|among^a)large*group<of-people,is$proportional.to!the square/of!their&number.", 5);
put word5=;
run;

will print

word5=interactions

to the log. If n is negative, it will count backwards (from right to left). So

data _null_;
word7 = SCAN("The/number(of+potential.interactions|among^a)large*group<of-people,is$proportional.to!the square/of!their&number.",-7);
put word7=;
run;

will print

word7=proportional

There’s also the function COUNTW, which will also take a text string and some delimiters as arguments, and returns the number of words in the string.

data _null_;
count = COUNTW("Consider, for example, the question of tax policy.");
put count=;
run;

produces

count=8

Once again regular expressions turn out to be ideally suited to translating this functionality to Excel worksheet functions. I’ve included them in Regex.xla as RXSCAN and RXCOUNTW:

Public Function RXSCAN(ByVal stringval As String, ByVal n As Integer, _
                        Optional ByVal dlm As String = "") As Variant
' RXSCAN - Returns the nth word from stringval, where
' words are delimited by any non-word character by default,
' or a specific character set by optional string dlm.

Dim objRegex As VBScript_RegExp_55.RegExp
Dim colMatch As VBScript_RegExp_55.MatchCollection
Dim charlist As String
Dim char As String
Dim i As Integer

' Build pattern
If dlm = "" Then ' Just looking for A-Za-z0-9_
  charlist = "\w+"
Else ' Negate character class specified
  charlist = "[^"
  For i = 1 To Len(dlm)
    char = Mid$(dlm, i, 1)
    If InStr("[\^-", char) Then char = "\" & char ' escape
    charlist = charlist & char
  Next i
  charlist = charlist & "]+"
End If

' Initialise Regexp object, get match collection
Set objRegex = New VBScript_RegExp_55.RegExp
With objRegex
  .Pattern = charlist
  .Global = True
  Set colMatch = .Execute(stringval)
End With

If colMatch.Count = 0 Then ' No words detected
    RXSCAN = CVErr(xlErrNA)
Else
  If Abs(n) > colMatch.Count Or n = 0 Then
    RXSCAN = CVErr(xlErrNum)
  Else
    If n > 0 Then ' Read from left to right
      RXSCAN = colMatch(n - 1)
    Else ' Right to left
      RXSCAN = colMatch(colMatch.Count + n)
    End If
  End If
End If

End Function

Public Function RXCOUNTW(ByVal stringval As String, _
                        Optional ByVal dlm As String = "") As Variant
' RXCOUNTW - Returns the count of words from stringval, where
' words are delimited by any non-word character by default,
' or a specific character set by optional string dlm.
Dim objRegex As VBScript_RegExp_55.RegExp
Dim colMatch As VBScript_RegExp_55.MatchCollection
Dim charlist As String
Dim char As String
Dim i As Integer

' Build pattern
If dlm = "" Then ' Just looking for A-Za-z0-9_
  charlist = "\w+"
Else ' Negate character class specified
  charlist = "[^"
  For i = 1 To Len(dlm)
    char = Mid$(dlm, i, 1)
    If InStr("[\^-", char) Then char = "\" & char ' escape
    charlist = charlist & char
  Next i
  charlist = charlist & "]+"
End If

' Initialise Regexp object, get match collection
Set objRegex = New VBScript_RegExp_55.RegExp
With objRegex
  .Pattern = charlist
  .Global = True
  Set colMatch = .Execute(stringval)
End With

If colMatch.Count = 0 Then ' No words detected
  RXCOUNTW = CVErr(xlErrNA)
Else
  RXCOUNTW = colMatch.Count
End If

End Function

I’ve also moved the add-in file to a dropbox folder. You can now find Regex.xla at http://dl.dropbox.com/u/29927305/Regex.xla

Building an Excel currency converter, part 2

To recap where we left off in the first part, we’d managed to import xml data from the rss feed for NZD rates at themoneyconverter.com into a table (which I’ve named Rates_Table). This gave us a list of rates which we can use – however the format is not quite right. In the “description” member of “item”, the rate is expressed in a structured text format, like so: “1 New Zealand Dollar = 2.96886 Argentine Peso”. Now of course that would be useful as it is, if I were prepared to manually type out the 2.96886 part myself to convert from NZ dollars to pesos. Which of course I’m not, lazy data analyst that I am.

What I’m going to do instead is to extract the two relevant parts of that text, the two bits that I need to know: the rate (the numeric bit, in this case 2.96886) and the foreign currency description (the character string that comes after the numeric bit, in this case “Argentine Peso”). This is something that can be done with text functions like MID and FIND in Excel, of course, and this is a perfectly legitimate way to go about it. However, unless you can find a really clever way to do it, you’ll end up with a really long, convoluted and awkward formula. Which is a great reason to use regular expression functions instead. RXGET for instance, is designed to do exactly this sort of extraction. Here’s how I went about it:

First, we need to determine the regular expression pattern that fits the text. The pattern I chose is
=\s(\d*\.\d+)\s(.+)
or to break it down a little,

  1. an equals sign followed by a space: =\s;
  2. followed by a group, which in turn contains zero or more digits, followed by a decimal point, followed by one or more digits: (\d*\.\d+) This will be used to capture the rate part of the expression. Note that the decimal point needs to be escaped here, as it signifies an arbitrary character in a regular expression;
  3. followed by another space: \s;
  4. followed by a second group, which contains one or more arbitrary characters: (.+) This will capture everything after the space, which in this case will be the currency description. Here the decimal point should not be escaped as we need it to match any character.

Once I know that this pattern works, I add a new sheet titled ‘Reference’ and enter the pattern into a cell that I name Rate_Pattern. Now I can use it to extract the rate and the currency using RXGET. I add a column with the header rate at the right-hand end of the Rates_Table and enter in it the RXGET formula, with the following arguments:

This corresponds to the formula:
=RXGET(Rate_Pattern,Rates_Table[[#This Row],[description]],1).
Note that the Submatch argument has the value 1, as we’re extracting the 1st matched group from the expression. To get the currency description, in another new column with the header currency I enter the same formula with the Submatch argument changed to 2 to extract the second group. So here’s how the table looks right now (click on the image to take a closer look):

In the final part of this series I’ll complete the converter by adding in a selector to pick out a rate from the table by specifying the currency.

Regex Functions updated

Taking a brief pause on the way to completing the currency converter, to revise the functions in the Excel regular expressions addin, Regex.xla. I’m really rather slack in leaving it this long – it was pointed out to me several months ago that the way I had passed error values back to the calling cell did not work as intended if the return value was not a Variant. For instance in RXGET the intent of

RXGET = CVErr(xlErrNA)

is to have Excel display #N/A in the cell. As it’s returning a String rather than a Variant however, it displays #VALUE! instead. So I’ve altered the return types to Variant in all the functions, to allow for more meaningful error values to be returned.

It was also pointed out more recently that using IsMissing to test for the presence of an optional parameter and provide a default, like so:

If IsMissing(start_num) Then start_num = 0

will not work, as IsMissing works on Variants only, rather than primitive data types. Actually it’s a better idea to specify default parameters anyway, then there’s no need for this logic. So that’s what I’ve done.

Thanks to Colin and mifrey for pointing out the above issues. I’ve replaced the add-in on fileden, and if you like the code changes can be inspected at the Regex functions page.

More Regex Functions, part 2

Part 2 of 3

After building the form in Part 1, the next stage of coding the regex filter involves implementing the logic for deciding which rows of the list to be filtered will pass through the filter. Each row in the criteria range (rngCriteria) can be thought of as a test made up of a set of criteria. In order for a row in the list range (rngSource) to pass the test, all of these criteria must be met. If the list row passes the test for any row in rngCriteria, then it will be included in the eventual output.

Given a criteria range with m rows (excluding the header row) and n columns, this can be modelled in pseudo-SQL:

SELECT * FROM rngSource AS a
WHERE 
	(Crit[1,1](a) AND ... AND Crit[1,n](a))
	OR ...
	OR (Crit[m,1](a) AND ... AND Crit[m,n](a))

where Crit[i,j](a) is a boolean function representing the criterion in row i, column j.

For the built-in advanced filter, Excel constructs these criteria by parsing the text in the criteria range – a large range of tests are available, including the use of relational operators, wildcard and exact text matches. For an introduction to use of the advanced filter, see the tutorial at Contextures.

For our purposes, though, we’ll be allowing only regular expression pattern matches on the text in the list rows. While in theory it would be possible to mimic all the built-in advanced filter functionality and add regular expression tests into the mix, I’m not attempting to extend the advanced filter at this point – instead I’m creating an alternative to it.

In order to test the rows, I decided to implement a couple of classes to contain them – one for the rows in the list and extract ranges, and one which works slightly differently for the criteria range. In both classes, the core idea is to use the Scripting.Dictionary object to store key-value pairs, where the keys are provided by the header rows and the values by the text in each cell.

The first of these, CRowDict, is what I intend to use for the rows in the list and extract ranges:

Option Explicit

' CRowDict - wraps dictionary, holds key-value pairings
' for each row in the list

'********************************************************************
' Class private attributes
'********************************************************************

' holds dictionary object to be populated
Private mdicValues As Scripting.Dictionary
Private Const msCLASS As String = "CRowDict"

'********************************************************************
' Class public properties
'********************************************************************
Public Property Get Item(ByVal sKey As String) As Variant
    If mdicValues.Exists(sKey) Then
        Item = mdicValues.Item(sKey)
    Else
        Err.Raise glNOTFOUND, msCLASS, gsNOTFOUND & sKey
    End If
End Property

'********************************************************************
' Class public methods
'********************************************************************

Public Sub Populate(ByRef rngHeaders As Excel.Range, _
                ByRef rngValues As Excel.Range)
    ' Populates dictionary from the row rngValues
    
    Dim colNum As Long
    
    For colNum = 1 To rngHeaders.Cells.Count
        mdicValues.Add rngHeaders(1, colNum).Text, _
                        rngValues(1, colNum).Text
    Next colNum

End Sub

Public Sub Display()
    ' Convenience method for testing
    Dim vKey As Variant
    For Each vKey In mdicValues.Keys
        Debug.Print vKey, mdicValues.Item(vKey)
    Next vKey
End Sub

'********************************************************************
' Class private methods
'********************************************************************

Private Sub Class_Initialize()
    Set mdicValues = New Scripting.Dictionary
End Sub

The idea here is to allow us to access the text values in each row by name rather than position – this allows us the flexibility to change the order and/or number of columns returned to the extract range.

The class CRXRowDict is very similar, but stores RegExp objects rather than text. It also includes a Test method, which operates on CRowDict instances and returns True if all of the required text values match the corresponding regex patterns:

Option Explicit

' CRXRowDict - wraps dictionary, holds key-value pairings
' for each row in the list, where values are RegExp objects

'********************************************************************
' Class private attributes
'********************************************************************

' holds dictionary object to be populated
Private mdicValues As Scripting.Dictionary
Private Const msCLASS As String = "CRXRowDict"

'********************************************************************
' Class public properties
'********************************************************************
Public Property Get Item(ByVal sKey As String) _
                            As VBScript_RegExp_55.RegExp
    If mdicValues.Exists(sKey) Then
        Set Item = mdicValues.Item(sKey)
    Else
        Err.Raise glNOTFOUND, msCLASS, gsNOTFOUND & sKey
    End If
End Property

'********************************************************************
' Class public methods
'********************************************************************

Public Sub Populate(ByRef rngHeaders As Excel.Range, _
                ByRef rngValues As Excel.Range)
    ' Populates dictionary from the row rngValues
    
    Dim colNum As Long
    Dim objRegex As VBScript_RegExp_55.RegExp
    
    ' create new regexp object for each cell in the row
    For colNum = 1 To rngHeaders.Cells.Count
        Set objRegex = New VBScript_RegExp_55.RegExp
        objRegex.Pattern = rngValues(1, colNum).Text
        mdicValues.Add rngHeaders(1, colNum).Text, objRegex
    Next colNum

End Sub

Public Function Test(ByRef clsRow As CRowDict) As Boolean
    ' Executes the test method for each regexp object
    ' against the corresponding item in clsRow
    
    Dim objRegex As VBScript_RegExp_55.RegExp
    Dim iKey As Integer
    Dim sKey As String
    Dim sTest As String
    Dim bResult As Boolean
    
    ' Assume true to begin with
    bResult = True
    
    For iKey = LBound(mdicValues.Keys) To UBound(mdicValues.Keys)
        sKey = mdicValues.Keys(iKey)
        Set objRegex = mdicValues(sKey)
        sTest = clsRow.Item(sKey)
        Debug.Print objRegex.Pattern, sTest
        ' if any test fails, return false
        If Not objRegex.Test(sTest) Then
            bResult = False
            Exit For
        End If
    Next iKey
    
    Test = bResult
    
End Function

'********************************************************************
' Class private methods
'********************************************************************

Private Sub Class_Initialize()
    Set mdicValues = New Scripting.Dictionary
End Sub

In the final part of this series (which I hope to put out by next week) I’ll be putting this all together, showing how these classes are used to implement the filter function.

More Regex Functions

And finally, I’m back! I finally overcame the inertia yesterday and got back to work on the regular expressions add-in. The next goal is to include regular expressions equivalents for Excel’s built-in advanced filter and find/replace functions. So far I’ve concentrated on just the filter part, which I hope to have finished fairly soon. However as there’s a fair amount of code involved, I’ll split this into 2 or 3 posts.

Firstly, the advanced filter dialog. Not being a big fan of re-inventing the wheel I decided to build pretty much exactly the same form as you find Excel presenting you with for an ordinary advanced filter. Here’s what my version looks like:

 

All that this form needs to do is collect 3 key pieces of information to control the filter process:

  1. What ranges will the filter operate on?
  2. Will the source range (List Range) be filtered in place, or will the filtered results be copied to another place (Copy to)?
  3. Does the filtered list need to display unique records only?

Here’s the code behind the form. The public properties List, Criteria and Extract return the addresses for the ranges in (1) from the RefEdit controls, and the booleans CopyTo and UniqueOnly return the values for (2) and (3) respectively.

Option Explicit

Private mbOK As Boolean

'**********************************************************
' Form class public properties
'**********************************************************

' User clicked OK
Public Property Get OK() As Boolean
    OK = mbOK
End Property
'**********************************************************
' Ranges selected for filter
'**********************************************************
Public Property Get List() As String
' List range to be filtered
    List = refList.Value
End Property

Public Property Get Criteria() As String
' Range containing criteria for filter
    Criteria = refCriteria.Value
End Property

Public Property Get Extract() As String
' Range to copy filtered rows to
    Extract = refExtract.Value
End Property

'**********************************************************
' Properties modifying filter operation
'**********************************************************

' User selected Copy to range option button
Public Property Get CopyTo() As Boolean
    CopyTo = optCopy.Value
End Property

' User selected Unique values only checkbox
Public Property Get UniqueOnly() As Boolean
    UniqueOnly = chkUnique.Value
End Property

'**********************************************************
' Form controls
'**********************************************************
Private Sub cmdCancel_Click()
    mbOK = False
    Me.Hide
End Sub

Private Sub cmdOk_Click()
    mbOK = True
    Me.Hide
End Sub

Private Sub optCopy_Click()
    
    ' enable refExtract
    lblExtract.ForeColor = vbButtonText
    With refExtract
        .Enabled = True
        .BackColor = vbWindowBackground
    End With
    
End Sub

Private Sub optInPlace_Click()

    ' grey out and disable refExtract
    lblExtract.ForeColor = vbGrayText
    With refExtract
        .Enabled = False
        .BackColor = vbInactiveCaptionText
    End With
    
End Sub

Private Sub UserForm_Initialize()

    ' Set default values for List range and optInPlace
    refList.Value = Application.ActiveCell.CurrentRegion.Address
    optInPlace.Value = True

End Sub

Private Sub UserForm_QueryClose(Cancel As Integer, CloseMode As Integer)

    If CloseMode = vbFormControlMenu Then
        cmdCancel_Click
        Cancel = True
    End If

End Sub

And here’s the sub GetRanges, which ‘reads’ the form:

Private Sub GetRanges(ByRef rngSource As Excel.Range, _
                ByRef rngCriteria As Excel.Range, _
                ByRef rngExtract As Excel.Range, _
                ByRef bUnique As Boolean)
                
    Dim frmFilter As FRXFilter

    Set frmFilter = New FRXFilter
    frmFilter.Show
    
    If frmFilter.OK Then
        With frmFilter
            If .List <> "" Then _
                Set rngSource = Range(.List)
            If .Criteria <> "" Then _
                Set rngCriteria = Range(.Criteria)
            If .CopyTo Then _
                Set rngExtract = Range(.Extract)
            bUnique = .UniqueOnly
        End With
    End If
                
End Sub

I also want to include some error-handling to cope with bad range addresses, but apart from that it’s pretty much there. Next time I’ll post the sub at the next level up, calling GetRanges and a couple of other routines to perform the filtering.