A couple of PowerShell scripts

I’ve been making a little more use of PowerShell recently. It’s a very handy addition to the Windows environment, certainly it’s a lot easier to use than the traditional command line, and it’s a lot easier to find your way around. I figured I’d share a couple of the scripts I’ve used for some specific tasks to do with file management, in case someone found this handy.

Recently we’ve been upgrading our SAS environment, so we’ve had to move a great deal of our data onto new servers. One of the concerns we had with migrating data was that some of our users had been storing disconnected SAS datasets in one of the directories to be migrated. Not really a great practice, and we didn’t want to waste bandwidth transporting over stale data unless there was a real need to. So, we needed a list of datasets – just the really big ones – that had been saved, so that the users responsible could justify the need for them in the new environment, or (preferably), bid them a fond farewell:

# the directory we are migrating
$badpath = "\\no\big\datasets\here\please"
# output to this file
$outfile = "\\list\of\big\datasets.csv"

<# 
  Extract a list of SAS datasets over 100MB in size, and
  write out the name, last modified date, size and location
  to the output file.
#>
Get-ChildItem $badpath -Recurse -Filter *.sas7bdat |
 ? {$_.Length -gt 100MB} |
 Select-Object Name, LastWriteTime, 
 @{Name="SizeInMB";Expression={$_.Length/1MB}},
 @{Name="Path";Expression={$_.directory}} |
 Export-Csv -Path $outfile

The Get-ChildItem cmdlet looks at all items in the folder $badpath, and the Recurse parameter tells it to look in all subfolders also. The Filter parameter allows us to limit the results of this search to SAS datasets. That gets piped to the Where-Object cmdlet (here shortened to ?), which allows only those passing the size limit in the script block through to the next pipe. Finally Select-Object takes a selection of properties of the objects coming through the where-object cmdlet and the Export-Csv cmdlet sends them to an output file. Note that as we know we are dealing with dataset files, the objects coming through will have type FileInfo. If we had not specified SAS datasets in the filter we would have had to include another cmdlet to exclude non-file items like directories and archives.

Another task involved searching through our SAS source code for something we knew was going to cause us problems in the new environment. Without going into too much detail as to the why of this, we needed to find any location where the developer had tested for whether a variable was equal to -1. As it turned out, due to the way SAS Data Integration Studio works, -1 gets added into generated code in a lot of places, so we needed a regular expression pattern to limit the cases returned. We didn’t care about the instances where special ETL macro variables were set to -1, or were compared with -1. So rather than looking for just the literal “-1” string in code, we needed a negative lookbehind regex to exclude these cases:

# Source code directory
$jobdir = "\\SAS\ETL\Code\Goes\Here"
# Output file
$outfile = "\\List\Of\Problem\Lines.csv"
# Occurrence of -1 that we care about
$pattern = "(?<!--((etls_\w+\s+=\s)|(etls_recnt[,=])))-1"

<#
  Extract lines from SAS ETL source code matching the pattern
  specified and write out the line to the output file
#>
Select-String -Pattern "$pattern" -Path $jobdir\*.sas |
 Select-Object Filename, LineNumber, Line |
  Export-Csv $outfile

Select-String is pretty much the PowerShell equivalent of grep. It returns a list of MatchInfo objects which, as with the FileInfo objects in the other example, all get passed to Select-Object to pull out the properties we want to know about to be exported to the output file.

PowerShell documentation is available at https://technet.microsoft.com/en-us/library/bb978526.aspx but you’ll find the built-in F1 help in PowerShell is just as easy to find your way around. Happy scripting!

Regular Expressions – new version

Just a quick post tonight to let you all know, I’ve added a new page for downloads, which contains a new version of the regular expressions add-in, compatible with Excel 2007 and later. I’ve added in a couple of utility functions for convenience (mine more than yours but you might find them useful), and a form to evaluate regular expressions against test strings. And there’s even documentation!

The documentation gives a (very) brief intro to the uses and abuses of regular expressions, a run-down of the worksheet functions in the add-in and some examples of their use. Here are a couple of those, I hope you find them useful.

Matching cells which contain variations on a word

There are some words in the English language which Americans, god bless them, spell in their own special way. However, given input on the spreadsheet from users who spell both ways (correctly and incorrectly), you may wish to match both variations of words like ‘realise’ (‘realize’) and ‘colour’ (‘color’).
The pattern to match realise/realize is simple: \breali(s|z)e\b
The word boundary markers ensure we are looking at a complete word, and the alternation of (s|z) means that we match both versions.
Applying the ISRXMATCH formula demonstrates this is successful:

Validating Email Addresses

Given a list of email addresses in a column on a spreadsheet, we wish to ensure that these stick to a form which at least obeys some of the rules governing the format of email addresses. As these are going to be used by a script to send emails, we wish to minimise the number of undeliverable responses due to invalid addresses. The basic rules we specify for these addresses are as follows:
The username part of the address contains one or more alphanumeric characters, and possibly some additional special characters. This is followed by a single @ sign, followed by the domain name, which consists of one or more alphanumeric and special characters, ending with a dot followed by the top-level domain. This must contain only alphanumeric characters, and there must be between 2 and 6 of these. The address should be the entire content of the cell, so the beginning and ending anchors are used at the start and end of the pattern. Case is unimportant, so the case_sensitive flag is set to false.
The pattern is as follows: ^[a-z0-9_%.+-]+@[a-z0-9-.]+\.[a-z]{2,6}$
This is then used in ISRXMATCH – a valid email address according to our rules above will return true:

The second address in the list fails due to the whitespace in the username, whereas the fourth fails because the domain name does not include a top-level domain part of a dot followed by 2-6 letters.
I borrowed this regex from http://www.regular-expressions.info/email.html. As well as a couple of alternative regexes to cover some edge cases which the above doesn’t catch, this page also discusses why email addresses can be tricky, and why you shouldn’t go overboard trying to cover every exception.

Some new regular expression functions

There are a couple of SAS functions I have found to be quite useful, and that I miss when I’m working in Excel. Probably the more useful of the two is the SCAN function. You give SCAN a text string, an integer n and optionally some delimiters, and SCAN will give you back the nth ‘word’ in a string, where the ‘words’ are delimited either by the characters you supplied or by default delimiter characters. For instance,

data _null_;
word5 = SCAN("The/number(of+potential.interactions|among^a)large*group<of-people,is$proportional.to!the square/of!their&number.", 5);
put word5=;
run;

will print

word5=interactions

to the log. If n is negative, it will count backwards (from right to left). So

data _null_;
word7 = SCAN("The/number(of+potential.interactions|among^a)large*group<of-people,is$proportional.to!the square/of!their&number.",-7);
put word7=;
run;

will print

word7=proportional

There’s also the function COUNTW, which will also take a text string and some delimiters as arguments, and returns the number of words in the string.

data _null_;
count = COUNTW("Consider, for example, the question of tax policy.");
put count=;
run;

produces

count=8

Once again regular expressions turn out to be ideally suited to translating this functionality to Excel worksheet functions. I’ve included them in Regex.xla as RXSCAN and RXCOUNTW:

Public Function RXSCAN(ByVal stringval As String, ByVal n As Integer, _
                        Optional ByVal dlm As String = "") As Variant
' RXSCAN - Returns the nth word from stringval, where
' words are delimited by any non-word character by default,
' or a specific character set by optional string dlm.

Dim objRegex As VBScript_RegExp_55.RegExp
Dim colMatch As VBScript_RegExp_55.MatchCollection
Dim charlist As String
Dim char As String
Dim i As Integer

' Build pattern
If dlm = "" Then ' Just looking for A-Za-z0-9_
  charlist = "\w+"
Else ' Negate character class specified
  charlist = "[^"
  For i = 1 To Len(dlm)
    char = Mid$(dlm, i, 1)
    If InStr("[\^-", char) Then char = "\" & char ' escape
    charlist = charlist & char
  Next i
  charlist = charlist & "]+"
End If

' Initialise Regexp object, get match collection
Set objRegex = New VBScript_RegExp_55.RegExp
With objRegex
  .Pattern = charlist
  .Global = True
  Set colMatch = .Execute(stringval)
End With

If colMatch.Count = 0 Then ' No words detected
    RXSCAN = CVErr(xlErrNA)
Else
  If Abs(n) > colMatch.Count Or n = 0 Then
    RXSCAN = CVErr(xlErrNum)
  Else
    If n > 0 Then ' Read from left to right
      RXSCAN = colMatch(n - 1)
    Else ' Right to left
      RXSCAN = colMatch(colMatch.Count + n)
    End If
  End If
End If

End Function

Public Function RXCOUNTW(ByVal stringval As String, _
                        Optional ByVal dlm As String = "") As Variant
' RXCOUNTW - Returns the count of words from stringval, where
' words are delimited by any non-word character by default,
' or a specific character set by optional string dlm.
Dim objRegex As VBScript_RegExp_55.RegExp
Dim colMatch As VBScript_RegExp_55.MatchCollection
Dim charlist As String
Dim char As String
Dim i As Integer

' Build pattern
If dlm = "" Then ' Just looking for A-Za-z0-9_
  charlist = "\w+"
Else ' Negate character class specified
  charlist = "[^"
  For i = 1 To Len(dlm)
    char = Mid$(dlm, i, 1)
    If InStr("[\^-", char) Then char = "\" & char ' escape
    charlist = charlist & char
  Next i
  charlist = charlist & "]+"
End If

' Initialise Regexp object, get match collection
Set objRegex = New VBScript_RegExp_55.RegExp
With objRegex
  .Pattern = charlist
  .Global = True
  Set colMatch = .Execute(stringval)
End With

If colMatch.Count = 0 Then ' No words detected
  RXCOUNTW = CVErr(xlErrNA)
Else
  RXCOUNTW = colMatch.Count
End If

End Function

I’ve also moved the add-in file to a dropbox folder. You can now find Regex.xla at http://dl.dropbox.com/u/29927305/Regex.xla

Building an Excel currency converter, part 2

To recap where we left off in the first part, we’d managed to import xml data from the rss feed for NZD rates at themoneyconverter.com into a table (which I’ve named Rates_Table). This gave us a list of rates which we can use – however the format is not quite right. In the “description” member of “item”, the rate is expressed in a structured text format, like so: “1 New Zealand Dollar = 2.96886 Argentine Peso”. Now of course that would be useful as it is, if I were prepared to manually type out the 2.96886 part myself to convert from NZ dollars to pesos. Which of course I’m not, lazy data analyst that I am.

What I’m going to do instead is to extract the two relevant parts of that text, the two bits that I need to know: the rate (the numeric bit, in this case 2.96886) and the foreign currency description (the character string that comes after the numeric bit, in this case “Argentine Peso”). This is something that can be done with text functions like MID and FIND in Excel, of course, and this is a perfectly legitimate way to go about it. However, unless you can find a really clever way to do it, you’ll end up with a really long, convoluted and awkward formula. Which is a great reason to use regular expression functions instead. RXGET for instance, is designed to do exactly this sort of extraction. Here’s how I went about it:

First, we need to determine the regular expression pattern that fits the text. The pattern I chose is
=\s(\d*\.\d+)\s(.+)
or to break it down a little,

  1. an equals sign followed by a space: =\s;
  2. followed by a group, which in turn contains zero or more digits, followed by a decimal point, followed by one or more digits: (\d*\.\d+) This will be used to capture the rate part of the expression. Note that the decimal point needs to be escaped here, as it signifies an arbitrary character in a regular expression;
  3. followed by another space: \s;
  4. followed by a second group, which contains one or more arbitrary characters: (.+) This will capture everything after the space, which in this case will be the currency description. Here the decimal point should not be escaped as we need it to match any character.

Once I know that this pattern works, I add a new sheet titled ‘Reference’ and enter the pattern into a cell that I name Rate_Pattern. Now I can use it to extract the rate and the currency using RXGET. I add a column with the header rate at the right-hand end of the Rates_Table and enter in it the RXGET formula, with the following arguments:

This corresponds to the formula:
=RXGET(Rate_Pattern,Rates_Table[[#This Row],[description]],1).
Note that the Submatch argument has the value 1, as we’re extracting the 1st matched group from the expression. To get the currency description, in another new column with the header currency I enter the same formula with the Submatch argument changed to 2 to extract the second group. So here’s how the table looks right now (click on the image to take a closer look):

In the final part of this series I’ll complete the converter by adding in a selector to pick out a rate from the table by specifying the currency.

Regex Functions updated

Taking a brief pause on the way to completing the currency converter, to revise the functions in the Excel regular expressions addin, Regex.xla. I’m really rather slack in leaving it this long – it was pointed out to me several months ago that the way I had passed error values back to the calling cell did not work as intended if the return value was not a Variant. For instance in RXGET the intent of

RXGET = CVErr(xlErrNA)

is to have Excel display #N/A in the cell. As it’s returning a String rather than a Variant however, it displays #VALUE! instead. So I’ve altered the return types to Variant in all the functions, to allow for more meaningful error values to be returned.

It was also pointed out more recently that using IsMissing to test for the presence of an optional parameter and provide a default, like so:

If IsMissing(start_num) Then start_num = 0

will not work, as IsMissing works on Variants only, rather than primitive data types. Actually it’s a better idea to specify default parameters anyway, then there’s no need for this logic. So that’s what I’ve done.

Thanks to Colin and mifrey for pointing out the above issues. I’ve replaced the add-in on fileden, and if you like the code changes can be inspected at the Regex functions page.