A couple of PowerShell scripts

I’ve been making a little more use of PowerShell recently. It’s a very handy addition to the Windows environment, certainly it’s a lot easier to use than the traditional command line, and it’s a lot easier to find your way around. I figured I’d share a couple of the scripts I’ve used for some specific tasks to do with file management, in case someone found this handy.

Recently we’ve been upgrading our SAS environment, so we’ve had to move a great deal of our data onto new servers. One of the concerns we had with migrating data was that some of our users had been storing disconnected SAS datasets in one of the directories to be migrated. Not really a great practice, and we didn’t want to waste bandwidth transporting over stale data unless there was a real need to. So, we needed a list of datasets – just the really big ones – that had been saved, so that the users responsible could justify the need for them in the new environment, or (preferably), bid them a fond farewell:

# the directory we are migrating
$badpath = "\\no\big\datasets\here\please"
# output to this file
$outfile = "\\list\of\big\datasets.csv"

<# 
  Extract a list of SAS datasets over 100MB in size, and
  write out the name, last modified date, size and location
  to the output file.
#>
Get-ChildItem $badpath -Recurse -Filter *.sas7bdat |
 ? {$_.Length -gt 100MB} |
 Select-Object Name, LastWriteTime, 
 @{Name="SizeInMB";Expression={$_.Length/1MB}},
 @{Name="Path";Expression={$_.directory}} |
 Export-Csv -Path $outfile

The Get-ChildItem cmdlet looks at all items in the folder $badpath, and the Recurse parameter tells it to look in all subfolders also. The Filter parameter allows us to limit the results of this search to SAS datasets. That gets piped to the Where-Object cmdlet (here shortened to ?), which allows only those passing the size limit in the script block through to the next pipe. Finally Select-Object takes a selection of properties of the objects coming through the where-object cmdlet and the Export-Csv cmdlet sends them to an output file. Note that as we know we are dealing with dataset files, the objects coming through will have type FileInfo. If we had not specified SAS datasets in the filter we would have had to include another cmdlet to exclude non-file items like directories and archives.

Another task involved searching through our SAS source code for something we knew was going to cause us problems in the new environment. Without going into too much detail as to the why of this, we needed to find any location where the developer had tested for whether a variable was equal to -1. As it turned out, due to the way SAS Data Integration Studio works, -1 gets added into generated code in a lot of places, so we needed a regular expression pattern to limit the cases returned. We didn’t care about the instances where special ETL macro variables were set to -1, or were compared with -1. So rather than looking for just the literal “-1” string in code, we needed a negative lookbehind regex to exclude these cases:

# Source code directory
$jobdir = "\\SAS\ETL\Code\Goes\Here"
# Output file
$outfile = "\\List\Of\Problem\Lines.csv"
# Occurrence of -1 that we care about
$pattern = "(?<!--((etls_\w+\s+=\s)|(etls_recnt[,=])))-1"

<#
  Extract lines from SAS ETL source code matching the pattern
  specified and write out the line to the output file
#>
Select-String -Pattern "$pattern" -Path $jobdir\*.sas |
 Select-Object Filename, LineNumber, Line |
  Export-Csv $outfile

Select-String is pretty much the PowerShell equivalent of grep. It returns a list of MatchInfo objects which, as with the FileInfo objects in the other example, all get passed to Select-Object to pull out the properties we want to know about to be exported to the output file.

PowerShell documentation is available at https://technet.microsoft.com/en-us/library/bb978526.aspx but you’ll find the built-in F1 help in PowerShell is just as easy to find your way around. Happy scripting!

Why I’d like Python at work

One of the main reasons I like Python as a scripting language is its brevity – in particular I enjoy using things like list comprehensions, which enable you to escape the use of for loops when dealing with lists. Lists in Python are powerful and when used in combination with map and filter, allow you to do a lot more with a lot less code. This is one of the things I really miss when I write code in VBA/VBScript, where you’re often stuck dealing with either arrays or collections from the object model. Dive Into Python offers an interesting perspective on this.

Dick Kusleika posted a VBA sub to open the newest file in a folder over at Daily Dose of Excel today, which made use of the Scripting library to iterate over .csv files in a folder, find the newest and open it. Dick mentioned wanting an easier way to find the newest than looping over all of them, which got me thinking. If the Scripting.Folder object had a .NewestFile property, then that would obviously do it, but failing that, I could not see any other way than to loop through all files to examine the create dates.

In Python, however, without looping, the exercise becomes a two-line function:

import os, os.path, sys

def newest(ext, path=os.getcwd()):
  """
  pre: path a valid absolute filepath or None, ext a file 
  extension (e.g. '.py')
  returns: newest (last modified on *nix system, last created 
  on Windows) file from path
  """
  file_dict = dict([(fname, \
    os.path.getctime(os.path.join(path, fname))) \
    for fname in os.listdir(path) \
    if os.path.splitext(fname)[1] == ext])
  return max(file_dict, key=lambda x: file_dict[x])

I’m not sure at all whether there would be any performance benefit from doing it this way as opposed to looping over the files, but the real benefit of this is that it frees you from the worry of how to construct a for loop over the list of filenames returned from os.listdir(). All you need to worry about is which one of them you want.