64-bit Excel

The MS Excel team just posted on their blog about a 64-bit version of Excel in the upcoming 2010 version. Nominally, this means, as they put it, “…a version of Excel that runs natively in that [64-bit] environment and takes advantage of 64-bit strengths. One of the more obvious benefits is the lifting of the addressable memory constraint we were dealing with in 32-bit Windows.”

OK, OK, so that’s true. Out-of-memory conditions are one of the problems that have traditionally plagued Excel. I’m really stoked to see this being addressed. But one thing I’m really not stoked about – and a few people have already commented about this – is the unfortunately wording of the explanation for need for 64-bit Excel: “In short: enormous capacity; humungous workbooks.” In short, no, no, NOOOOOO!!!

Let me put my objection into context. I see Excel as filling a number of roles in the BI space. It’s got a calculation engine and pivot tables which make it a great tool for analysis. It’s got a charting engine and formatting which makes it an excellent presentation layer. It’s got VBA (for the time being anyway), which allows it to be used as a serviceable small-scale rapid application development platform. However, one thing it’s not and should never be used as is a database. The occasions when it is used in that way almost inevitably end in tears, and quite often in mission-critical systems which just quit working as they were intended – models which no longer calculate, or “humungous workbooks” which no longer open.

I was approached recently by someone in the team implementing the rollout of Office 2007 at work, asking about a number of very large workbooks (built in Excel 2003) which appeared not to open in Excel 2007. “How large is large?” I asked, thinking it’d be in the 50-100 MB range. Turns out these monsters approached 500 MB – several sheets containing 30K rows with 100+ columns filled with VLOOKUPs, OFFSETs, SUMIFs and array functions. They do open (and re-calculate) in Excel 2003, although you do have to wait a few minutes in either case. In fact, they do eventually open in Excel 2007, but only after a 5-hour wait.

Luckily, these behemoths are not in current use, which is the scenario which concerned the project team. If they had been part of a business-critical system, we would have been in serious trouble. The bottom line is, this is a great example of really poor spreadsheet design, which should have as a fundamental guiding principle the idea of restricting data to the minimal possible to achieve the desired result. Really large datasets do NOT belong in spreadsheets, and I’ll take a lot of convincing to believe that the enhancements offered in a 64-bit version of Excel can change that.

Conversations with Excel, part 3

In the first and second posts in this series I looked at some ways to read from and write to Excel files using SAS. The motivation for doing this is replacing code which uses the DDE protocol to communicate with the Excel application – which works fine using PC SAS on the desktop, but when SAS is on the server, there’s no application to talk to. SAS uses the SAS/ACCESS interface to ‘speak’ directly with Excel files.

A similar facility is available for working with Excel files in Python, by making use of the packages available from http://www.python-excel.org/. The conversation is enabled through three packages:

xlrd allows you to read data and formatting information from Excel files.

xlwt allows you to create and write to Excel files.

xlutils contains a bunch of utilities for using both xlrd and xlwt, and is required to copy or modify existing Excel files.

There’s comprehensive documentation for all three packages at the site, and a tutorial with some examples of how they can be used. You can also visit the active Google group for support.

This means that you have the ability to work with Excel files, using Python on a platform where Excel is not available. Usage is fairly straightforward – for instance, to read the data contained on the sheet ‘dataSht’ in the workbook ‘Template_test.xls’:

>>> from xlrd import open_workbook
>>> wb = open_workbook('Template_test.xls')
>>> sht = wb.sheet_by_name('dataSht')
>>> for row in range(sht.nrows):
...   values = [sht.cell(row, col).value for col in range(sht.ncols)]
...   print values
... 
[u'Template_test.xls', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', 96.0]
[u'Date', u'Data_1', u'Data_2', u'Data_3', u'Data_4', u'Data_5', '', '']
[40043.0, 0.71426979592069983, 0.29721317486837506, 0.14380949875339866, 0.70981460809707642, 0.19360692566260695, '', '']
[40044.0, 0.30376328527927399, 0.75381017150357366, 0.26589830825105309, 0.30413518520072103, 0.41826989687979221, '', '']
[40045.0, 0.99682421330362558, 0.0027025360614061356, 0.64853132842108607, 0.27574777463451028, 0.99392103916034102, '', '']
[40046.0, 0.14693491021171212, 0.93810823513194919, 0.32732625165954232, 0.77697453368455172, 0.35358203155919909, '', '']
[40047.0, 0.43824125546962023, 0.20211741980165243, 0.6220957413315773, 0.28986502904444933, 0.85634097876027226, '', '']
[40048.0, 0.3646774091757834, 0.33247592020779848, 0.84804946463555098, 0.36496656434610486, 0.0059830849058926105, '', '']
[40049.0, 0.6037151631899178, 0.079236360732465982, 0.30319626023992896, 0.74752466194331646, 0.7890509688295424, '', '']
[40050.0, 0.49680318590253592, 0.051287947688251734, 0.54286114033311605, 0.76270149415358901, 0.35542313288897276, '', '']
[40051.0, 0.96113103721290827, 0.75952570792287588, 0.35812566895037889, 0.60966236609965563, 0.03527348255738616, '', '']
[40052.0, 0.35204670811071992, 0.75659727631136775, 0.97338171768933535, 0.67937295977026224, 0.53357180999591947, '', '']
[40053.0, 0.32696374924853444, 0.11761421523988247, 0.73568923026323318, 0.94905949058011174, 0.4074792442843318, '', '']
[40054.0, 0.59203020902350545, 0.31373690022155643, 0.73995516449213028, 0.44007967365905643, 0.67870346456766129, '', '']
[40055.0, 0.74593824986368418, 0.043794836848974228, 0.75793982530012727, 0.049134510103613138, 0.79131949925795197, '', '']
[40056.0, 0.54699079459533095, 0.54593769600614905, 0.84260744694620371, 0.089851934928447008, 0.30863919015973806, '', '']
[40057.0, 0.19803057983517647, 0.050982972607016563, 0.068164898082613945, 0.55615624878555536, 0.66064533870667219, '', '']
[40058.0, 0.1034383806400001, 0.90820295689627528, 0.41724261501803994, 0.076820098329335451, 0.58757591666653752, '', '']

There’s also support for reading named ranges in much the same way as with the SAS Excel libname engine, although it also has the same problem of being unable to evaluate dynamic ranges. It is possible to extract the formula from the ‘RefersTo’ string, so I think it may be possible to parse it to get an address to read from – I’ll follow up on this later.

Enjoy!

Project Euler #26

Problem 26 at Project Euler asks

A unit fraction contains 1 in the numerator. The decimal representation of the unit fractions with denominators 2 to 10 are given:

^(1)/_(2) = 0.5

^(1)/_(3) = 0.(3)

^(1)/_(4) = 0.25

^(1)/_(5) = 0.2

^(1)/_(6) = 0.1(6)

^(1)/_(7) = 0.(142857)

^(1)/_(8) = 0.125

^(1)/_(9) = 0.(1)

^(1)/_(10) = 0.1

Where 0.1(6) means 0.166666…, and has a 1-digit recurring cycle. It can be seen that ^(1)/_(7) has a 6-digit recurring cycle.

Find the value of d < 1000 for which ^(1)/_(d) contains the longest recurring cycle in its decimal fraction part.

This is a problem about repeated division by the same number – how many divisions can you do before you start repeating yourself? The division algorithm can be expressed as follows:

def div_by(rem, div):
    y = min([x for x in range(5) if rem*(10**x) > div])
    return rem*(10**y) % div

So then I need to use this to repeatedly divide by the test value d – len_pattern calls div_by until it detects a repeated remainder:

def len_pattern(init, rems, d):
    rem=div_by(init, d)
    if rem in rems:
        return len(rems) - rems.index(rem)
    else:
        rems.append(rem)
        return len_pattern(rem, rems, d)

I then have to use this to find out which d < 1000 has the longest cycle. However, I don't need to call it on all the numbers under 1000 – the only ones I really need to test are the primes. The upper bound on the length of the cycle is n-1 for any number n (for example the length of the cycle generated by 7 is 6), and primes are the only numbers which generate cycles of this length. So I need to use the list generated by the primes_under() function, then call len_pattern on each of these, returning the maximum length.

def get_longest():
    from EulerFunc import primes_under
    divs = dict([(div, len_pattern(1.0, [], div)) \
        for div in primes_under(1000) if div > 5])
    longest=max(divs,  key=lambda x: divs[x])
    print "Longest cycle is for %d with length %d" % \
        (longest, divs[longest])

if __name__ == '__main__':
    from timeit import Timer
    t = Timer("get_longest()",  "from __main__ import get_longest")
    print t.timeit(number=100)

>>> Longest cycle is for 983 with length 884
Longest cycle is for 983 with length 884
Longest cycle is for 983 with length 884
.
.
.
Longest cycle is for 983 with length 884
151.166881084

So it takes on average around 1.5 seconds to run. Not bad, I see some people had Python code which ran in under a second, so I’m sure this could be optimised…

Accessing Dynamic Named Ranges

I was planning to continue the series with a post on accessing Excel data from Python – this is a brief diversion related to a problem I’ve encountered recently with using SAS to read Excel files.

I have a bunch of Excel files which I need to read every month. The relevant data is contained in a dynamic named range – that is, a name where the RefersTo string is typically a formula of the form

=OFFSET(fixedCell,0,0,COUNTA(columnInRange),rangeWidth)

This is useful when the amount of data varies from file to file – in this case, some files will have 5 rows, and some will have 5,000. The use of the COUNTA() function means that the number of rows in the range is evaluated by Excel, hence the height of the range adjusts automatically. Which is great, when you’re reading those ranges in Excel, but which can cause problems when you’re trying to get at them without the benefit of the Excel calculation engine. Case in point – the read_file macro from a previous post which uses the SAS Excel libname engine, fails to read dynamic ranges. I’m not 100% sure that it’s impossible to do so (the wonderful folks at SAS tech support are looking into it for me), but it has always failed so far.

So, I need a workaround. I came up with the following – it’s a bit of a clunker, but it works. The basic idea: run an Excel macro over the files which copies the data from the dynamic range into a static range in a new workbook.

Private Sub ReadFile(ByVal sBookName As String)
' Reads sitting time files in INPATH, copies all time entries
' to a new workbook in OUTPATH
Dim wbkFile As Excel.Workbook
Dim wbkOut As Excel.Workbook
Dim rngData As Excel.Range
Dim rngOut As Excel.Range
Dim sBranchName As String
Dim sMonth As String


' Specify required values for output range
Set wbkFile = Application.Workbooks.Open(INPATH & sBookName)
Set rngData = wbkFile.Names("rngTimeData").RefersToRange
sBranchName = wbkFile.Names("BranchName").RefersToRange.Value
sMonth = Format$(Evaluate(wbkFile.Names("Month").RefersTo), _
        "mmyy")

' Add new workbook and specifies range to dump data into
Set wbkOut = Application.Workbooks.Add()
Set rngOut = wbkOut.Worksheets(1).Cells(1, 1)
Set rngOut = rngOut.Resize(rngData.Rows.Count, _
        rngData.Columns.Count)
' transfer values across, name the range, save and close the new
' workbook
With rngOut
    .Value = rngData.Value
End With
wbkOut.Names.Add "TimeData", rngOut


wbkFile.Close False
sBranchName = Replace$(sBranchName, " ", "_")
sBranchName = Left$(sBranchName, InStr(sBranchName, "_DC") - 1)
wbkOut.SaveAs OUTPATH & sBranchName & "_" & sMonth, _
        xlWorkbookNormal
wbkOut.Close


End Sub

Public Sub GetFiles()
' Read all files in INPATH, save copies into OUTPATH and check 
' which ones are missing
Dim fsObj As Scripting.FileSystemObject
Dim inFolder As Scripting.Folder
Dim inFile As Scripting.File


Application.ScreenUpdating = False


Set fsObj = CreateObject("Scripting.FileSystemObject")
Set inFolder = fsObj.GetFolder(INPATH)


For Each inFile In inFolder.Files
    ReadFile inFile.Name
Next inFile


OutputDirectoryContents


Application.ScreenUpdating = True


End Sub

OutputDirectoryContents is a macro which writes the names of all the files in the directory OUTPATH to the main worksheet, then runs an advanced filter over the table of files that should be in there, returning those that aren’t (the Criteria range is set to File Found =FALSE):

Private Sub OutputDirectoryContents()

' Lists contents of OUTPATH on the front sheet and returns any
' missing filenames
Dim fsObj As Scripting.FileSystemObject
Dim outFolder As Scripting.Folder
Dim outFile As Scripting.File
Dim wksht As Excel.Worksheet
Dim rngFiles As Excel.Range
Dim startCell As Excel.Range
Dim rngSource As Excel.Range
Dim rngCriteria As Excel.Range
Dim rngExtract As Excel.Range


' set required variables...
Set fsObj = CreateObject("Scripting.FileSystemObject")
Set outFolder = fsObj.GetFolder(OUTPATH)
Set wksht = ThisWorkbook.Worksheets(1)
Set rngFiles = ThisWorkbook.Names("Files_Found").RefersToRange
Set startCell = rngFiles(1)
Set rngSource = ThisWorkbook.Names("FileWishlist").RefersToRange
Set rngCriteria = ThisWorkbook.Names("rngCriteria").RefersToRange
Set rngExtract = ThisWorkbook.Names("rngExtract").RefersToRange


' Clear, then repopulate, range of files found in the specified folder
rngFiles.ClearContents


startCell.Value = Format$(DateAdd("m", -1, Date), "mmm-yyyy")
For Each outFile In outFolder.Files
    Set startCell = startCell.Offset(1)
    startCell.Value = outFile.Name
Next outFile


' Refresh filter to see which ones are missing...
RefreshAdvancedFilter rngCriteria, rngExtract, rngSource, , True


End Sub

Public Sub RefreshAdvancedFilter(ByRef rngCriteria As Excel.Range, _
               ByRef rngExtract As Excel.Range, _
               ByRef rngSource As Excel.Range, _
               Optional ByVal IsCopy As Boolean = True, _
               Optional ByVal IsUnique As Boolean = False)
' Refreshes advanced filter of the source range
Dim lCopy As Long


lCopy = IIf(IsCopy, xlFilterCopy, xlFilterInPlace)


rngSource.AdvancedFilter Action:=lCopy, criteriarange:=rngCriteria, _
    copytorange:=rngExtract, unique:=IsUnique


End Sub

Once I’m sure that all the files are there that need to be, I can run the SAS macro get_all_files over the contents of OUTPATH to create the datasets I want. Slightly convoluted (and a little wasteful in that the data is duplicated), but it gets the job done, so I’m happy.

Conversations with Excel, part 2

In the first post of this series I talked about accessing data from Excel workbooks using the SAS Excel libname engine, as a replacement for data access through DDE. However, it’s also possible to use SQL pass-through to perform the same task – again, this makes the code relatively simple and familiar in comparison with the machinery required for DDE. For example, the read_file macro can be rewritten like so:

%macro read_file(fname);
*** Libnames ;
* sas dataset directory;
libname sasdata "&path.\&adhocref.\sasdata" ;

* strip off the file extension;
%let namelen=%eval(%length(&fname)-4);
%let fshortname=%substr(&fname, 1, &namelen);

proc sql;
  connect to excel(path="&inputdir.\&fname");
  create table sasdata.&fshortname as
  select * from connection to excel
  (select * from 'rngData');
  disconnect from excel;
quit;

%mend read_file;

So why prefer this to the libname? Well I think it’s probably more a question of style than anything – I haven’t compared performance but really with the amount of data in the spreadsheets I’m accessing it’s quick either way. I’ve gotten a lot more used to using sql pass-through over the last year or so, and now I tend to find it a more natural way to think about a data access problem than using the data step.

For an example, given a spreadsheet with a few thousand rows of employee hours worked in the range EmployeeTimeData, and a list of employees with their department and email address in the range emp_list, the following pass-through query allows me to calculate totals for each employee and return them to a sas dataset:

*** Subtotal hours worked for each employee;
proc sql;
  connect to excel(path="&inputdir.\&fname");
  create table sasdata.emp_hrs_email as
  select * from connection to excel
   (select
           a.Name_of_Employee as emp_name,
           b.DEPARTMENT as emp_department,
           b.EMAIL_ADDRESS as emp_email,
           sum(a.Time_in_Hours) as total_time
    from
      'EmployeeTimeData' a inner join
      'emp_list' b on
           a.Name_of_Employee = b.NAME_FOR_DOCUMENTATION
    group by
            a.Name_of_Employee,
            b.DEPARTMENT,
            b.EMAIL_ADDRESS
    );
quit;

It’s also possible to make the conversation go the other way, passing non-query statements with the execute statement:

*** Drop and re-create table emp_list;
proc sql;
  connect to excel as xldb(path="&inputdir.\&fname");
  execute(drop table 'emp_list') by xldb;
  execute(create table 'emp_list'
               (NAME_FOR_DOCUMENTATION VARCHAR(20) not null,
                DEPARTMENT VARCHAR(10),
                EMAIL_ADDRESS VARCHAR(20))
             ) by xldb;
quit;

Documentation for using the pass-through facility to access Excel/Access files in SAS 9.1.3 is here.

Next time I’ll take a look at some other ways to start conversations with Excel, this time using Python.

Conversations with Excel, part 1

One of our most frequent requirements at work is transporting data between SAS and Excel. For the most part with our ad-hoc reporting the ‘conversation’ is one-way: the extraction and transformation is performed in SAS, then the data is output to a file destination in a format that Excel can read, via ods to csv, html, xml, or tab-delimited text files. The output file can then be opened in Excel and any after-work analysis or presentation performed there – formatting, worksheet formulae, pivot tables and charts can be added.

However, with regular reports it would be nice to automate a lot of this after-work by keeping a template with the formatting, formulae etc already complete, and having SAS simply update the underlying data. This can be done manually via a copy-paste operation from the output file or by using VBA to automate the transfer, and this is perfectly satisfactory – but this is not the only way. A lot of our legacy code uses the very useful but also slightly outdated DDE protocol – I won’t go into the detail of how DDE is used in SAS (usually through a filename statement) but there are plenty of good references available, for example here. DDE allows SAS to control the Excel application via the Excel 4 macro language, to open Excel files and to specify locations within those files to read from or write to. The only real problem we have with DDE is that, as we use SAS Enterprise Guide on the desktop as a thin client and SAS is on the server, all that legacy code is unable to find the Excel application to talk to.

The SAS v9 Excel libname engine (in the SAS/ACCESS package) to a large extent fills the gap left by DDE, without the need to control an instance of the Excel application. The libname statement creates a library out of the Excel workbook, where the tables in the library are the worksheets and named ranges in the workbook. This makes the code relatively simple – for example, I use the following macro to read the range ‘rngData’ from the file ‘fname’ into the SAS dataset ‘fshortname’:

%macro read_file(fname);
*** Libnames ;
* sas dataset directory;
libname sasdata "&path.\&adhocref.\sasdata" ;

* Open file and create libref;
libname returnWB EXCEL "&inputdir.\&fname" USEDATE=NO;

* strip off the file extension;
%let namelen=%eval(%length(&fname)-4);
%let fshortname=%substr(&fname, 1, &namelen);

data sasdata.&fshortname.;
  set returnWB.rngData;
run;

* Close file and clear libname;
libname returnWB clear;

%mend read_file;

In this application, I then use the SAS functions dopen, dnum, dread and dclose to iterate over files in the source directory and create a bunch of SAS datasets from their specified ranges:

%macro get_all_files;

%let fref=flist;

%let dname=%sysfunc(filename(fref, &inputdir));

%let did=%sysfunc(dopen(&fref));

%let numfiles=%sysfunc(dnum(&did));

%do i=1 %to &numfiles;

  %read_file(%sysfunc(dread(&did, &i)));

%end;

%let dname=%sysfunc(dclose(&did));

%mend get_all_files;

The advantage of having a libname created is then that the data can be viewed and manipulated in much the same way as any other SAS dataset. The conversation can of course go the other way, simply by changing the order of the dataset references in the data step:

%macro writeto_file(fname);
*** Libnames ;
* sas dataset directory;
libname sasdata "&path.\&adhocref.\sasdata" ;

* Open file and create libref;
libname returnWB EXCEL "&inputdir.\&fname" USEDATE=NO;

* strip off the file extension;
%let namelen=%eval(%length(&fname)-4);
%let fshortname=%substr(&fname, 1, &namelen);

data returnWB.rngData;
  set sasdata.&fshortname.;
run;

* Close file and clear libname;
libname returnWB clear;

%mend writeto_file;

The macro now creates the range rngData in the workbook (assuming it doesn’t already exist – if it does this method will not work). The article De-Mystifying the SAS Libname Engine in Microsoft Excel contains some great examples of how to access, replace, create and delete ranges in Excel workbooks.

Next time I’ll look at an alternative: SQL pass-through (also through SAS/ACCESS), and the reason this may be preferable.