Count Distinct Values

I’m sorry for the lack of posts over the last few months – what can I say, I’ve been busy. I had meant to continue the series on data access patterns from here and I put a revision of the code here, but I haven’t had the time to do much more. I may get back to it eventually, but in the meantime I thought I’d post something else which I’ve found dictionaries to be quite useful for.

The below is a function which gives a count of the distinct items in an array. As items are added, the dictionary will prevent duplication of key values – and the Count method gives the number of keys in the dictionary.

Public Function CountDistinct(ByRef vArr As Variant) As Variant

    Dim dict As Scripting.Dictionary
    Dim lVal As Long
    Set dict = New Scripting.Dictionary
    For lVal = LBound(vArr) To UBound(vArr)
        dict(vArr(lVal)) = vArr(lVal)
    Next lVal
    CountDistinct = dict.Count

End Function


A much simpler implementation is available in Python, using the set datatype:

>>> def CountDistinct(vals):
...   return len(set(vals))
>>> CountDistinct([1, 2, 3, 3, 4])

Python Data Access patterns, part 1

Recently I’ve been reading Clifton Nock’s book Data Access Patterns: Database Interactions in Object-Oriented Applications, which as you’d expect from the title, covers a lot of patterns to do with data access. The first part of the book is dedicated to decoupling patterns, which seems to me to be pretty fundamental to good design of applications which connect with databases. If you want an application to interact with a database, then in general it’s a good idea to place the code handling data access into a separate component. The idea is to encapsulate physical database calls in logical operations, which the application can then use, as opposed to mixing them into the rest of the application code.

The benefits of keeping application code (code which needs to know about objects in the application domain) separate from data access code (code which needs to know about how to get at the database) are many. Maintenance is a lot easier – by exposing only logical operations to your application you can change anything related to the detail of how these are implemented without worrying about breaking code elsewhere in the application. The application code itself is cleaner, as it can focus on the objects without worrying about how these are stored physically. This also affords a nice division of labour, so you could have someone working on the business logic in the application who knows nothing of the database behind it.

Here’s an example of how such a data access component could be implemented in Python, where the application is interacting with a SQLite database. The DataAccessor class (apologies to Clifton Nock here, I both stole the class name and modelled the code pretty closely on his Java version) handles connecting to the database and exposes the operations the application might require: read, insert, update, delete. For the sake of brevity I’ve only included the read and insert operations here – update and delete follow a pretty similar pattern to insert.

import apsw
import os

def sql_trace(stmt, bindings):
    'Echoes all SQL executed'
    print "SQL:", stmt
    if bindings:
        print "Bindings:", bindings
    return True

class DataAccessor(object):
    Class to handle data access using apsw sqlite wrapper
    def __init__(self, dbpath, echo=False):
            if os.path.exists(dbpath):
                self.conn = apsw.Connection(dbpath)
                self.cur = self.conn.cursor()
                if echo:
                raise IOError('Database not found: ' + dbpath)
        except apsw.CantOpenError as detail:
            print "Unable to open db file: ", dbpath, detail

    def read(self, table, columns=None, where_row=None, sort_cols=None):
        '''Executes a SELECT statement against table.

        table                 -- name of the table to be read
        columns (optional)    -- list of columns to be read
                              from table
        where_row (optional)  -- dict used to build WHERE
        sort_cols (optional)  -- list of (column, order) pairs
                              used to specify order of the
                              rows returned. Needs to be of
                              the form ('<column>', 'ASC'|'DESC')

        Returns: rows returned from the SELECT statement.
            stmt = 'SELECT '
            if columns:
                stmt += ', '.join(columns)
                stmt += '*'

            # from clause
            stmt += "\nFROM " + table

            # where clause
            if where_row:
                stmt += "\nWHERE "
                stmt += "\n  AND ".join([col + "=:" + col \
                                    for col in where_row])

            # order clause
            if sort_cols:
                stmt += "\nORDER BY "
                stmt += ', '.join([col[0] + ' ' + col[1] \
                                    for col in sort_cols])

            stmt += ';'

            # submit and return results
            args = where_row and (stmt, where_row) or (stmt,)

            results = columns and [dict(zip(columns, row)) \
                for row in self.cur.execute(*args)] \
                or [row for row in self.cur.execute(*args)]

            return results

        except apsw.SQLError as sql:
            print 'Error in SQL submitted:', sql
            print 'SQL:', stmt
            if where_row:
                print 'Bindings:', where_row

        except apsw.Error as error:
            print 'APSW Error: ', error

        except Exception as error:
            print 'Error reading from database:', error


    def insert(self, table, values):
        '''Executes an INSERT statement against table.

        table           -- name of the table to be written to
        values          -- list of rows (dicts) to be inserted

        Returns: None
            # build list of column names
            cols = values[0].keys()

            # generate insert statement
            stmt = 'INSERT INTO ' + table + ' ('
            stmt += ', '.join(cols)
            stmt += ') VALUES ('
            stmt += ', '.join([":%s" % col for col in cols])
            stmt += ')'

            # submit

            self.cur.execute('BEGIN IMMEDIATE')
            self.cur.executemany(stmt, values)

            return self.conn.changes()

        except apsw.SQLError as sql:
            print 'Error in SQL submitted:', sql
            print 'SQL:', stmt

        except apsw.Error as error:
            print 'APSW Error: ', error

        except Exception as error:
            print 'Error submitting insert:', error


I’ve used the apsw SQLite wrapper here, but if at some point I decided to switch to pysqlite, or to use a MySQL database, I could do so without greatly affecting the calling code. The potential drawback here is that what you gain is offset by the loss of control. It may well be that all your application needs to do is to read from or write to a single table at a time, but what about if it needs to execute a join? Drop or create a table or view? What if you need to do specify a more complex where condition than “column=value”?

It may well be possible to rewrite this class to expose these operations, but I could see this getting to be a lot of work. In the next part I’ll look at some ORM (Object Relational Map) solutions which get around this nicely by mapping an OO structure to a relational model.

Project Euler #18 and #67

I’m hoping to get back to work on the regular expressions filter soon – should have some time over the Christmas/New Year break – but in the meantime, I took a look at the Euler problems related to number triangles. Both problems ask for the maximal path through adjacent points in the triangle – from problem 18:

By starting at the top of the triangle below and moving to adjacent numbers on the row below, the maximum total from top to bottom is 23.

7 4
2 4 6
8 5 9 3

That is, 3 + 7 + 4 + 9 = 23.

Find the maximum total from top to bottom of the triangle below:

95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23

It is of course possible to work out the sum of every single path, by starting at the top and trying every combination of left and right choices to get to the bottom, but for a triangle with n rows this works out at 2n-1 sums to calculate. This makes the brute-force approach impractical for the larger triangle in problem 67 (100 rows); it also involves a whole lot of unnecessary calculation.

The key to seeing this is understanding that (1) every point in the triangle above the bottom row is itself the apex of a sub-triangle, that has its own maximal path, and (2) where any sub-triangles intersect, this intersection contains another sub-triangle – so working out all the paths through the two intersecting triangles involves working out every path through the intersection twice. For example, in the triangle

7 4
2 4 6
8 5 9 3

the two sub-triangles with apexes at 7 and 4 both contain the sub-triangle

5 9

with two possible paths. So calculating every possible path through the larger triangle involves traversing both these paths twice, once for each sub-triangle containing them. It’s easy to see that this gets worse quickly as the triangle gets larger.

A smarter approach is what might be termed a ‘bottom-up’ method. Working from the base upwards, calculate the maximal path for every point in each row. Once you have the maxima for two adjacent points in a row, the maximum for the point above both these points is simply the sum of this point and the maximum of these two values. The advantage of this method is that the maximum for any point in the triangle is calculated once and only once. Here’s how this looks in Python.

First, a helper function get_triangle to read the string in and get a list of rows of integers making up the triangle:

def get_triangle(triangle_string):
    #reverse to work from the base up
    return [[int(x) for x in row.split()] \
        for row in string_list]

Then, the main routine max_path works through each row, progressively replacing the list of maxima at each point in the previous row with the same for the current row:

def max_path(triangle_string):
    triangle = get_triangle(triangle_string)
    #every point in the base is its own maximum
    max_list = triangle[0]
    for row in range(1, len(triangle)):
        #add the greater of the two adjacent maxima to
        #each point in the row
        max_list = \
            [triangle[row][x] + max(max_list[x], max_list[x+1]) \
             for x in range(len(triangle[row]))]
    return max_list[0]

That’s all for now – time permitting I will be posting the next part of the regular expression filter sometime in the next couple of days. Until then, hoping you all had a restful and happy Christmas and are taking some time off to be with family. Take care…

Conversations with Excel, part 3

In the first and second posts in this series I looked at some ways to read from and write to Excel files using SAS. The motivation for doing this is replacing code which uses the DDE protocol to communicate with the Excel application – which works fine using PC SAS on the desktop, but when SAS is on the server, there’s no application to talk to. SAS uses the SAS/ACCESS interface to ‘speak’ directly with Excel files.

A similar facility is available for working with Excel files in Python, by making use of the packages available from The conversation is enabled through three packages:

xlrd allows you to read data and formatting information from Excel files.

xlwt allows you to create and write to Excel files.

xlutils contains a bunch of utilities for using both xlrd and xlwt, and is required to copy or modify existing Excel files.

There’s comprehensive documentation for all three packages at the site, and a tutorial with some examples of how they can be used. You can also visit the active Google group for support.

This means that you have the ability to work with Excel files, using Python on a platform where Excel is not available. Usage is fairly straightforward – for instance, to read the data contained on the sheet ‘dataSht’ in the workbook ‘Template_test.xls’:

>>> from xlrd import open_workbook
>>> wb = open_workbook('Template_test.xls')
>>> sht = wb.sheet_by_name('dataSht')
>>> for row in range(sht.nrows):
...   values = [sht.cell(row, col).value for col in range(sht.ncols)]
...   print values
[u'Template_test.xls', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', 96.0]
[u'Date', u'Data_1', u'Data_2', u'Data_3', u'Data_4', u'Data_5', '', '']
[40043.0, 0.71426979592069983, 0.29721317486837506, 0.14380949875339866, 0.70981460809707642, 0.19360692566260695, '', '']
[40044.0, 0.30376328527927399, 0.75381017150357366, 0.26589830825105309, 0.30413518520072103, 0.41826989687979221, '', '']
[40045.0, 0.99682421330362558, 0.0027025360614061356, 0.64853132842108607, 0.27574777463451028, 0.99392103916034102, '', '']
[40046.0, 0.14693491021171212, 0.93810823513194919, 0.32732625165954232, 0.77697453368455172, 0.35358203155919909, '', '']
[40047.0, 0.43824125546962023, 0.20211741980165243, 0.6220957413315773, 0.28986502904444933, 0.85634097876027226, '', '']
[40048.0, 0.3646774091757834, 0.33247592020779848, 0.84804946463555098, 0.36496656434610486, 0.0059830849058926105, '', '']
[40049.0, 0.6037151631899178, 0.079236360732465982, 0.30319626023992896, 0.74752466194331646, 0.7890509688295424, '', '']
[40050.0, 0.49680318590253592, 0.051287947688251734, 0.54286114033311605, 0.76270149415358901, 0.35542313288897276, '', '']
[40051.0, 0.96113103721290827, 0.75952570792287588, 0.35812566895037889, 0.60966236609965563, 0.03527348255738616, '', '']
[40052.0, 0.35204670811071992, 0.75659727631136775, 0.97338171768933535, 0.67937295977026224, 0.53357180999591947, '', '']
[40053.0, 0.32696374924853444, 0.11761421523988247, 0.73568923026323318, 0.94905949058011174, 0.4074792442843318, '', '']
[40054.0, 0.59203020902350545, 0.31373690022155643, 0.73995516449213028, 0.44007967365905643, 0.67870346456766129, '', '']
[40055.0, 0.74593824986368418, 0.043794836848974228, 0.75793982530012727, 0.049134510103613138, 0.79131949925795197, '', '']
[40056.0, 0.54699079459533095, 0.54593769600614905, 0.84260744694620371, 0.089851934928447008, 0.30863919015973806, '', '']
[40057.0, 0.19803057983517647, 0.050982972607016563, 0.068164898082613945, 0.55615624878555536, 0.66064533870667219, '', '']
[40058.0, 0.1034383806400001, 0.90820295689627528, 0.41724261501803994, 0.076820098329335451, 0.58757591666653752, '', '']

There’s also support for reading named ranges in much the same way as with the SAS Excel libname engine, although it also has the same problem of being unable to evaluate dynamic ranges. It is possible to extract the formula from the ‘RefersTo’ string, so I think it may be possible to parse it to get an address to read from – I’ll follow up on this later.


Project Euler #26

Problem 26 at Project Euler asks

A unit fraction contains 1 in the numerator. The decimal representation of the unit fractions with denominators 2 to 10 are given:

^(1)/_(2) = 0.5

^(1)/_(3) = 0.(3)

^(1)/_(4) = 0.25

^(1)/_(5) = 0.2

^(1)/_(6) = 0.1(6)

^(1)/_(7) = 0.(142857)

^(1)/_(8) = 0.125

^(1)/_(9) = 0.(1)

^(1)/_(10) = 0.1

Where 0.1(6) means 0.166666…, and has a 1-digit recurring cycle. It can be seen that ^(1)/_(7) has a 6-digit recurring cycle.

Find the value of d < 1000 for which ^(1)/_(d) contains the longest recurring cycle in its decimal fraction part.

This is a problem about repeated division by the same number – how many divisions can you do before you start repeating yourself? The division algorithm can be expressed as follows:

def div_by(rem, div):
    y = min([x for x in range(5) if rem*(10**x) > div])
    return rem*(10**y) % div

So then I need to use this to repeatedly divide by the test value d – len_pattern calls div_by until it detects a repeated remainder:

def len_pattern(init, rems, d):
    rem=div_by(init, d)
    if rem in rems:
        return len(rems) - rems.index(rem)
        return len_pattern(rem, rems, d)

I then have to use this to find out which d < 1000 has the longest cycle. However, I don't need to call it on all the numbers under 1000 – the only ones I really need to test are the primes. The upper bound on the length of the cycle is n-1 for any number n (for example the length of the cycle generated by 7 is 6), and primes are the only numbers which generate cycles of this length. So I need to use the list generated by the primes_under() function, then call len_pattern on each of these, returning the maximum length.

def get_longest():
    from EulerFunc import primes_under
    divs = dict([(div, len_pattern(1.0, [], div)) \
        for div in primes_under(1000) if div > 5])
    longest=max(divs,  key=lambda x: divs[x])
    print "Longest cycle is for %d with length %d" % \
        (longest, divs[longest])

if __name__ == '__main__':
    from timeit import Timer
    t = Timer("get_longest()",  "from __main__ import get_longest")
    print t.timeit(number=100)

>>> Longest cycle is for 983 with length 884
Longest cycle is for 983 with length 884
Longest cycle is for 983 with length 884
Longest cycle is for 983 with length 884

So it takes on average around 1.5 seconds to run. Not bad, I see some people had Python code which ran in under a second, so I’m sure this could be optimised…

Functional Programming in Python

Before I get started I should acknowledge a number of sources that form the basis for this post. First and foremost, Guido van Rossum and Fred Drake, Jr.’s excellent ‘Functional Programming HOWTO’ for Python 2.6.2, and secondly David Mertz’s book ‘Text Processing in Python’, which both served to convince me of the benefits of functional programming. Check them out, they’re freely available and both extremely well written (although neither is really suitable for beginners). Also the Abelson and Sussman text ‘Structure and Interpretation of Computer Programs’ from which I borrow a couple of examples.

The programming languages I started out learning were mostly procedural: in C, a program solves a problem by having the computer execute a set of instructions in a particular order. I then moved on to object-oriented programming in languages like VB, Ruby and Python – solving a problem is accomplished by designing, reading and changing the internal state of objects which model the problem. Both of these models are imperative, in that they are characterised by telling the computer to do something. There is typically a lot of assignment to the different variables which make up the state machine of the program. For a very simple example, the function sum_to below finds the sum of integers up to the argument n (the nth trianglular number):

def sum_to(n):
  a = 0
  for x in range(n + 1):
    a += x
  return a

>>> sum_to(10)

This relies on the repeated assignment of new values to the variable a, which holds the value which is eventually returned. The state of the variables x and a is maintained throughout the program’s execution.

Functional programming, as the name suggests, is centred on functions – a program written in a functional style relies on the evaluation of functions, which turn input into output without the need to waste code on maintaining internal state. The sum_to function could be rewritten in a functional style as follows:

def sum_to(n):
  return sum(range(n+1))

>>> sum_to(10)

Python fully supports writing functional code, although as the examples above illustrate, you don’t have to. So why would you? No doubt there is some advantage to not having to maintain a potentially complex internal state machine, which does make functional code easier to write IMO; I guess potentially also there is some performance advantage to be gained by not having so much clutter in the namespace. However, I find the two best reasons for using functional code are

  1. It encourages functional decomposition. This is a good thing. Rather than make the classic mistake of trying to solve a complex problem with a single monolithic procedure (which is of course usually possible but which will probably have those who maintain your code cursing your name for years to come), the program flow will ideally go from input through a series of small, easily debugged functions to eventual output. This is an approach which I think appeals to me at least in part because it’s similar to what I do when I construct a spreadsheet model. In fact, worksheet functions in any spreadsheet are a very good example of purely functional programming – each function simply takes input and produces output. It also appeals because it improves code reusability and composability.
  2. Higher-order functions. These are like function constructors: functions which take functions as arguments and return functions as outputs. This is an extremely convenient abstraction, as it enables you to concentrate on the concepts involved in the problem, rather than getting lost in the detail of how the concepts are handled.

Python has a lot of builtin tools to support higher-order functions: generator expressions and list comprehensions provide concise statements for large structures, and the reduce() function iteratively applies a function to an sequence. To take an example from SICP, a useful functional abstraction is the summation of a series of terms. We can construct a ‘function factory’ for the general series, which takes as  arguments the function used to construct the terms, and the iterator that supplies the argument to that function:

def sum_terms(term, objIter):
  sum_terms: function constructor
  Usage: term is a function taking a single argument
         objIter is an iterable object returning the arguments for term
  Returns: a function to calculate a sum of terms in a series
  return sum([term(x) for x in objIter])

Because this abstracts out the process of summation, the problem of constructing a large series is decomposed nicely into two smaller sub-problems. So, for instance, I can construct the triangular sum function from earlier by splitting the task into the construction of the term (the identity function) and the range to apply it to (for an argument n, the range of integers from 1 to n):

>>> identity = lambda x: x
>>> inc = lambda n: range(1, n + 1)
>>> sum_triangle = lambda x: sum_terms(identity, inc(x))
>>> sum_triangle(10)

I can also create a function for summing a series of cubes: \sum_{i=1}^n i^3

>>> cube = lambda x: x**3
>>> sum_cubes = lambda x: sum_terms(cube, inc(x))
>>> sum_cubes(10)

A slightly more complex series can give an approximate numerical solution for the definite integral of a function over an interval:

\int_a^b f(x) \, dx = \delta*\sum_{k=1}^{(b-a)/\delta} f(a+\delta/2 +k*\delta)

In this case, the arguments a and b determine the length of the interval and dx sets the width of the subintervals used:

def integral(func, a, b, dx):
  def rng(a, b, dx): 
    return ((a + dx/2 + k*dx) for k in range(int((b-a)/dx)))
  return dx*sum_terms(func, rng(a, b, dx))

This then returns an integration function which can be applied to any function. Testing it out on the cube function:

>>> integral(cube, 0, 1, 0.01)

which evaluates exactly to 1/4.

Why I’d like Python at work, part 2

Automating Excel with Python

As well as using Python to interact with the filesystem, it’s possible to use COM automation to open Excel and access its functionality. I decided to take a look at Mark Hammond’s PythonWin IDE, which contains the win32com.client package to provide a wrapper for the Excel automation object. For example, the function newest from part 1 could be implemented as follows:

# - opens latest Excel file
import os, os.path, win32com.client

def newest(ext, path=os.getcwd()):
    '''pre: path a valid absolute filepath (optional), ext a file
    extension (e.g. '.xls')
    returns: name of the newest (last modified on *nix systems,
    last created on Windows) file with extension ext from path
    file_dict = dict([(fname, \
      os.path.getctime(os.path.join(path, fname))) \
      for fname in os.listdir(path) \
      if os.path.splitext(fname)[1] == ext])
    return os.path.join(path, \
                        max(file_dict, key=lambda x: file_dict[x]))

if __name__ == __main__:
  xlApp = win32com.client.Dispatch("Excel.Application")
  lastFileName = newest('.xls')
  lastFile = xlApp.Workbooks.Open(lastFileName)

  # Do some stuff with lastFile...

  xlApp = None

So, what stuff can you do? One thing that would be nice is to be able to open an Excel file and run a macro in it. I decided to test this out by saving a file named ‘PythonTest.xls’, containing the macro ‘helloFromPython’:

Public Sub helloFromPython()
Sheet1.Range("B1:B100").Value = "Hello from Python!"
End Sub

The script then creates an Excel automation object, which opens the file and calls the macro in it:

# - opens named Excel file fName, runs named macro
# macName
import os, os.path, win32com.client

def run_macro(fName, macName, path=os.getcwd()):
  pre: fName is the name of a valid Excel file with macro macName
  post: fName!macName is run, fName saved and closed
  fName = os.path.join(path, fName)
  xlApp = win32com.client.Dispatch("Excel.Application")
  fTest = xlApp.Workbooks.Open(fName)
  macName = fTest.Name + '!' + macName
  xlApp = None

if __name__ == "__main__":
    run_macro("PythonTest.xls", "helloFromPython")

Run the script, open the file and voila!


What I’m thinking would be really nice is to be able to automate a lot of our regular reporting by doing this – a single script could be run to kick off a batch of reports. I wonder, though, about how efficient using Python to do this would be as opposed to using a VBA/VBScript routine to do the same thing…does anyone else use Python for this sort of task?