Count Distinct Values

I’m sorry for the lack of posts over the last few months – what can I say, I’ve been busy. I had meant to continue the series on data access patterns from here and I put a revision of the code here, but I haven’t had the time to do much more. I may get back to it eventually, but in the meantime I thought I’d post something else which I’ve found dictionaries to be quite useful for.

The below is a function which gives a count of the distinct items in an array. As items are added, the dictionary will prevent duplication of key values – and the Count method gives the number of keys in the dictionary.

Public Function CountDistinct(ByRef vArr As Variant) As Variant

    Dim dict As Scripting.Dictionary
    Dim lVal As Long
    
    Set dict = New Scripting.Dictionary
    
    For lVal = LBound(vArr) To UBound(vArr)
        dict(vArr(lVal)) = vArr(lVal)
    Next lVal
    
    CountDistinct = dict.Count

End Function

?CountDistinct(array(1,2,3,3,4))
 4

A much simpler implementation is available in Python, using the set datatype:

>>> def CountDistinct(vals):
...   return len(set(vals))
... 
>>> CountDistinct([1, 2, 3, 3, 4])
4

Python Data Access patterns, part 1

Recently I’ve been reading Clifton Nock’s book Data Access Patterns: Database Interactions in Object-Oriented Applications, which as you’d expect from the title, covers a lot of patterns to do with data access. The first part of the book is dedicated to decoupling patterns, which seems to me to be pretty fundamental to good design of applications which connect with databases. If you want an application to interact with a database, then in general it’s a good idea to place the code handling data access into a separate component. The idea is to encapsulate physical database calls in logical operations, which the application can then use, as opposed to mixing them into the rest of the application code.

The benefits of keeping application code (code which needs to know about objects in the application domain) separate from data access code (code which needs to know about how to get at the database) are many. Maintenance is a lot easier – by exposing only logical operations to your application you can change anything related to the detail of how these are implemented without worrying about breaking code elsewhere in the application. The application code itself is cleaner, as it can focus on the objects without worrying about how these are stored physically. This also affords a nice division of labour, so you could have someone working on the business logic in the application who knows nothing of the database behind it.

Here’s an example of how such a data access component could be implemented in Python, where the application is interacting with a SQLite database. The DataAccessor class (apologies to Clifton Nock here, I both stole the class name and modelled the code pretty closely on his Java version) handles connecting to the database and exposes the operations the application might require: read, insert, update, delete. For the sake of brevity I’ve only included the read and insert operations here – update and delete follow a pretty similar pattern to insert.

import apsw
import os

def sql_trace(stmt, bindings):
    'Echoes all SQL executed'
    print "SQL:", stmt
    if bindings:
        print "Bindings:", bindings
    return True

class DataAccessor(object):
    '''
    Class to handle data access using apsw sqlite wrapper
    '''
    def __init__(self, dbpath, echo=False):
        try:
            if os.path.exists(dbpath):
                self.conn = apsw.Connection(dbpath)
                self.cur = self.conn.cursor()
                if echo:
                    self.cur.setexectrace(sql_trace)
            else:
                raise IOError('Database not found: ' + dbpath)
        except apsw.CantOpenError as detail:
            print "Unable to open db file: ", dbpath, detail
            raise

    def read(self, table, columns=None, where_row=None, sort_cols=None):
        '''Executes a SELECT statement against table.

        Arguments:
        table                 -- name of the table to be read
        columns (optional)    -- list of columns to be read
                              from table
        where_row (optional)  -- dict used to build WHERE
                              clause
        sort_cols (optional)  -- list of (column, order) pairs
                              used to specify order of the
                              rows returned. Needs to be of
                              the form ('<column>', 'ASC'|'DESC')

        Returns: rows returned from the SELECT statement.
        '''
        try:
            stmt = 'SELECT '
            if columns:
                stmt += ', '.join(columns)
            else:
                stmt += '*'

            # from clause
            stmt += "\nFROM " + table

            # where clause
            if where_row:
                stmt += "\nWHERE "
                stmt += "\n  AND ".join([col + "=:" + col \
                                    for col in where_row])

            # order clause
            if sort_cols:
                stmt += "\nORDER BY "
                stmt += ', '.join([col[0] + ' ' + col[1] \
                                    for col in sort_cols])

            stmt += ';'

            # submit and return results
            args = where_row and (stmt, where_row) or (stmt,)

            results = columns and [dict(zip(columns, row)) \
                for row in self.cur.execute(*args)] \
                or [row for row in self.cur.execute(*args)]

            return results

        except apsw.SQLError as sql:
            print 'Error in SQL submitted:', sql
            print 'SQL:', stmt
            if where_row:
                print 'Bindings:', where_row

        except apsw.Error as error:
            print 'APSW Error: ', error

        except Exception as error:
            print 'Error reading from database:', error

        finally:
            self.cur.close()

    def insert(self, table, values):
        '''Executes an INSERT statement against table.

        Arguments:
        table           -- name of the table to be written to
        values          -- list of rows (dicts) to be inserted

        Returns: None
        '''
        try:
            # build list of column names
            cols = values[0].keys()

            # generate insert statement
            stmt = 'INSERT INTO ' + table + ' ('
            stmt += ', '.join(cols)
            stmt += ') VALUES ('
            stmt += ', '.join([":%s" % col for col in cols])
            stmt += ')'

            # submit

            self.cur.execute('BEGIN IMMEDIATE')
            self.cur.executemany(stmt, values)
            self.cur.execute('COMMIT')

            return self.conn.changes()

        except apsw.SQLError as sql:
            print 'Error in SQL submitted:', sql
            print 'SQL:', stmt
            self.cur.execute('ROLLBACK')

        except apsw.Error as error:
            print 'APSW Error: ', error
            self.cur.execute('ROLLBACK')

        except Exception as error:
            print 'Error submitting insert:', error
            self.cur.execute('ROLLBACK')

        finally:
            self.cur.close()

I’ve used the apsw SQLite wrapper here, but if at some point I decided to switch to pysqlite, or to use a MySQL database, I could do so without greatly affecting the calling code. The potential drawback here is that what you gain is offset by the loss of control. It may well be that all your application needs to do is to read from or write to a single table at a time, but what about if it needs to execute a join? Drop or create a table or view? What if you need to do specify a more complex where condition than “column=value”?

It may well be possible to rewrite this class to expose these operations, but I could see this getting to be a lot of work. In the next part I’ll look at some ORM (Object Relational Map) solutions which get around this nicely by mapping an OO structure to a relational model.

Project Euler #18 and #67

I’m hoping to get back to work on the regular expressions filter soon – should have some time over the Christmas/New Year break – but in the meantime, I took a look at the Euler problems related to number triangles. Both problems ask for the maximal path through adjacent points in the triangle – from problem 18:

By starting at the top of the triangle below and moving to adjacent numbers on the row below, the maximum total from top to bottom is 23.

3
7 4
2 4 6
8 5 9 3

That is, 3 + 7 + 4 + 9 = 23.

Find the maximum total from top to bottom of the triangle below:

75
95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23

It is of course possible to work out the sum of every single path, by starting at the top and trying every combination of left and right choices to get to the bottom, but for a triangle with n rows this works out at 2n-1 sums to calculate. This makes the brute-force approach impractical for the larger triangle in problem 67 (100 rows); it also involves a whole lot of unnecessary calculation.

The key to seeing this is understanding that (1) every point in the triangle above the bottom row is itself the apex of a sub-triangle, that has its own maximal path, and (2) where any sub-triangles intersect, this intersection contains another sub-triangle – so working out all the paths through the two intersecting triangles involves working out every path through the intersection twice. For example, in the triangle

3
7 4
2 4 6
8 5 9 3

the two sub-triangles with apexes at 7 and 4 both contain the sub-triangle

4
5 9

with two possible paths. So calculating every possible path through the larger triangle involves traversing both these paths twice, once for each sub-triangle containing them. It’s easy to see that this gets worse quickly as the triangle gets larger.

A smarter approach is what might be termed a ‘bottom-up’ method. Working from the base upwards, calculate the maximal path for every point in each row. Once you have the maxima for two adjacent points in a row, the maximum for the point above both these points is simply the sum of this point and the maximum of these two values. The advantage of this method is that the maximum for any point in the triangle is calculated once and only once. Here’s how this looks in Python.

First, a helper function get_triangle to read the string in and get a list of rows of integers making up the triangle:

def get_triangle(triangle_string):
    string_list=triangle_string.strip().split('\n')
    #reverse to work from the base up
    string_list.reverse()
    return [[int(x) for x in row.split()] \
        for row in string_list]

Then, the main routine max_path works through each row, progressively replacing the list of maxima at each point in the previous row with the same for the current row:

def max_path(triangle_string):
    triangle = get_triangle(triangle_string)
    #every point in the base is its own maximum
    max_list = triangle[0]
    for row in range(1, len(triangle)):
        #add the greater of the two adjacent maxima to
        #each point in the row
        max_list = \
            [triangle[row][x] + max(max_list[x], max_list[x+1]) \
             for x in range(len(triangle[row]))]
    return max_list[0]

That’s all for now – time permitting I will be posting the next part of the regular expression filter sometime in the next couple of days. Until then, hoping you all had a restful and happy Christmas and are taking some time off to be with family. Take care…

SQLite in Ubuntu 9.04

There are a lot of good choices for a free/open-source DBMS in Ubuntu (or anywhere for that matter): MySQL and PostgreSQL would be the two that stand out most prominently. But when you’re developing a small-scale application which requires a small-scale database (whether or not you intend to scale up later), a client-server DBMS like that is the proverbial sledgehammer, with your data structure as the walnut. SQLite offers a file-based transactional SQL database engine, without the installation/configuration/administration time associated with a server-based DB. Think MS Access, cross-platform, minus all the forms.

I’m currently building a proof-of-concept application (for migrating a spreadmart system), using the Django framework to put a web form presentation layer across a db backend. SQLite is ideal for this sort of task, in that the development time is shortened – I can pretty much get down to writing code as soon as the design is complete. An interface to SQLite (sqlite3) is built in to Python since 2.5, so if you wish to create an SQLite database using Python, all you have to do is write the appropriate code to do so. Without using Python you’ll need to enable command-line access by installing the sqlite3 library:

sudo apt-get install sqlite3

or just open Synaptic and search for sqlite3.

So for instance to create a database file ‘expenses.db’, create a category table and populate one row in it, I can do the following from the command line:

$ sqlite3 expenses.db
SQLite version 3.6.10
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table category(
   ...> id integer primary key,
   ...> cat_code varchar(3),
   ...> cat_name varchar(20)
   ...> );
sqlite> insert into category values(
   ...> null, 'GRO', 'groceries'
   ...> );
sqlite> select * from category;
1|GRO|groceries

To do the same thing in Python:

>>> import sqlite3
>>> conn = sqlite3.connect('expenses.db')
>>> conn.execute('''create table category(
... id integer primary key,
... cat_code varchar(3),
... cat_name varchar(20)
... )''')
<sqlite3.Cursor object at 0xb77ce2c0>
>>> conn.commit()
>>> conn.execute('''insert into category values(
... null, 'GRO', 'groceries')''')
<sqlite3.Cursor object at 0xb77ce290>
>>> conn.commit()
>>> for row in conn.execute('select * from category'):
...   print row
... 
(1, u'GRO', u'groceries')

If you prefer to have a GUI, you can install the SQLite Database Browser:

sudo apt-get install sqlitebrowser

This is a very handy (albeit rather basic) visual tool which gives you access to pretty much all the same functionality as you have from the command line:
sqlite-expenses.db

So that’s SQLite – simple, easy to use and very handy. Incidentally there’s a great blog post here extolling the virtues of SQLite in itself and also in combination with Excel. Intriguing reading and makes a great case for the benefits of using a “fractional horsepower” database. I’ll post more shortly on what this approach has allowed me to do with both Django and Excel.

My new favourite editor

The innocent-seeming question of “Which text editor do you use?” is the source of considerable conflict in the Linux world – mostly of a humorous nature. I’ve seen a number of polls across various forums, and while there’s no clear winner in most cases, the two front runners would probably be vi and emacs. I’ve tried them both at various times but I have to say I don’t particularly care for either – for me there’s just too much work involved in learning to use them, and they lack a lot of the features found in most modern IDEs. For that reason I’ve mostly stuck with gedit – plain jane it may be, but it’s easy to use and doesn’t require a tutorial to learn.

But I still miss all the good stuff like tab autocompletion, calltips, code folding, etc. I know that a lot of those features can be enabled in gedit by installing plugins, but it’s just a lot easier when they’re built-in, so the other day I decided to give Komodo Edit a try. It’s not available in the Ubuntu repositories, but it can be downloaded direct (and for free) from ActiveState. Installation is easy and is well documented – you need to add the path to the komodo executable to your PATH variable after running the install script, but once that’s done you’re away:

Screenshot-Start Page - Komodo Edit 5.1

Overall, I like the look and feel of it – and all of those ‘nice’ features that I expect in a modern IDE are there. With the exception of a debugging tool, which would require the Komodo IDE (the pay version), it’s very easy to enable language-specific support, for a very large range of programming languages. I’ve recently been playing around with Django (more about that later), and enabling syntax support for that was as easy as adding a few extra directories to the python import directories in the ‘Preferences’ screen:

Screenshot-Preferences

I’ll still be using gedit for quick and dirty stuff, but for anything serious, this is my editor of choice.

Conversations with Excel, part 3

In the first and second posts in this series I looked at some ways to read from and write to Excel files using SAS. The motivation for doing this is replacing code which uses the DDE protocol to communicate with the Excel application – which works fine using PC SAS on the desktop, but when SAS is on the server, there’s no application to talk to. SAS uses the SAS/ACCESS interface to ‘speak’ directly with Excel files.

A similar facility is available for working with Excel files in Python, by making use of the packages available from http://www.python-excel.org/. The conversation is enabled through three packages:

xlrd allows you to read data and formatting information from Excel files.

xlwt allows you to create and write to Excel files.

xlutils contains a bunch of utilities for using both xlrd and xlwt, and is required to copy or modify existing Excel files.

There’s comprehensive documentation for all three packages at the site, and a tutorial with some examples of how they can be used. You can also visit the active Google group for support.

This means that you have the ability to work with Excel files, using Python on a platform where Excel is not available. Usage is fairly straightforward – for instance, to read the data contained on the sheet ‘dataSht’ in the workbook ‘Template_test.xls’:

>>> from xlrd import open_workbook
>>> wb = open_workbook('Template_test.xls')
>>> sht = wb.sheet_by_name('dataSht')
>>> for row in range(sht.nrows):
...   values = [sht.cell(row, col).value for col in range(sht.ncols)]
...   print values
... 
[u'Template_test.xls', '', '', '', '', '', '', '']
['', '', '', '', '', '', '', 96.0]
[u'Date', u'Data_1', u'Data_2', u'Data_3', u'Data_4', u'Data_5', '', '']
[40043.0, 0.71426979592069983, 0.29721317486837506, 0.14380949875339866, 0.70981460809707642, 0.19360692566260695, '', '']
[40044.0, 0.30376328527927399, 0.75381017150357366, 0.26589830825105309, 0.30413518520072103, 0.41826989687979221, '', '']
[40045.0, 0.99682421330362558, 0.0027025360614061356, 0.64853132842108607, 0.27574777463451028, 0.99392103916034102, '', '']
[40046.0, 0.14693491021171212, 0.93810823513194919, 0.32732625165954232, 0.77697453368455172, 0.35358203155919909, '', '']
[40047.0, 0.43824125546962023, 0.20211741980165243, 0.6220957413315773, 0.28986502904444933, 0.85634097876027226, '', '']
[40048.0, 0.3646774091757834, 0.33247592020779848, 0.84804946463555098, 0.36496656434610486, 0.0059830849058926105, '', '']
[40049.0, 0.6037151631899178, 0.079236360732465982, 0.30319626023992896, 0.74752466194331646, 0.7890509688295424, '', '']
[40050.0, 0.49680318590253592, 0.051287947688251734, 0.54286114033311605, 0.76270149415358901, 0.35542313288897276, '', '']
[40051.0, 0.96113103721290827, 0.75952570792287588, 0.35812566895037889, 0.60966236609965563, 0.03527348255738616, '', '']
[40052.0, 0.35204670811071992, 0.75659727631136775, 0.97338171768933535, 0.67937295977026224, 0.53357180999591947, '', '']
[40053.0, 0.32696374924853444, 0.11761421523988247, 0.73568923026323318, 0.94905949058011174, 0.4074792442843318, '', '']
[40054.0, 0.59203020902350545, 0.31373690022155643, 0.73995516449213028, 0.44007967365905643, 0.67870346456766129, '', '']
[40055.0, 0.74593824986368418, 0.043794836848974228, 0.75793982530012727, 0.049134510103613138, 0.79131949925795197, '', '']
[40056.0, 0.54699079459533095, 0.54593769600614905, 0.84260744694620371, 0.089851934928447008, 0.30863919015973806, '', '']
[40057.0, 0.19803057983517647, 0.050982972607016563, 0.068164898082613945, 0.55615624878555536, 0.66064533870667219, '', '']
[40058.0, 0.1034383806400001, 0.90820295689627528, 0.41724261501803994, 0.076820098329335451, 0.58757591666653752, '', '']

There’s also support for reading named ranges in much the same way as with the SAS Excel libname engine, although it also has the same problem of being unable to evaluate dynamic ranges. It is possible to extract the formula from the ‘RefersTo’ string, so I think it may be possible to parse it to get an address to read from – I’ll follow up on this later.

Enjoy!

Project Euler #26

Problem 26 at Project Euler asks

A unit fraction contains 1 in the numerator. The decimal representation of the unit fractions with denominators 2 to 10 are given:

^(1)/_(2) = 0.5

^(1)/_(3) = 0.(3)

^(1)/_(4) = 0.25

^(1)/_(5) = 0.2

^(1)/_(6) = 0.1(6)

^(1)/_(7) = 0.(142857)

^(1)/_(8) = 0.125

^(1)/_(9) = 0.(1)

^(1)/_(10) = 0.1

Where 0.1(6) means 0.166666…, and has a 1-digit recurring cycle. It can be seen that ^(1)/_(7) has a 6-digit recurring cycle.

Find the value of d < 1000 for which ^(1)/_(d) contains the longest recurring cycle in its decimal fraction part.

This is a problem about repeated division by the same number – how many divisions can you do before you start repeating yourself? The division algorithm can be expressed as follows:

def div_by(rem, div):
    y = min([x for x in range(5) if rem*(10**x) > div])
    return rem*(10**y) % div

So then I need to use this to repeatedly divide by the test value d – len_pattern calls div_by until it detects a repeated remainder:

def len_pattern(init, rems, d):
    rem=div_by(init, d)
    if rem in rems:
        return len(rems) - rems.index(rem)
    else:
        rems.append(rem)
        return len_pattern(rem, rems, d)

I then have to use this to find out which d < 1000 has the longest cycle. However, I don't need to call it on all the numbers under 1000 – the only ones I really need to test are the primes. The upper bound on the length of the cycle is n-1 for any number n (for example the length of the cycle generated by 7 is 6), and primes are the only numbers which generate cycles of this length. So I need to use the list generated by the primes_under() function, then call len_pattern on each of these, returning the maximum length.

def get_longest():
    from EulerFunc import primes_under
    divs = dict([(div, len_pattern(1.0, [], div)) \
        for div in primes_under(1000) if div > 5])
    longest=max(divs,  key=lambda x: divs[x])
    print "Longest cycle is for %d with length %d" % \
        (longest, divs[longest])

if __name__ == '__main__':
    from timeit import Timer
    t = Timer("get_longest()",  "from __main__ import get_longest")
    print t.timeit(number=100)

>>> Longest cycle is for 983 with length 884
Longest cycle is for 983 with length 884
Longest cycle is for 983 with length 884
.
.
.
Longest cycle is for 983 with length 884
151.166881084

So it takes on average around 1.5 seconds to run. Not bad, I see some people had Python code which ran in under a second, so I’m sure this could be optimised…