SQLite in Ubuntu 9.04

There are a lot of good choices for a free/open-source DBMS in Ubuntu (or anywhere for that matter): MySQL and PostgreSQL would be the two that stand out most prominently. But when you’re developing a small-scale application which requires a small-scale database (whether or not you intend to scale up later), a client-server DBMS like that is the proverbial sledgehammer, with your data structure as the walnut. SQLite offers a file-based transactional SQL database engine, without the installation/configuration/administration time associated with a server-based DB. Think MS Access, cross-platform, minus all the forms.

I’m currently building a proof-of-concept application (for migrating a spreadmart system), using the Django framework to put a web form presentation layer across a db backend. SQLite is ideal for this sort of task, in that the development time is shortened – I can pretty much get down to writing code as soon as the design is complete. An interface to SQLite (sqlite3) is built in to Python since 2.5, so if you wish to create an SQLite database using Python, all you have to do is write the appropriate code to do so. Without using Python you’ll need to enable command-line access by installing the sqlite3 library:

sudo apt-get install sqlite3

or just open Synaptic and search for sqlite3.

So for instance to create a database file ‘expenses.db’, create a category table and populate one row in it, I can do the following from the command line:

$ sqlite3 expenses.db
SQLite version 3.6.10
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table category(
   ...> id integer primary key,
   ...> cat_code varchar(3),
   ...> cat_name varchar(20)
   ...> );
sqlite> insert into category values(
   ...> null, 'GRO', 'groceries'
   ...> );
sqlite> select * from category;
1|GRO|groceries

To do the same thing in Python:

>>> import sqlite3
>>> conn = sqlite3.connect('expenses.db')
>>> conn.execute('''create table category(
... id integer primary key,
... cat_code varchar(3),
... cat_name varchar(20)
... )''')
<sqlite3.Cursor object at 0xb77ce2c0>
>>> conn.commit()
>>> conn.execute('''insert into category values(
... null, 'GRO', 'groceries')''')
<sqlite3.Cursor object at 0xb77ce290>
>>> conn.commit()
>>> for row in conn.execute('select * from category'):
...   print row
... 
(1, u'GRO', u'groceries')

If you prefer to have a GUI, you can install the SQLite Database Browser:

sudo apt-get install sqlitebrowser

This is a very handy (albeit rather basic) visual tool which gives you access to pretty much all the same functionality as you have from the command line:
sqlite-expenses.db

So that’s SQLite – simple, easy to use and very handy. Incidentally there’s a great blog post here extolling the virtues of SQLite in itself and also in combination with Excel. Intriguing reading and makes a great case for the benefits of using a “fractional horsepower” database. I’ll post more shortly on what this approach has allowed me to do with both Django and Excel.

Advertisements

Lookup tables in SAS

Recently there was a change to a reference table in a database we use frequently for reporting. For a lot of our regular reports this won’t be troublesome, as the changed values will come through with the extract. However, with some reports for one reason and another the new reference data will not come through, so we need to replace the changed field in some datasets. There are many ways to do this in SAS, so I thought I’d run through a couple of them here. The reference data that’s changing is the mapping from the field court to the field region, which I extracted in the table court_reg below:

proc sql;
connect to oledb(init_string="Provider=msdaora ;
User ID=&name; Password=&password; 
data source=odwp") ;
  create table court_reg as
  select * from connection to oledb
 (select distinct
    crt.short_name as court,
    reg.name as new_region
  from 
    court crt
    inner join region_type reg
      on crt.region_type_code = reg.code
  );
  disconnect from oledb;
quit;

What I need to do is to use the court field as a key to pull a new value for region into an existing dataset. The pattern for the problem, then, is to pull values from a small dataset (< 500 rows) into a large dataset (> 100,000 rows). In Excel, this could be easily accomplished by using VLOOKUP into the small table from the large one. Here’s some ways I could do much the same thing in SAS:

Formats

It’s a relatively simple thing to define a format from a SAS dataset: just do some renaming of the variables (‘start’ is the variable to be formatted, and ‘label’ is the formatted value we want) and assign the format name to ‘fmtname’ and its type to ‘type’, then use proc format to read the values into the format ‘fmtname’, like so:

data tmp_format;
  retain fmtname 'regtype' type 'C';
  set work.court_reg (rename=(court=start
                              new_region=label));
run;

proc format cntlin=tmp_format fmtlib;
run;

Once that’s done, the new values are just a put away:

data sasdata.newset;
  set sasdata.oldset (drop=region);
  region=put(court, $regtype.);
run;

A format has the advantage of being fast (well, potentially faster than a merge/join, anyway) as it uses a binary tree lookup to assign the value. Balancing that is the memory requirement; the entire format must fit into memory. Whether this is an issue or not depends on the size of the dataset being used to create the format and the type of the values being applied (in this case, region is a 10-byte character value).

Data Step with Merge

Another option is an approach which uses a lot less memory: merge both sets by the key field and replace the old value with the new. I’m actually not going to post the code for this here, as in this case in my opinion it’s not a good idea. While the merge does use less memory than the format (it only loads one observation from each set into the program data vector at a time), and sequential access into the data sets makes the search very fast, the trade-off is the increased overhead of either sorting or indexing both sets. The bigger the large set, the more costly this will become.

Creating a temporary array

Rather than creating a format, another approach is to create a temporary array of the values to be mapped. This has the advantage of being much quicker than the format, as the lookup is indexed to the location of the values in the array. The disadvantage is the requirement for numeric values to be used for the array index. We can get around this by using an informat created from the order of the observations in the small dataset, for instance:

data tmp_format;
  retain fmtname 'regarray' type 'IN';
  set work.court_reg (keep=court 
                      rename=(court=start);
  label=put(_n_, 8.);
run;

proc format cntlin=tmp_format fmtlib;
run;

/*gets the size of the array required*/
data _null_;
  if 0 then set work.court_reg nobs=nobs;
  call symputx("nobs", nobs);
  stop;
run;

data sasdata.newset (drop=old_region);
  /*create and populate the array*/
  array regions {&nobs} $ _temporary_;
  if _n_ = 1 then do i=1 to &nobs;
    set work.court_reg;
    crtid=input(court, crtarray.);
    regions{crtid}=new_region;
  end;
  /*lookup into the array*/
  set sasdata.oldset (rename=(region=old_region));
  crtid=input(court, crtarray.) 
  region=regions{crtid};
run;

This informat can then be used to assign the values of region into positions in the array. However, the fact that this assignment requires a lookup into a format appears to mean that the array is not going to achieve any better performance in this case. Not only that, but the code is convoluted by the need to determine the size of the array to begin with.

The Data Step Hash Object

Another option is to use a hash table – an associative array of key-value mappings (think hash as in Perl, or dictionary as in Python). The lookup is faster than the use of a format, as the hash object uses an efficient algorithm to translate the key values into array positions, providing a similar gain to that you’d get with the use of an array. The hash object is also more flexible than a format, as it allows more than one item to be stored per key, and these can be a mixture of character and numeric variables. The key itself can be either character or numeric, or a combination of variables.

Here’s the hash object in action:

data sasdata.newset (drop=old_region);
  length court $ 20 new_region $ 10;
  /*initialise the hash object*/
  if _n_ = 1 then do;
    declare hash reg(dataset: 'court_reg');
    reg.defineKey('court');
    reg.defineData('new_region');
    reg.defineDone();
    call missing(court, new_region);
  end;
  /*retrieve data*/
  set set sasdata.oldset (rename=(region=old_region));
  if reg.find()=0 then do;
    region=new_region;
    output;
  end;
run;

I like the efficiency of this approach, but more than that, I like the code. It’s simple, concise, and the intention is explicit. There’s a couple of good references to the use of the hash object here and here, and a detailed explanation of how it’s implemented here.

My new favourite editor

The innocent-seeming question of “Which text editor do you use?” is the source of considerable conflict in the Linux world – mostly of a humorous nature. I’ve seen a number of polls across various forums, and while there’s no clear winner in most cases, the two front runners would probably be vi and emacs. I’ve tried them both at various times but I have to say I don’t particularly care for either – for me there’s just too much work involved in learning to use them, and they lack a lot of the features found in most modern IDEs. For that reason I’ve mostly stuck with gedit – plain jane it may be, but it’s easy to use and doesn’t require a tutorial to learn.

But I still miss all the good stuff like tab autocompletion, calltips, code folding, etc. I know that a lot of those features can be enabled in gedit by installing plugins, but it’s just a lot easier when they’re built-in, so the other day I decided to give Komodo Edit a try. It’s not available in the Ubuntu repositories, but it can be downloaded direct (and for free) from ActiveState. Installation is easy and is well documented – you need to add the path to the komodo executable to your PATH variable after running the install script, but once that’s done you’re away:

Screenshot-Start Page - Komodo Edit 5.1

Overall, I like the look and feel of it – and all of those ‘nice’ features that I expect in a modern IDE are there. With the exception of a debugging tool, which would require the Komodo IDE (the pay version), it’s very easy to enable language-specific support, for a very large range of programming languages. I’ve recently been playing around with Django (more about that later), and enabling syntax support for that was as easy as adding a few extra directories to the python import directories in the ‘Preferences’ screen:

Screenshot-Preferences

I’ll still be using gedit for quick and dirty stuff, but for anything serious, this is my editor of choice.