Conversations with Excel, part 2

In the first post of this series I talked about accessing data from Excel workbooks using the SAS Excel libname engine, as a replacement for data access through DDE. However, it’s also possible to use SQL pass-through to perform the same task – again, this makes the code relatively simple and familiar in comparison with the machinery required for DDE. For example, the read_file macro can be rewritten like so:

%macro read_file(fname);
*** Libnames ;
* sas dataset directory;
libname sasdata "&path.\&adhocref.\sasdata" ;

* strip off the file extension;
%let namelen=%eval(%length(&fname)-4);
%let fshortname=%substr(&fname, 1, &namelen);

proc sql;
  connect to excel(path="&inputdir.\&fname");
  create table sasdata.&fshortname as
  select * from connection to excel
  (select * from 'rngData');
  disconnect from excel;
quit;

%mend read_file;

So why prefer this to the libname? Well I think it’s probably more a question of style than anything – I haven’t compared performance but really with the amount of data in the spreadsheets I’m accessing it’s quick either way. I’ve gotten a lot more used to using sql pass-through over the last year or so, and now I tend to find it a more natural way to think about a data access problem than using the data step.

For an example, given a spreadsheet with a few thousand rows of employee hours worked in the range EmployeeTimeData, and a list of employees with their department and email address in the range emp_list, the following pass-through query allows me to calculate totals for each employee and return them to a sas dataset:

*** Subtotal hours worked for each employee;
proc sql;
  connect to excel(path="&inputdir.\&fname");
  create table sasdata.emp_hrs_email as
  select * from connection to excel
   (select
           a.Name_of_Employee as emp_name,
           b.DEPARTMENT as emp_department,
           b.EMAIL_ADDRESS as emp_email,
           sum(a.Time_in_Hours) as total_time
    from
      'EmployeeTimeData' a inner join
      'emp_list' b on
           a.Name_of_Employee = b.NAME_FOR_DOCUMENTATION
    group by
            a.Name_of_Employee,
            b.DEPARTMENT,
            b.EMAIL_ADDRESS
    );
quit;

It’s also possible to make the conversation go the other way, passing non-query statements with the execute statement:

*** Drop and re-create table emp_list;
proc sql;
  connect to excel as xldb(path="&inputdir.\&fname");
  execute(drop table 'emp_list') by xldb;
  execute(create table 'emp_list'
               (NAME_FOR_DOCUMENTATION VARCHAR(20) not null,
                DEPARTMENT VARCHAR(10),
                EMAIL_ADDRESS VARCHAR(20))
             ) by xldb;
quit;

Documentation for using the pass-through facility to access Excel/Access files in SAS 9.1.3 is here.

Next time I’ll take a look at some other ways to start conversations with Excel, this time using Python.

Advertisements

SQL passthrough in SAS

Our main data source at work is an Oracle data warehouse, which we (mainly) query from SAS. There are two main ways we can do this – one is to use a libname statement with the OLEDB engine. You can then extract and process data directly from the Oracle tables exactly as you would from a SAS dataset, in a data step, like so:

libname oralib oledb
    Provider=MSDAORA
    Properties=("User ID"=&name;
    Password=&password; "Data Source"=oradw);

data newset;
  set oralib.oldset;
run;

This, however, is far less efficient than the alternative, which is to use the Proc SQL pass-through facility to extract the data before processing it:

proc sql;
  connect to oledb
    init_string="Provider=msdaora; User ID=&name;
    Password=&password; Data Source=oradw";
  create table newset as
  select * from connection to oledb
  (select * from oldset);
quit;

Here’s a nice visual explanation of why this is a more efficient data access method:


To paraphrase, with the pass-through any selection or processing of data is done in the database before pulling it into SAS. This makes it significantly faster, and the more tables being referenced the better the gain in speed over the use of the libname engine.

However, one thing you do miss out on with the pass-through is the ability to use results from an earlier query stored in a temporary dataset to restrict data extracted from a later one. For example, the following stores a dataset id_list in the SAS workspace:

proc sql;
  create table id_list as
  select
    a.id
  from
    oralib.table1 a
    inner join oralib.table2 b
        on a.emp_id = b.sp_emp_id
  where
    b.rate >= 10000
  group by a.id
  ;
quit;

id_list can then be used in subsequent queries, like so:

proc sql;
  create table using_ids as
  select *
  from oralib.table3 c
  inner join id_list d
    on c.id = d.id
  ;
quit;

Of course, the same end result can be achieved in a pass-through by using a subquery:

proc sql;
  connect to oledb
    (init_string="Provider=MSDAORA; User ID=&name;
    Password=&password; Data Source=oradw");
  create table using_ids as
  select * from connection to oledb
  (select *
  from table3 c
  where c.id in
    (select
        a.id
     from
        table1 a
        inner join table2 b
            on a.emp_id = b.sp_emp_id
     where
        b.rate >= 10000
     group by a.id
    )
  );
quit;

What I don’t like about this approach is the overhead of evaluating the inner query before going to work on the outer one. With a simple query like the above, the slow-down is not likely to be that noticeable, but as more tables and joined and more complex selection criteria are added, this becomes more of a concern. In terms of time taken, SQL code like this can often be greater than the sum of its parts.
An alternative approach is to create a temporary table on the database, which can then be referenced in the pass-through:

proc sql;
  create table oralib.id_list as
  select
    a.id
  from
    oralib.table1 a
    inner join oralib.table2 b
        on a.emp_id = b.sp_emp_id
  where
    b.rate >= 10000
  group by a.id
  ;
quit;

libname oralib clear; /*the libname can interfere with the pass-through*/

proc sql;
  connect to oledb
    (init_string="Provider=MSDAORA; User ID=&name;
    Password=&password; Data Source=oradw");
  create table using_ids as
  select * from connection to oledb
  (select *
  from table3 c
  inner join id_list d
    on c.id = d.id
  );
quit;

This of course assumes that the user has permissions to create tables on the database.