To recap where we left off in the first part, we’d managed to import xml data from the rss feed for NZD rates at themoneyconverter.com into a table (which I’ve named Rates_Table). This gave us a list of rates which we can use – however the format is not quite right. In the “description” member of “item”, the rate is expressed in a structured text format, like so: “1 New Zealand Dollar = 2.96886 Argentine Peso”. Now of course that would be useful as it is, if I were prepared to manually type out the 2.96886 part myself to convert from NZ dollars to pesos. Which of course I’m not, lazy data analyst that I am.
What I’m going to do instead is to extract the two relevant parts of that text, the two bits that I need to know: the rate (the numeric bit, in this case 2.96886) and the foreign currency description (the character string that comes after the numeric bit, in this case “Argentine Peso”). This is something that can be done with text functions like MID and FIND in Excel, of course, and this is a perfectly legitimate way to go about it. However, unless you can find a really clever way to do it, you’ll end up with a really long, convoluted and awkward formula. Which is a great reason to use regular expression functions instead. RXGET for instance, is designed to do exactly this sort of extraction. Here’s how I went about it:
First, we need to determine the regular expression pattern that fits the text. The pattern I chose is
or to break it down a little,
- an equals sign followed by a space: =\s;
- followed by a group, which in turn contains zero or more digits, followed by a decimal point, followed by one or more digits: (\d*\.\d+) This will be used to capture the rate part of the expression. Note that the decimal point needs to be escaped here, as it signifies an arbitrary character in a regular expression;
- followed by another space: \s;
- followed by a second group, which contains one or more arbitrary characters: (.+) This will capture everything after the space, which in this case will be the currency description. Here the decimal point should not be escaped as we need it to match any character.
Once I know that this pattern works, I add a new sheet titled ‘Reference’ and enter the pattern into a cell that I name Rate_Pattern. Now I can use it to extract the rate and the currency using RXGET. I add a column with the header rate at the right-hand end of the Rates_Table and enter in it the RXGET formula, with the following arguments:
This corresponds to the formula:
Note that the Submatch argument has the value 1, as we’re extracting the 1st matched group from the expression. To get the currency description, in another new column with the header currency I enter the same formula with the Submatch argument changed to 2 to extract the second group. So here’s how the table looks right now (click on the image to take a closer look):
In the final part of this series I’ll complete the converter by adding in a selector to pick out a rate from the table by specifying the currency.