Reading Data from a Website
The URL access method of the FILENAME statement enables a web page to be used as the input to a data step. The statement itself is simple enough, but the resulting input will be in the form of HTML code. To extract useful data from this is unlikely to be entirely straightforward, and usually best approached using PRX functions.
This example tackles the problem of producing a list of the tips currently available on the Amadeus website. The site displays a number of tips, each in a different category. Examination of the HTML code shows that the string “tipCat” introduces a tip category, with the name of the category the last thing before the next “</a>” tag. Then “Latest Tip” introduces the tip, with the name of the tip the last thing before the next “</a>”. To make matters more interesting, it turns out that the tip category and the tip name may or may not be in the same HTML record, and the same HTML record may contain details of more than one tip. But it’s nothing we can’t handle…
The INFILE statement 1 specifies the FILENAME URL for the Amadeus tips page, specifying a record length of 4096, which appears to be sufficient in this case. A buffer BUF is declared, big enough to hold two such records.
Three Perl regular expressions are defined, and retained 2. RXTIPFUL 3 will give a match only when a record contains a complete category name and tip name. RXCATNAME is used to extract the category name. Since it is quite a complicated expression, it has been commented in detail using the “?#” syntax.
The initial “s” indicates a substitution expression, and the vertical bar character “|” has been used as delimiter 4. The “?” in “.*?” 5 indicates “non-greedy matching”, so that in this part of the expression the smallest number of characters will be used that enables a match to be achieved. This is because, where the buffer contains more than one match, we want to deal with the first of them first. The subexpression at 6 will match the name of the tip category – all the text before “</a>”, as far back as the preceding “>”. The final part of the expression 7 specifies that the output from a PRXCHANGE call that uses this expression shall be “$1” i.e. the value of the subexpression i.e. the name of the tip category. The final “x” specifies that the whole thing was an “extended regular expression”; all this means is that it was permissible for us to include white space in it, enabling it to be laid out comprehensibly.
The RXTIPTITLE expression is very similar, but presented without any helpful comments.
The logic at 8 is that, after “tipCat” has been found, we check whether the buffer also contains both the category name and the tip name. If it does not, the next record is read in and appended to the buffer. And at 9, having read one category name and tip name from the buffer, we discard that portion of the buffer contents, and loop around in case what remains in the buffer contains details of another tip.
Typical results from the final Proc PRINT are: