This example uses RCurl to download an HTML
document and then follow each relative link within that document.
The goal is to be able to start the download and pass this to the
xmlTreeParsexmlEventParse
functions for processing.
As that parser needs more input, it calls fetches data
from the HTTP response stream.
To do this, we need to use the
<b>multi</b> interface for libcurl.
The idea is quite simple. We initiate the download and associate a
"writer" to slurp up the body of the HTTP response. This is registered
with libcurl and is invoked whenever libcurl is in control and is
processing the HTTP response. If there is information to be read on
the HTTP stream from the server, this function reads it and appends it
to a variable
pending.
The second part of this is that we need a function that is called by
xmlEventParse which can provide input the XML parser.
Of course, it will use the content coming from the HTTP server that is
collected in the function getHTTPResponse. So we create a sibling
function that shares the state of the getHTTPResponse function and so
can see the contents of the variable
pending. When the
XML parser demands some input, our function
supplyXMLContent checks to see if pending has
non-trivial content (i.e. is not the empty string). If it has some
content, it returns that. Otherwise, it tells libcurl to read some
more from the HTTP stream. When it hands control to libcurl in this
way, libcurl will invoke our
getHTTPResponse
function, populating the contents of
pending. So when
libcurl yields control, we will now have content to pass to the XML
parser.
The only additional issue that we have to deal with in this setup is
that the XML event parser asks for input up to a certain size. We
cannot necessarily give it all of the content of
pending. If
pending has more characters
than the XML parser wants, we must give it the first
maxLen characters and then leave the remainder in
pending for the next request from the XML parser.
The following generator function defines the two
functions that do the pulling of the text from libcurl
and the pushing to the XML parser.
HTTPReaderXMLParser =
function(curl, verbose = FALSE, save = FALSE)
{
pending = ""
text = character()
getHTTPResponse =
function(txt) {
pending <<- paste(pending, txt, sep = "")
if(save)
text <<- c(text, txt)
if(verbose) {
cat("Getting more information from HTTP response\n")
print(pending)
}
"" # Give back something real.
}
supplyXMLContent =
function(maxLen) {
if(verbose)
cat("Getting data for XML parser\n")
if(pending == "") {
if(verbose)
cat("Need to fetch more data for XML parser from HTTP response\n")
while(pending == "") {
status = curlMultiPerform(curl, multiple = TRUE)
if(status[2] == 0)
break
}
}
if(pending == "") {
# There is no more input available from this request.
return(character())
}
# Now, we have the text, and we return at most maxLen - 1
# characters
if(nchar(pending) >= maxLen) {
ans = substring(pending, 1, maxLen-1)
pending <<- substring(pending, maxLen)
} else {
ans = pending
pending <<- ""
}
if(verbose)
cat("Sending '", ans, "' to XML\n", sep = "")
ans
}
list(getHTTPResponse = getHTTPResponse,
supplyXMLContent = supplyXMLContent,
pending = function() pending,
text = function() paste(text, collapse = "")
)
}
|
---|
The remaining part is how we combine these pieces with
RCurl and the XML packages to do the parsing in this
asynchronous, interleaved manner.
The code below performs the basic steps
uri = "http://www.omegahat.org/RCurl/philosophy.xml"
uri = "http://www.omegahat.org/RDoc/overview.xml"
handle = getCurlMultiHandle()
streams = HTTPReaderXMLParser(handle)
handle = getURLAsynchronous(uri,
write = streams$getHTTPResponse,
multiHandle = handle, perform = FALSE)
links = getDocbookLinks()
xmlEventParse(streams$supplyXMLContent, handlers = links, saxVersion = 2)
links$links()
|
---|
We create a 'multi handle'. This gives us the asynchronous behavior
that returns control back to us from libcurl rather than sending the
request and slurping back all the data in one single action. Next, we
create our functions to do the pulling and pushing of text from HTTP
to the XML parser. This is done with the call to
getURLAsynchronous. Note that we tell it not to
actually perform the request. We are just setting it up to be done
when the XML parser requests input. This is important as this call
must return so that we can call
xmlEventParse.
[1] The next step is to establish the
XML event parser. We provide a collection of handlers that process
the XML content in the way that we want (see below). And now we are
off, and the XML parser will request input and the functions will read
from the HTTP stream.
To process the links within the Docbook document, we are looking for
each ulink element and fetching its url attribute. So we can provide
a collection of handlers that consist of a function only for ulink.
And it need only look at the attributes it is given and determine if
there is a url entry.
If there is, it appends the value to its internal collection of links.
When we are finished the parsing, we can ask for this collection
of links using the additional function links.
getDocbookLinks =
function()
{
links = character()
ulink = function(name, attrs, ns, namespaces) {
if("url" %in% names(attrs))
links[length(links) + 1 ] <<- attrs["url"]
}
list(ulink = ulink,
links = function() links)
}
|
---|
To run this code, we need to
load both the
RCurl
and
XML
packages.
library(RCurl)
library(XML)
|
---|