Asynchronous, concurrent link processing
The strategy in this approach is
to start the parsing of the original document.
We do this in almost exactly the same way
that we do in the
xmlParser
example.
That is, we create a multi CURL handler
and we create a function
that will feed data from the HTTP response
to the XML parser when it is required.
We then put the downloading of the original/top-level
file on the stack for the multi handler.
uri = "http://www.omegahat.org/index.html"
uri = "http://www.omegahat.org/RCurl/philosophy.xml"
multiHandle = getCurlMultiHandle()
streams = HTTPReaderXMLParser(multiHandle, save = TRUE)
curl = getCurlHandle(URL = uri, writefunction = streams$getHTTPResponse)
multiHandle = push(multiHandle, curl)
[1]
At this point, the HTTP request has not actually been performed
and therefore there is no data.
And this is good. We want to start the XML parser.
So we give establish the handlers that will process
the elements of interest in our document,
e.g. a ulink for a Docbook document, or a for an
HTML document.
The function
downloadLinks is the
function used to do this.
And now we are ready to start the XML parser
via a call to
xmlEventParse.
links = downloadLinks(multiHandle, "http://www.omegahat.org", "ulink", "url", verbose = TRUE)
xmlEventParse(streams$supplyXMLContent, handlers = links, saxVersion = 2)
At this point, XML asks for some input.
It calls the supplyXMLContent and this fetches data from the
HTTP reply. In our case, this will cause the HTTP request
to be sent to the server and we will wait until we get the first
part of the document.
The XML parser then takes this chunk and parses it.
When it encounters an element of interest, i.e. a ulink,
it calls the approriate handler function given in
links.
And this gets the URI of the link and then arranges to
add to the multi handle an HTTP request to fetch that document.
The next time that the multi curl handle is requested to get
input for the XML parser, it will send that new HTTP request
and the response will be available.
The write handler for the new HTTP request
simply collects all the text for the document
into a single string. We use
basicTextGatherer
for this.
There is one last little detail before we can access the
results. It is possible that the XML event parser will have
digested all its input before the downloads for
the other documents have finished.
There will be nothing causing libcurl to return
to process those HTTP responses.
So they may be stuck in limbo, with input pending
but nobody paying attention.
To ensure that this doesn't happen, we can use
the
complete function
to complete all the pending transactions on the
multi handle.
complete(multiHandle)
And now that we have guaranteed that all the processing
is done (or an error has occurred), we can access the results.
The result of calling
downloadLinks
gives us a function to access the download documents.
links$contents()
To get the original document also, we have to look inside
the
streams object and ask it for the contents
that it downloaded.
This is why we called
HTMLReaderXMLParser
with
TRUE
for the
<s:param>save</s:param> argument.
The definition of the XML event handlers is reasonably straightforward
at this point.
We need a handler function for the link element
that adds an HTTP request for the link document
to the multi curl handle.
And we need a way to get the resulting text back
when the request is completed.
We maintain a list of text gatherer objects
in the variable
docs.
These are indexed by the names of the documents being
downloaded.
The function that processes a link element in the XML document
merely determines whether the document is already being
downloaded (to avoid duplicating the work)
or not. If not, it pushes the new request for that document
onto the curl handle and returns.
This is the function
op.
There are details about dealing with relative links.
We have ignored them here and only dealt with
links that have an explicit
http: prefix.
downloadLinks =
function(curlm, base, elementName = "a", attr = "href", verbose = FALSE)
{
docs = list()
contents = function() {
sapply(docs, function(x) x$value())
}
ans = list(docs = function() docs,
contents = contents)
op = function(name, attrs, ns, namespaces) {
cat("<op>", name, paste(names(attrs), attrs, sep = " = ", collapse = ", "), "\n")
if(attr %in% names(attrs)) {
u = attrs[attr]
if(length(grep("^http:")) == 0)
return(FALSE)
if(!(u %in% names(docs))) {
if(verbose)
cat("Adding", u, "to document list\n")
write = basicTextGatherer()
curl = getCurlHandle(URL = u, writefunction = write$update)
curlm <<- push(curlm, curl)
docs[[u]] <<- write
}
}
TRUE
}
ans[[elementName]] = op
ans
}
|
---|