This document aims to provide a basic overview of the RCurl package.
It doesn't try to provide all the details. The R function help files
and the libcurl documentation have all the relevant information.
Since the package is an interface to libcurl, it is important to use
the documentation for it regarding features, options, etc. You can
consult the
libcurl
documentation and
libcurl
examples (in C code).
The RCurl package provides three primary high-level entry points.
These allow us to fetch a URL and submit forms. The functions are
,
and
. The first is relatively
straightforward, given the name; it allows us to fetch the contents of
a URI. The other two functions provide ways to submit a form using
the GET or POST methods. These are quite different internally, but
for users, both require a set of name-value pairs giving the
parameters for the form submission. The difference is in how the form
is submitted and the POST method allows us to submit/upload files,
binary content, etc.
Let us look at the
function.
At it simplest, this is just like
the
function in the standard R.
We can fetch a URI with the command something like
getURL("http://www.omegahat.org/RCurl/index.html")
The idea is that we specify the URI.
There are several other arguments to this function,
but for the most part we don't need them.
We can use HTTPS to fetch URIs securely.
For example,
getURL("https://sourceforge.net")
This is already more than we can do with the
regular connections or built-in
in R. (Using an external program allows HTTPS access.)
There are three different sets of arguments for the
function. One is named
curl and we will cover this in section
the section called “CURL Handles”. This is merely a way to cumulate requests on a
single connection with shared options.
The
write function is again rather specialized.
It allows us to specify an R function that is called each time
libcurl has some text as part of the HTTP response.
It hands this text (as a sequence of bytes) to the function
so that it can process it in whatever way it deems fit.
This corresponds to the
writefunction option
for the libcurl operation described next. We have it as an explicit
argument simply because we need to use it to get the return value
in a single action as the default behavior.
The third set of arguments is the most general and is handled
by the ... in the
function. With
this, one can specify name-value pairs governing the actual request.
There are numerous possible settings that one can specify. The basic
idea is that one can set options provided by the
curl_easy_setopt routine. These allow us to
set parameters for many different aspects of the request. For
example, we can specify additional headers for the HTTP request, or
include a password for the Web site.
The set of possible options can be determined
via the function
.
and the set of names for the different options
can be found via the command
names(getCurlOptionsConstants())
This is a collection of names of options that are understood
by many of the functions in the RCurl package.
At present, there are 113 possible options.
sort(names(getCurlOptionsConstants()))
[1] "autoreferer" "buffersize"
[3] "cainfo" "capath"
[5] "closepolicy" "connecttimeout"
[7] "cookie" "cookiefile"
[9] "cookiejar" "cookiesession"
[11] "crlf" "customrequest"
[13] "debugdata" "debugfunction"
[15] "dns.cache.timeout" "dns.use.global.cache"
[17] "egdsocket" "encoding"
[19] "errorbuffer" "failonerror"
[21] "file" "filetime"
[23] "followlocation" "forbid.reuse"
[25] "fresh.connect" "ftp.create.missing.dirs"
[27] "ftp.response.timeout" "ftp.ssl"
[29] "ftp.use.eprt" "ftp.use.epsv"
[31] "ftpappend" "ftplistonly"
[33] "ftpport" "header"
[35] "headerfunction" "http.version"
[37] "http200aliases" "httpauth"
[39] "httpget" "httpheader"
[41] "httppost" "httpproxytunnel"
[43] "infile" "infilesize"
[45] "infilesize.large" "interface"
[47] "ipresolve" "krb4level"
[49] "low.speed.limit" "low.speed.time"
[51] "maxconnects" "maxfilesize"
[53] "maxfilesize.large" "maxredirs"
[55] "netrc" "netrc.file"
[57] "nobody" "noprogress"
[59] "nosignal" "port"
[61] "post" "postfields"
[63] "postfieldsize" "postfieldsize.large"
[65] "postquote" "prequote"
[67] "private" "progressdata"
[69] "progressfunction" "proxy"
[71] "proxyauth" "proxyport"
[73] "proxytype" "proxyuserpwd"
[75] "put" "quote"
[77] "random.file" "range"
[79] "readfunction" "referer"
[81] "resume.from" "resume.from.large"
[83] "share" "ssl.cipher.list"
[85] "ssl.ctx.data" "ssl.ctx.function"
[87] "ssl.verifyhost" "ssl.verifypeer"
[89] "sslcert" "sslcertpasswd"
[91] "sslcerttype" "sslengine"
[93] "sslengine.default" "sslkey"
[95] "sslkeypasswd" "sslkeytype"
[97] "sslversion" "stderr"
[99] "tcp.nodelay" "telnetoptions"
[101] "timecondition" "timeout"
[103] "timevalue" "transfertext"
[105] "unrestricted.auth" "upload"
[107] "url" "useragent"
[109] "userpwd" "verbose"
[111] "writefunction" "writeheader"
[113] "writeinfo"
Each of these and what it controls is described in the libcurl man(ual) page
for
curl_easy_setopt
and that is the authoritative documentation.
Anything we provide here is merely repetition or additional
explanation.
The names of the options require a slight explanation. These
correspond to symbolic names in the C code of libcurl. For example,
the option
url in R corresponds to
CURLOPT_URL in C. Firstly, uppercase
letters are annoying to type and read, so we have mapped them to lower
case letters in R. We have also removed the prefix "CURLOPT_" since
we know the context in which they option names are being used. And
lastly, any option names that have a _ (after we have removed the
CURLOPT_ prefix) are changed to replace the '_' with a '.' so we can
type them in R without having to quote them. For example, combining
these three rules, "CURLOPT_URL" becomes
url and
CURLOPT_NETRC_FILE becomes
netrc.file.
That is the mapping scheme.
The code that handles options in RCurl automatically
maps the user's inputs to lower case. This means
that you can use any mixture of upper-case
that makes your code more readable to you and others.
For example, we might
write
writeFunction = basicTextGatherer()
or
HTTPHeader = c(Accept="text/html")
We specify one or more options by using the names. To make
interactive use easier, we perform partial matching on the names
relative to the set of know names. So, for example, we could specify
getURL("http://www.omegahat.org/RCurl/testPassword",
verbose = TRUE)
or, more succinctly,
getURL("http://www.omegahat.org/RCurl/testPassword",
v = TRUE)
Obviously, the first is more readable and less ambiguous.
Please use the full form when writing
"software". But you might use the abbreviated form when
working interactively.
Each option expects a certain type of value from R.
For example, the following options
expect a number or logical value.
[1] "autoreferer" "buffersize"
[3] "closepolicy" "connecttimeout"
[5] "cookiesession" "crlf"
[7] "dns.cache.timeout" "dns.use.global.cache"
[9] "failonerror" "followlocation"
[11] "forbid.reuse" "fresh.connect"
[13] "ftp.create.missing.dirs" "ftp.response.timeout"
[15] "ftp.ssl" "ftp.use.eprt"
[17] "ftp.use.epsv" "ftpappend"
[19] "ftplistonly" "header"
[21] "http.version" "httpauth"
[23] "httpget" "httpproxytunnel"
[25] "infilesize" "ipresolve"
[27] "low.speed.limit" "low.speed.time"
[29] "maxconnects" "maxfilesize"
[31] "maxredirs" "netrc"
[33] "nobody" "noprogress"
[35] "nosignal" "port"
[37] "post" "postfieldsize"
[39] "proxyauth" "proxyport"
[41] "proxytype" "put"
[43] "resume.from" "ssl.verifyhost"
[45] "ssl.verifypeer" "sslengine.default"
[47] "sslversion" "tcp.nodelay"
[49] "timecondition" "timeout"
[51] "timevalue" "transfertext"
[53] "unrestricted.auth" "upload"
[55] "verbose"
The
connecttimeout gives the maximum number
of seconds the connection should take before
raising an error, so this is a number.
The
header option, on the other hand,
is merely a flag to indicate whether header information
from the response should be included.
So this can be a logical value (or a number that is
0 to say FALSE or non-zero for TRUE.)
At present, all numbers passed from R are converted to
long when used in libcurl.
Many options are specified as strings.
For example, we can specify
the user password for a URI as
getURL("http://www.omegahat.org/RCurl/testPassword/index.html", userpwd = "bob:duncantl", verbose = TRUE)
Note that we also turned on the "verbose" option so that we can see what libcurl is doing.
This is extremely convenient when trying to understand why things aren't
working (or are working in a particular way!).
Another example of using strings is to
specify a
referer URI and a user-agent.
getURL("http://www.omegahat.org/RCurl/index.html", useragent="RCurl", referer="http://www.omegahat.org")
(Again, you might want to turn on the "verbose" option
to see what libcurl is doing with this information.)
The libcurl facilities allow us to not only set our own values for
fields used in the HTTP request header (such as the
referer or
user-agent), but it also allows us to set an entire collection of new
fields or replacements for any existing field. We do this in R using
the
httpheader option for libcurl and we specify a value which is a
named character vector.
For example, suppose we want to provide a value
for the Accept field and add a new field named,
say, Made-up-field.
We could do this in the request as
getURL("http://www.omegahat.org/RCurl", httpheader = c(Accept="text/html", 'Made-up-field' = "bob"))
If you turn on the verbose option again for this request, you will see these
fields being set.
> getURL("http://www.omegahat.org", httpheader = c(Accept="text/html", 'Made-up-field' = "bob"), verbose = TRUE)
* About to connect() to www.omegahat.org port 80
* Connected to www.omegahat.org (169.237.46.32) port 80
> GET / HTTP/1.1
Host: www.omegahat.org
Pragma: no-cache
Accept: text/html
Made-up-field: bob
(Note that not all servers will tolerate setting header fields arbitrarily
and may return an error.)
The key thing to note is that headers are specified as name-value
pairs in a character vector. R takes these and pastes the name and
value together and passes the resulting character vector to libcurl.
So while it is convenient to express the headers as
c(name = "value", name = "value")
if you already have the data in the form
c("name: value", "name: value")
you can use that directly.
Some of the libcurl options expect a C routine. For example, when
libcurl is receiving the response from the HTTP server, it will call
the C routine specified via the option
CURLOPT_WRITEFUNCTION each time
it has a full buffer of bytes. While it is possible for us to be able
to specify a C routine from R (using
), we currently don't
support this. Instead, it is more natural to specify an R function
which is to be called when appropriate. And this is indeed how we do
things in RCurl. One can specify a function for the
writefunctionwriteheader and
debugfunction options. (We can add support for
the others such as
readfunction.) To use these is quite simple. We
expect an R function that takes a single argument which is the
character of bytes to process. The function can do what it wants with
this argument. Typically, it will accumulate it in a persistent
variable (e.g. using closures) or process it on-the-fly such as adding
to a plot, passing it to an HTML parser, ....
The function
is an example
of the idea and this mechanism is used in
. Suppose, for some reason, we wanted
to read the header information that was returned by HTTP server in the
response to our request. (This has interesting things like cookies,
content type, etc. that libcurl uses internally, but we may also want
to process.) Then we would firstly use the
header option to turn on
the libcurl facility to report the response header information.
If we just do this, the header information will be included in
the text that
returns.
This is fine, but we will have to separate it out
by finding the first line, etc.
Instead, it is easier to ask libcurl to hand the header
information to use separate from the text/body of the response.
We can do this by creating a callback function via the
function.
h = basicTextGatherer()
txt = getURL("http://www.omegahat.org/RCurl", header = TRUE, headerfunction = h$update)
All we have done is create a collection of functions (stored in
h) and passed the update callback to libcurl. Each
time libcurl receives more of the headers, it calls this function with
the header text. It may call this just once or several times. This
depends on how large the header information is, how libcurl buffers
the information, etc.
Having called
, we have the text
from the URI. The header information is available from
h, specifically its
value
function element.
h$value()
The
is another example of a
callback that can be used with libcurl. If we set the "verbose"
option to
TRUE
, libcurl will provide a lot of information about
its actions. By default, these will be written on the console
(e.g. stderr). In some cases, we would not want these to be on the
screen but instead, for example, displayed in a GUI or stored in a
variable for closer examination. We can do this by providing a
callback function for the debugging output via the
debugfunction
option for libcurl.
The
is a simple
one that merely cumulates its inputs in different
categories and makes them available via
the
value function.
The setup is easy:
d = debugGatherer()
x = getURL("http://www.omegahat.org/RCurl", debugfunction=d$update, verbose = TRUE)
At the end of the request, again we have the text from the URI in
x, but we also have the debugging information.
libcurl has called our
update function
each time it has some information (either from the
HTTP server or from its own internal dialog).
(R) names(d$value())
[1] "text" "headerIn" "headerOut" "dataIn" "dataOut"
The headerIn and headerOut fields report the
text of the header for the response from the Web server
and for our request respectively.
Similarly, the dataIn and dataOut fields give
the body of the response and request.
And the text is just messages from libcurl.
We should note that not all options are (currently)) meaningful in R.
For example, it is not
currently possible to redirect
standard error for libcurl to a different
FILE* via
the "stderr" option. (In the future, we may be able to specify an R
function for writing errors from libcurl, but we have not put that in
yet.)
The RCurl package provides many additional
mechanisms for downloading URIs that R does not currently have
built-in. But perhaps the most pressing reason for developing the
RCurl package was the need to submit forms. The [
odbAccess] package is
a package that can read an HTML page with one or more forms and create
an S function for each form that allows S users to submit the form
programmatically rather than requiring interactively browsing the
page, saving the result to a file and then loading it into R.
In order for these functions to work, we need to be able to submit the
contents of the form from S as if it came from a regular browser.
We use RCurl to do this.
There are two mechanisms used for submitting HTML forms: GET and POST.
Both take a set of name-value pairs giving the arguments
to parameterize the call.
The difference between the mechanisms is how these
name-value pairs are delivered to the HTTP server.
The GET method puts the name-value pairs of parameters
at the end of the URI name,
e.g.
http://www.omegahat.org/cgi-bin/form.pl?a=1&b=2
The POST method expects the name-value pairs to be
sent as the body of the HTTP request,
each put in its own "paragraph" or stanza.
This is more complicated but supports sending binary data, etc.
Which of the GET and POST mechanism is appropriate is specified with the HTML form itself via the
action attribute of the
FORM itself. To the user, however, the browser
takes care of figuring out the correct way to deliver the name-value
pairs specified by the user when interacting with the components of
the form. In RCurl, we don't have access to the original HTML form so
we cannot tell what mechanism to use. It is up to the caller to
determine whether to use
or
depending on the value of the
action attribute in the original HTML
file.
After determining whether to use
POST or GET, the interface to the functions is typically the same to the
user. Essentially, she need only specify the name-value pairs for
each of the form elements. We do this via a named list or named
character vector. (The list simply allows us to have objects of
different type other than strings!) We must specify all the fields,
including the hidden fields, if the the processor on the HTTP server
is to make sense of it. RCurl doesn't try to interpret the name-value
pairs, but just transports them.
Let's look at an example of sending a query to Google
(via HTTP rather than its API).
getForm("http://www.google.com/search", hl="en", lr="", ie="ISO-8859-1", q="RCurl", btnG="Search")
The result is the HTML you would ordinarily see in your browser.
You might use
to parse it.
What is important in the example is that we are specifying the required fields
in the query as named arguments to R.
takes care of bringing them together and constructing
the full URI name. Note that libcurl also handles escaping the
special characters, e.g. converting a space to %20.
Note that if you wanted to explicitly do this escaping on a string
rather than having libcurl implicitly do it, you can
use
.
Similarly, there is a function
to reverse the escaping and make a string "human-readable".
is almost identical.
Let's submit a POST form to
http://www.speakeasy.org/~cgires/perl_form.cgi
postForm("http://www.speakeasy.org/~cgires/perl_form.cgi",
"some_text" = "Duncan",
"choice" = "Ho",
"radbut" = "eep",
"box" = "box1, box2"
)
Here, the form elements are named some_text, choice, radbut, box. We
have simply provided values for them. Again, the result is the
regular response from the HTTP server.
Sometimes we already have the arguments in a list. It is slightly
more complex then to pass them to the function via the
...
argument. The two form submission functions in RCurl
(
and
) also accept the name-value
arguments via the
... parameter. This arises in programmatic
access to the functions rather than interactive use.
Since we use
... for the name-value pairs of the form, we cannot
specify the libcurl options (unambiguously) in this way and we require
than any such options to control the HTTP request at the libcurl-level
be passed via the
.opts parameter. RCurl and libcurl
construct the HTTP request and after that, the request is just like a
regular URI download. All of the usual techniques for reading the
response, its header, etc. work.
The functions we have presented above are the high-level entry points
that allow R users to make the common-style HTTP requests. The RCurl
package is capable of more however. It provides access to the basic
libcurl primitives which one can use to compose more complicated and
non-standard HTTP requests. For the most part, one merely specifies
libcurl options by name to the different functions and these take
effect for that call. An alternative model (used more in C code) is
that we first create a libcurl object to represent the HTTP request,
then we customize it by setting options and then we invoke the
request. This is far more involved than we need in R. There is a
simplicity about the
function that
removes the need to know about the internal C structure representing
the call. However, there are occasions when it is useful to know
about this and exploit it. Specifically, one can create an instance
of this libcurl "handle" and use it in several requests. This has the
advantage that we do not have to set the options in each call, but
rather can do this just once. This saves a marginal amount of time in
R by reducing the computations, but it will be essentially negligible
relative to the network latency involved in the request itself. What
is more important is that if the sequence of requests are to the same
server, the libcurl engine can maintain the connection to the server
and avoid having to reestablish it each time. This handshaking is
quite expensive, so reusing the "handle" in such situations can yield
non-trivial performance gains. It is also even possible to "pipeline"
requests by sending multiple requests before getting the answer back
for the first one. This again can improve performance.
Now that we both know about the internal libcurl structures and know
why we might be interested in reusing them across requests, the
question remains how do we do this. It is quite easy. Each of the
"action" functions in the package (i.e. that work with libcurl
directly) have a parameter named
curl. For each of
these functions, the default value is
and what this means is that, if
no value is given for
curl, a new handle is created for
the duration of this call.
So it is easy for us to create such a handle before calling
one of these functions and then pass that as the value
for
curl.
For example, we can make two requests to the
www.omegahat.org site using the same handle
as follows:
handle = getCurlHandle()
a = getURL("http://www.omegahat.org/RCurl", curl = handle)
b = getURL("http://www.omegahat.org/", curl = handle)
It is important to remember that if we set any options in any of the
calls, these will be set in the libcurl handle and these will persist
across requests unless they are reset. For example, if we had set the
header=TRUE
option in the first call
above, it would remain set for the second call. This can be sometimes
inconvenient. In such cases, either use separate libcurl handles, or
reset the options.
The function
allows us to
create a new libcurl handle that is an exact copy of the
existing one. This allows us to quickly reuse
existing settings without having them affect
other requests.
(The data in the option values are not copied).
See
curl_easy_duphandle.
By reusing libcurl handles, we avoid reallocating
a new one and potentially benefit from improved connectivity.
One downside, however, when reusing handles is that the options we set in
R need to be copied as C data since they will persist across
R function calls in the libcurl handle itself.
As a result, there are additional computations needed.
Again, this is negligible in almost all cases
and will be dominated by the network speed.
libcurl doesn't have any explicit function for fetching a
URL. Instead, it uses a powerful but simple interface which involves
merely setting the options in the libcurl handle as desired and then
invoking the request. So one just prepares the request and forces it
to be sent. This is done via the
function in R. This is how
is actually
implemented.
It is hopefully clear that it is the libcurl options that make this
interface work and allow us to make interesting queries. From
specifying the URI to how to read the text, to providing passwords,
it is the options that are critical. For the most part, these
options are passed by name to functions in RCurl via the
...
mechanism in R and the
.opts argument.
These two collections of arguments are merged, with those
in
... overriding corresponding ones in the
.opts object.
Why do we have the
.opts argument?
The reason is similar to the
.params
in the form functions: often we have the options
in a list and it is not as convenient to use
the
... approach.
Having both allows the caller/programmer to use
whichever is most convenient.
One case in which the
.opts argument is useful is if we
want to prepare a set of options that are to be used in all (or a set
of) calls. We can combine these arguments into a list just once and
then pass them to each HTTP request easily by simply using that
variable. Since we merge the values in
... and
.opts, this works nicely.
To create such a list of options, we can use the function
. This creates an S3-style object
with class
CURLOptions. This function never
involves libcurl, but sorts out the names of the options by using
partial matching (via the
function) and returns an R object with the options as name-value pairs in a list.
The fact that this is a class means that if we access any elements, the
full names are used, even when we set an element. This means that the
names are kept resolved as we use it in R and correspond unambiguously
to real libcurl options.
We can use this function something like the following.
opts = curlOptions(header = TRUE, userpwd = "bob:duncantl", netrc = TRUE)
getURL("http://www.omegahat.org/RCurl/testPassword/index.html", verbose = TRUE, .opts = opts)
Here we create the options ahead of time and use them in a call while
specifying additional options (i.e. "verbose").
Some readers will have noticed that we could achieve the same effect
of having a set of fixed options that are used in a collection of
calls by reusing a libcurl handle. We could create the handle, set
the common options, and then use that handle in the set of calls.
This is indeed a natural and often good way to do things.
The following code does what we want.
h = getCurlHandle(header = TRUE, userpwd = "bob:duncantl", netrc = TRUE)
getURL("http://www.omegahat.org/RCurl/testPassword/index.html", verbose = TRUE, curl = h)
The first line creates a new handle and fills in the three
"persistent" options. These are in the handle itself, not in R at
this stage. Now, when we perform the request via
, we specify this libcurl handle and
provide the "verbose" option.
The function
is used implicitly in
the code above and this actually sets the option-values in a libcurl
handle. It can also be used to simply resolve them.