Recently, a student of mine asked me about automating the collection of data from the HTML form http://www.wrh.noaa.gov/forecast/xml/xml.php. The intent is to collect the forecasts twice a day and process the XML into a data frame. Getting the content of the URL, parsing it and extracting the data is a quite straightforward application of the RCurl and XML packages along with XPath and getNodeSet().
Automating the collection involves the cron facility on a Linux machine. And there she ran into some troubles which are somewhat interesting to note as a learning experience. Firstly, why use a Linux machine? Because we want a machine that is on the network all the time rather than using a laptop or home machine. Also, it is nice if that machine is backed up and reliable.
One can “google” cron and find crontab. So one needs to learn the syntax for specifying a cron job. We have one entry per line that specifies the minute, hour, day of month, month and day of week.
We can have ranges, e.g. 2-4. An asterisk * in a field means all of them possible values, i.e. first-last.
The last field is a shell command. The first question is what shell? The second question is what are the settings? In other words, is our .profile read? is the .login read? The answers to these questions are available from the documentation. But it is interesting to consider how we can derive the answers ourselves with a series of tests.
The problem the student had was that she could run R fine in an interactive shell, but in the cron job, she was getting errors. She worked through a sequence of problems and ended up with an issue
when loading the RCurl package with an error message (in the output from cron) that indicated that it couldn’t load libcurl.so.
So let’s get to work.
Probably the simplest thing to do is set a cron job that prints the environment variables. Put the following into a file, say, myCronJobs:
* * * * * env > /tmp/myEnv
and then run the shell command
It is always a good idea to see if the job(s) were set with
Now, wait a minute. Literally that is until the job runs. Alternatively, quickly create the file /tmp/myEnv and run
tail -f /tmp/myEnv
and watch the output to see the lines appear. BTW, you can run tail on the file that isn’t there!
This shows you what environment variables are set and you can compare this to your regular shell by running the env command in it. Quite a difference!
Note the value of the LD_LIBRARY_PATH environment variable when run in the regular shell and in the cron job. What directories are missing from the cron job’s version? Now let’s try to find libcurl.so. We might use the locate command. And indeed, on that machine it is in a directory that is not included in the LD_LIBRARY_PATH environment variable for the cron job. So we have to explicitly set that in our cron script with a call to
or whatever the relevant settings are (using : to separate the directory names).
Another approach would be to update the ld.so.conf to have the system’s dynamic loader know to look in that directory. But one need’s administrator privileges to do that.
Note that the way to get the output is via shell redirection. And we can and should use redirection to put the output to standard error to standard out to appropriate files or combine them into the same output. Note that you are using the regular shell – sh which is probably a variant of bash. So you need to use
export LD_LIBRARY_PATH=/usr/local/lib ; /usr/local/bin/Rscript -e 'library(RCurl); print(getURLContent)' > /tmp/Routput 2>&1