Omegahat Statistical Computing

Ideas for statistical computing

Twitter API with OAuth2 using R

Posted by omegahat on October 13, 2014

I just put together some code to collect tweets from Twitter’s search API for some students at Davis.  A brief document describing the approach and the code itself is available at https://github.com/duncantl/TwitterOAuth2.git. It is not completely robust, but it does illustrate how to

  • use OAuth2 for the application-only authentication,
  • deal with rate-limiting, and
  • cursor through the result set of a single query.

The OAuth2 approach gives us a higher rate-limit.  There is also code to use the OAuth1.1 mechanism by directly signing the request using the ROAuth package. This is quite simple using the ROAuth:::signRequest() function.

I am not the first to do this and other people have posted aspects of this at various places.  This tries to show all the pieces.

Advertisement

Posted in Uncategorized | Leave a Comment »

Data Science & Data Engineering

Posted by omegahat on April 15, 2014

I was very happy to participate in an NRC workshop on “TRAINING STUDENTS TO EXTRACT VALUE FROM BIG DATA“.  The discussion was very interesting and there was a terrific mix of very impressive participants.  Near the end, there was a slight contention. One dimension of that was what is perhaps a useful distinction that we should make when talking about Data Science generally.  Many of us talk about databases in Data Science, however, some people think of that in very ambitious, technically advanced and interesting ways.  On the other hand, practitioners (and people thinking from the data analysis perspective), typically think of using databases in a very simple-minded manner, and perhaps as being something dictated by the provider of the data.  As a result, many topics important in database design and implementation are not as relevant to consumers of databases.  So perhaps it is important to think of two types of data scientist – the consumer of data focusing on data analysis, and a different group who are “data engineers” and who design data products and implementations and who can architect important frameworks for the analysts to consume.

So data science may be better described as having sub-categories of data analysts and data engineers.

 

 

Posted in Uncategorized | 1 Comment »

Rllvm

Posted by omegahat on September 1, 2010

Over the past 10 years, I have been torn between building a new stat. computing environment
or trying to overhaul R. There are many issues on both sides. But the key thing is to
enable doing new and better things in stat. computing rather than just making the existing things
easier and more user-friendly.

If we are to continue with R for the next few years, it is essential that it get faster.
There are many aspects to this. One is compiling interpreted R code into something faster.
LLVM is a toolkit that facilitates the compilation of machine code. So in the past few days
I have looked into this and developed an R package that provides R-bindings to some of
the LLVM functionality.

The package is available from http://www.omegahat.org/Rllvm, as are several examples
of its use.
I used the package to implement a compiled version of one of Luke Tierney’s compilation examples
which uses a loop in R to add 1 to each element of a vector. The compiled version gives a speedup
of a factor of 100, i.e. 100 times faster than interpreted R code. This is slower than x + 1
in R which is implemented in C and does more. But it is a promising start. The compiled version is also faster than bytecode interpreter approaches. So this is reasonably promising.

Of course, it would be nicer to leverage an existing compiler! (Think SBCL and building on top of LISP).

Posted in Language, R, Uncategorized | 3 Comments »

Rffi

Posted by omegahat on September 1, 2010

A few weeks ago, I posted the Rffi package on the Omegahat repository.
It is an interface to libffi which is a portable mechanism for invoking native routines
without having to write and compile any wrapper routines in the native language.
In other words, we can use this in R to call C routines using only R code.
This enables us to call arbitrary routines and get back arbitrary values, including structures
arrays, unions, etc.

One could use the RGCCTranslationUnit package to obtain descriptions of routines and data
structures and then generate the interfaces to those routines via functions in Rffi.

Writing or generating C/C++ code for wrappers (see RGCCTranslationUnit) is still the way to
go in many ways, but Rffi is very convenient for dynamic invocations without any write and compile
setup costs.

As usual, you can install this from source from the Omegahat repository

install.packages(“Rffi”, repos = “http://www.omegahat.org/R”, type = “source”)

but you will need to have installed libffi.

Posted in Language, R, Uncategorized | Leave a Comment »

RXQuery

Posted by omegahat on March 24, 2010

I have put a new version of the RXQuery package which interfaces to the Zorba XQuery engine. This makes the package compatible with the 1.0.0 release of Zorba for external functions.

The package allows one to use XQuery from within R and to use R functions within XQuery scripts.

Posted in R, Uncategorized, XML | Tagged: | Leave a Comment »

Package Releases

Posted by omegahat on March 20, 2010

I just put a new version of the XML package on the Omegahat repository.

There is a new version of the RKML package which handles large datasets much more rapidly.

Also, I put a new package named RJSCanvasDevice which implements and R graphics device that creates JavaScript code that can be subsequently display on a JavaScript canvas in an HTML document.

Posted in R | Tagged: , , | 4 Comments »

Posted by omegahat on March 17, 2010

Hin-Tak Leung mailed me about a problem with certain malformed XML documents from FlowJo. There are namespace prefixes (prfx:nodeName) with no corresponding namespace declarations (xmlns:prefix=”uri”). How do we fix these? Well, the XML parser can read this but raises errors. We can do nice things to catch these errors and then post-process them. Then we can fix up the errors, add namespace declarations to the document and then re-parse the resulting document. Here is the code. It will make it into the XML package.

fixXMLNamespaces =
  #
  #  call as
  #    dd = fixXMLNamespaces("~/v75_step6.wsp", .namespaces = MissingNS)
  #  or
  #   dd = fixXMLNamespaces("~/v75_step6.wsp", gating = "http://www.crap.org", 'data-type' = "http://www.morecrap.org")
  #
function(doc = "~/v75_step6.wsp", ..., .namespaces = list(...)) 
{
    # collect the error messages
  e = xmlErrorCumulator(, FALSE)
  doc = xmlParse(doc, error = e)

  if(length(e) == 0)
     return(doc)

     # find the ones that refer to prefixes that are not defined
  ns = grep("^Namespace prefix .* not defined", unique(environment(e)$messages), val = TRUE)
  ns = unique(gsub("Namespace prefix ([^ ]+) .*", "\\1", ns))

    # now set those name spaces on the root of the document
  if(is(.namespaces, "list"))
    .namespaces = structure(as.character(unlist(.namespaces)), names = names(.namespaces))

  uris = .namespaces[ns]
  if(length(uris)) {
     mapply(function(id, uri)
              newXMLNamespace(xmlRoot(doc), uri, id),
            names(uris), uris)
     xmlParse(saveXML(doc), asText = TRUE)
  } else
     doc
}

(I’ve made some minor changes thanks to Hin-Tak’s suggestions, but haven’t tested them.)

Posted in R, Uncategorized, XML | 4 Comments »

Posting blog entries directly from R.

Posted by omegahat on March 14, 2010

While looking more at how others were preparing blog content about R, I saw that at least one person was uploading content via a python script. I like programmatic solutions and since I am writing a book on XML and Web Technologies including Web services, I looked into this. The mechanism used is XML-RPC. I have an XMLRPC package for R so we can quickly deploy it to provide functionality in R that allows each of us to

  • query information from our blog
  • post blog items, append to a post, create new pages, add categories, etc.

So the RWordpress package is the result.

Correction: WordPress is taking the URL and capitalizing the p in RWordpress. This seems to happen only for words containing “wordpress”. So I have renamed the R package to RWordPress. The link on this page (even while being lower-case p in the HTML) now corresponds to the new package name. Thanks Tal.

Posted in Uncategorized | 2 Comments »

Blogging directly from R – example

Posted by omegahat on March 14, 2010

This post is submitted directly from R using the RWordpress package.

newPost(list(description = 'This post is submitted directly from R using the RWordpress package.', title = 'Blogging directly from R - example'))

Posted in Uncategorized | Leave a Comment »

Debugging the cron mechanism

Posted by omegahat on March 13, 2010

Recently, a student of mine asked me about automating the collection of data from the HTML form http://www.wrh.noaa.gov/forecast/xml/xml.php. The intent is to collect the forecasts twice a day and process the XML into a data frame. Getting the content of the URL, parsing it and extracting the data is a quite straightforward application of the RCurl and XML packages along with XPath and getNodeSet().

Automating the collection involves the cron facility on a Linux machine. And there she ran into some troubles which are somewhat interesting to note as a learning experience. Firstly, why use a Linux machine? Because we want a machine that is on the network all the time rather than using a laptop or home machine. Also, it is nice if that machine is backed up and reliable.

One can “google” cron and find crontab. So one needs to learn the syntax for specifying a cron job. We have one entry per line that specifies the minute, hour, day of month, month and day of week.
We can have ranges, e.g. 2-4. An asterisk * in a field means all of them possible values, i.e. first-last.

The last field is a shell command. The first question is what shell? The second question is what are the settings? In other words, is our .profile read? is the .login read? The answers to these questions are available from the documentation. But it is interesting to consider how we can derive the answers ourselves with a series of tests.

The problem the student had was that she could run R fine in an interactive shell, but in the cron job, she was getting errors. She worked through a sequence of problems and ended up with an issue
when loading the RCurl package with an error message (in the output from cron) that indicated that it couldn’t load libcurl.so.

So let’s get to work.

Probably the simplest thing to do is set a cron job that prints the environment variables. Put the following into a file, say, myCronJobs:

* * * * * env > /tmp/myEnv

and then run the shell command

crontab myCronJobs

It is always a good idea to see if the job(s) were set with

crontab -l

Now, wait a minute. Literally that is until the job runs. Alternatively, quickly create the file /tmp/myEnv and run

tail -f /tmp/myEnv

and watch the output to see the lines appear. BTW, you can run tail on the file that isn’t there!

This shows you what environment variables are set and you can compare this to your regular shell by running the env command in it. Quite a difference!

Note the value of the LD_LIBRARY_PATH environment variable when run in the regular shell and in the cron job. What directories are missing from the cron job’s version? Now let’s try to find libcurl.so. We might use the locate command. And indeed, on that machine it is in a directory that is not included in the LD_LIBRARY_PATH environment variable for the cron job. So we have to explicitly set that in our cron script with a call to

export LD_LIBRARY_PATH=/usr/local/lib:/usr/lib:/usr/local/lib64

or whatever the relevant settings are (using : to separate the directory names).

Another approach would be to update the ld.so.conf to have the system’s dynamic loader know to look in that directory. But one need’s administrator privileges to do that.

Note that the way to get the output is via shell redirection. And we can and should use redirection to put the output to standard error to standard out to appropriate files or combine them into the same output. Note that you are using the regular shell – sh which is probably a variant of bash. So you need to use

  export LD_LIBRARY_PATH=/usr/local/lib ;  /usr/local/bin/Rscript -e 'library(RCurl); print(getURLContent)' > /tmp/Routput 2>&1  

Posted in General Computing | Tagged: , , , , | Leave a Comment »