Omegahat Statistical Computing

Ideas for statistical computing

Posted by omegahat on March 17, 2010

Hin-Tak Leung mailed me about a problem with certain malformed XML documents from FlowJo. There are namespace prefixes (prfx:nodeName) with no corresponding namespace declarations (xmlns:prefix=”uri”). How do we fix these? Well, the XML parser can read this but raises errors. We can do nice things to catch these errors and then post-process them. Then we can fix up the errors, add namespace declarations to the document and then re-parse the resulting document. Here is the code. It will make it into the XML package.

fixXMLNamespaces =
  #
  #  call as
  #    dd = fixXMLNamespaces("~/v75_step6.wsp", .namespaces = MissingNS)
  #  or
  #   dd = fixXMLNamespaces("~/v75_step6.wsp", gating = "http://www.crap.org", 'data-type' = "http://www.morecrap.org")
  #
function(doc = "~/v75_step6.wsp", ..., .namespaces = list(...)) 
{
    # collect the error messages
  e = xmlErrorCumulator(, FALSE)
  doc = xmlParse(doc, error = e)

  if(length(e) == 0)
     return(doc)

     # find the ones that refer to prefixes that are not defined
  ns = grep("^Namespace prefix .* not defined", unique(environment(e)$messages), val = TRUE)
  ns = unique(gsub("Namespace prefix ([^ ]+) .*", "\\1", ns))

    # now set those name spaces on the root of the document
  if(is(.namespaces, "list"))
    .namespaces = structure(as.character(unlist(.namespaces)), names = names(.namespaces))

  uris = .namespaces[ns]
  if(length(uris)) {
     mapply(function(id, uri)
              newXMLNamespace(xmlRoot(doc), uri, id),
            names(uris), uris)
     xmlParse(saveXML(doc), asText = TRUE)
  } else
     doc
}

(I've made some minor changes thanks to Hin-Tak's suggestions, but haven't tested them.)

4 Responses to “”

  1. I came across this blog while googling for something else :-). I see you have avoided reparsing if no error as I suggested. I have also since looked at the earlier version of this piece of code in detail and made some changes myself – I don’t have it here (different computer) but the most important part in my change is possibly that I have changed the line:
    xmlParse(saveXML(doc), asText = TRUE)
    to
    new.doc <- xmlParse(saveXML(doc), asText = TRUE)
    free(doc)
    new.doc
    to free the memory of the old one, to avoid memory leak. I don't know if that make sense and "add value" or just being tedious, but I'd rather do it than be sorry.

  2. omegahat said

    Hi Hin-Tak

    The garbage collection model since XML_2.6.0 (AFAIR) means that the call to free() is not needed, in theory. However, it wasn’t working the entire time because of a trivial bug. So it is useful in some versions, but generally, the model doesn’t need it and the document will be freed when it and no nodes within it are referenced by R.

    • Hi Duncan,

      But there hasn’t been a release *since* 2.6? (Until 2.8 this weekend). What are the pros and cons of doing an explicit free()? It is definitely much more user-friendly if resources are automatically freed the usual R way. There aren’t any obscenely large XML in the wild (e.g. hundreds of MB) but it is general good house-keeping – at least for package-writing, vs one-off usage code – to manually deallocate at a chosen time.

      External pointers, weak-references (and their life-cycle, i.e. creation and destruction) is a topic I confess don’t completely understand.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: