omegahat / xml Goto Github PK

The XML package for R

License: Other

R 11.25% C 6.03% Shell 1.91% Assembly 0.48% Turing 75.55% XSLT 0.06% HTML 3.94% Makefile 0.20% M4 0.58%

xml's Introduction

The Packages/ directory has some package source tar.gz files.

See index.html for a description of the package and the installation procedures.

This R package is not in the R package format in the github repository. It was initially developed in 1999 and was intended for use in both S-Plus and R and so requires a different structure for each.

make ADMIN=1

copies the files to an appropriate structure for R. It currently requires some supporting tools from the Omegahat admin facilities.

xml's People

Contributors

Stargazers

Watchers

Forkers

jetaber agamat lawremi akbertram zappingseb eblondel yvkschaefer guiyolac damodardhakad martinschmelzer

xml's Issues

install error

** preparing package for lazy loading
Creating a generic function for ‘source’ from package ‘base’ in package ‘XML’
Error in rematchDefinition(definition, fdef, mnames, fnames, signature) : 
  methods can add arguments to the generic ‘source’ only if '...' is an argument to the generic
Error : unable to load R code in package ‘XML’
ERROR: lazy loading failed for package ‘XML’
* removing ‘/home/cdsw/R/XML’

The downloaded source packages are in
	‘/tmp/RtmpqPIPM9/downloaded_packages’
Warning message:
In install.packages("XML") :
  installation of package ‘XML’ had non-zero exit status

Repo up to date with package?

Trying to install this package but getting the error:
package ‘XML’ is not available (for R version 3.6.0)

This lead me to look at the DESCRIPTION file but that specifies a dependency on R >= 1.2.0 whereas the CRAN page says R 4.0.

I'm not sure if I'm confusing two different packages or what is happening?

Partial argument match

You might want to change seq(along = nodes) to seq(along.with = nodes). If options("warnPartialMatchArgs") == TRUE, this throws an unnecessary warning (at least with R 4.1.3+).

XML/R/reflection.R

Line 63 in a74ea7d

for(i in seq(along = nodes)) {

Unable to get all files using getHTMLExternalFiles

Hi,

I have a SVN repository having multiple sub directories. I am looking to get the path of all xml files present in the sub directories. I tried using the following, but I am getting only the links and not the file paths. Any help would be awesome.

             doc <- htmlParse(path)
             getHTMLExternalFiles(doc,xpQuery = "//a/@href",recursive = T)

This function is giving me the result as

           [1] "../"                           "Olympus/"                      "Reflections/"                 
           [4] "StopTimePref/"                 "http://subversion.tigris.org/"

The repository images are shared.

I want to get the full path of the xml files.

Add option to load/parse XML selectively [feature request]

Below is an example where I read in and parse and XML file. After storing the data in a data table, I delete many columns.

It would be very helpful to have an option to parse XML files selectively, i.e. files that are too big for direct processing (because of memory restrictions) could then be loaded directly by only reading in the data that is needed for further analyses.

# Load Package 
library(XML)
library(data.table)

# Set up toy example 
xmlText  <- "<posts>
  <row Id='1' PostTypeId='1' AcceptedAnswerId='8' CreationDate='2012-12-11T20:37:08.823' Score='42' ViewCount='5761' Body='&lt;p&gt;Assuming the world in the One Piece universe is round, then there is not really a beginning or and end of the grand line.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;The Straw Hats started out at the first half and now are sailing across the second half.&#xA;Wouldn.t it have been quicker to set sail in the opposite direction where they started?&lt;/p&gt;&#xA;' OwnerUserId='21' LastEditorUserId='88' LastEditDate='2013-06-20T03:28:03.750' LastActivityDate='2013-11-29T11:23:22.793' Title='The treasure in One Piece is at the end of the grand line. But isn.t that the same as the beginning?' Tags='&lt;one-piece&gt;' AnswerCount='4' CommentCount='0' />
  <row Id='2' PostTypeId='1' AcceptedAnswerId='33' CreationDate='2012-12-11T20:39:40.780' Score='10' ViewCount='161' Body='&lt;p&gt;In the middle of &lt;em&gt;The Dark Tournament&lt;/em&gt;, Yusuke Urameshi gets to fully inherit Genkai.s power of the &lt;em&gt;Spirit Wave&lt;/em&gt; by absorbing a ball of energy from her.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;My question is, why is it such a painful procedure to learn and absorb this power?&lt;/p&gt;&#xA;' OwnerUserId='26' LastEditorUserId='247' LastEditDate='2013-02-26T17:02:31.570' LastActivityDate='2013-06-20T03:31:39.187' Title='Why does absorbing the Spirit Wave from Genkai involve such a painful process?' Tags='&lt;yu-yu-hakusho&gt;' AnswerCount='1' CommentCount='0' />
  <row Id='3' PostTypeId='1' AcceptedAnswerId='148' CreationDate='2012-12-11T20:42:47.447' Score='6' ViewCount='1468' Body='&lt;p&gt;In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round.  At one point she even has a watermelon garden and attacks all the bugs that get near the melons.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;What.s the significance of the watermelon and why does she carry one around?&lt;/p&gt;&#xA;' OwnerUserId='29' LastActivityDate='2014-01-15T21:01:55.043' Title='What.s the significance of the watermelon in Sora no Otoshimono?' Tags='&lt;sora-no-otoshimono&gt;' AnswerCount='2' CommentCount='1' />
</posts>"

# Parse XML
doc <- xmlParse(xmlText, asText=TRUE)
r <- xmlRoot(doc)

# Convert XML to data table
# See: http://www.carlboettiger.info/2013/07/22/XML-parsing-strategies
d <- as.data.table(XML:::xmlAttrsToDataFrame(getNodeSet(r, path = "row")))

# Delete columns which are not used in further analyses
d[, c("Score", "ViewCount", "Body", "LastEditorUserId", "LastEditDate", 
      "LastActivityDate", "Title", "Tags", "AnswerCount", "CommentCount"):=NULL]

Add XSLT 1.0 transformation method

As a feature request, consider adding XSLT 1.0 support into the omegahat/XML project. To be a fully robust DOM library with already XPath 1.0 support, omegahat/XML would benefit with XSLT 1.0 support, allowing users to transform XML files with external XSL stylesheets to generate new XML, HTML, even text output.

Currently, in R one must use the xslt library, a sister extension of xml2 which is a different XML package adopted into the tidyverse ecosystem. However, an integrated XSLT processor capability in XML can keep to the base R and S flavor and allow users to leverage existing code and libraries to integrate complex, nested, and dense XML files into R environment.

One constant need often asked in R on StackOverflow, is how to import a nested XML into a data frame. For flat (few nests), element-centric (no attributes), xmlToDataFrame is a very convenient function. But for more complex XML files with attribute values, various looped calls to lapply and xpathSApply is required to bind returned vectors into data frames.

One solution I have advocated on many SO solutions, is to use XSLT: a W3C standards-compliant, well-known, special-purpose, declarative language used regularly in the industry. With its functional, recursive nature, it can transform any original XML to any XML, HTML, or text output for end use needs such as flattening to 2D row-by-column structure for R data frames. However, to run XSLT in R requires a mix of tools as illustrated below.

library(XML)
library(xslt)

# LOAD XML AND XSL
input <- read_xml("/path/to/input.xml", package = "xslt")
style <- read_xml("/path/to/xslt_script.xsl", package = "xslt")

# TRANSFORM INPUT INTO OUTPUT
new_xml <- xml_xslt(input, style)
output <- as.character(new_xml)

# PARSE OUTPUT FROM STRING
doc <- xmlParse(output, asText=TRUE)

# BUILD DATAFRAME
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc))

Alternatively, if using command line tools such as xsltproc on *Unix machines:

library(XML)

# TRANSFORM INPUT INTO OUTPUT
new_xml <- xml_xslt(input, style)
output <- as.character(new_xml)

# PARSE OUTPUT FROM STRING
doc <- xmlParse(output, asText=TRUE)

# COMMAND LINE CALL TO UNIX'S XSLTPROC (ALTERNATIVE TO xslt PACKAGE)
system("xsltproc -o /path/to/input.xml /path/to/xslt_script.xsl /path/to/output.xml")
doc <- xmlParse("/path/to/output.xml")

# BUILD DATAFRAME
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, '//race'))

And with Windows via a PowerShell script to interface to the built-in .NET System.Xml.Xsl Class

PowerShell

param ($xml, $xsl, $output)

if (-not $xml -or -not $xsl -or -not $output) {
    Write-Host "& .\xslt.ps1 [-xml] xml-input [-xsl] xsl-input [-output] transform-output"
    exit;
}

trap [Exception]{
    Write-Host $_.Exception;
}

$xslt = New-Object System.Xml.Xsl.XslCompiledTransform;
$xslt.Load($xsl);
$xslt.Transform($xml, $output);

Write-Host "generated" $output;

library(XML)

system(paste0('Powershell.exe -File',
              ' "C:\\Path\\To\\PowerShell\\Script.ps1"',
              ' "C:\\Path\\To\\Input.xml"',
              ' "C:\\Path\\To\\XSLT\\Script.xsl"', 
              ' "C:\\Path\\To\\Output.xml"'))

df <- xmlToDataFrame("C:\\Path\\To\\Output.xml")

Many open-source DOM libraries including Python's lxml, PHP's xsl class, Perl's XML::LibXSLT class, and even R's xslt package use the libxslt C library which maintains supported methods:

doc = xmlParseFile(...);
style = xsltParseStylesheetFile(...);
res = xsltApplyStylesheet(style, doc, params);

As a good project to galvanize activity on this awesome, omegahat/XML package, please consider XSLT 1.0 support in near future.

Inconsistent conversion for xmlToList

The conversion of XML documents to lists using xmlToList is inconsistent. An XML document of the following format:

<h>
  <i>
    <s>1</s>
    <s>1</s>
    <s>1</s>
  </i>
  <j>
    <s>2</s>
    <s>2</s>
  </j>
</h>

is converted in three different ways depending on the exact number of <s> elements under the <i> and <j> elements. A reproducible example, along with an implementation that works as expected, can be found on Rpubs.

This issue also has popped up on StackOverflow some time ago.

pull in html data with extra java reveal?

I want to pull in data from 538, but I want the full data which is arrived at by clicking on "Show more polls"... Is there any way for the function to access the additional lines of the table?

http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/

The code for pulling in the top level data is:

require(XML)
polls.html <- htmlTreeParse("http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/",
useInternalNodes = TRUE)
parsedDoc <- readHTMLTable(polls.html, stringsAsFactors=FALSE)
pollData <- data.frame(parsedDoc[4])

Consider exporting XML:::addXMLAttributes to the NAMESPACE?

@omegahat Is there a good reason addXMLAttributes is not exported at present? For instance, I've found it useful in our EML repo: https://github.com/ropensci/EML/blob/master/R/emlToS4.R#L13

Is there a better way I should be doing this?

(Since calling unexported functions results in a R CMD check warning of course)

Still significant memory leak on Windows

Hi Duncan,

it's been a while so I thought I'd check back if you found out anything about the cause of the memory leak when using XML on Windows.

I'm sure that you have got a thousand more interesting things to do, but I would just so much appreciate if you could fix this bug. It just keeps coming back at me and slows down all of my efforts WRT to Web Scraping. And given the fact that more and more cool packages emerge that depend on your package (e.g. RSelenium or rvest, this issue propagates to all of them as well.

Thank you so much,
Janko

Here is a slightly updated version of my investigations:

Preliminaries

require("rvest")
require("XML")

Functions

getTaskMemoryByPid <- function(
  pid = Sys.getpid()
) {
  cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
  mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
  mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
  mem
}  
getCurrentMemoryStatus <- function() {
  mem_os  <- getTaskMemoryByPid()
  mem_r   <- memory.size()
  prof_1  <- memory.profile()
  list(r = mem_r, os = mem_os, ratio = mem_os/mem_r)
}
memoryLeak <- function(
  x = system.file("exampleData", "mtcars.xml", package="XML"),
  n = 10000,
  use_text = FALSE,
  xpath = FALSE,
  free_doc = FALSE,
  clean_up = FALSE,
  detailed = FALSE,
  use_rvest = FALSE,
  user_agent = httr::user_agent("Mozilla/5.0")
) {
  if(use_text) {
    x <- readLines(x)
  }
  ## Before //
  prof_1  <- memory.profile()
  mem_before <- getCurrentMemoryStatus()

  ## Per run //
  mem_perrun <- lapply(1:n, function(ii) {
    doc <- if (!use_rvest) {
      xmlParse(x, asText = use_text)
    } else {
      if (file.exists(x)) {
      ## From disk //        
        rvest::html(x)  
      } else {
      ## From web //
        rvest::html_session(x, user_agent)  
      }
    }
    if (xpath) {
      res <- xpathApply(doc = doc, path = "/blah", fun = xmlValue)
      rm(res)
    }
    if (free_doc) {
      free(doc)
    }
    rm(doc)
    out <- NULL
    if (detailed) {
      out <- list(
        profile = memory.profile(),
        size = memory.size()
      )
    } 
    out
  })
  has_perrun <- any(sapply(mem_perrun, length) > 0)
  if (!has_perrun) {
    mem_perrun <- NULL
  } 

  ## Garbage collect //
  mem_gc <- NULL
  if(clean_up) {
    gc()
    tmp <- gc()
    mem_gc <- list(gc_mb = tmp["Ncells", "(Mb)"])
  }

  ## After //
  prof_2  <- memory.profile()
  mem_after <- getCurrentMemoryStatus()

  ## Return value //
  if (detailed) {
    list(
      before = mem_before, 
      perrun = mem_perrun, 
      gc = mem_gc, 
      after = mem_after, 
      comparison_r = data.frame(
        before = prof_1, 
        after = prof_2, 
        increase = round((prof_2/prof_1)-1, 4)
      ),
      increase_r = (mem_after$r/mem_before$r)-1,
      increase_os = (mem_after$os/mem_before$os)-1
    )
  } else {
    list(
      before_after = data.frame(
        r = c(mem_before$r, mem_after$r),
        os = c(mem_before$os, mem_after$os)
      ),
      increase_r = (mem_after$r/mem_before$r)-1,
      increase_os = (mem_after$os/mem_before$os)-1
    )
  }
}

Memory status before anything has ever been requested

getCurrentMemoryStatus()

Generate additional offline example content

s <- html_session("http://had.co.nz/")
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "hadley.html")
# html("hadley.html")

s <- html_session(
  "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd",
  httr::user_agent("Mozilla/5.0"))
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "amazon.html")
# html("amazon.html")

getCurrentMemoryStatus()

Profiling

################
## Mtcars.xml ##
################

res <- memoryLeak(n = 50000, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.1.rdata")
save(res, file = fpath)

res <- memoryLeak(n = 50000, clean_up = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.2.rdata")
save(res, file = fpath)

res <- memoryLeak(n = 50000, clean_up = TRUE, free_doc = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.3.rdata")
save(res, file = fpath)

###################
## www.had.co.nz ##
###################

## Offline //
res <- memoryLeak(x = "hadley.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, 
  detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.3.rdata")
save(res, file = fpath)

## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://had.co.nz/"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.3.rdata")
save(res, file = fpath)

####################
## www.amazon.com ##
####################

## Offline //
res <- memoryLeak(x = "amazon.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, 
  detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)

## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)

htmlTreeParse Error

There seems to be a bug with htmlTreeParse() in XML 3.98-1.4 on R 3.3.2. Here's a minimal example:

link = "http://anson.ucdavis.edu/~mueller/cveng13.html"
tree = htmlTreeParse(link)
tree_body = tree$children$html[[2]]
tree_div = getNodeSet(tree_body, path="//div")

The error message is:

Failed to parse QName 'padding-left:'
Failed to parse QName 'padding-bottom:'
Failed to parse QName 'padding-top:'
Comment must not contain '--' (double-hyphen)
Comment must not contain '--' (double-hyphen)
Comment must not contain '--' (double-hyphen)
Error: 1: Failed to parse QName 'padding-left:'
2: Failed to parse QName 'padding-bottom:'
3: Failed to parse QName 'padding-top:'
4: Comment must not contain '--' (double-hyphen)
5: Comment must not contain '--' (double-hyphen)
6: Comment must not contain '--' (double-hyphen)

This error does not occur with htmlParse().

trim option does not work with xmlValue when recursive = TRUE

I am using XpathSApply and xmlvalue as below:

xpathSApply(page, "//div[@Class = 'raised ']", xmlValue, trim = TRUE))

This works as intended and trims the /n and whitespace from around the nodes I'm trying to get at.

However, if I add the argument 'recursive = FALSE', to only get at the head node:
xpathSApply(page, "//div[@Class = 'raised ']", xmlValue, recursive = FALSE, trim = TRUE))

The whitespace and /n are back in the result. If I run the gsub code that makes up the trim function on the answer, it gives me the desired output.

Looking at the code (https://github.com/omegahat/XML/blob/4d95a6eff4a7c3ac3dce7c5f9d0dd3862e175cf6/R/xmlNodes.R), it looks like trim is invoked outside the conditional function when recursive is false. This would probably be fixable by adding a trim statement to the two outputs within that conditional function.

R: saveXML prefix standalone="yes" outputting as standalone="true"

I am attempting to create a simple XML file that is sourced from an R dataframe along with preamble/header that includes the following text: standalone="yes"

This would follow: xml version="1.0" encoding="UTF-8"

However, when including this in the prefix argument in the saveXML function as such: prefix = '\n'

the preamble reads instead as:

I've read through much of the documentation, but cannot figure out how to do this, and I do not understand why it is converting the text "yes" to "true". Is this an issue with how the prefix is saved? When I provide the file argument, this occurs, but when this is omitted, the xml output to the console prints standalone="yes" correctly. For reference, here is reproducible code that successfully creates an XML file from an R dataframe, but with the incorrect preamble:

data<- structure(list(page = c("Page One", "Page Two"), base64 = c("Very Long String thats a base64 character", "Very Long String thats a base64 character")), .Names = c("page", "base64"), row.names = 1:2, class = "data.frame")
names(data) <- c("title", "page")

library(XML)
xml <- xmlTree()

names(xml)

xml$addTag("report", close=FALSE, attrs=c(type="enhanced"))
xml$addTag("pages", close=FALSE)
for (i in 1:nrow(data)) {
xml$addTag("page", close=FALSE)
for (j in names(data)) {
xml$addTag(j, data[i, j])
}
xml$closeTag()
}
xml$closeTag()
xml$closeTag()
saveXML(xml,file="test.xml",prefix = '\n')

Error adding tag after open tag with xmlOutputBuffer

Hi,

I'm getting an error when running the following code:
doc <- xmlOutputBuffer()
doc$addTag("bravo", close=FALSE)
doc$addTag("charlie", "forename")
Error in if (!startingTag && !is.null(namespace) && namespace == nameSpace && :
missing value where TRUE/FALSE needed

This is how xmlOutputBuffer appears in the documentation:
xmlOutputBuffer(dtd=NULL, nameSpace="", buf=NULL, nsURI=NULL, header="<?xml version=\"1.0\"?>")

And this is how it appears in the source code:
xmlOutputBuffer <- function(dtd = NULL, nameSpace = NULL, buf = NULL, nsURI = NULL, header = "<?xml version=\"1.0\"?>") {..}

When calling the function with nameSpace="" the error does not appear.