Ticket #95 (closed enhancement: fixed)

Opened 1 year ago

Last modified 1 year ago

[PATCH] Improved Hash.from_xml

Reported by: has.s..@gmail.com Assigned to:
Priority: medium Milestone: 0.3.x
Component: Merb Keywords:
Cc: chris@octopod.info

Description

The Hpricot version of the Hash.from_xml is very inefficient. It internally creates a document structure in memory and is then translated into the hash representation required. This makes it very slow.

This patch is based on the SAX parsing method and generates the hash almost directly as it's read.

From a benchmark of an xml feed run 100 times, this version vs the Hpricot version is 3.1 times faster. This is done on my iMac with plenty of other things running so take it with a grain of salt, although I did average it over 5 runs.

This patch relies on the libxml-ruby gem <=0.3.8.4 (the last stable). This may be an issue as this is not supported to run on windows. There has recently been discussion on the libxml list to add this support.

Attachments

hash_to_xml_libxml.diff (8.4 kB) - added by has.s..@gmail.com on 07/23/07 05:59:01.
from_xml using libxml-ruby SAX Parser
libxml_VS_Hpricot_from_xml.rb (9.4 kB) - added by has.s..@gmail.com on 07/23/07 20:12:05.
A Simple Benchmark Script for libxml-ruby vs Hpricot
libxml_VS_Hpricot_VS_REXML_from_xml.rb (12.1 kB) - added by chr..@octopod.info on 07/26/07 15:59:05.
The previous benchmark with a slightly modified version of the libxml sax code that uses REXML fast stream parsing
hash_to_xml_rexml.diff (5.8 kB) - added by chr..@octopod.info on 07/26/07 16:09:59.
Patch containg the rexml version, all credit to has.sox@gmail.com for the hard work :)
benchmark_with_confidence_intervals.rb (13.1 kB) - added by jon.egil.stra..@gmail.com on 07/27/07 06:34:42.

Change History

07/23/07 05:59:01 changed by has.s..@gmail.com

  • attachment hash_to_xml_libxml.diff added.

from_xml using libxml-ruby SAX Parser

07/23/07 09:42:54 changed by jon.egil.stra..@gmail.com

This is interesting. Could you please attach the benchmark xml-feed and benchmark-script. It would be interesting to reproduce.

All the best

07/23/07 10:12:41 changed by chr..@octopod.info

In my experience, libxml-ruby leaks really badly when creating xml [1]. Perhaps it doesn't when using SAX, but we'd need to be very sure of that. It also doesn't have an active maintainer at the moment and needs some work [2]. I'm not sure we should be including it as a dependency until there's some progress on these issues.

[1] http://rubyforge.org/tracker/index.php?func=detail&aid=7945&group_id=494&atid=1971 [2] http://rubyforge.org/pipermail/libxml-devel/2007-July/000340.html

07/23/07 15:45:28 changed by has.s..@gmail.com

I've also had experience with libxml-ruby leaking. That was when I was manipulating Nodes though and I actually put it away because of this on my project. I have not had those issues when just reading XML though.

You're right about the development of libxml-ruby. It's disappointingly inactive at the moment. I followed the thread listed above and it sounds promising but at the moment it sucks.

I wasn't expecting this patch to make it into core at the moment with the current status of libxml-ruby, but I did want to get the idea of using a SAX parser for this going. I've used Hpricot before and for valid xml I've found it very slow compared to libxml-ruby. Having said that, Hpricot doesn't leak from what I've seen.

I will follow up witht the benchmark script in a bit.

07/23/07 20:12:05 changed by has.s..@gmail.com

  • attachment libxml_VS_Hpricot_from_xml.rb added.

A Simple Benchmark Script for libxml-ruby vs Hpricot

07/26/07 13:04:37 changed by chr..@octopod.info

Have you tried good ol' REXML SAX?

07/26/07 13:56:23 changed by jon.egil.stra..@gmail.com

from irc:

(10:04:31 PM) ezmobius: i dont want to depend on lib-xml at all
(10:04:41 PM) ezmobius: hpricot has binany gems for even windows

octopod: your comment is interesting, I look forward to see the benchmark-script and results.

07/26/07 15:59:05 changed by chr..@octopod.info

  • attachment libxml_VS_Hpricot_VS_REXML_from_xml.rb added.

The previous benchmark with a slightly modified version of the libxml sax code that uses REXML fast stream parsing

07/26/07 16:00:42 changed by chr..@octopod.info

On my local machine (15" C2D 2.2GHz MBP)

Rehearsal ----------------------------------------------- sax 4.240000 0.070000 4.310000 ( 4.403817) hpricot 10.370000 0.150000 10.520000 ( 10.748726) rexml based 5.650000 0.070000 5.720000 ( 5.895491)


user system total real

sax 4.220000 0.060000 4.280000 ( 4.353719) hpricot 10.160000 0.140000 10.300000 ( 10.472697) rexml based 5.580000 0.050000 5.630000 ( 5.685909)

07/26/07 16:01:43 changed by chr..@octopod.info

bah! should have previewed first:

Rehearsal -----------------------------------------------
sax           4.240000   0.070000   4.310000 (  4.403817)
hpricot      10.370000   0.150000  10.520000 ( 10.748726)
rexml based   5.650000   0.070000   5.720000 (  5.895491)
------------------------------------- total: 20.550000sec

                  user     system      total        real
sax           4.220000   0.060000   4.280000 (  4.353719)
hpricot      10.160000   0.140000  10.300000 ( 10.472697)
rexml based   5.580000   0.050000   5.630000 (  5.685909)

07/26/07 16:09:59 changed by chr..@octopod.info

  • attachment hash_to_xml_rexml.diff added.

Patch containg the rexml version, all credit to has.s..@gmail.com for the hard work :)

07/26/07 16:14:05 changed by chr..@octopod.info

  • cc set to chr..@octopod.info.

07/26/07 17:42:32 changed by has.s..@gmail.com

That's great Chris. I didn't even think to use REXML, what with all those reports that it was very slow I didn't even stop to think about using it.

07/27/07 06:33:41 changed by jon.egil.stra..@gmail.com

A benchmark report with confidence intervals. To conclude one significantly different from another (with 95% certainty) look for non-overlapping conf-intervals.

The benchmark-script is almost identical to chris' and has.sox, credit to them!

Note how various feeds give various results:

All time in seconds.

-- http://feeds.feedburner.com/37signals/beMH (n=1000) --
           mean     95% interval
hpricot    0.14   ( 0.07 .. 0.20 )
libxml     0.14   ( 0.09 .. 0.20 )
rexml      0.07   ( 0.01 .. 0.12 )


-- http://brainspl.at/xml/rss20/feed.xml (n=1000) --
           mean     95% interval
hpricot    0.05   ( 0.01 .. 0.09 )
libxml     0.01   (-0.01 .. 0.04 )
rexml      0.02   (-0.01 .. 0.05 )


-- http://jobs.37signals.com/categories/2/jobs;rss (n=1000) --
           mean     95% interval
hpricot    0.28   ( 0.21 .. 0.36 )
libxml     0.11   ( 0.05 .. 0.18 )
rexml      0.10   ( 0.06 .. 0.14 )


-- http://www.aftenposten.no/eksport/rss-1_0/?seksjon=ece_frontpage (n=1000) --
           mean     95% interval
hpricot    0.05   ( 0.01 .. 0.08 )
libxml     0.01   (-0.01 .. 0.03 )
rexml      0.02   (-0.00 .. 0.05 )


-- http://merb.devjavu.com/projects/merb/report/9?format=rss (n=1000) --
           mean     95% interval
hpricot    0.43   ( 0.37 .. 0.49 )
libxml     0.19   ( 0.11 .. 0.27 )
rexml      0.16   ( 0.09 .. 0.23 )


-- http://odeo.com/profile/OdeoMusic/rss (n=1000) --
           mean     95% interval
hpricot    0.32   ( 0.25 .. 0.39 )
libxml     0.07   ( 0.02 .. 0.12 )
rexml      0.12   ( 0.07 .. 0.17 )

This ticket is really rich in information, that's great. I feel chris and has.sox should decide the final outcome, then we can commit and mark as fixed.

07/27/07 06:34:42 changed by jon.egil.stra..@gmail.com

  • attachment benchmark_with_confidence_intervals.rb added.

07/27/07 08:16:44 changed by has.s..@gmail.com

There really is a lot of great information in this ticket. Thanx Chris and Jon for getting involved in this.

Looking a Jons' results I would have to vote for the REXML version. It is consistant, and looks to be fast across most feeds tested.

Cheers Daniel

07/27/07 08:50:37 changed by chr..@octopod.info

Thanks for the proper benchmarking Jon, what did you use for that?

Regarding which one to use, I think rexml should win as it's faster and part of ruby stdlib. However before choosing, one thing I was thinking about these benchmarks is whether they reflect the general case of xml file that would be turned into a hash. As I understand it, this is mainly going to be used for turning REST style XML into a params hash. If this is the case we should try with a bunch of randomly generated smaller files. If I'm wrong then ignore this :)

07/27/07 08:56:15 changed by chr..@octopod.info

OK, I just looked at the code to see how the benchmarking worked. I thought it might be something I've been meaning to check out for a while http://eigenclass.org/hiki.rb?adaptative+benchmark

07/28/07 14:47:35 changed by e.@brainspl.at

  • status changed from new to closed.
  • resolution set to fixed.

(In [350]) Commiting the REXML Hash.from_xml code. Nice work to everyone involved, i love's me some benchmarks. Closes #95