Mojo::DOMとWeb::Scraper - Charsbar::Note より
最近、Web::Scraperの代わりにWeb::Queryを使うことが多いのでWeb::Queryも追加してみたけど、速度の面ではWeb::Scraperと何ら変わりないのであまり意味がなかった。。
use strict; use warnings; use 5.010; use Path::Extended; use Web::Query; use Web::Scraper; use Mojolicious; use Mojo::UserAgent; use Mojo::DOM; use Data::Dump qw/dump/; use Benchmark qw/cmpthese/; use Test::More; use Test::Differences; say "perl: $^V"; for (qw/Mojolicious Web::Query Web::Scraper HTML::Selector::XPath/) { say "$_: " . $_->VERSION; } my $test = 0; for (qw( www.yahoo.co.jp )) { my $file = file($_.'.html'); my $html; if ($file->exists) { $html = $file->slurp; } else { my $ua = Mojo::UserAgent->new; # pretend to be IE 8 $ua->name('Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)'); $html = $ua->get("http://$_/")->res->body; $file->save($html); } if ($test) { ok get_links_mojo($html); ok get_topics_mojo($html); ok get_shortcut_mojo($html); eq_or_diff [get_links_mojo($html)] => [get_links_scraper($html)]; eq_or_diff [get_topics_mojo($html)] => [get_topics_scraper($html)]; eq_or_diff [get_shortcut_mojo($html)] => [get_shortcut_scraper($html)]; eq_or_diff [get_links_mojo($html)] => [get_links_wq($html)]; eq_or_diff [get_topics_mojo($html)] => [get_topics_wq($html)]; eq_or_diff [get_shortcut_mojo($html)] => [get_shortcut_wq($html)]; } { my $scraper = scraper { process 'a' => 'links[]' => '@href' }; cmpthese(100, { mojo => sub { get_links_mojo($html) }, wq => sub { get_links_wq($html) }, scraper => sub { get_links_scraper($html) }, scraper2 => sub { get_links_scraper2($html, $scraper) }, }); } { cmpthese(100, { mojo => sub { get_topics_mojo($html) }, wq => sub { get_topics_wq($html) }, scraper => sub { get_topics_scraper($html) }, }); } { cmpthese(100, { mojo => sub { get_shortcut_mojo($html) }, wq => sub { get_shortcut_wq($html) }, scraper => sub { get_shortcut_scraper($html) }, }); } } done_testing if $test; sub get_links_mojo { my $html = shift; my $dom = Mojo::DOM->new($html); @{ $dom->find('a')->map(sub { shift->{href} }) }; } sub get_links_wq { my $html = shift; my $wq = Web::Query->new_from_html($html); $wq->find('a')->attr('href'); } sub get_links_scraper { my $html = shift; my $scraper = scraper { process 'a' => 'links[]' => '@href' }; @{ $scraper->scrape($html)->{links} }; } sub get_links_scraper2 { my ($html, $scraper) = @_; @{ $scraper->scrape($html)->{links} || [] }; } sub get_topics_mojo { my $html = shift; my $dom = Mojo::DOM->new($html); @{ $dom->find('div#topicsfb ul.emphasis li')->map(sub { shift->all_text(0) }) }; } sub get_topics_wq { my $html = shift; my $wq = Web::Query->new_from_html($html); $wq->find('div#topicsfb ul.emphasis li')->text; } sub get_topics_scraper { my $html = shift; my $scraper = scraper { process 'div#topicsfb ul.emphasis li' => 'topics[]' => 'TEXT' }; @{ $scraper->scrape($html)->{topics} || [] }; } sub get_shortcut_mojo { my $html = shift; my $dom = Mojo::DOM->new($html); $dom->at('ul.shortcut li a[href="r/pnp"]')->text; } sub get_shortcut_wq { my $html = shift; my $wq = Web::Query->new_from_html($html); $wq->find('ul.shortcut li a[href="r/pnp"]')->text; } sub get_shortcut_scraper { my $html = shift; my $scraper = scraper { process 'ul.shortcut li a[href="r/pnp"]' => pnp => 'TEXT'; result 'pnp' }; $scraper->scrape($html); }
ベンチを取ってみると、Mojo::DOMが高速なのがわかる。(というより、HTML::TreeBuilder::XPathが遅い)
perl: v5.16.0 Mojolicious: 3.05 Web::Query: 0.08 Web::Scraper: 0.36 HTML::Selector::XPath: 0.14 Rate scraper scraper2 wq mojo scraper 9.24/s -- -1% -4% -42% scraper2 9.32/s 1% -- -4% -41% wq 9.66/s 5% 4% -- -39% mojo 15.9/s 72% 71% 65% -- Rate scraper wq mojo scraper 7.70/s -- -1% -42% wq 7.74/s 1% -- -42% mojo 13.3/s 73% 72% -- Rate scraper wq mojo scraper 8.30/s -- -0% -50% wq 8.34/s 1% -- -50% mojo 16.6/s 100% 100% --
ふつーに使うときにはHTML::TreeBuilder::LibXMLと一緒に使うので問題ないと思います。
HTML::TreeBuilder::LibXML->replace_original;したときのベンチ:
perl: v5.16.0 Mojolicious: 3.05 Web::Query: 0.08 Web::Scraper: 0.36 HTML::Selector::XPath: 0.14 Rate mojo scraper2 scraper wq mojo 16.1/s -- -80% -80% -83% scraper2 79.4/s 394% -- -1% -15% scraper 80.0/s 398% 1% -- -14% wq 93.5/s 481% 18% 17% -- Rate mojo wq scraper mojo 13.4/s -- -89% -89% wq 122/s 807% -- -1% scraper 123/s 819% 1% -- Rate mojo scraper wq mojo 16.6/s -- -87% -87% scraper 130/s 681% -- -1% wq 132/s 691% 1% --