読者です 読者をやめる 読者になる 読者になる

Mojo::DOMとWeb::ScraperとWeb::Query

Mojo::DOMとWeb::Scraper - Charsbar::Note より
最近、Web::Scraperの代わりにWeb::Queryを使うことが多いのでWeb::Queryも追加してみたけど、速度の面ではWeb::Scraperと何ら変わりないのであまり意味がなかった。。

use strict;
use warnings;
use 5.010;
use Path::Extended;
use Web::Query;
use Web::Scraper;
use Mojolicious;
use Mojo::UserAgent;
use Mojo::DOM;
use Data::Dump qw/dump/;
use Benchmark qw/cmpthese/;
use Test::More;
use Test::Differences;

say "perl: $^V";
for (qw/Mojolicious Web::Query Web::Scraper HTML::Selector::XPath/) {
  say "$_: " . $_->VERSION;
}

my $test = 0;

for (qw( www.yahoo.co.jp )) {
  my $file = file($_.'.html');
  my $html;
  if ($file->exists) {
    $html = $file->slurp;
  } else {
    my $ua = Mojo::UserAgent->new;
    # pretend to be IE 8
    $ua->name('Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)');
    $html = $ua->get("http://$_/")->res->body;
    $file->save($html);
  }

  if ($test) {
    ok get_links_mojo($html);
    ok get_topics_mojo($html);
    ok get_shortcut_mojo($html);

    eq_or_diff [get_links_mojo($html)]    => [get_links_scraper($html)];
    eq_or_diff [get_topics_mojo($html)]   => [get_topics_scraper($html)];
    eq_or_diff [get_shortcut_mojo($html)] => [get_shortcut_scraper($html)];

    eq_or_diff [get_links_mojo($html)]    => [get_links_wq($html)];
    eq_or_diff [get_topics_mojo($html)]   => [get_topics_wq($html)];
    eq_or_diff [get_shortcut_mojo($html)] => [get_shortcut_wq($html)];
  }

  {
    my $scraper = scraper { process 'a' => 'links[]' => '@href' };
    cmpthese(100, {
      mojo     => sub { get_links_mojo($html) },
      wq       => sub { get_links_wq($html) },
      scraper  => sub { get_links_scraper($html) },
      scraper2 => sub { get_links_scraper2($html, $scraper) },
    });
  }

  {
    cmpthese(100, {
      mojo     => sub { get_topics_mojo($html) },
      wq       => sub { get_topics_wq($html) },
      scraper  => sub { get_topics_scraper($html) },
    });
  }

  {
    cmpthese(100, {
      mojo     => sub { get_shortcut_mojo($html) },
      wq       => sub { get_shortcut_wq($html) },
      scraper  => sub { get_shortcut_scraper($html) },
    });
  }
}

done_testing if $test;

sub get_links_mojo {
  my $html = shift;
  my $dom = Mojo::DOM->new($html);
  @{ $dom->find('a')->map(sub { shift->{href} }) };
}

sub get_links_wq {
  my $html = shift;
  my $wq = Web::Query->new_from_html($html);
  $wq->find('a')->attr('href');
}

sub get_links_scraper {
  my $html = shift;
  my $scraper = scraper { process 'a' => 'links[]' => '@href' };
  @{ $scraper->scrape($html)->{links} };
}

sub get_links_scraper2 {
  my ($html, $scraper) = @_;
  @{ $scraper->scrape($html)->{links} || [] };
}

sub get_topics_mojo {
  my $html = shift;
  my $dom = Mojo::DOM->new($html);
  @{ $dom->find('div#topicsfb ul.emphasis li')->map(sub { shift->all_text(0) }) };
}

sub get_topics_wq {
  my $html = shift;
  my $wq = Web::Query->new_from_html($html);
  $wq->find('div#topicsfb ul.emphasis li')->text;
}

sub get_topics_scraper {
  my $html = shift;
  my $scraper = scraper { process 'div#topicsfb ul.emphasis li' => 'topics[]' => 'TEXT' };
  @{ $scraper->scrape($html)->{topics} || [] };
}

sub get_shortcut_mojo {
  my $html = shift;
  my $dom = Mojo::DOM->new($html);
  $dom->at('ul.shortcut li a[href="r/pnp"]')->text;
}

sub get_shortcut_wq {
  my $html = shift;
  my $wq = Web::Query->new_from_html($html);
  $wq->find('ul.shortcut li a[href="r/pnp"]')->text;
}

sub get_shortcut_scraper {
  my $html = shift;
  my $scraper = scraper { process 'ul.shortcut li a[href="r/pnp"]' => pnp => 'TEXT'; result 'pnp' };
  $scraper->scrape($html);
}

ベンチを取ってみると、Mojo::DOMが高速なのがわかる。(というより、HTML::TreeBuilder::XPathが遅い)

perl: v5.16.0
Mojolicious: 3.05
Web::Query: 0.08
Web::Scraper: 0.36
HTML::Selector::XPath: 0.14
           Rate  scraper scraper2       wq     mojo
scraper  9.24/s       --      -1%      -4%     -42%
scraper2 9.32/s       1%       --      -4%     -41%
wq       9.66/s       5%       4%       --     -39%
mojo     15.9/s      72%      71%      65%       --
          Rate scraper      wq    mojo
scraper 7.70/s      --     -1%    -42%
wq      7.74/s      1%      --    -42%
mojo    13.3/s     73%     72%      --
          Rate scraper      wq    mojo
scraper 8.30/s      --     -0%    -50%
wq      8.34/s      1%      --    -50%
mojo    16.6/s    100%    100%      --

ふつーに使うときにはHTML::TreeBuilder::LibXMLと一緒に使うので問題ないと思います。
HTML::TreeBuilder::LibXML->replace_original;したときのベンチ:

perl: v5.16.0
Mojolicious: 3.05
Web::Query: 0.08
Web::Scraper: 0.36
HTML::Selector::XPath: 0.14
           Rate     mojo scraper2  scraper       wq
mojo     16.1/s       --     -80%     -80%     -83%
scraper2 79.4/s     394%       --      -1%     -15%
scraper  80.0/s     398%       1%       --     -14%
wq       93.5/s     481%      18%      17%       --
          Rate    mojo      wq scraper
mojo    13.4/s      --    -89%    -89%
wq       122/s    807%      --     -1%
scraper  123/s    819%      1%      --
          Rate    mojo scraper      wq
mojo    16.6/s      --    -87%    -87%
scraper  130/s    681%      --     -1%
wq       132/s    691%      1%      --