Nokogiriを使用してHTMLをプリティプリントするにはどうすればよいですか？

Question

RubyでWebクローラーを作成し、Nokogiri::HTMLを使用してページを解析しています。ページを印刷する必要があり、IRBをいじりながら、pretty_printメソッドに気づきました。ただし、パラメータと私はそれが何を望んでいるのか理解できません。

私のクローラーは、WebページのHTMLをキャッシュし、ローカルマシン上のファイルに書き込んでいます。 HTMLを「きれいに印刷」して、見栄えがよく、適切にフォーマットされているようにします。

mislav · Accepted Answer

HTMLページの「プリティプリント」とは、適切なインデントを使用してHTML構造を再フォーマットすることを意味していると思います。 Nokogiriはこれをサポートしていません。 pretty_printメソッドは「pp」ライブラリ用であり、出力はデバッグにのみ役立ちます。

HTMLを十分に理解していて、実際に重要な空白を破壊せずに再フォーマットできるプロジェクトがいくつかあります（有名なものは HTML Tidy ）が、グーグルによってこの投稿のタイトルが-になりました。 "NokogiriとXSLTを使用したXHTMLのプリティプリント" 。

それはこれに帰着します：

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl")) html = Nokogiri(File.open("source.html")) puts xsl.apply_to(html).to_s

もちろん、リンクされたXSLファイルをファイルシステムにダウンロードする必要があります。私は自分のマシンでそれを非常に素早く試しました、そしてそれは魅力のように働きます。

Phrogz · Answer

@mislavによる答えはやや間違っています。 Nokogiriはプリティプリントをサポートしますifあなた：

ドキュメントをXMLとして解析します
解析中に空白のみのノード（「空白」）を無視するようにNokogiriに指示します
使用する to_xhtmlまたはto_xml指定するプリティプリントパラメータ

動作中：

html = '<section> <h1>Main Section 1</h1><p>Intro</p> <section> <h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p> </section><section> <h2>Subhead 1.2</h2><p>Meat</p> </section></section>' require 'nokogiri' doc = Nokogiri::XML(html,&:noblanks) puts doc #=> <section> #=> <h1>Main Section 1</h1> #=> <p>Intro</p> #=> <section> #=> <h2>Subhead 1.1</h2> #=> <p>Meat</p> #=> <p>MOAR MEAT</p> #=> </section> #=> <section> #=> <h2>Subhead 1.2</h2> #=> <p>Meat</p> #=> </section> #=> </section> puts doc.to_xhtml( indent:3, indent_text:"." ) #=> <section> #=> ...<h1>Main Section 1</h1> #=> ...<p>Intro</p> #=> ...<section> #=> ......<h2>Subhead 1.1</h2> #=> ......<p>Meat</p> #=> ......<p>MOAR MEAT</p> #=> ...</section> #=> ...<section> #=> ......<h2>Subhead 1.2</h2> #=> ......<p>Meat</p> #=> ...</section> #=> </section>

bronson · Answer

これは私のために働いた：

 pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3)

上記のREXMLバージョンを試しましたが、一部のドキュメントが破損しました。そして、xsltを新しいプロジェクトに取り入れることは嫌いです。どちらも時代遅れだと感じています。 :)

Julien · Answer

REXMLを試すことができます：

require "rexml/document" doc = REXML::Document.new(xml) doc.write($stdout, 2)

pariser · Answer

私の解決策は、実際のprintオブジェクトにNokogiriメソッドを追加することでした。以下のスニペットのコードを実行すると、node.printを記述できるようになり、内容がきれいに出力されます。 xsltは必要ありません:-)

Nokogiri::XML::Node.class_eval do # Print every Node by default (will be overridden by CharacterData) define_method :should_print? do true end # Duplicate this node, replace the contents of the duplicated node with a # newline. With this content substitution, the #to_s method conveniently # returns a string with the opening tag (e.g. `<a href="foo">`) on the first # line and the closing tag on the second (e.g. `</a>`, provided that the # current node is not a self-closing tag). # # Now, print the open tag preceded by the correct amount of indentation, then # recursively print this node's children (with extra indentation), and then # print the close tag (if there is a closing tag) define_method :print do |indent=0| duplicate = self.dup duplicate.content = "
" open_tag, close_tag = duplicate.to_s.split("
") puts (" " * indent) + open_tag self.children.select(&:should_print?).each { |child| child.print(indent + 2) } puts (" " * indent) + close_tag if close_tag end end Nokogiri::XML::CharacterData.class_eval do # Only print CharacterData if there's non-whitespace content define_method :should_print? do content =~ /\S+/ end # Replace all consecutive whitespace characters by a single space; precede the # outut by a certain amount of indentation; print this text. define_method :print do |indent=0| puts (" " * indent) + to_s.strip.sub(/\s+/, ' ') end end