DOCTYPEとDTDの制限のためにHTMLの解析を拒否するGroovyのXmlSlurperを回避する方法は？

Question

HTMLカバレッジレポートの要素をコピーしようとしているので、カバレッジの合計がレポートの上部と下部に表示されます。

HTMLはこうして始まり、私は整形式だと信じています：

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" /> <link rel="stylesheet" href=".resources/report.css" type="text/css" /> <link rel="shortcut icon" href=".resources/report.gif" type="image/gif" /> <title>Unified coverage</title> <script type="text/javascript" src=".resources/sort.js"></script> </head> <body onload="initialSort(['breadcrumb', 'coveragetable'])">

GroovyのXmlSlurperは次のように文句を言います：

doc = new XmlSlurper( /* false, false, false */ ).parse("index.html") [Fatal Error] index.html:1:48: DOCTYPE is disallowed when the feature "http://Apache.org/xml/features/disallow-doctype-decl" set to true. DOCTYPE is disallowed when the feature "http://Apache.org/xml/features/disallow-doctype-decl" set to true.

DOCTYPEを有効にする：

doc = new XmlSlurper(false, false, true).parse("index.html") [Fatal Error] index.html:1:148: External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property. External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property. doc = new XmlSlurper(false, true, true).parse("index.html") [Fatal Error] index.html:1:148: External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property. External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property. doc = new XmlSlurper(true, true, true).parse("index.html") External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property. doc = new XmlSlurper(true, false, true).parse("index.html") External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

だから私はすべてのオプションをカバーしたと思います。正規表現に頼らず、トニー・ザ・ポニーの怒りを危険にさらすことなく、これを機能させる方法がなければなりません。

android.weasel · Accepted Answer

Tsk。

parser=new XmlSlurper() parser.setFeature("http://Apache.org/xml/features/disallow-doctype-decl", false) parser.setFeature("http://Apache.org/xml/features/nonvalidating/load-external-dtd", false); parser.parse(it)

ataylor · Answer

HTMLもたまたま整形式のXMLですが、HTMLを解析するためのより一般的なソリューションは、真のHTMLパーサーを使用することです。私は過去に TagSoup パーサーを使用しましたが、実際のHTMLを非常にうまく処理します。

TagSoupは javax.xml.parsers.SAXParser インターフェースを実装するパーサーを提供し、コンストラクターのXmlSlurperに提供できます。例：

@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1') import org.ccil.cowan.tagsoup.Parser def doc = new XmlSlurper(new Parser()).parse("index.html")