特定のXML要素タイプの子を抽出する

Question

特定のXML要素（つまり、特定のタグ名）とXMLデータのスニペットが与えられた場合、その要素の各オカレンスから子を抽出したいと思います。より具体的には、次のスニペット（有効ではない）のXMLデータがあります。

<!-- data.xml --> <instance ab=1 > <a1>aa</a1> <a2>aa</a2> </instance> <instance ab=2 > <b1>bb</b1> <b2>bb</b2> </instance> <instance ab=3 > <c1>cc</c1> <c2>cc</c2> </instance>

このデータを入力として受け取り、次の出力を生成するスクリプトまたはコマンドが必要です。

<a1>aa</a1><a2>aa</a2> <b1>bb</b1><b2>bb</b2> <c1>cc</c1><c2>cc</c2>

解決策として、sedやawkなどの標準のテキスト処理ツールを使用することを希望します。

次のsedコマンドを使用しようとしましたが、機能しませんでした。

sed -n '/<Sample/,/<\/Sample/p' data.xml

igal · Answer

XMLファイルに対してsed-またはawkのようなコマンドライン処理が本当に必要な場合は、XML処理コマンドラインツールの使用を検討する必要があります。以下は、私がよく使用するツールの一部です。

また、XML固有のプログラミング/クエリ言語がいくつかあることにも注意してください。

（有効なXMLになるために）XMLデータにはルートノードが必要であり、属性値は引用符で囲む必要があります。つまり、データファイルは次のようになります。

<!-- data.xml --> <instances> <instance ab='1'> <a1>aa</a1> <a2>aa</a2> </instance> <instance ab='2'> <b1>bb</b1> <b2>bb</b2> </instance> <instance ab='3'> <c1>cc</c1> <c2>cc</c2> </instance> </instances>

データが有効なXMLとしてフォーマットされている場合は、 XPath を xmlstarlet とともに使用して、非常に簡潔なコマンドで必要なものを正確に取得できます。

xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml

これにより、次の出力が生成されます。

<a1>aa</a1><a2>aa</a2> <b1>bb</b1><b2>bb</b2> <c1>cc</c1><c2>cc</c2>

または、Python（私の個人的なお気に入りの選択）を使用することもできます。同じタスクを実行するPythonスクリプトを次に示します。

#!/usr/bin/env python2 # -*- encoding: ascii -*- """extract_instance_children.bash""" import sys import xml.etree.ElementTree # Load the data tree = xml.etree.ElementTree.parse(sys.argv[1]) root = tree.getroot() # Extract and output the child elements for instance in root.iter("instance"): print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))

次に、スクリプトを実行する方法を示します。

python extract_instance_children.py data.xml

これは、厳密なXMLパーサーでもある Python標準ライブラリのxmlパッケージを使用します。

適切にフォーマットされたXMLに関心がなく、提示したものとほぼ同じように見えるテキストファイルを解析したい場合は、シェルスクリプトと標準のコマンドラインツールを使用するだけで、希望どおりの結果を確実に達成できます。。以下はawkスクリプトです（要求されたとおり）：

#!/usr/bin/env awk # extract_instance_children.awk BEGIN { addchild=0; children=""; } { # Opening tag for "instance" element - set the "addchild" flag if($0 ~ "^ *<instance[^<>]+>") { addchild=1; } # Closing tag for "instance" element - reset "children" string and "addchild" flag, print children else if($0 ~ "^ *</instance>" && addchild == 1) { addchild=0; printf("%s
", children); children=""; } # Concatenating child elements - strip whitespace else if (addchild == 1) { gsub(/^[ 	]+/,"",$0); gsub(/[ 	]+$/,"",$0); children=children $0; } }

ファイルからスクリプトを実行するには、次のようなコマンドを使用します。

awk -f extract_instance_children.awk data.xml

そして、望ましい出力を生成するBashスクリプトは次のとおりです。

#!/bin/bash # extract_instance_children.bash # Keep track of whether or not we're inside of an "instance" element instance=0 # Loop through the lines of the file while read line; do # Set the instance flag to true if we come across an opening tag if echo "${line}" | grep -q '<instance.*>'; then instance=1 # Set the instance flag to false and print a newline if we come across a closing tag Elif echo "${line}" | grep -q '</instance>'; then instance=0 echo # If we're inside an instance tag then print the child element Elif [[ ${instance} == 1 ]]; then printf "${line}" fi done < "${1}"

次のように実行します。

bash extract_instance_children.bash data.xml

または、もう一度Pythonに戻って、 Beautiful Soup パッケージを使用できます。ビューティフルスープは、無効なXMLを解析する機能において、標準のPython XMLモジュール（および私が出会った他のすべてのXMLパーサー）よりもはるかに柔軟です。以下は、美しいスープを使用して目的の結果を達成するPythonスクリプトです。

#!/usr/bin/env python2 # -*- encoding: ascii -*- """extract_instance_children.bash""" import sys from bs4 import BeautifulSoup as Soup with open(sys.argv[1], 'r') as xmlfile: soup = Soup(xmlfile.read(), "html.parser") for instance in soup.findAll('instance'): print(''.join([str(child) for child in instance.findChildren()]))

Isaac · Answer

これは助けになるかもしれません：

#!/bin/bash awk -vtag=instance -vp=0 '{ if($0~("^<"tag)){p=1;next} if($0~("^</"tag)){p=0;printf("
");next} if(p==1){$1=$1;printf("%s",$0)} }' infile

あなたの例のSampleテキストは間違いであり、シンプルに保つと仮定します。

P変数は、いつ印刷するかを決定します。 $1=$1は、先行スペースを削除します。