python-lxmlタグ名で要素の子を取得する方法？

Question

次のようなxmlファイルがあります。

<page> <title>title1</title> <subtitle>subtitle</subtitle> <ns>0</ns> <id>1</id> <text>hello world!@</text> </page> <page> <title>title2</title> <ns>0</ns> <id>1</id> <text>hello world</text> </page>

各ページのテキストを取得するにはどうすればよいですか？現在、各ページのリストがあります。次のコードは、2番目のページ要素のテキストを印刷しますが、最初のページは印刷しません。 element['text']のようなタグ名で子要素を取得する方法はありますか

for i in pages: print i[3]

Satish Garg · Answer

次のようなコードを書くことができます：

from lxml import html xml = """<page> <title>title1</title> <subtitle>subtitle</subtitle> <ns>0</ns> <id>1</id> <text>hello world!@</text> </page> <page> <title>title2</title> <ns>0</ns> <id>1</id> <text>hello world</text> </page>""" root = html.fromstring(xml) print(root.xpath('//page/text/text()'))

結果は次のようになります：

['hello world!@', 'hello world']

Wolfgang Fahl · Answer

問題を簡単にするために、dictを返す「Node」ヘルパークラスを使用しています。

class Node(): @staticmethod def childTexts(node): texts={} for child in list(node): texts[child.tag]=child.text return texts

使用例：

xml = """<pages> <page> <title>title1</title> <subtitle>subtitle</subtitle> <ns>0</ns> <id>1</id> <text>hello world!@</text> </page> <page> <title>title2</title> <ns>0</ns> <id>1</id> <text>hello world</text> </page> </pages> """ root = etree.fromstring(xml) for node in root.xpath('//page'): texts=Node.childTexts(node) print (texts)

結果：

{'title': 'title1', 'subtitle': 'subtitle', 'ns': '0', 'id': '1', 'text': 'hello world!@'} {'title': 'title2', 'ns': '0', 'id': '1', 'text': 'hello world'}

mj_whales · Answer

このチュートリアルは同様のタスクを支援してくれました：

各反復は、「id」または「text」という名前のタグを見つけます。タグが見つからない場合は、文字列 'None'を返します。次に、1回の反復の結果がリストに追加され、そのリストをデータフレームと同様の形式で印刷できます。

import lxml import lxml.etree as ET # Initialise a list to append results to list_of_results = [] # Loop through the pages to search for text for page in root: id = page.findtext('id', default = 'None') text = page.findtext('text', default = 'None') list_of_results.append([id, text]) # Print list list_of_results

結果：

[['1', 'hello world!@'], ['1', 'hello world']]

テキストを印刷するだけの場合は、id行を削除するだけです。