PDFファイルからテキストとテキスト座標を抽出する方法は？

Question

PDFMinerを使用してPDFファイルからすべてのテキストボックスとテキストボックスの座標を抽出します。

他の多くのスタックオーバーフローの投稿では、すべてのテキストを順序どおりに抽出する方法を扱っていますが、テキストとテキストの場所を取得する中間ステップを実行するにはどうすればよいですか？

PDFファイルを指定すると、出力は次のようになります。

 489, 41, "Signature" 500, 52, "b" 630, 202, "a_g_i_r"

pnj · Accepted Answer

改行は最終出力でアンダースコアに変換されます。これは私が見つけた最小限の実用的な解決策です。

from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice from pdfminer.layout import LAParams from pdfminer.converter import PDFPageAggregator import pdfminer # Open a PDF file. fp = open('/Users/me/Downloads/test.pdf', 'rb') # Create a PDF parser object associated with the file object. parser = PDFParser(fp) # Create a PDF document object that stores the document structure. # Password for initialization as 2nd parameter document = PDFDocument(parser) # Check if the document allows text extraction. If not, abort. if not document.is_extractable: raise PDFTextExtractionNotAllowed # Create a PDF resource manager object that stores shared resources. rsrcmgr = PDFResourceManager() # Create a PDF device object. device = PDFDevice(rsrcmgr) # BEGIN LAYOUT ANALYSIS # Set parameters for analysis. laparams = LAParams() # Create a PDF page aggregator object. device = PDFPageAggregator(rsrcmgr, laparams=laparams) # Create a PDF interpreter object. interpreter = PDFPageInterpreter(rsrcmgr, device) def parse_obj(lt_objs): # loop over the object list for obj in lt_objs: # if it's a textbox, print text and location if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal): print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('
', '_')) # if it's a container, recurse Elif isinstance(obj, pdfminer.layout.LTFigure): parse_obj(obj._objs) # loop over all pages in the document for page in PDFPage.create_pages(document): # read the page into a layout object interpreter.process_page(page) layout = device.get_result() # extract text from this object parse_obj(layout._objs)

Mark Amery · Answer

PDF内のすべてのテキストブロックの左上隅をリストするコピーアンドペースト対応の例を次に示します。これは、「フォームXオブジェクト」を含まないPDFで機能するはずです。それらにテキストがあります：

_from pdfminer.layout import LAParams, LTTextBox from pdfminer.pdfpage import PDFPage from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.converter import PDFPageAggregator fp = open('yourpdf.pdf', 'rb') rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) pages = PDFPage.get_pages(fp) for page in pages: print('Processing next page...') interpreter.process_page(page) layout = device.get_result() for lobj in layout: if isinstance(lobj, LTTextBox): x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text() print('At %r is text: %s' % ((x, y), text)) _

上記のコードは、PDFMinerドキュメントのレイアウト分析の実行の例に加えて、pnj（ https://stackoverflow.com/a/22898159/1709587 ）とMattによる例に基づいています。スウェイン（ https://stackoverflow.com/a/25262470/1709587 ）。これらの前の例からいくつかの変更を加えました。

私はPDFPage.get_pages()を使用します。これは、ドキュメントを作成し、_is_extractable_をチェックしてPDFPage.create_pages()に渡すための省略表現です
PDFMinerは現在、テキスト内のテキストを適切に処理することができないため、LTFiguresを処理する必要はありません。

LAParamsを使用すると、PDFの個々の文字をPDFMinerによって行とテキストボックスに魔法のようにグループ化する方法を制御するいくつかのパラメーターを設定できます。このようなグループ化がまったく発生する必要のあるものであることに驚いた場合は、 pdf2txt docs で正当化されます。

実際のPDFファイルでは、オーサリングソフトウェアによっては、実行中にテキスト部分がいくつかのチャンクに分割される場合があります。したがって、テキスト抽出では、テキストチャンクをスプライスする必要があります。

LAParamsのパラメーターは、ほとんどのPDFMinerと同様にドキュメント化されていませんが、ソースコードで確認できますまたはPythonシェルでhelp(LAParams)を呼び出すことで確認できます。パラメータのsomeの意味は、 https://pdfminer-docs.readthedocs.io/pdfminer_index.html#pdf2txt-py に記載されています。コマンドラインで_pdf2text_に引数として渡すこともできます。

上記のlayoutオブジェクトはLTPageであり、「レイアウトオブジェクト」の反復可能オブジェクトです。これらのレイアウトオブジェクトはそれぞれ、次のタイプのいずれかになります...

LTTextBox
LTFigure
LTImage
LTLine
LTRect

...またはそのサブクラス。（特に、テキストボックスはおそらくすべてLTTextBoxHorizontalsになります。）

LTPageの構造の詳細は、次のドキュメントの画像に示されています。

Tree diagram of the structure of an <code>LTPage</code>. Of relevance to this answer: it shows that an <code>LTPage</code> contains the 5 types listed above, and that an <code>LTTextBox</code> contains <code>LTTextLine</code>s plus unspecified other stuff, and that an <code>LTTextLine</code> contains <code>LTChar</code>s, <code>LTAnno</code>s, <code>LTText</code>s, and unspecified other stuff.

上記の各タイプには、（x0、y0、/を保持する_.bbox_プロパティがありますx1、y1）オブジェクトの左、下、右、上それぞれの座標を含むタプル。 y座標は、ページのbottomからの距離として与えられます。代わりに、上から下に行くY軸で作業する方が便利な場合は、ページの_.mediabox_の高さからY軸を引くことができます。

_x0, y0, x1, y1 = some_lobj.bbox y0 = page.mediabox[3] - y1 y1 = page.mediabox[3] - y0 _

bboxに加えて、LTTextBoxesには、上記の.get_text()メソッドもあり、テキストコンテンツを文字列として返します。各LTTextBoxは、LTChars（PDFによって明示的に描画された文字、bbox）およびLTAnnos（PDFMinerがテキストボックスのコンテンツの文字列表現に追加する余分なスペース）のコレクションであることに注意してください。 bboxがありません）。

この回答の冒頭のコード例では、これら2つのプロパティを組み合わせて、テキストの各ブロックの座標を示しています。

最後に、unlike上記の他のスタックオーバーフローの回答ですが、LTFiguresに再帰することはありません。 LTFiguresにはテキストを含めることができますが、PDFMinerはそのテキストをLTTextBoxesにグループ化できないようです（の例PDFを試してみてくださいhttps://stackoverflow.com/a/27104504/1709587 ）代わりに、LTFigureオブジェクトを直接含むLTCharを生成します。原則として、これらを文字列にまとめる方法を理解することはできますが、PDFMiner（バージョン20181108現在）ではそれを行うことができません。

しかし、うまくいけば、解析する必要のあるPDFは、テキストが含まれているフォームXObjectsを使用しないので、この警告は当てはまりません。