boto3からS3バケットのサブフォルダー名を取得する

Question

Boto3を使用して、AWS S3バケットにアクセスできます。

s3 = boto3.resource('s3') bucket = s3.Bucket('my-bucket-name')

現在、バケットにはフォルダーfirst-levelが含まれています。このフォルダー自体には、タイムスタンプで名前が付けられたいくつかのサブフォルダー、たとえば1456753904534が含まれています。私がやっている別の仕事のためにこれらのサブフォルダの名前を知る必要があり、boto3がそれらを取得してくれるかどうか疑問に思います。

だから私は試しました：

objs = bucket.meta.client.list_objects(Bucket='my-bucket-name')

これは辞書を提供し、そのキー 'Contents'は、第2レベルのタイムスタンプディレクトリではなく、第3レベルのすべてのファイルを提供します。実際には、

{u'ETag '：' "etag" '、u'Key'：first-level/1456753904534/part-00014 '、u'LastModified'：datetime.datetime（2016、2、29、13、52、24、tzinfo = tzutc（））、
u'Owner '：{u'DisplayName'： 'owner'、u'ID '：' id '}、
u'Size '：サイズ、u'StorageClass'： 'storageclass'}

特定のファイル、この場合はpart-00014が取得されていることがわかりますが、ディレクトリの名前だけを取得したいです。原則として、すべてのパスからディレクトリ名を削除できますが、3番目のレベルですべてを取得して2番目のレベルを取得するのはくて高価です！

また、報告されたものを試しましたこちら：

for o in bucket.objects.filter(Delimiter='/'): print(o.key)

しかし、私は目的のレベルでフォルダを取得しません。

これを解決する方法はありますか？

mootmoot · Accepted Answer

S3はオブジェクトストレージであり、実際のディレクトリ構造はありません。「/」はむしろ見た目です。人々がアプリケーションにツリーを維持/整理/追加できるため、ディレクトリ構造が必要な理由の1つです。 S3の場合、このような構造をインデックスまたは検索タグの一種として扱います。

S3でオブジェクトを操作するには、boto3.clientまたはboto3.resourceが必要です。すべてのオブジェクトをリストするには

import boto3 s3 = boto3.client("s3") all_objects = s3.list_objects(Bucket = 'bucket-name')

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects

実際、s3オブジェクト名が '/'セパレータを使用して保存されている場合、python os.path関数を使用してフォルダプレフィックスを抽出できます。

import os s3_key = 'first-level/1456753904534/part-00014' filename = os.path.basename(s3_key) foldername = os.path.dirname(s3_key) # if you are not using conventional delimiter like '#' s3_key = 'first-level#1456753904534#part-00014 filename = s3_key.split("#")[-1]

Boto3に関するリマインダー：boto3.resourceはニースの高レベルAPIです。 boto3.clientとboto3.resourceを使用する長所と短所があります。内部共有ライブラリを開発する場合、boto3.resourceを使用すると、使用されるリソースを覆うブラックボックスレイヤーが提供されます。

Dipankar · Answer

以下のコードは、s3バケットの「フォルダー」内の「サブフォルダー」のみを返します。

import boto3 bucket = 'my-bucket' #Make sure you provide / in the end prefix = 'prefix-name-with-slash/' client = boto3.client('s3') result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/') for o in result.get('CommonPrefixes'): print 'sub folder : ', o.get('Prefix')

詳細については、 https://github.com/boto/boto3/issues/134 を参照してください。

itz-azhar · Answer

把握するのに多くの時間を要しましたが、最後に、boto3を使用してS3バケットのサブフォルダーの内容をリストする簡単な方法を紹介します。それが役に立てば幸い

prefix = "folderone/foldertwo/" s3 = boto3.resource('s3') bucket = s3.Bucket(name="bucket_name_here") FilesNotFound = True for obj in bucket.objects.filter(Prefix=prefix): print('{0}:{1}'.format(bucket.name, obj.key)) FilesNotFound = False if FilesNotFound: print("ALERT", "No file in {0}/{1}".format(bucket, prefix))

Pierre D · Answer

短い答え：

Delimiter='/' を使用します。これにより、バケットの再帰的なリスト表示が回避されます。ここでいくつかの答えは、完全なリストを作成し、文字列操作を使用してディレクトリ名を取得することを間違って示唆しています。これは恐ろしく非効率的です。 S3では、バケットに含めることができるオブジェクトの数に事実上制限がないことに注意してください。したがって、bar/とfoo/の間には、1兆個のオブジェクトがあると想像してください。['bar/', 'foo/']を取得するのに非常に長い時間待つことになります。
Paginators を使用します。同じ理由（S3はエンジニアの無限大の近似値）であるため、mustページを一覧表示し、すべての一覧をメモリに保存しないでください。代わりに、「リスター」をイテレーターと見なし、それが生成するストリームを処理します。
boto3.client ではなく、 boto3.resource を使用します。 resourceバージョンはDelimiterオプションをうまく処理していないようです。 bucket = boto3.resource('s3').Bucket(name)などのリソースがある場合は、bucket.meta.clientで対応するクライアントを取得できます。

長答：

以下は、単純なバケット（バージョン処理なし）に使用するイテレーターです。

import boto3 from collections import namedtuple from operator import attrgetter S3Obj = namedtuple('S3Obj', ['key', 'mtime', 'size', 'ETag']) def s3list(bucket, path, start=None, end=None, recursive=True, list_dirs=True, list_objs=True, limit=None): """ Iterator that lists a bucket's objects under path, (optionally) starting with start and ending before end. If recursive is False, then list only the "depth=0" items (dirs and objects). If recursive is True, then list recursively all objects (no dirs). Args: bucket: a boto3.resource('s3').Bucket(). path: a directory in the bucket. start: optional: start key, inclusive (may be a relative path under path, or absolute in the bucket) end: optional: stop key, exclusive (may be a relative path under path, or absolute in the bucket) recursive: optional, default True. If True, lists only objects. If False, lists only depth 0 "directories" and objects. list_dirs: optional, default True. Has no effect in recursive listing. On non-recursive listing, if False, then directories are omitted. list_objs: optional, default True. If False, then directories are omitted. limit: optional. If specified, then lists at most this many items. Returns: an iterator of S3Obj. Examples: # set up >>> s3 = boto3.resource('s3') ... bucket = s3.Bucket(name) # iterate through all S3 objects under some dir >>> for p in s3ls(bucket, 'some/dir'): ... print(p) # iterate through up to 20 S3 objects under some dir, starting with foo_0010 >>> for p in s3ls(bucket, 'some/dir', limit=20, start='foo_0010'): ... print(p) # non-recursive listing under some dir: >>> for p in s3ls(bucket, 'some/dir', recursive=False): ... print(p) # non-recursive listing under some dir, listing only dirs: >>> for p in s3ls(bucket, 'some/dir', recursive=False, list_objs=False): ... print(p) """ kwargs = dict() if start is not None: if not start.startswith(path): start = os.path.join(path, start) # note: need to use a string just smaller than start, because # the list_object API specifies that start is excluded (the first # result is *after* start). kwargs.update(Marker=__prev_str(start)) if end is not None: if not end.startswith(path): end = os.path.join(path, end) if not recursive: kwargs.update(Delimiter='/') if not path.endswith('/'): path += '/' kwargs.update(Prefix=path) if limit is not None: kwargs.update(PaginationConfig={'MaxItems': limit}) paginator = bucket.meta.client.get_paginator('list_objects') for resp in paginator.paginate(Bucket=bucket.name, **kwargs): q = [] if 'CommonPrefixes' in resp and list_dirs: q = [S3Obj(f['Prefix'], None, None, None) for f in resp['CommonPrefixes']] if 'Contents' in resp and list_objs: q += [S3Obj(f['Key'], f['LastModified'], f['Size'], f['ETag']) for f in resp['Contents']] # note: even with sorted lists, it is faster to sort(a+b) # than heapq.merge(a, b) at least up to 10K elements in each list q = sorted(q, key=attrgetter('key')) if limit is not None: q = q[:limit] limit -= len(q) for p in q: if end is not None and p.key >= end: return yield p def __prev_str(s): if len(s) == 0: return s s, c = s[:-1], ord(s[-1]) if c > 0: s += chr(c - 1) s += ''.join(['\u7FFF' for _ in range(10)]) return s

テスト：

以下は、paginatorおよびlist_objectsの動作をテストするのに役立ちます。多くのディレクトリとファイルを作成します。ページは最大1000エントリなので、dirsとファイルにはその倍数を使用します。 dirsにはディレクトリのみが含まれます（それぞれにオブジェクトが1つあります）。 mixedには、dirとオブジェクトの混合が含まれ、各ディレクトリに2つのオブジェクトの比率があります（もちろん、dirの下に1つのオブジェクト。S3はオブジェクトのみを格納します）。

import concurrent def genkeys(top='tmp/test', n=2000): for k in range(n): if k % 100 == 0: print(k) for name in [ os.path.join(top, 'dirs', f'{k:04d}_dir', 'foo'), os.path.join(top, 'mixed', f'{k:04d}_dir', 'foo'), os.path.join(top, 'mixed', f'{k:04d}_foo_a'), os.path.join(top, 'mixed', f'{k:04d}_foo_b'), ]: yield name with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor: executor.map(lambda name: bucket.put_object(Key=name, Body='hi
'.encode()), genkeys())

結果の構造は次のとおりです。

./dirs/0000_dir/foo ./dirs/0001_dir/foo ./dirs/0002_dir/foo ... ./dirs/1999_dir/foo ./mixed/0000_dir/foo ./mixed/0000_foo_a ./mixed/0000_foo_b ./mixed/0001_dir/foo ./mixed/0001_foo_a ./mixed/0001_foo_b ./mixed/0002_dir/foo ./mixed/0002_foo_a ./mixed/0002_foo_b ... ./mixed/1999_dir/foo ./mixed/1999_foo_a ./mixed/1999_foo_b

paginatorからの応答を検査するために、上記のs3listに与えられたコードを少し修正することで、いくつかの面白い事実を観察できます。

Markerは本当に排他的です。 Marker=topdir + 'mixed/0500_foo_a'を指定すると、リストが開始されますafterそのキー（ AmazonS3 API に従って）、つまり.../mixed/0500_foo_bでそれが__prev_str()の理由です。
Delimiterを使用して、mixed/をリストするとき、paginatorからの各応答には、666個のキーと334個の共通プレフィックスが含まれます。膨大な応答を構築しないのが得意です。
対照的に、dirs/をリストする場合、paginatorからの各応答には1000の共通プレフィックスが含まれます（キーは含まれません）。
PaginationConfig={'MaxItems': limit}の形式で制限を渡すと、キーの数のみが制限され、一般的なプレフィックスは制限されません。イテレータのストリームをさらに切り詰めることで対処します。

Sophie Muspratt · Answer

私は同じ問題を抱えていましたが、boto3.clientおよびlist_objects_v2をBucketおよびStartAfterパラメーターと共に使用して解決することができました。

s3client = boto3.client('s3') bucket = 'my-bucket-name' startAfter = 'firstlevelFolder/secondLevelFolder' theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter ) for object in theobjects['Contents']: print object['Key']

上記のコードの出力結果には、次が表示されます。

firstlevelFolder/secondLevelFolder/item1 firstlevelFolder/secondLevelFolder/item2

Boto3 list_objects_v2 Documentation

secondLevelFolderのディレクトリ名のみを削除するために、pythonメソッドsplit()を使用しました。

s3client = boto3.client('s3') bucket = 'my-bucket-name' startAfter = 'firstlevelFolder/secondLevelFolder' theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter ) for object in theobjects['Contents']: direcoryName = object['Key']..encode("string_escape").split('/') print direcoryName[1]

上記のコードの出力結果には、次が表示されます。

secondLevelFolder secondLevelFolder

Python split（）ドキュメント

ディレクトリ名とコンテンツアイテム名を取得する場合は、印刷行を次のように置き換えます。

print "{}/{}".format(fileName[1], fileName[2])

そして、以下が出力されます：

secondLevelFolder/item2 secondLevelFolder/item2

お役に立てれば

CpILL · Answer

S3での大きな実現は、キーだけのフォルダ/ディレクトリがないことです。 見かけのフォルダ構造がファイル名の先頭に追加されるが 'Key'になるため、myBucketのsome/path/to/the/file/の内容を一覧表示するには、次のようにします。

s3 = boto3.client('s3') for obj in s3.list_objects_v2(Bucket="myBucket", Prefix="some/path/to/the/file/")['Contents']: print(obj['Key'])

次のようになります：

some/path/to/the/file/yo.jpg some/path/to/the/file/meAndYou.gif ...

cem · Answer

以下は私のために働く... S3オブジェクト：

s3://bucket/ form1/ section11/ file111 file112 section12/ file121 form2/ section21/ file211 file112 section22/ file221 file222 ... ... ...

を使用して：

from boto3.session import Session s3client = session.client('s3') resp = s3client.list_objects(Bucket=bucket, Prefix='', Delimiter="/") forms = [x['Prefix'] for x in resp['CommonPrefixes']]

我々が得る：

form1/ form2/ ...

と：

resp = s3client.list_objects(Bucket=bucket, Prefix='form1/', Delimiter="/") sections = [x['Prefix'] for x in resp['CommonPrefixes']]

我々が得る：

form1/section11/ form1/section12/

Paul Zielinski · Answer

AWS cliは、aws s3 ls s3://my-bucket/を実行するときに（おそらくバケット内のすべてのキーを取得および反復することなく）これを行うため、boto3を使用する方法が必要だと考えました。

https://github.com/aws/aws-cli/blob/0fedc4c1b6a7aee13e2ed10c3ada778c702c22c3/awscli/customizations/s3/subcommands.py#L499

彼らは実際にプレフィックスとデリミタを使用しているようです-そのコードを少し変更することで、バケットのルートレベルにあるすべてのディレクトリを取得する関数を書くことができました：

def list_folders_in_bucket(bucket): paginator = boto3.client('s3').get_paginator('list_objects') folders = [] iterator = paginator.paginate(Bucket=bucket, Prefix='', Delimiter='/', PaginationConfig={'PageSize': None}) for response_data in iterator: prefixes = response_data.get('CommonPrefixes', []) for prefix in prefixes: prefix_name = prefix['Prefix'] if prefix_name.endswith('/'): folders.append(prefix_name.rstrip('/')) return folders

Pirheas · Answer

まず、S3には実際のフォルダーの概念はありません。 @ '/folder/subfolder/myfile.txt'というファイルがあり、フォルダーもサブフォルダーもないことは間違いありません。

S3のフォルダーを「シミュレート」するには、名前の最後に「/」を付けて空のファイルを作成する必要があります（ Amazon S3 boto-フォルダーの作成方法？を参照）

あなたの問題のために、おそらくメソッドget_all_keysを2つのパラメーターと共に使用する必要があります：prefixおよびdelimiter

https://github.com/boto/boto/blob/develop/boto/s3/bucket.py#L427

for key in bucket.get_all_keys(prefix='first-level/', delimiter='/'): print(key.name)

Acumenus · Answer

`boto3.resource`を使用

これは itz-azharによる回答に基づいて構築され、オプションのlimitを適用します。 boto3.clientバージョンよりも明らかに使用する方が明らかに簡単です。

import logging from typing import List, Optional import boto3 from boto3_type_annotations.s3 import ObjectSummary # pip install boto3_type_annotations log = logging.getLogger(__name__) def s3_list(bucket_name: str, prefix: str, *, limit: Optional[int] = None) -> List[ObjectSummary]: """Return a list of S3 object summaries.""" # Ref: https://stackoverflow.com/a/57718002/ return list(boto3.resource("s3").Bucket(bucket_name).objects.limit(count=limit).filter(Prefix=prefix)) if __== "__main__": s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)

`boto3.client`を使用

これはlist_objects_v2を使用し、 CpILLによる回答に基づいて構築され、1000を超えるオブジェクトを取得します。

import logging from typing import cast, List import boto3 log = logging.getLogger(__name__) def s3_list(bucket_name: str, prefix: str, *, limit: int = cast(int, float("inf"))) -> List[dict]: """Return a list of S3 object summaries.""" # Ref: https://stackoverflow.com/a/57718002/ s3_client = boto3.client("s3") contents: List[dict] = [] continuation_token = None if limit <= 0: return contents while True: max_keys = min(1000, limit - len(contents)) request_kwargs = {"Bucket": bucket_name, "Prefix": prefix, "MaxKeys": max_keys} if continuation_token: log.info( # type: ignore "Listing %s objects in s3://%s/%s using continuation token ending with %s with %s objects listed thus far.", max_keys, bucket_name, prefix, continuation_token[-6:], len(contents)) # pylint: disable=unsubscriptable-object response = s3_client.list_objects_v2(**request_kwargs, ContinuationToken=continuation_token) else: log.info("Listing %s objects in s3://%s/%s with %s objects listed thus far.", max_keys, bucket_name, prefix, len(contents)) response = s3_client.list_objects_v2(**request_kwargs) assert response["ResponseMetadata"]["HTTPStatusCode"] == 200 contents.extend(response["Contents"]) is_truncated = response["IsTruncated"] if (not is_truncated) or (len(contents) >= limit): break continuation_token = response["NextContinuationToken"] assert len(contents) <= limit log.info("Returning %s objects from s3://%s/%s.", len(contents), bucket_name, prefix) return contents if __== "__main__": s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)

boto3からS3バケットのサブフォルダー名を取得する

boto3.resourceを使用

boto3.clientを使用

`boto3.resource`を使用

`boto3.client`を使用