他のhdf5リーダーとの相互運用性のためにh5pyを使用してpandas DataFrame

Question

サンプルデータフレームは次のとおりです。

import pandas as pd NaN = float('nan') ID = [1, 2, 3, 4, 5, 6, 7] A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1] B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN] C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN] columns = {'A':A, 'B':B, 'C':C} df = pd.DataFrame(columns, index=ID) df.index.name = 'ID' print(df) A B C ID 1 NaN 0.2 NaN 2 NaN NaN 0.5 3 NaN 0.2 0.5 4 0.1 0.2 NaN 5 0.1 0.2 0.5 6 0.1 NaN 0.5 7 0.1 NaN NaN

pandasにはpytablesベースのHDFStoreがあり、データフレームを効率的にシリアル化/逆シリアル化する簡単な方法です。しかし、これらのデータセットは、リーダーh5pyまたはmatlabを使用して直接ロードするのはそれほど簡単ではありません。 h5pyを使用してデータフレームを保存し、別のhdf5リーダーを使用して簡単にロードし直すにはどうすればよいですか？

Jeff · Accepted Answer

pandas HDFStore形式は標準のHDF5であり、メタデータの解釈方法に関する規則があります。ドキュメントはここ

In [54]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True) In [55]: h = h5py.File('test.h5') In [56]: h['df']['table'] Out[56]: <HDF5 dataset "table": shape (7,), type "|V32"> In [64]: h['df']['table'][:] Out[64]: array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5), (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5), (7, 0.1, nan, nan)], dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')]) In [57]: h['df']['table'].attrs.items() Out[57]: [(u'CLASS', 'TABLE'), (u'VERSION', '2.7'), (u'TITLE', ''), (u'FIELD_0_NAME', 'index'), (u'FIELD_1_NAME', 'A'), (u'FIELD_2_NAME', 'B'), (u'FIELD_3_NAME', 'C'), (u'FIELD_0_FILL', 0), (u'FIELD_1_FILL', 0.0), (u'FIELD_2_FILL', 0.0), (u'FIELD_3_FILL', 0.0), (u'index_kind', 'integer'), (u'A_kind', "(lp1
S'A'
a."), (u'A_meta', 'N.'), (u'A_dtype', 'float64'), (u'B_kind', "(lp1
S'B'
a."), (u'B_meta', 'N.'), (u'B_dtype', 'float64'), (u'C_kind', "(lp1
S'C'
a."), (u'C_meta', 'N.'), (u'C_dtype', 'float64'), (u'NROWS', 7)] In [58]: h.close()

データはどのHDF5リーダーでも完全に読み取ることができます。一部のメタデータはピクルス化されているため、注意が必要です。

Phil · Answer

これがこの問題を解決するための私のアプローチです。他の誰かがより良い解決策を持っているか、私のアプローチが他の人に役立つことを願っています。

まず、pandas DataFrameから）（レコード配列ではなく）numpy構造体配列を作成する関数を定義します。

import numpy as np def df_to_sarray(df): """ Convert a pandas DataFrame object to a numpy structured array. This is functionally equivalent to but more efficient than np.array(df.to_array()) :param df: the data frame to convert :return: a numpy structured array representation of df """ v = df.values cols = df.columns types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)] dtype = np.dtype(types) z = np.zeros(v.shape[0], dtype) for (i, k) in enumerate(z.dtype.names): z[k] = v[:, i] return z

使用する reset_indexデータの一部としてインデックスを含む新しいデータフレームを作成します。そのデータフレームを構造体配列に変換します。

sa = df_to_sarray(df.reset_index()) sa array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5), (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5), (7L, 0.1, nan, nan)], dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

その構造化配列をhdf5ファイルに保存します。

import h5py with h5py.File('mydata.h5', 'w') as hf: hf['df'] = sa

H5データセットをロードする

with h5py.File('mydata.h5') as hf: sa2 = hf['df'][:]

ID列を抽出し、sa2から削除します

ID = sa2['ID'] sa2 = nprec.drop_fields(sa2, 'ID')

Sa2を使用してインデックスIDでデータフレームを作成します

df2 = pd.DataFrame(sa2, index=ID) df2.index.name = 'ID' print(df2) A B C ID 1 NaN 0.2 NaN 2 NaN NaN 0.5 3 NaN 0.2 0.5 4 0.1 0.2 NaN 5 0.1 0.2 0.5 6 0.1 NaN 0.5 7 0.1 NaN NaN

iipr · Answer

誰かに役立つ場合は、この投稿からギヨームとフィルを取り、私のニーズに合わせて少し変更しました ankostis 。 CSVファイルからpandas DataFrameを読み取ります。

オブジェクトをHDF5ファイルに保存できないため、主にStringsに適合させました（私は信じています）。まず、どの列タイプがnumpy objectsであるかを確認します。次に、その列の最長の長さを確認し、その列をその長さの文字列に修正します。残りは他の投稿と非常に似ています。

def df_to_sarray(df): """ Convert a pandas DataFrame object to a numpy structured array. Also, for every column of a str type, convert it into a 'bytes' str literal of length = max(len(col)). :param df: the data frame to convert :return: a numpy structured array representation of df """ def make_col_type(col_type, col): try: if 'numpy.object_' in str(col_type.type): maxlens = col.dropna().str.len() if maxlens.any(): maxlen = maxlens.max().astype(int) col_type = ('S%s' % maxlen, 1) else: col_type = 'f2' return col.name, col_type except: print(col.name, col_type, col_type.type, type(col)) raise v = df.values types = df.dtypes numpy_struct_types = [make_col_type(types[col], df.loc[:, col]) for col in df.columns] dtype = np.dtype(numpy_struct_types) z = np.zeros(v.shape[0], dtype) for (i, k) in enumerate(z.dtype.names): # This is in case you have problems with the encoding, remove the if branch if not try: if dtype[i].str.startswith('|S'): z[k] = df[k].str.encode('latin').astype('S') else: z[k] = v[:, i] except: print(k, v[:, i]) raise return z, dtype

したがって、ワークフローは次のようになります。

import h5py import pandas as pd # Read a CSV file # Here we assume col_dtypes is a dictionary that contains the dtypes of the columns df = pd.read_table('./data.csv', sep='	', dtype=col_dtypes) # Transform the DataFrame into a structured numpy array and get the dtype sa, saType = df_to_sarray(df) # Open/create the HDF5 file f = h5py.File('test.hdf5', 'a') # Save the structured array f.create_dataset('someData', data=sa, dtype=saType) # Retrieve it and check it is ok when you transform it into a pandas DataFrame sa2 = f['someData'][:] df2 = pd.DataFrame(sa2) print(df2.head()) f.close()

また、このようにして、たとえばgzip圧縮を使用している場合でも、 HDFView からそれを確認できます。