pandas DataFrameをサブクラス化する方法は？

Question

サブクラス化pandasクラスは一般的なニーズのようですが、対象に関する参照を見つけることができませんでした。（pandas開発者はまだそれに取り組んでいるようです： https://github.com/pydata/pandas/issues/6 ）。

主題にはいくつかのSOスレッドがありますが、ここで誰かがpandas.DataFrameをサブクラス化するための現在最良の方法について、より体系的な説明を提供して、2つの一般的な要件を満たすことを願っています：

import numpy as np import pandas as pd class MyDF(pd.DataFrame): # how to subclass pandas DataFrame? pass mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D']) print type(mydf) # <class '__main__.MyDF'> # Requirement 1: Instances of MyDF, when calling standard methods of DataFrame, # should produce instances of MyDF. mydf_sub = mydf[['A','C']] print type(mydf_sub) # <class 'pandas.core.frame.DataFrame'> # Requirement 2: Attributes attached to instances of MyDF, when calling standard # methods of DataFrame, should still attach to the output. mydf.myattr = 1 mydf_cp1 = MyDF(mydf) mydf_cp2 = mydf.copy() print hasattr(mydf_cp1, 'myattr') # False print hasattr(mydf_cp2, 'myattr') # False

また、pandas.Seriesのサブクラス化に大きな違いはありますか？ありがとうございました。

cjrieds · Answer

Pandasデータ構造をサブクラス化する方法に関する公式ガイドがあり、DataFrameとSeriesが含まれています。

ガイドはここにあります： http://pandas.pydata.org/pandas-docs/stable/internals.html#subclassing-pandas-data-structures

ガイドは、Geopandasプロジェクトのこのサブクラス化されたDataFrameを良い例として言及しています： https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py

HYRYの回答のように、達成しようとしていることが2つあるようです。

クラスのインスタンスでメソッドを呼び出すときは、正しいタイプ（ユーザーのタイプ）のインスタンスを返します。そのためには、型を返す_constructorプロパティを追加するだけです。
オブジェクトのコピーに添付される属性を追加します。これを行うには、これらの属性の名前を特別な_metadata属性としてリストに格納する必要があります。

次に例を示します。

class SubclassedDataFrame(DataFrame): _metadata = ['added_property'] added_property = 1 # This will be passed to copies @property def _constructor(self): return SubclassedDataFrame

HYRY · Answer

要件1については、_constructor：

import pandas as pd import numpy as np class MyDF(pd.DataFrame): @property def _constructor(self): return MyDF mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D']) print type(mydf) mydf_sub = mydf[['A','C']] print type(mydf_sub)

要件2には簡単な解決策はないと思います。__init__、copy、または_constructor、例えば：

import pandas as pd import numpy as np class MyDF(pd.DataFrame): _attributes_ = "myattr1,myattr2" def __init__(self, *args, **kw): super(MyDF, self).__init__(*args, **kw) if len(args) == 1 and isinstance(args[0], MyDF): args[0]._copy_attrs(self) def _copy_attrs(self, df): for attr in self._attributes_.split(","): df.__dict__[attr] = getattr(self, attr, None) @property def _constructor(self): def f(*args, **kw): df = MyDF(*args, **kw) self._copy_attrs(df) return df return f mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D']) print type(mydf) mydf_sub = mydf[['A','C']] print type(mydf_sub) mydf.myattr1 = 1 mydf_cp1 = MyDF(mydf) mydf_cp2 = mydf.copy() print mydf_cp1.myattr1, mydf_cp2.myattr1