scipy sparse csr_matrixをポータブルデータ形式で保存/読み込み

Question

Scipy sparse _csr_matrix_を移植可能な形式で保存/ロードするにはどうすればよいですか？ scipyスパースマトリックスは、Python 3（Windows 64ビット）で作成され、Python 2（Linux 64ビット）で実行されます。最初は、pickleを使用しました（protocol = 2およびfix_imports = Trueを使用）が、これはPython 3.2.2（Windows 64ビット）からPython 2.7。 2（Windows 32ビット）およびエラー：

_TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')). _

次に、_numpy.save_と_numpy.load_およびscipy.io.mmwrite()とscipy.io.mmread()を試してみましたが、これらのメソッドはいずれも機能しませんでした。

Henry Thornton · Accepted Answer

edit：SciPy 1.19には scipy.sparse.save_npz および scipy.sparse.load_npz が追加されました。

from scipy import sparse sparse.save_npz("yourmatrix.npz", your_matrix) your_matrix_back = sparse.load_npz("yourmatrix.npz")

両方の関数で、file引数は、ファイル名ではなく、ファイルのようなオブジェクト（つまり、openの結果）でもかまいません。

Scipyユーザーグループから回答を得ました。

Csr_matrixには、重要な3つのデータ属性があります：.data、.indices、および.indptr。すべてが単純なndarrayであるため、numpy.saveはそれらに対して機能します。 numpy.saveまたはnumpy.savezを使用して3つの配列を保存し、numpy.loadを使用してロードし直してから、次を使用してスパース行列オブジェクトを再作成します。
new_csr = csr_matrix((data, indices, indptr), shape=(M, N)) 

たとえば、次のとおりです。

def save_sparse_csr(filename, array): np.savez(filename, data=array.data, indices=array.indices, indptr=array.indptr, shape=array.shape) def load_sparse_csr(filename): loader = np.load(filename) return csr_matrix((loader['data'], loader['indices'], loader['indptr']), shape=loader['shape'])

Frank Zalkow · Answer

あなたが書いても、scipy.io.mmwriteおよびscipy.io.mmreadあなたのために働かない、私は彼らがどのように働くかを追加したい。この質問はノーです。 1つのGoogleヒットがあったので、私自身はnp.savezおよびpickle.dumpシンプルで明白なscipy-functionsに切り替える前。彼らは私のために働いており、まだ試していない人によって監視されるべきではありません。

from scipy import sparse, io m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]]) m # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format> io.mmwrite("test.mtx", m) del m newm = io.mmread("test.mtx") newm # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format> newm.tocsr() # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format> newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

Dennis Golomazov · Answer

Jupyter Notebookを使用した、最も評価の高い3つの回答のパフォーマンスの比較を次に示します。入力は、密度が0.001の1M x 100Kのランダムなスパース行列で、100Mの非ゼロ値を含みます。

from scipy.sparse import random matrix = random(1000000, 100000, density=0.001, format='csr') matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in Compressed Sparse Row format>

`io.mmwrite`/`io.mmread`

from scipy.sparse import io %time io.mmwrite('test_io.mtx', matrix) CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s Wall time: 4min 39s %time matrix = io.mmread('test_io.mtx') CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s Wall time: 2min 43s matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in COOrdinate format> Filesize: 3.0G.

（形式がcsrからcooに変更されていることに注意してください）。

`np.savez`/`np.load`

import numpy as np from scipy.sparse import csr_matrix def save_sparse_csr(filename, array): # note that .npz extension is added automatically np.savez(filename, data=array.data, indices=array.indices, indptr=array.indptr, shape=array.shape) def load_sparse_csr(filename): # here we need to add .npz extension manually loader = np.load(filename + '.npz') return csr_matrix((loader['data'], loader['indices'], loader['indptr']), shape=loader['shape']) %time save_sparse_csr('test_savez', matrix) CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s Wall time: 2.74 s %time matrix = load_sparse_csr('test_savez') CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s Wall time: 1.73 s matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in Compressed Sparse Row format> Filesize: 1.1G.

`cPickle`

import cPickle as pickle def save_pickle(matrix, filename): with open(filename, 'wb') as outfile: pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL) def load_pickle(filename): with open(filename, 'rb') as infile: matrix = pickle.load(infile) return matrix %time save_pickle(matrix, 'test_pickle.mtx') CPU times: user 260 ms, sys: 888 ms, total: 1.15 s Wall time: 1.15 s %time matrix = load_pickle('test_pickle.mtx') CPU times: user 376 ms, sys: 988 ms, total: 1.36 s Wall time: 1.37 s matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in Compressed Sparse Row format> Filesize: 1.1G.

注：cPickleは非常に大きなオブジェクトでは動作しません（この回答を参照）。私の経験では、270Mの非ゼロ値を持つ2.7M x 50kマトリックスでは機能しませんでした。 np.savezソリューションはうまくいきました。

結論

（CSRマトリックスのこの簡単なテストに基づく）cPickleは最速の方法ですが、非常に大きなマトリックスnp.savezは少しだけ遅いですが、io.mmwriteははるかに遅く、より大きなファイルを生成し、間違った形式に復元します。そう np.savezがここでの勝者です。

Victor Deplasse · Answer

これで、scipy.sparse.save_npz： https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html

Joe Kington · Answer

両方のマシンにscipyがあると仮定すると、pickleを使用できます。

ただし、numpy配列を酸洗いする場合は、必ずバイナリプロトコルを指定してください。そうしないと、巨大なファイルが作成されます。

とにかく、これを行うことができるはずです：

import cPickle as pickle import numpy as np import scipy.sparse # Just for testing, let's make a dense array and convert it to a csr_matrix x = np.random.random((10,10)) x = scipy.sparse.csr_matrix(x) with open('test_sparse_array.dat', 'wb') as outfile: pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)

次に、それをロードできます：

import cPickle as pickle with open('test_sparse_array.dat', 'rb') as infile: x = pickle.load(infile)

x0s · Answer

Scipy 0.19.0以降、この方法でスパース行列を保存およびロードできます。

from scipy import sparse data = sparse.csr_matrix((3, 4)) #Save sparse.save_npz('data_sparse.npz', data) #Load data = sparse.load_npz("data_sparse.npz")

Yuval · Answer

私の2セントを追加します：私にとって、npzは、それを使用してPython以外のクライアントに簡単にエクスポートできないため、移植性がありません（たとえば、PostgreSQL-修正されてうれしいです）。そのため、スパース行列のCSV出力を取得したいと思います（print()スパース行列を取得するのと同じように）。これを実現する方法は、スパース行列の表現に依存します。 CSRマトリックスの場合、次のコードはCSV出力を出力します。他の表現に適応できます。

import numpy as np def csr_matrix_tuples(m): # not using unique will lag on empty elements uindptr, uindptr_i = np.unique(m.indptr, return_index=True) for i, (start_index, end_index) in Zip(uindptr_i, Zip(uindptr[:-1], uindptr[1:])): for j, data in Zip(m.indices[start_index:end_index], m.data[start_index:end_index]): yield (i, j, data) for i, j, data in csr_matrix_tuples(my_csr_matrix): print(i, j, data, sep=',')

私がテストしたことから、現在の実装ではsave_npzよりも約2倍遅いです。

dlorch · Answer

これは、lil_matrixを保存するために使用したものです。

import numpy as np from scipy.sparse import lil_matrix def save_sparse_lil(filename, array): # use np.savez_compressed(..) for compression np.savez(filename, dtype=array.dtype.str, data=array.data, rows=array.rows, shape=array.shape) def load_sparse_lil(filename): loader = np.load(filename) result = lil_matrix(Tuple(loader["shape"]), dtype=str(loader["dtype"])) result.data = loader["data"] result.rows = loader["rows"] return result

NumPyのnp.load（..）が非常に遅いであることがわかったと言わなければなりません。これは私の現在のソリューションです、私ははるかに速く実行すると感じています：

from scipy.sparse import lil_matrix import numpy as np import json def lil_matrix_to_dict(myarray): result = { "dtype": myarray.dtype.str, "shape": myarray.shape, "data": myarray.data, "rows": myarray.rows } return result def lil_matrix_from_dict(mydict): result = lil_matrix(Tuple(mydict["shape"]), dtype=mydict["dtype"]) result.data = np.array(mydict["data"]) result.rows = np.array(mydict["rows"]) return result def load_lil_matrix(filename): result = None with open(filename, "r", encoding="utf-8") as infile: mydict = json.load(infile) result = lil_matrix_from_dict(mydict) return result def save_lil_matrix(filename, myarray): with open(filename, "w", encoding="utf-8") as outfile: mydict = lil_matrix_to_dict(myarray) json.dump(mydict, outfile)

Guy s · Answer

マトリックスを単純で一般的な形式で送信するように求められました。

<x,y,value>

私はこれで終わった：

def save_sparse_matrix(m,filename): thefile = open(filename, 'w') nonZeros = np.array(m.nonzero()) for entry in range(nonZeros.shape[1]): thefile.write("%s,%s,%s
" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))

Thomas Ahle · Answer

これは私のために働く：

_import numpy as np import scipy.sparse as sp x = sp.csr_matrix([1,2,3]) y = sp.csr_matrix([2,3,4]) np.savez(file, x=x, y=y) npz = np.load(file) >>> npz['x'].tolist() <1x3 sparse matrix of type '<class 'numpy.int64'>' with 3 stored elements in Compressed Sparse Row format> >>> npz['x'].tolist().toarray() array([[1, 2, 3]], dtype=int64) _

トリックは、.tolist()を呼び出して、形状0のオブジェクト配列を元のオブジェクトに変換することでした。

scipy sparse csr_matrixをポータブルデータ形式で保存/読み込み

io.mmwrite/io.mmread

np.savez/np.load

cPickle

結論

`io.mmwrite`/`io.mmread`

`np.savez`/`np.load`

`cPickle`