なぜこのHaskellコードは-Oで遅くなるのですか？

Question

このHaskellコードの実行は、-Oを使用するとはるかに遅くなりますが、-Oは非危険になります=。誰が何が起こったのか教えてもらえますか？それが重要であれば、それは解決する試みですこの問題、そしてそれは二分探索と永続的セグメントツリーを使用します：

import Control.Monad import Data.Array data Node = Leaf Int -- value | Branch Int Node Node -- sum, left child, right child type NodeArray = Array Int Node -- create an empty node with range [l, r) create :: Int -> Int -> Node create l r | l + 1 == r = Leaf 0 | otherwise = Branch 0 (create l m) (create m r) where m = (l + r) `div` 2 -- Get the sum in range [0, r). The range of the node is [nl, nr) sumof :: Node -> Int -> Int -> Int -> Int sumof (Leaf val) r nl nr | nr <= r = val | otherwise = 0 sumof (Branch sum lc rc) r nl nr | nr <= r = sum | r > nl = (sumof lc r nl m) + (sumof rc r m nr) | otherwise = 0 where m = (nl + nr) `div` 2 -- Increase the value at x by 1. The range of the node is [nl, nr) increase :: Node -> Int -> Int -> Int -> Node increase (Leaf val) x nl nr = Leaf (val + 1) increase (Branch sum lc rc) x nl nr | x < m = Branch (sum + 1) (increase lc x nl m) rc | otherwise = Branch (sum + 1) lc (increase rc x m nr) where m = (nl + nr) `div` 2 -- signature said it all tonodes :: Int -> [Int] -> [Node] tonodes n = reverse . tonodes' . reverse where tonodes' :: [Int] -> [Node] tonodes' (h:t) = increase h' h 0 n : s' where s'@(h':_) = tonodes' t tonodes' _ = [create 0 n] -- find the minimum m in [l, r] such that (predicate m) is True binarysearch :: (Int -> Bool) -> Int -> Int -> Int binarysearch predicate l r | l == r = r | predicate m = binarysearch predicate l m | otherwise = binarysearch predicate (m+1) r where m = (l + r) `div` 2 -- main, literally main :: IO () main = do [n, m] <- fmap (map read . words) getLine nodes <- fmap (listArray (0, n) . tonodes n . map (subtract 1) . map read . words) getLine replicateM_ m $ query n nodes where query :: Int -> NodeArray -> IO () query n nodes = do [p, k] <- fmap (map read . words) getLine print $ binarysearch (ok nodes n p k) 0 n where ok :: NodeArray -> Int -> Int -> Int -> Int -> Bool ok nodes n p k s = (sumof (nodes ! min (p + s + 1) n) s 0 n) - (sumof (nodes ! max (p - s) 0) s 0 n) >= k

（これはコードレビューとまったく同じコードですが、この質問は別の問題に対処します。）

これはC++の入力ジェネレーターです。

#include <cstdio> #include <cstdlib> using namespace std; int main (int argc, char * argv[]) { srand(1827); int n = 100000; if(argc > 1) sscanf(argv[1], "%d", &n); printf("%d %d
", n, n); for(int i = 0; i < n; i++) printf("%d%c", Rand() % n + 1, i == n - 1 ? '
' : ' '); for(int i = 0; i < n; i++) { int p = Rand() % n; int k = Rand() % n + 1; printf("%d %d
", p, k); } }

C++コンパイラを使用できない場合は、これは./gen.exe 1000 の結果です。

これは私のコンピュータでの実行結果です：

$ ghc --version The Glorious Glasgow Haskell Compilation System, version 7.8.3 $ ghc -fforce-recomp 1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real 0m0.088s user 0m0.015s sys 0m0.015s $ ghc -fforce-recomp -O 1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real 0m2.969s user 0m0.000s sys 0m0.045s

そして、これはヒーププロファイルの要約です：

$ ghc -fforce-recomp -rtsopts ./1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null 70,207,096 bytes allocated in the heap 2,112,416 bytes copied during GC 613,368 bytes maximum residency (3 sample(s)) 28,816 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 132 colls, 0 par 0.00s 0.00s 0.0000s 0.0004s Gen 1 3 colls, 0 par 0.00s 0.00s 0.0006s 0.0010s INIT time 0.00s ( 0.00s elapsed) MUT time 0.03s ( 0.03s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.03s ( 0.04s elapsed) %GC time 0.0% (14.7% elapsed) Alloc rate 2,250,213,011 bytes per MUT second Productivity 100.0% of total user, 83.1% of total elapsed $ ghc -fforce-recomp -O -rtsopts ./1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null 6,009,233,608 bytes allocated in the heap 622,682,200 bytes copied during GC 443,240 bytes maximum residency (505 sample(s)) 48,256 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10945 colls, 0 par 0.72s 0.63s 0.0001s 0.0004s Gen 1 505 colls, 0 par 0.16s 0.13s 0.0003s 0.0005s INIT time 0.00s ( 0.00s elapsed) MUT time 2.00s ( 2.13s elapsed) GC time 0.87s ( 0.76s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.89s ( 2.90s elapsed) %GC time 30.3% (26.4% elapsed) Alloc rate 3,009,412,603 bytes per MUT second Productivity 69.7% of total user, 69.4% of total elapsed

Joachim Breitner · Accepted Answer

この質問が適切な答えを得る時が来たと思います。

_`-O`_でコードに何が起こったか

メイン関数を拡大して、少し書き直してみましょう。

_main :: IO () main = do [n, m] <- fmap (map read . words) getLine line <- getLine let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line replicateM_ m $ query n nodes _

明らかに、ここでの意図は、NodeArrayが1回作成され、次にmのすべてのquery呼び出しで使用されることです。

残念ながら、GHCはこのコードを事実上、

_main = do [n, m] <- fmap (map read . words) getLine line <- getLine replicateM_ m $ do let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line query n nodes _

ここで問題をすぐに確認できます。

状態ハックとは何ですか、なぜそれは私のプログラムのパフォーマンスを破壊しますか

その理由は、（大まかに）次のように述べている状態ハックです。「何かが_IO a_タイプのものである場合、一度だけ呼び出されると想定します。」公式ドキュメントはそれほど複雑ではありません。

_-fno-state-hack_

"state hack"をオフにします。これにより、引数としてState＃トークンを持つラムダは単一エントリと見なされるため、その内部にインライン化しても問題ありません。これにより、IOおよびSTモナドコードのパフォーマンスが向上しますが、共有が低下するリスクがあります。

おおまかに言うと、アイデアは次のとおりです。IOタイプとwhere句を使用して関数を定義すると、たとえば、.

_foo x = do putStrLn y putStrLn y where y = ...x... _

タイプ_IO a_の何かは、タイプRealWord -> (a, RealWorld)の何かと見なすことができます。その見方では、上記は（おおよそ）

_foo x = let y = ...x... in \world1 -> let (world2, ()) = putStrLn y world1 let (world3, ()) = putStrLn y world2 in (world3, ()) _

fooの呼び出しは、（通常）次のようになります_foo argument world_。しかし、fooの定義は1つの引数のみを取り、もう1つの引数はローカルのラムダ式によって後でのみ使用されます。これはfooへの呼び出しが非常に遅くなります。コードが次のようになれば、はるかに高速になります。

_foo x world1 = let y = ...x... in let (world2, ()) = putStrLn y world1 let (world3, ()) = putStrLn y world2 in (world3, ()) _

これはeta-expansionと呼ばれ、さまざまな理由で行われます（たとえば、関数の定義を分析する、それがどのように呼び出されているかを確認する、およびこの場合は、型指定ヒューリスティック））。

残念ながら、これは、fooの呼び出しが実際に_let fooArgument = foo argument_の形式である場合、つまり引数を指定した場合に、パフォーマンスが低下しますが、worldは（まだ）渡されません。元のコードでは、fooArgumentが複数回使用された場合でも、yは1回だけ計算され、共有されます。変更されたコードでは、yは毎回再計算されます– nodesに正確に何が起こったか。

修正できますか？

たぶん。そうする試みについては＃9388 を参照してください。それを修正することの問題は、それがwill変換が正常に行われる多くの場合にパフォーマンスが低下することです。承知しました。また、技術的に問題がある場合、つまり共有が失われる場合もありますが、高速な呼び出しによる高速化が再計算の追加コストを上回るため、それでもメリットがあります。したがって、ここからどこに行くべきかは明確ではありません。

なぜこのHaskellコードは-Oで遅くなるのですか？

_-O_でコードに何が起こったか

状態ハックとは何ですか、なぜそれは私のプログラムのパフォーマンスを破壊しますか

修正できますか？

_`-O`_でコードに何が起こったか