ARM Thumb2の効率的な整数平方根アルゴリズムを探しています

Question

私は、符号なし整数の平方根（その整数部）を見つけるための高速な整数のみのアルゴリズムを探しています。コードはARM Thumb 2プロセッサで優れたパフォーマンスを発揮する必要があります。アセンブリ言語またはCコードである可能性があります。

ヒントを歓迎します。

Craig McQueen · Answer

整数平方根 Jack W. Crenshawが別の参考資料として役立つかもしれません。

C Snippets Archive には整数平方根の実装もあります。これは整数の結果を超えて、回答の余分な小数（固定小数点）ビットを計算します。（更新：残念ながら、Cスニペットアーカイブは廃止されました。リンクはページのWebアーカイブを指しています。）Cスニペットアーカイブのコードは次のとおりです。

#define BITSPERLONG 32 #define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2)) struct int_sqrt { unsigned sqrt, frac; }; /* usqrt: ENTRY x: unsigned long EXIT returns floor(sqrt(x) * pow(2, BITSPERLONG/2)) Since the square root never uses more than half the bits of the input, we use the other half of the bits to contain extra bits of precision after the binary point. EXAMPLE suppose BITSPERLONG = 32 then usqrt(144) = 786432 = 12 * 65536 usqrt(32) = 370727 = 5.66 * 65536 NOTES (1) change BITSPERLONG to BITSPERLONG/2 if you do not want the answer scaled. Indeed, if you want n bits of precision after the binary point, use BITSPERLONG/2+n. The code assumes that BITSPERLONG is even. (2) This is really better off being written in Assembly. The line marked below is really a "arithmetic shift left" on the double-long value with r in the upper half and x in the lower half. This operation is typically expressible in only one or two Assembly instructions. (3) Unrolling this loop is probably not a bad idea. ALGORITHM The calculations are the base-two analogue of the square root algorithm we all learned in grammar school. Since we're in base 2, there is only one nontrivial trial multiplier. Notice that absolutely no multiplications or divisions are performed. This means it'll be fast on a wide range of processors. */ void usqrt(unsigned long x, struct int_sqrt *q) { unsigned long a = 0L; /* accumulator */ unsigned long r = 0L; /* remainder */ unsigned long e = 0L; /* trial product */ int i; for (i = 0; i < BITSPERLONG; i++) /* NOTE 1 */ { r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */ a <<= 1; e = (a << 1) + 1; if (r >= e) { r -= e; a++; } } memcpy(q, &a, sizeof(long)); }

私は次のコードに決めました。それは本質的に平方根計算方法に関するウィキペディアの記事からです。ただし、stdint.h型uint32_tなどを使用するように変更されました。厳密に言えば、戻り値の型はuint16_tに変更できます。

/** * \brief Fast Square root algorithm * * Fractional parts of the answer are discarded. That is: * - SquareRoot(3) --> 1 * - SquareRoot(4) --> 2 * - SquareRoot(5) --> 2 * - SquareRoot(8) --> 2 * - SquareRoot(9) --> 3 * * \param[in] a_nInput - unsigned integer for which to find the square root * * \return Integer square root of the input value. */ uint32_t SquareRoot(uint32_t a_nInput) { uint32_t op = a_nInput; uint32_t res = 0; uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type // "one" starts at the highest power of four <= than the argument. while (one > op) { one >>= 2; } while (one != 0) { if (op >= res + one) { op = op - (res + one); res = res + 2 * one; } res >>= 1; one >>= 2; } return res; }

ナイスなことは、かなり簡単な変更で「丸められた」答えが返されることです。これは、特定のアプリケーションでより高い精度を得るのに役立ちます。この場合、戻り値の型はuint32_tでなければならないことに注意してください³²-1は2¹⁶。

/** * \brief Fast Square root algorithm, with rounding * * This does arithmetic rounding of the result. That is, if the real answer * would have a fractional part of 0.5 or greater, the result is rounded up to * the next integer. * - SquareRootRounded(2) --> 1 * - SquareRootRounded(3) --> 2 * - SquareRootRounded(4) --> 2 * - SquareRootRounded(6) --> 2 * - SquareRootRounded(7) --> 3 * - SquareRootRounded(8) --> 3 * - SquareRootRounded(9) --> 3 * * \param[in] a_nInput - unsigned integer for which to find the square root * * \return Integer square root of the input value. */ uint32_t SquareRootRounded(uint32_t a_nInput) { uint32_t op = a_nInput; uint32_t res = 0; uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type // "one" starts at the highest power of four <= than the argument. while (one > op) { one >>= 2; } while (one != 0) { if (op >= res + one) { op = op - (res + one); res = res + 2 * one; } res >>= 1; one >>= 2; } /* Do arithmetic rounding to nearest integer */ if (op > res) { res++; } return res; }

Dave Gamble · Answer

正確な精度が必要でない場合は、260バイトのRAM（それを半分にすることはできますが、しないでください）を使用する高速な近似があります。

_int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340}; int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977}; int fisqrt(int val) { int cnt=0; int t=val; while (t) {cnt++;t>>=1;} if (6>=cnt) t=(val<<(6-cnt)); else t=(val>>(cnt-6)); return (ftbl[cnt]*ftbl2[t&31])>>15; } _

テーブルを生成するコードは次のとおりです。

_ftbl[0]=0; for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i)); printf("int ftbl[33]={0"); for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]); printf("};
"); for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768; printf("int ftbl2[32]={"); for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]); printf("};
"); _

範囲1→2²⁰、最大エラーは11で、範囲1→2³⁰、約256です。より大きなテーブルを使用して、これを最小限に抑えることができます。エラーは常に負であることに注意してください。つまり、間違っている場合、値は正しい値よりも少なくなります。

精製段階でこれに従うとよいでしょう。

アイデアは非常に簡単です：（ab）^0.5 = a^0.b ×b^0.5。

したがって、入力X = A×Bを使用します（A = 2）^N および1≤B <2

次に、sqrt（2のルックアップテーブルがあります^N）、およびsqrt（1≤B <2）のルックアップテーブル。 sqrt（2のルックアップテーブルを保存します^N）整数として、これは間違いかもしれません（テストでは悪影響はありません）。sqrt（1≤B <2）のルックアップテーブルを15ビット固定小数点として保存します。

1≤sqrt（2^N）<65536、つまり16ビットであり、報復を恐れることなく、ARM上で実際に乗算できるのは16ビット×15ビットのみであることがわかっているため、これが私たちのすることです。

実装に関しては、while(t) {cnt++;t>>=1;}は実質的にカウントリーディングビット命令（CLB）であるため、チップセットのバージョンにそれがある場合は勝ちです！また、シフト命令は、双方向シフターを使用すると簡単に実装できますか？

最上位セットビットをカウントするためのLg [N]アルゴリズムがありますここで

マジックナンバーに関しては、テーブルサイズを変更する場合、ftbl2_のマジックナンバーは32ですが、6（Lg [32] +1）はシフトに使用されます。

S.Lott · Answer

一般的なアプローチの1つは、二分法です。

hi = number lo = 0 mid = ( hi + lo ) / 2 mid2 = mid*mid while( lo < hi-1 and mid2 != number ) { if( mid2 < number ) { lo = mid else hi = mid mid = ( hi + lo ) / 2 mid2 = mid*mid

そのような何かが合理的にうまくいくはずです。 log2（number）テストを行い、log2（number）の乗算と除算を行います。除算は2による除算なので、>>に置き換えることができます。

終了条件が適切でない可能性があるため、さまざまな整数をテストして、2による除算が2つの偶数値の間で誤って発振しないことを確認してください。それらは1以上異なるでしょう。

Gutskalk · Answer

ほとんどのアルゴリズムは単純なアイデアに基づいていますが、必要以上に複雑な方法で実装されています。私はここからアイデアを取りました： http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf （Ross M. Foslerによる）そしてそれを非常に短いCにしました-関数：

uint16_t int_sqrt32(uint32_t x) { uint16_t res=0; uint16_t add= 0x8000; int i; for(i=0;i<16;i++) { uint16_t temp=res | add; uint32_t g2=temp*temp; if (x>=g2) { res=temp; } add>>=1; } return res; }

これは、blackfinで5サイクル/ビットにコンパイルされます。 whileループの代わりにforループを使用すると、コンパイルされたコードは一般に高速になり、確定的な時間の利点が得られると思います（ただし、コンパイラがifステートメントを最適化する方法にある程度依存します）。

Yazou · Answer

Sqrt関数の使用法に依存します。私は、高速バージョンを作成するために、おおよそいくつかを使用します。たとえば、vectorのモジュールを計算する必要がある場合：

Module = SQRT( x^2 + y^2)

私が使う：

Module = MAX( x,y) + Min(x,y)/2

次のように3つまたは4つの命令でコーディングできます。

If (x > y ) Module = x + y >> 1; Else Module = y + x >> 1;

Philip · Answer

速くはありませんが、小さくて簡単です。

int isqrt(int n) { int b = 0; while(n >= 0) { n = n - b; b = b + 1; n = n - b; } return b - 1; }

Ber · Answer

このWikipediaの記事で説明されている2桁のバイナリアルゴリズムに似たものに落ち着きました。

warren · Answer

整数log_2とNewtonのメソッドを組み合わせてループフリーアルゴリズムを作成するJavaのソリューションです。欠点として、除算が必要です。コメント行は64ビットアルゴリズムにアップコンバートするために必要です。。

_private static final int debruijn= 0x07C4ACDD; //private static final long debruijn= ( ~0x0218A392CD3D5DBFL)>>>6; static { for(int x= 0; x<32; ++x) { final long v= ~( -2L<<(x)); DeBruijnArray[(int)((v*debruijn)>>>27)]= x; //>>>58 } for(int x= 0; x<32; ++x) SQRT[x]= (int) (Math.sqrt((1L<<DeBruijnArray[x])*Math.sqrt(2))); } public static int sqrt(final int num) { int y; if(num==0) return num; { int v= num; v|= v>>>1; // first round up to one less than a power of 2 v|= v>>>2; v|= v>>>4; v|= v>>>8; v|= v>>>16; //v|= v>>>32; y= SQRT[(v*debruijn)>>>27]; //>>>58 } //y= (y+num/y)>>>1; y= (y+num/y)>>>1; y= (y+num/y)>>>1; y= (y+num/y)>>>1; return y*y>num?y-1:y; } _

仕組み：最初の部分では、約3ビットの精度の平方根が生成されます。行y= (y+num/y)>>1;はビット単位の精度を2倍にします。最後の行は、生成可能な屋根の根を削除します。

Bumsik Kim · Answer

ARM Thumb 2プロセッサにのみ必要な場合、ARMによるCMSIS DSPライブラリが最適です。 これは、Thumb 2プロセッサを設計した人々によって作られました。

実際には、アルゴリズムは必要ありませんが、 [〜＃〜] vsqrt [〜＃〜] などの特殊な平方根ハードウェア命令も必要ありません。 ARM会社は、VSQRTのようなハードウェアを使用することにより、Thumb 2対応プロセッサ向けに高度に最適化された数学およびDSPアルゴリズムの実装を維持しています。ソースコードを取得できます。

arm_sqrt_f32()
arm_sqrt_q15.c/arm_sqrt_q31.c （q15とq31は、ARM Thum 2互換プロセッサに付属していることが多いDSPコアに特化した固定小数点データ型です。）

ARMは、ARM Thumbアーキテクチャ固有の命令について可能な限り最高のパフォーマンスを保証するCMSIS DSPのコンパイル済みバイナリも維持します。ライブラリを使用しますここでバイナリを入手できます

Ken Turkowski · Answer

この方法は、長い除算に似ています。ルートの次の数字の推測を作成し、減算を行い、差が特定の基準を満たす場合に数字を入力します。バイナリバージョンでは、次の数字の唯一の選択肢は0または1なので、常に1を推測し、減算を行い、差が負でない限り1を入力します。

http://www.realitypixels.com/turk/opensource/index.html#FractSqrt

Kde · Answer

最近、ARM Cortex-M3（STM32F103CBT6））で同じタスクに遭遇し、インターネットを検索した後、次のソリューションを思い付きました。ここで提供されているソリューションと比較すると最速ではありません（最大エラーは1、つまりUI32入力範囲全体でLSB）および比較的良好な速度（72 MHzで毎秒約130万平方根ARM Cortex-M3または約55サイクルあたり関数呼び出しを含む単一ルート）。

// FastIntSqrt is based on Wikipedia article: // https://en.wikipedia.org/wiki/Methods_of_computing_square_roots // Which involves Newton's method which gives the following iterative formula: // // X(n+1) = (X(n) + S/X(n))/2 // // Thanks to ARM CLZ instruction (which counts how many bits in a number are // zeros starting from the most significant one) we can very successfully // choose the starting value, so just three iterations are enough to achieve // maximum possible error of 1. The algorithm uses division, but fortunately // it is fast enough here, so square root computation takes only about 50-55 // cycles with maximum compiler optimization. uint32_t FastIntSqrt (uint32_t value) { if (!value) return 0; uint32_t xn = 1 << ((32 - __CLZ (value))/2); xn = (xn + value/xn)/2; xn = (xn + value/xn)/2; xn = (xn + value/xn)/2; return xn; }

私はIARを使用しており、次のアセンブラーコードを生成します。

 SECTION `.text`:CODE:NOROOT(1) THUMB _Z11FastIntSqrtj: MOVS R1,R0 BNE.N ??FastIntSqrt_0 MOVS R0,#+0 BX LR ??FastIntSqrt_0: CLZ R0,R1 RSB R0,R0,#+32 MOVS R2,#+1 LSRS R0,R0,#+1 LSL R0,R2,R0 UDIV R3,R1,R0 ADDS R0,R3,R0 LSRS R0,R0,#+1 UDIV R2,R1,R0 ADDS R0,R2,R0 LSRS R0,R0,#+1 UDIV R1,R1,R0 ADDS R0,R1,R0 LSRS R0,R0,#+1 BX LR ;; return