ギガビットボンドで150 MB / s以上のスループットが得られないのはなぜですか？

Question

2つの異なるPCIeアダプターに2つのPowerEdge 6950クロスオーバー（直線を使用）を直接接続しました。

これらの各回線でギガビットリンクを取得しています（1000 MBit、全二重、両方向のフロー制御）。

現在、両側でrrアルゴリズムを使用して、これらのインターフェースをbond0に結合しようとしています（1つのIPセッションで2000 MBitを取得したい）。

Tcpモードでdd bs = 1Mとnetcatを使用して/ dev/zeroを/ dev/nullに転送することでスループットをテストしたところ、予想どおり150MB/s以上のスループットが得られました。

単一の回線を使用すると、各回線で異なる方向を使用した場合、各回線で約98 MB /秒になります。単一の回線を使用する場合、トラフィックが「同じ」方向に進むと、回線で70 MB/sと90 MB/sが得られます。

Bonding-readme（/usr/src/linux/Documentation/networking/bonding.txt）を読んだ後、次のセクションが役立つことがわかりました：（13.1.1 MTシングルスイッチトポロジ）

balance-rr：このモードは、単一のTCP/IP接続が複数のインターフェース間でトラフィックをストライプ化できる唯一のモードです。したがって、単一のTCP/IPストリームが複数のインターフェースに相当するスループットを利用できる唯一のモードです。ただし、これには代償が伴います。ストライピングの結果、ピアシステムが順不同でパケットを受信し、セグメントを再送信することにより、TCP/IPの輻輳制御システムが起動することがよくあります。
 It is possible to adjust TCP/IP's congestion limits by altering the net.ipv4.tcp_reordering sysctl parameter. The usual default value is 3, and the maximum useful value is 127. For a four interface balance-rr bond, expect that a single TCP/IP stream will utilize no more than approximately 2.3 interface's worth of throughput, even after adjusting tcp_reordering. Note that this out of order delivery occurs when both the sending and receiving systems are utilizing a multiple interface bond. Consider a configuration in which a balance-rr bond feeds into a single higher capacity network channel (e.g., multiple 100Mb/sec ethernets feeding a single gigabit ethernet via an etherchannel capable switch). In this configuration, traffic sent from the multiple 100Mb devices to a destination connected to the gigabit device will not see packets out of order. However, traffic sent from the gigabit device to the multiple 100Mb devices may or may not see traffic out of order, depending upon the balance policy of the switch. Many switches do not support any modes that stripe traffic (instead choosing a port based upon IP or MAC level addresses); for those devices, traffic flowing from the gigabit device to the many 100Mb devices will only utilize one interface. 

ここで、すべての回線（4）の両方の接続サーバーでそのパラメーターを3から127に変更しました。

再度ボンディングした後、約100 MB/sが得られますが、それ以上ではありません。

何かアイデアはありますか？

更新：lspci -vのハードウェアの詳細：

24:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) Subsystem: Intel Corporation PRO/1000 PT Dual Port Server Adapter Flags: bus master, fast devsel, latency 0, IRQ 24 Memory at dfe80000 (32-bit, non-prefetchable) [size=128K] Memory at dfea0000 (32-bit, non-prefetchable) [size=128K] I/O ports at dcc0 [size=32] Capabilities: [c8] Power Management version 2 Capabilities: [d0] MSI: Mask- 64bit+ Count=1/1 Enable- Capabilities: [e0] Express Endpoint, MSI 00 Kernel driver in use: e1000 Kernel modules: e1000

最終結果を更新します。

8589934592バイト（8.6 GB）コピー、35.8489秒、240 MB /秒

多くのtcp/ipと低レベルドライバーオプションを変更しました。これには、ネットワークバッファの拡大が含まれます。これがddに200 MB/sを超える数値が表示されるようになった理由です。（送信バッファーで）転送を待機している出力がある間にddが終了します。

アップデート2011-08-05：目標を達成するために変更された設定（/ etc/sysctl.conf）：

# See http://www-didc.lbl.gov/TCP-tuning/linux.html # raise TCP max buffer size to 16 MB. default: 131071 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # raise autotuninmg TCP buffer limits # min, default and max number of bytes to use # Defaults: #net.ipv4.tcp_rmem = 4096 87380 174760 #net.ipv4.tcp_wmem = 4096 16384 131072 # Tuning: net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 # Default: Backlog 300 net.core.netdev_max_backlog = 2500 # # Oracle-DB settings: fs.file-max = 6815744 fs.aio-max-nr = 1048576 net.ipv4.ip_local_port_range = 9000 65500 kernel.shmmax = 2147659776 kernel.sem = 1250 256000 100 1024 net.core.rmem_default = 262144 net.core.wmem_default = 262144 # # Tuning for network-bonding according to bonding.txt: net.ipv4.tcp_reordering=127

Bond-deviceの特別な設定（SLES：/ etc/sysconfig/network/ifcfg-bond）：

MTU='9216' LINK_OPTIONS='txqueuelen 10000'

最大のMTUを設定することがソリューションの鍵であったことに注意してください。

関連するネットワークカードのrx/txバッファーのチューニング：

/usr/sbin/ethtool -G eth2 rx 2048 tx 2048 /usr/sbin/ethtool -G eth4 rx 2048 tx 2048

user842313 · Accepted Answer

しばらく前に、2つのギガビットリンクでdrbd同期の速度を上げようとすると、同様の問題が発生しました。結局、約150MB /秒の同期速度を得ることができました。これらは、両方のノードに適用した設定です。

ifconfig bond0 mtu 9000 ifconfig bond0 txqueuelen 10000 echo 3000 > /proc/sys/net/core/netdev_max_backlog

ネットワークカードがない場合は、（ethtool --coalesceを使用して）割り込み合体を有効にすることもできます。

ashmere · Answer

nICにジャンボフレームを構成している場合は、それを見て、高MTUもサポートするようにスイッチを構成していることを確認してください。

ジャンボフレームは、ギガビットネットワークで優れたパフォーマンスを発揮しますが、エンドツーエンド（送信元サーバーと宛先サーバーの両方、およびそれらが使用するネットワークスイッチ）で構成されていることを確認する必要があります。

user48838 · Answer

PowerEdge 6950は、バス全体で133 MB/sを共有するPCIスロットに制限されているようです。システムバスアーキテクチャ自体にI/Oの制限がある場合があります。

テストするハードウェアとI/Oアーキテクチャが異なる他のシステムを使用する以外に、ケーブル配線も有効です。いくつかの可能な組み合わせは、長さだけでなく、異なる定格（5e対6）の線に沿っている場合があります（短いほど常に良いとは限りません）。

Will - TechToolbox · Answer

ジャンボフレームを使用することは、スイッチとNICがサポートしている限り、非常に役立ちます。管理されていないスイッチがある場合、ほとんどの場合、帯域幅に必要な場所を取得できませんが、スイッチのポートをバインドしている場合はそうではありません。ここに、私がずっと前に学んだ何かがあります。65％の時間、それは物理的な問題です。 cat6ケーブルを使用していますか？

Julien Vehent · Answer

ジャンボフレーム？

ifconfig <interface> mtu 9000

Chopper3 · Answer

スイッチでこの双方向トランクを構成しましたか？そうでない場合は、そのように機能しません。アクティブ/パッシブモードで機能し、1Gbpsリンクの1つのみを使用します。