OMG We have so many TCP retransmits... Or not.
So you came into the office and a chat window pops up, saying a Solaris 11 zone is having issues. The problem was already found, TCP retransmits all over the place...
$ netstat -s -P tcp | /usr/xpg4/bin/grep -E 'tcpOutDataBytes|tcpRetransBytes' tcpOutDataSegs =3943655938 tcpOutDataBytes =868103525 tcpRetransSegs =588588 tcpRetransBytes =516389125
Not bad, that's a 60 % retransmission rate.
Is it? We recall our SA-400 Solaris System performance management training and start investigating.
First, let's check if netstat
is based on dtrace, procfs or kstat...
$ truss -ftioctl netstat -s -P tcp 28803: ioctl(3, I_PUSH, "tcp") = 0 28803: ioctl(3, I_PUSH, "udp") = 0 28803: ioctl(3, I_PUSH, "icmp") = 0 28803: ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000) = 3714269 28803: ioctl(4, KSTAT_IOC_READ, "kstat_headers") Err#12 ENOMEM 28803: ioctl(4, KSTAT_IOC_READ, "kstat_headers") = 3714269 28803: ioctl(1, TCGETA, 0xFF59C734) = 0 ...
So it's using the venerable kstat provider...
$ kstat -p tcp:0:tcp:outDataBytes -p tcp:0:tcp:retransBytes tcp:0:tcp:outDataBytes 2065463549 tcp:0:tcp:retransBytes 517472499
That rings some bells... Let's check the source (should still be valid for Solaris 11.3) at http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/inet/tcp/tcp_stats.c#470
470 { "outDataBytes", KSTAT_DATA_UINT32, 0 }, 471 { "retransBytes", KSTAT_DATA_UINT32, 0 },
Looks like the counters are using uint32_t integer types. See Fixed width integer types
And uint32_t can get how big exactly (/usr/include/stdint.h includes /usr/include/sys/stdint.h which includes /usr/include/sys/int_limits.h)?
$ grep -w UINT32_MAX /usr/include/sys/* ... /usr/include/sys/int_limits.h:#define UINT32_MAX (4294967295U)
So we get an outDataBytes integer overflow every UINT32_MAX bytes.
$ kstat -p tcp:0:tcp:outDataBytes -p tcp:0:tcp:retransBytes 5 tcp:0:tcp:outDataBytes 4260947220 tcp:0:tcp:retransBytes 519501630 tcp:0:tcp:outDataBytes 4292542537 tcp:0:tcp:retransBytes 519502160 wait for it... tcp:0:tcp:outDataBytes 66427847 tcp:0:tcp:retransBytes 519528716
Guess our 60 % TCP retransmission rate is not what it seems.
Let's use DTrace just to make sure.
# cat tcp_retransmits.d #!/usr/sbin/dtrace -s #pragma D option quiet tcp:::send / (args[2]->ip_plength - args[4]->tcp_offset) > 0 / { @transmit[args[2]->ip_daddr, args[4]->tcp_dport] = sum(args[2]->ip_plength - args[4]->tcp_offset); } tcp:::send / (args[2]->ip_plength - args[4]->tcp_offset) > 0 && args[3]->tcps_retransmit == 1/ { @retransmit[args[2]->ip_daddr, args[4]->tcp_dport] = sum(args[2]->ip_plength - args[4]->tcp_offset); } tick-5s { printf("%-25s %-15s %-15s %-15s\n", "Remote host", "Port", "BytesSent", "BytesResent"); printa("%-25s %-15d %@-15d %@-15d\n", @transmit, @retransmit); clear(@transmit); clear(@retransmit); } # ./tcp_retransmits.d Remote host Port BytesSent BytesResent ... 10.x.x.x 34286 130646 170 10.x.x.x 53483 5790314 0 10.x.x.x 48394 8183188 2896 10.x.x.x 47664 8602988 0 10.x.x.x 35623 8713439 0 10.x.x.x 52682 8725822 1448 10.x.x.x 48103 8803098 0 10.x.x.x 36497 9101003 510
Nothing to worry about. We have tiny retransmits. Looks like our network is fine.
Do things get better on Solaris 12/11.next? Well netstat is still based on kstat1 and kstat2 doesn't seem to use 64bit integers...
# uname -a SunOS s12test 5.12 s12_120 sun4v sparc sun4v # kstat2 -p kstat:/mib2/tcp/tcp/0\;retransBytes -p kstat:/mib2/tcp/tcp/0\;outDataBytes -i 5 ... kstat:/mib2/tcp/tcp/0;outDataBytes 4239688713 kstat:/mib2/tcp/tcp/0;retransBytes 125307 kstat:/mib2/tcp/tcp/0;outDataBytes 142297693 kstat:/mib2/tcp/tcp/0;retransBytes 125307
So that's a nope :(
No comments:
Post a Comment