OMG We have so many TCP retransmits... Or not.
So you came into the office and a chat window pops up, saying a Solaris 11 zone is having issues. The problem was already found, TCP retransmits all over the place...
$ netstat -s -P tcp | /usr/xpg4/bin/grep -E 'tcpOutDataBytes|tcpRetransBytes'
tcpOutDataSegs =3943655938 tcpOutDataBytes =868103525
tcpRetransSegs =588588 tcpRetransBytes =516389125
Not bad, that's a 60 % retransmission rate.
Is it? We recall our SA-400 Solaris System performance management training and start investigating.
First, let's check if netstat is based on dtrace, procfs or kstat...
$ truss -ftioctl netstat -s -P tcp 28803: ioctl(3, I_PUSH, "tcp") = 0 28803: ioctl(3, I_PUSH, "udp") = 0 28803: ioctl(3, I_PUSH, "icmp") = 0 28803: ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000) = 3714269 28803: ioctl(4, KSTAT_IOC_READ, "kstat_headers") Err#12 ENOMEM 28803: ioctl(4, KSTAT_IOC_READ, "kstat_headers") = 3714269 28803: ioctl(1, TCGETA, 0xFF59C734) = 0 ...
So it's using the venerable kstat provider...
$ kstat -p tcp:0:tcp:outDataBytes -p tcp:0:tcp:retransBytes tcp:0:tcp:outDataBytes 2065463549 tcp:0:tcp:retransBytes 517472499
That rings some bells... Let's check the source (should still be valid for Solaris 11.3) at http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/inet/tcp/tcp_stats.c#470
470 { "outDataBytes", KSTAT_DATA_UINT32, 0 },
471 { "retransBytes", KSTAT_DATA_UINT32, 0 },
Looks like the counters are using uint32_t integer types. See Fixed width integer types
And uint32_t can get how big exactly (/usr/include/stdint.h includes /usr/include/sys/stdint.h which includes /usr/include/sys/int_limits.h)?
$ grep -w UINT32_MAX /usr/include/sys/* ... /usr/include/sys/int_limits.h:#define UINT32_MAX (4294967295U)
So we get an outDataBytes integer overflow every UINT32_MAX bytes.
$ kstat -p tcp:0:tcp:outDataBytes -p tcp:0:tcp:retransBytes 5 tcp:0:tcp:outDataBytes 4260947220 tcp:0:tcp:retransBytes 519501630 tcp:0:tcp:outDataBytes 4292542537 tcp:0:tcp:retransBytes 519502160 wait for it... tcp:0:tcp:outDataBytes 66427847 tcp:0:tcp:retransBytes 519528716
Guess our 60 % TCP retransmission rate is not what it seems.
Let's use DTrace just to make sure.
# cat tcp_retransmits.d
#!/usr/sbin/dtrace -s
#pragma D option quiet
tcp:::send
/ (args[2]->ip_plength - args[4]->tcp_offset) > 0 /
{
@transmit[args[2]->ip_daddr, args[4]->tcp_dport] =
sum(args[2]->ip_plength - args[4]->tcp_offset);
}
tcp:::send
/ (args[2]->ip_plength - args[4]->tcp_offset) > 0 &&
args[3]->tcps_retransmit == 1/
{
@retransmit[args[2]->ip_daddr, args[4]->tcp_dport] =
sum(args[2]->ip_plength - args[4]->tcp_offset);
}
tick-5s
{
printf("%-25s %-15s %-15s %-15s\n",
"Remote host", "Port", "BytesSent", "BytesResent");
printa("%-25s %-15d %@-15d %@-15d\n",
@transmit, @retransmit);
clear(@transmit);
clear(@retransmit);
}
# ./tcp_retransmits.d
Remote host Port BytesSent BytesResent
...
10.x.x.x 34286 130646 170
10.x.x.x 53483 5790314 0
10.x.x.x 48394 8183188 2896
10.x.x.x 47664 8602988 0
10.x.x.x 35623 8713439 0
10.x.x.x 52682 8725822 1448
10.x.x.x 48103 8803098 0
10.x.x.x 36497 9101003 510
Nothing to worry about. We have tiny retransmits. Looks like our network is fine.
Do things get better on Solaris 12/11.next? Well netstat is still based on kstat1 and kstat2 doesn't seem to use 64bit integers...
# uname -a SunOS s12test 5.12 s12_120 sun4v sparc sun4v # kstat2 -p kstat:/mib2/tcp/tcp/0\;retransBytes -p kstat:/mib2/tcp/tcp/0\;outDataBytes -i 5 ... kstat:/mib2/tcp/tcp/0;outDataBytes 4239688713 kstat:/mib2/tcp/tcp/0;retransBytes 125307 kstat:/mib2/tcp/tcp/0;outDataBytes 142297693 kstat:/mib2/tcp/tcp/0;retransBytes 125307
So that's a nope :(
No comments:
Post a Comment