Hi all, just wanted to share a recent experience I had investigating a packet drop issue on a linux system. Hope it’s useful to anyone facing similar issues.
Recently I was alerted by our monitoring system that it detected network packet loss. The loss was not excessive, only a couple of hundred packets, nothing compared to the billions of packets flowing through our systems. There was no observable degradation of the functionality of the system. The number of lost packets was low and they were TCP packets, so TCP retransmission would overcome the loss. However, packet loss is often an early sign that something is wrong with the systems, i.e. load is building up beyond capacity. So I took the alert seriously and started searching for the cause. There can be various reasons for packet loss. It can be that the network transport is unreliable and packet loss is natural, the network link could be congested, applications cannot handle the offered load. In this case the network link was reliable fibre, the network link was far from congested. The monitoring system indicated that the packet loss was on a linux box running several server applications that process data sent by our trading machines for analysis by our traders. The packet drops were on incoming traffic. The drop count is shown by the ifconfig command.
ens3f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:0f:53:2b:87:e0 txqueuelen 1000 (Ethernet) RX packets 2155496544 bytes 767987963091 (715.2 GiB) RX errors 0 dropped 69740 overruns 0 frame 0 TX packets 2335238951 bytes 663467243034 (617.9 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16
It seemed most likely that the applications received more traffic than they could handle. Given that on this box one application was running close to its maximum capacity, I expected that this application could not handle the load during some periods during the day, something we had seen before. So I analyzed the performance metrics of this application. CPU load was not very high, the incoming traffic was not out of the ordinary. All metrics indicated that the application was running normally. Nothing indicated that it was overloaded.
This is a single threaded application handling approximately 7000 TCP connections. The CPU load showed some spikes, but nothing excessive. Some of the packet loss periods coincided with those spikes, but not all. Also some periods of high load did not show any packet loss. The same held for the network traffic measured at the NIC level. The periods with dropped packets did not coincide with the periods having the highest packet rate. I also checked the other applications on the box and as expected their load was way less and did not show any signs of overload.
This looked odd. The network seems fine, the applications do not seem overload, yet packets are dropped. Could the monitoring system be wrong? We captured network traffic with tcpdump. If the monitoring system showed packet loss, but the TCP streams in the capture do not, then the monitoring system could be wrong. It could still be that the monitoring system is right as the packets are dropped on the machine, possibly after capturing. The capture did show gaps in the TCP sequence numbers. Tcpdump may not be able to capture all traffic and create those gaps by itself. In this case, the gaps were real as later in time the retransmissions for those missing packets could be seen, i.e. the clients sending those packets did not receive acknowledgements and retransmit the packets.
The capture confirmed the alerts from the monitoring system. We learned something useful from the capture though. The packet drops were seen on TCP streams from all applications running on the box, not only for the one heavy application! That means application performance could not be the cause. Still I was not 100% convinced that it was not something in the application. What else could it be? I decided to do an experiment:
- Run a server that accepts incoming connections, but does not read any data from the connected socket
- Connect a client to the server
- Keep sending data from that client to the server
At some point the server should drop the incoming data. How would this packet drop show in the metrics?
During the experiment I watched the output of the netstat -atn command on the machine running the server:
Proto Recv-Q Send-Q Local Address Foreign Address State tcp 5250096 0 172.30.12.53:24232 172.30.12.52:50049 ESTABLISHED
As expected the size of the receive queue (Recv-Q) kept growing. It grew till ~5MB and then the queue was full, no more packets could be received. Tcpdump showed retransmissions, coming in slower and slower. The TCP exponential backoff algorithm did its work at the client side.
(here comes the last acknowledge packet) 14:59:30.461175 IP 172.30.12.52.50110 > 172.30.12.53.24232: Flags [P.], seq 5021029:5021461, ack 0, win 15, options [nop,nop,TS val 3705262642 ecr 216579590], length 432 14:59:30.501125 IP 172.30.12.53.24232 > 172.30.12.52.50110: Flags [.], ack 5021461, win 32, options [nop,nop,TS val 216579630 ecr 3705262642], length 0
(retransmissions coming in, slower and slower) 14:59:30.501182 IP 172.30.12.52.50110 > 172.30.12.53.24232: Flags [P.], seq 5021461:5021785, ack 0, win 15, options [nop,nop,TS val 3705262682 ecr 216579630], length 324 14:59:30.740948 IP 172.30.12.52.50110 > 172.30.12.53.24232: Flags [P.], seq 5021461:5021785, ack 0, win 15, options [nop,nop,TS val 3705262922 ecr 216579630], length 324 14:59:31.221942 IP 172.30.12.52.50110 > 172.30.12.53.24232: Flags [P.], seq 5021461:5021785, ack 0, win 15, options [nop,nop,TS val 3705263403 ecr 216579630], length 324
To my surprise the ‘dropped’ count in ifconfig did not increase! I searched for other metrics to see where those dropped packets showed up. I found that the following counters, displayed by netstat -s incremented with each incoming retransmission.
18415 packets pruned from receive queue because of socket buffer overrun 50 packets pruned from receive queue
Now I was convinced that the packet drops that we had in our production system were not due to performance issues in the application itself. But then what was it? The packets arrived at the machine, the application can handle the load, still the packets were dropped. What dropped it? As a software developer I was aware of network issues and application performance, but I never really thought about what is between the NIC and my application. Something there must be dropping packets. In between is the network driver and the kernel, something I never really looked into. Now I had to. The NIC, a Solarflare card, has a ring buffer of 1024 slots by default. Every incoming packet is put into this buffer through DMA. Each is 2K and can hold 1 packet at most, i.e. a small packet of only several bytes will occupy a full slot. This ring buffer can be increased to 4096 slots. We did and then there were no more packet drops!
Apparently the buffer overflowed and increasing the buffer stopped the packets from being dropped. Still I felt unsatisfied as I did not really understand why the buffer overflowed. After all the NIC was a 10GB/s interface and the incoming traffic was well below that rate. When a packet arrives at the NIC the following happens:
- An IRQ is raised
- This IRQ eventually triggers a task net_rx_action
- The net_rx_action task takes packets from the NIC ring buffer and puts them in socket buffer (skb) structures
- The skb’s are passed up to the protocol layers, IP, TCP
The net_rx_action has a limited budget to run. The budget is determined by 2 constraints:
- A packet budget defined by net.core.netdev_budget – This is the amount of packets that can be handled at most in a single run.
- A time constraint of 2 jiffies.
If either of these constraints is reached, then the task is stopped. Remaining packets will be left in the ring buffer till the task runs again. This is to prevent net_rx_action hogging the kernel. When this happens the squeeze_time counter in the /proc/net/softnet_stat statistics gets incremented. These statistics show 1 line per CPU core. Each counter is 4 bytes. The 3rd counter is the squeeze_time counter. Looking at the core handling the IRQ’s from the NIC we saw that the squeeze_time counter got incremented once in a while:
6cbc9e8e 00000000 0005e467 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Monitoring this count showed a pattern similar to the packet drops that we had first. The packet drops stopped due to increased buffer size, but at the times we had drops before we now observed time squeezes. By default the packet budget net.core.netdev_budget is 300. This seems quite a low number and apparently we receive bursts of packets exceeding this number, probably exceeding the default ring buffer size of 1024. Searching the internet showed that the default budget is too low for handling 10GB/s. Hogging the CPU core did not seem to be an issue as we had dedicated one CPU core to handle the network IRQ’s. So far this core never exceeded a load of 15%.
Increasing the budget indeed reduced the number of time squeezes. Still I was puzzled why this happened in the first place as the box did not receive anything close to 10GB/s or did it? All traffic rate metrics have a sampling rate of 1s. What if there are some exceptionally large bursts in sub-second intervals?
With the packet drops gone, we captured network traffic once more. This time I made an I/O graph for incoming packets with a 1ms interval using Wireshark to get an insight into traffic rates. This is what I got:
Three sharp peaks stand out and in between lower peaks can be seen.The 3 peaks show that packet rate goes up to ~700 packets/ms. This is quite high and perhaps the rate is even higher on some sub-millisecond intervals. The peaks are caused by hundreds of clients sending a small packet to the server at the same time.
Although the average amount of traffic was moderate, there were micro bursts of traffic due to synchronized behavior of the clients. The micro bursts were the cause of the packet drops we saw. They are somewhat like microscopic denial of service attacks.
When this issue started I thought it would be a case of application performance issues. I did not expect to end up in the depths of the kernel. As with all complex issues, there are multiple causes for the packets loss:
- Multiple clients sending data in a synchronized manner cause micro traffic bursts at the server
- The default settings of the net_rx_task kernel task are not sufficient to process these bursts in one go
- The default size of the ring buffer is too small to buffer all packets from these bursts
- The packets making up the bursts are small, so the data rate measured in bytes stays well below the maximum
The art of software development requires one to think in layers of abstraction. Running an application requires layers upon layers upon layers with the hardware at the bottom. Most of the time we never really think about all those layers as they just do their job, making themselves invisible to us. When problems arise we search in the top layer, our application. This is where we find most of our problems. But once in a while the problems are in those layers below. To find those you need to be aware what is beneath your application. You need expertise from others. I got help from colleagues with expertise in networks and the Linux kernel to help me out.
In this case there is also the effect from multiple clients behaving in a synchronized way causing those bursts. Every single client worked as expected, but together they did not.
Without our monitoring systems we would not have known about this problem. Most likely the amount of traffic would have grown over time, possibly making the bursts larger or more frequent up to the point where functionality of the system starts to degrade, i.e. unresponsive applications on the trader’s desktop. Doing this investigation under such conditions would have been much more stressful.
No matter how well you design and build your systems, unforeseen issues will happen. You need monitoring to find these issues before they become harmful.