[摘要] 一大早LVS突然挂掉,短信,邮件轮番轰炸,我等苦逼SA立马上线查看。。 发现是愚蠢的LVS在频繁的踢掉和加入后端的realserver,导致服务极不稳定,从日志看问题应该出在后端的realserver。PS一下:我们的操作系统同样为愚蠢大便(Debian6)。观察了实际的realserver的日志及状态,发现FIN_WAIT2状态的连接非常多:
Dec 10 01:39:45 sudops.com kernel: [1696088.973658] Out of socket memory Dec 10 01:39:45 sudops.com kernel: [1696088.973666] Out of socket memory Dec 10 01:39:45 sudops.com kernel: [1696088.973675] Out of socket memory TCP连接状态 # netstat -an|awk '{print $NF}' | sort | uniq -c | sort -nr | head -10 234725 FIN_WAIT2 84975 ESTABLISHED 14376 TIME_WAIT 10515 FIN_WAIT1 256 SYN_RECV 187 LAST_ACK 186 SYN_SENT 173 LISTEN 172 CLOSE_WAIT 129 0.0.0.0:* # cat /proc/net/sockstat sockets: used 66022 TCP: inuse 189515 orphan 132272 tw 76580 alloc 190763 mem 59842 UDP: inuse 129 mem 125 UDPLITE: inuse 0 RAW: inuse 0 FRAG: inuse 0 memory 0 # cat /proc/sys/net/ipv4/tcp_max_orphans 262144
怀疑是orphan太多导致了Out of socket memory。于是尝试增大了net.ipv4.tcp_max_orphans的值,同时缩短了net.ipv4.tcp_fin_timeout的大小,以减少FIN状态。
内核参数优化可以参考我的另外文章:linux下TCP/IP及内核参数优化调优
现在已经恢复正常,基本参数及状态值如下:
# cat /etc/sysctl.conf net.ipv4.tcp_max_orphans = 3276800 vm.swappiness=0 fs.file-max = 1491124 net.ipv4.tcp_max_tw_buckets = 10000 net.ipv4.tcp_max_syn_backlog = 262144 net.ipv4.conf.default.accept_source_route = 0 net.ipv4.tcp_fin_timeout = 5 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_syncookies = 1 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_intvl = 15 net.ipv4.tcp_max_syn_backlog = 8388608 net.core.netdev_max_backlog = 8388608 net.ipv4.tcp_keepalive_time = 1200 net.ipv4.tcp_window_scaling = 0 net.ipv4.tcp_sack = 1 net.ipv4.tcp_timestamps = 1 net.ipv4.ip_local_port_range = 1024 65000 net.ipv4.icmp_ignore_bogus_error_responses = 1 net.ipv4.tcp_rmem = 4096 87380 4194304 net.ipv4.tcp_wmem = 4096 16384 4194304 net.ipv4.tcp_mem = 94500000 915000000 927000000 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # netstat -an|awk '{print $NF}' | sort | uniq -c | sort -nr | head -10 465079 ESTABLISHED 25749 FIN_WAIT2 10886 FIN_WAIT1 1504 CLOSE_WAIT 606 TIME_WAIT 256 SYN_RECV 245 SYN_SENT 173 LISTEN 147 LAST_ACK 129 0.0.0.0:* # cat /proc/net/sockstat sockets: used 475830 TCP: inuse 495505 orphan 27302 tw 10000 alloc 495507 mem 252960 UDP: inuse 129 mem 125 UDPLITE: inuse 0 RAW: inuse 0 FRAG: inuse 0 memory 0 # cat /proc/sys/net/ipv4/tcp_max_orphans 3276800
因为应用类型是典型的keepalive应用,目前后端每台服务器ESTABLISHED连接数在50w左右,共有几十台realserver,看来服务器很给力,用户数也还可以哈。。
Pingback: Jigdo 一种专门为 Debian 系统设计的下载工具 - 运维·速度 | 运维·速度