I think, general principles of network troubleshooting are:
- Find out at what level of TCP/IP stack(or some other stack) occurs the problem.
- Understand what is the correct system behavior, and what is deviation from normal system state
- Try to express the problem in one sentence or in several words
- Using obtained information from buggy system, your own experience and experience of other people(google, various forum, etc.), try to solve the problem until success(or failure)
- If you fail, ask other people about help or some advice
As for me, I usually obtain all required information using all needed tools, and try to match this information to my experience. Deciding what level of network stack contains the bug helps to cut off unlikely variants. Using experience of other people helps to solve the problems quickly, but often it leads to situation, that I can solve some problem without its understanding and if this problem occurs again, it’s impossible for me to tackle it again without the Internet.
And in general, I don’t know how I solve network problems. It seems that there is some magic function in my brain named
SolveNetworkProblem(information_about_system_state, my_experience, people_experience), which could sometimes return exactly the right answer, and also could sometimes fail(like here TCP dies on a Linux laptop).
I usually use utils from this set for network debugging:
ip addr) – for obtaining information about network interfaces
ping– for validating, if target host is accessible from my machine.
pingis also could be used for basic DNS diagnostics – we could ping host by IP-address or by its hostname and then decide if DNS works at all. And then
mtrto look what’s going on on the way there.
dig– diagnose everything DNS
dmesg | lessor
dmesg | tailor
dmesg | grep -i error– for understanding what the Linux kernel thinks about some trouble.
| grep smth– my most popular usage of netstat command, which shows information about TCP connections. Often I perform some filtering using grep. See also the new
iproute2the new standard suite of Linux networking tools) and
lsof -ai tcp -c some-cmd.
telnet <host> <port>– is very useful for communicating with various TCP-services(e.g. on SMTP, HTTP protocols), also we could check general opportunity to connect to some TCP port.
iptables-save(on Linux) – to dump the full iptables tables
ethtool– get all the network interface card parameters (status of the link, speed, offload parameters…)
socat– the swiss army tool to test all network protocols (UDP, multicast, SCTP…). Especially useful (more so than telnet) with a few
iperf– to test bandwidth availability
x509…) to debug all SSL/TLS/PKI issues.
wireshark– the powerful tool for capturing and analyzing network traffic, which allows you to analyze and catch many network bugs.
iftop– show big users on the network/router.
iptstate(on Linux) – current view of the firewall’s connection tracking.
arp(or the new (Linux)
ip neigh) – show the ARP-table status.
routeor the newer (on Linux)
ip route– show the routing table status.
tuscdepending on the system) – is useful tool which shows what system calls does the problem process, it also shows error codes(errno) when system calls fails. This information often says enough for understanding the system behavior and solving a problem. Alternatively, using breakpoints on some networking functions in
gdbcan let you find out when they are made and with which arguments.
- to investigate firewall issues on Linux:
iptables -nvLshows how many packets are matched by each rule (
iptables -Zto zero the counters). The
LOGtarget inserted in the firewall chains is useful to see which packets reach them and how they have already been transformed when they get there. To get further
ulogd) will log the full packet.
A surprising number of “network problems” boil down to DNS problems of one kind or another. Initial troubleshooting should use
ping -n w.x.y.z in order to leave out DNS resolution of a hostname, and just check IP connectivity. After that, use
route -n to check the default IP route without DNS resolution.
After verifying IP connectivity, and routing,
dig can yield information. Remember that “locking up” can indicate that DNS timeouts are occuring.
Don’t forget to check existence and contents of
/etc/resolv.conf. DHCP clients change that file with every lease, and sometimes they get it wrong, or if disk space is tight, an update might not happen.
Cabling problems can exist. If you have access to the hardware, ensure that the cables are all plugged in and mechanically engaged. If you can see routers or ethernet interfaces, ensure that the link lights are on.
Remotely, you have to depend on
[root@flask ~]# ethtool eth0 Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supported pause frame use: No Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised pause frame use: Symmetric Advertised auto-negotiation: Yes Speed: 10Mb/s Duplex: Half Port: MII PHYAD: 24 Transceiver: internal Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x00000001 (1) drv Link detected: yes
“Link detected: yes” is good, but 10Mb/s and Half duplex are not good, as the NIC on that computer can do better. I need to figure out if the NIC is goofed up or the cable is. Another computer plugged into the same router says 100Mb/s, Full duplex.