Hello ForumAdmin, thank you for your reply. My response below (preceded by "--"):
ForumAdmin wrote:What you are describing sounds very different in that it could only really happen if some resource is being depleted. There is not much I can suggest other than:
1. Set the DEBUG option in csf.conf to 1 (you can increase it up to 4 but it will create a huge log file if you do) and keep a close eye on lfd.log for what is happening. You should obviously check lfd.log for the time you are seeing issues to see if anything is apparent
-- OK, will try this.
2. Using lsof and strace on some of the processes may help identify the cause of the issue, though it may be difficult on a slow running system
-- Will do this as well.
3. Ensure that the root account has no ulimits set
-- Does not have ulimit set ("unlimited"). I also verified no ulimit in:
/root/.profile
/etc/profile
/etc/pam.d/sshd
/etc/pam.d/su
/etc/rc.local
(I have about 40 instances running Squeeze that *do* have ulimit for fail2ban and their CSF has *never* exhibited problem)
4. If you are using Virtuozzo/OpenVZ then all bets are off as it can be such a nightmare of a system to manage and we would strongly recommend any thing else (e.g. Xen, KVM, etc)
-- Issue occurring on KVM / Wheezy instances. Most instances (40+) are running KVM. Only have 3 OpenVZ instances (running Squeeze) not being reported here because .... they *never* exhibited the problem.
5. Only run csf with the default settings incase any alterations you make are causing timing conflicts
-- Been trying to track down the issue by changing *one* config item at a time between default config and my config to see when it occurs. Yes, this is tedious and time consuming.
6. Ensure any monitored logs are not being flooded and overloading lfd
-- I don't believe this is occurring. BUT, Wheezy does log SSH Login failures a bit differently (i.e., "Bye Bye [preauth]") and this was not included in the default sshd.conf / ssh-ddos.conf of fail2ban. Therefore, quite a few of these show up in the logs with each unauthorized SSH Login attempt before being nullrouted or blocked by fail2ban. I have added a new regex to tighten up the blocking of attempts logged this way. (These instances only permit public key authentication + do not permit root login + need to have SSH port open for their intended purpose. Again, no issues whatsoever on Debian Squeeze instances.)
Edit: 7. The resolver processes might also suggest slow to respond nameservers - ensure those are working correctly and quickly in /etc/resolv.conf
-- Appear to be working fine (i.e., query time as low as 1 msec; nominal 50 msec). Timed them with "dig" + tried own DNS resolvers + installed "unbound" on each instance to do resolution from localhost to attempt to rule out DNS timing/latency issues.
Again, thank you for your quick reply. I will try some of your suggestions above as well.