Rumour has it that bot traffic now accounts for just over 50% of web requests. While legitimate bots serve a purpose, not all bots are completely legitimate and too much bot traffic can cause problems such as the slow loading of websites and even websites going offline. The big problem is that the people managing these bots, are trying to evade the detection and blocking mechanisms that server owners often deploy to prevent related issues. So what’s going on, why is this happening and what can you do about it?
If you don’t want to read all the preamble, click here to find out how to block.
What is a bot?
A bot is a program that reads websites. Often these bots will read and store the contents of websites for analysis that serves some kind of purpose later down the line.
As these bots are computer programs, they’re not really reading website like you and I do as humans, they’re mostly just getting web page content, and storing it.
A lot of bots work like this, with legitimate (like the Google bot), non-legitimate (such as hacking or probing bots), and the bots in the grey area in between the two (data aggregators) all work in roughly this way.
Which bots are the problem?
Too much of any kind of bot traffic is a problem, legitimate or otherwise.
The legitimate bots tend to operate in a manner that doesn’t cause problems for server owners. You can also give instructions to these kind of bots using robots.txt files which can help define what should and shouldn’t be crawled and more importantly, how frequently pages should be requested. These types of bots are usually not a problem.
Non-legitimate bots are a problem when it comes to things like hacking and the harvesting of email addresses to send spam to. These bots will only tend to cause server wide problems if websites that you’re hosting attract this kind of activity. Hosting a lot of poorly secured and rarely updated CMS based sites will often attract this type of unwanted attention.
This type of activity bot activity only really causes a problem if there’s a large volume of it, and if there’s a large volume of it, your problem isn’t so much the bots themselves, it’s what they’re attracted to (websites that can be compromised) that’s more of a problem, and the bots are more of a side effect.
Although problems can occur, server wide problems tend to be occasional, intermittent, and can usually be dealt with using pattern matching and IP type blocking, and can be dealt with in the linger term by applying security measures to websites. The key point here, is that you can usually do something about these kinds of bots.
The bots in the grey area tend to be centric to data aggregation. They tend to fall into two categories. One type of data aggregation bot has legitimate intent. SEO crawlers used by Moz, Semrush and AHrefs are all examples of data aggregation based on a legitimate purpose, as they’re related to SEO tools used by website owners.
Their activity can be annoying, sometimes cause server wide problems (if they overlap with other types of crawling), but these bots also tend to adhere to robots.txt, respect 429 responses, and have consistent user agents, you can usually do something about these guys crawling your stuff if you really want to.
The second category of data aggregators are the big problem. These types of bots are reading a lot of web pages in a small space of time, they don’t respect robots.txt or 429 responses, and they’re also deliberately trying to avoid the detection, blocking and/or crawl rate reduction mechanisms that you might have in place.
These bots are the dinner party equivalent of an uninvited guest that eats everyone’s dinner straight from the oven. These bots hitting your server can cause issues ranging from sites loading slowly to all sites going offline.
If you’re running a lot of sites when these non-legitimate data aggregator bots cause some universal “no websites load” type problem, that’s your phone ringing off the hook, followed by lots of questions about what’s being done to mitigate that kind thing in the future.
What the problematic bots are doing, why and why is this happening now?
What the problematic bots are doing is reading and/or trying to discover lots of web pages (then reading them) in short space of time.
We don’t completely know why this is happening but we believe that the intent is to harvest data that’s found in web page content. It’s likely that this data will be sold, although beyond that we can’t really say much for sure.
It may be the case that some of this data is used to train AI. This issue has become more prolific since AI has become more abundant, but whether this is making crawling easier to carry out, or driving more crawling is questionable.
We think that deploying crawlers has become more accessible.
In part this is due to cloud providers (Microsoft Azure COUGH!) offering free tiers of publicly accessible cloud servers. This means that someone can run a server on the internet at no cost.
Even with the above taken into account, crawling did used to require a reasonable degree of knowledge to get up and running (which effectively bottlenecked the amount of this that took place) but AI being available, again for free, has effectively helped to reduce the amount of knowledge needed to be able to deploy a crawler.
What this all means is that more people can deploy crawlers with less (or sometimes no) financial expense. If the data that’s harvested is sold, this is effectively free money for the party doing the crawling, but has a cost for hosting providers in terms of electricity, time, additional software, and a support overhead. So thanks for that. [redacted]!
What can you do about these bots?
This is where things start getting a bit tricky. These bots, the ones doing the excessive crawling and causing you server wide issues, they’re deliberately trying to evade detection and blocking.
Traditional blocking tends to be IP address based, but these bots are using residential IP proxy services, such as anyip(dot)com, to cycle IP addresses. This makes their requests look like they’re coming from different systems from a server perspective. The cycling of IPs also means that any IP based blocking won’t work (that’s the purpose of this type of service) as you don’t have any means of telling the IP addresses that will be cycled to. In many cases, if you look at logs you’ll see a very large number of requests coming from a very large number of IP addresses with some IP addresses making as few as one or two requests. IP blocking isn’t effective due to this.
Mod security does offer pattern matching based blocking, although in some cases this falls back to IP blocking and you can only do things like user agent based blocking effectively in newer versions of mod security. The newer versions of mod security can help if there’s a constant user agent in use, but in many cases there isn’t. Just like the cycling of IP addresses being used to prevent IP blocking, the spoofing of randomised user agents is used by many bots to prevent user agent based blocking.
This randomisation is taken a step further by these bots. Anything that can be randomised will be randomised so that you can’t use pattern matching methodology to block. If there’s randomisation, there’s no pattern and if there’s no pattern you don’t have much you can use to block these bots.
If you’re reading this in disbelief you might consider taking a look at some of these subreddits dedicated to, or related to, scraping and crawling (catch you in a couple of months, right?):
https://www.reddit.com/r/webscraping/
https://www.reddit.com/r/scrapingtheweb/
https://www.reddit.com/r/scrapinghub/
https://www.reddit.com/r/scrapy/
https://www.reddit.com/r/WebScrapingInsider/
https://www.reddit.com/r/hacking/
https://www.reddit.com/r/ComplexWebScraping/
https://www.reddit.com/r/WebDataDiggers/
https://www.reddit.com/r/scrapetalk/
https://www.reddit.com/r/webscrapping/
https://www.reddit.com/r/ResidentialProxies/
https://www.reddit.com/r/AI_Agents/
With the above taken into account there is one pattern that these bots can’t do anything about, and that’s the resolve of the British public when it comes to not doing sketchy things with their computers.
What? What’s that got to do with any of this.
I know that’s a bit of a tangent, so let me provide a bit of context.
We’re lucky to be in the position where we’re primarily providing web hosting for people and business in the UK.
Very little of this bot traffic originates from the UK, I would guess because members of the UK public aren’t volunteering their internet connections to the likes of anyip(dot)com.
For once, I find myself glad of the fact that GCSE computing has historically consisted of learning how to use MS office, and had very little to do with actual computing. The net result of this is that the general British public, unless self motivated or professionally orientated otherwise are really bad with computers. Because they’re most of the UK population is pretty bad with computers, they’re usually not interested in anything that sounds remotely dubious, which anyip(dot)com does. They’re [redacted] too, by the way, just so we’re clear on that.
The net result of the above means that we can use country blocking, to a degree, to mitigate these waves of bot traffic.
Here’ how.
How to stop high volume scraping or horrific bot traffic with mod_remoteip and mod_maxminddb.
A combination of mod_remoteip and mod_maxmind can be used to block countries from which scraping is originating. Customers can also use .htaccess files to set up their own country specific blocking or allowing.
Although this is a bit of a sledgehammer cracking a nut, there’s no patterns, so what else can you do. Cry? Pray? Hunt down all residential IP proxy providers and unplug their stuff. It’s not very likely is it?
The main thing to bear in mind with this technique is that there is an element of IP address involvement here, as in “this IP address is in that country”. The reason that you can’t do use mod_maxminddb alone, is due to the use of CDNs.
If you host a lot of sites, and some of these use CDNs such as cloudflare, you can inadvertently block traffic from the CDN, rather than traffic from the scraping bot. Using mod_remoteip mitigates the inadvertent blocking of CDN IP addresses by ensuring the client’s real IP address is correctly restored before the country specific blocking takes place (which is done by mod_maxminddb).
You’re effectively using mod_remoteip to block the scraper IP in another country, rather than the CDN IP specific to the scrapers requests.
Not using mod_remoteip and just using mod_maxminddb would result in CDN IPs in blocked countries being blocked, rather than scrapers. This will make websites using CDNs appear intermittently down for some visitors.
Deploying mod_remoteip.
If you’re using cPanel this article covers how to restore visitors IP with mod_remoteip for cPanel servers. https://support.cpanel.net/hc/en-us/articles/360051107513-How-to-restore-visitors-IP-with-mod-remoteip
If you’re running vanilla Apache, you should be able to:
1. Enable the mod_remoteip Module:
Debian/Ubuntu: Run:
sudo a2enmod remoteipRHEL / Alma / Rocky / CentOS / cPanel / Enhance:
mod_remoteip is usually already compiled with Apache
You don’t normally add a LoadModule line manually on these systems
Instead, you verify it is loaded:
[root@cp8 ~]# httpd -M | grep remoteip
remoteip_module (shared)If it is you’re good, but if it isn’t which is rare you’d add:
LoadModule remoteip_module modules/mod_remoteip.soTo either:
/etc/httpd/conf.modules.d/00-base.conf or
Or
a vendor-approved include file.
2. Configure Apache
Edit your Apache configuration file to recognise the proxy header:
Debian/Ubuntu: /etc/apache2/conf.d/remoteip.conf
RHEL/CentOS/Enhance:
Pre-virtual host include in /etc/httpd/conf/httpd.conf or equivalent.
Only one RemoteIPHeader directive can be used when you do this.
Use this if you want to support Cloudflare and/or other proxies or CDNs (e.g. QUIC.cloud, load balancers):
RemoteIPHeader X-Forwarded-ForUse this only if the site(s) are exclusively behind Cloudflare and will never be accessed via any other proxy:
RemoteIPHeader CF-Connecting-IP# Define your proxy IP addresses (put cloudflare's IP addresses in this section)
RemoteIPTrustedProxy 173.245.48.0/20
RemoteIPTrustedProxy 103.21.244.0/22
RemoteIPTrustedProxy 103.22.200.0/22
RemoteIPTrustedProxy 103.31.4.0/22
RemoteIPTrustedProxy 141.101.64.0/18
RemoteIPTrustedProxy 108.162.192.0/18
RemoteIPTrustedProxy 190.93.240.0/20
RemoteIPTrustedProxy 188.114.96.0/20
RemoteIPTrustedProxy 197.234.240.0/22
RemoteIPTrustedProxy 198.41.128.0/17
RemoteIPTrustedProxy 162.158.0.0/15
RemoteIPTrustedProxy 104.16.0.0/13
RemoteIPTrustedProxy 104.24.0.0/14
RemoteIPTrustedProxy 172.64.0.0/13
RemoteIPTrustedProxy 131.0.72.0/22You can get cloudflare’s IP addresses from here and you can get QUiC.cloud IP addresses from here.
You’ll need to keep the defined IP lists up-to-date as Cloudflare and QUIC.cloud sometimes add new ranges.
Your whole block should look like this if you’re covering all proxies and load balancers:
# Use X-Forwarded-For for standard proxies
RemoteIPHeader X-Forwarded-For
RemoteIPTrustedProxy 173.245.48.0/20
RemoteIPTrustedProxy 103.21.244.0/22
RemoteIPTrustedProxy 103.22.200.0/22
RemoteIPTrustedProxy 103.31.4.0/22
RemoteIPTrustedProxy 141.101.64.0/18
RemoteIPTrustedProxy 108.162.192.0/18
RemoteIPTrustedProxy 190.93.240.0/20
RemoteIPTrustedProxy 188.114.96.0/20
RemoteIPTrustedProxy 197.234.240.0/22
RemoteIPTrustedProxy 198.41.128.0/17
RemoteIPTrustedProxy 162.158.0.0/15
RemoteIPTrustedProxy 104.16.0.0/13
RemoteIPTrustedProxy 104.24.0.0/14
RemoteIPTrustedProxy 172.64.0.0/13
RemoteIPTrustedProxy 131.0.72.0/223. Update Log Format
Ensure your log format uses %a to log the actual client IP instead of %h (which logs the proxy IP) in your global Apache config:
Debian/Ubuntu: /etc/apache2/apache2.conf or a file in /etc/apache2/conf-available/ that you then a2enconf and reload Apache.
RHEL/CentOS/Enhance: /etc/httpd/conf/httpd.conf or a pre-virtual-host include (so it applies globally).
Using:
LogFormat "%a %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined4. Restart Apache
Ubuntu/Debian:
sudo systemctl restart apache2CentOS/Bitnami:
sudo apachectl restart or
sudo /opt/bitnami/ctlscript.sh restart apache.5. Verify the installation
Verify mod_remoteip is in use by running:
apachectl -M | grep remoteipAnd you should see:
remoteip_module (shared)Once you’ve done all that, Apache “sees” the client/bot IP, rather than the CDN IPs.
Deploying mod_maxminddb.
Maxmind is the thing that does the “this IP is in this country” part. Deploying this is a really a two part job, installing mod_maxminddb and defining the database of IPs.
To install mod_maxminddb:
1. Get the maxminddb package:
wget https://github.com/maxmind/mod_maxminddb/releases/download/1.3.0/mod_maxminddb-1.3.0.tar.gzExtract the package, then install maxminddb:
tar -zxf mod_maxminddb-1.3.0.tar.gz
cd mod_maxminddb-1.3.0
./configure
make installOn some systems, the module may end up in /usr/lib/apache2/modules/ (Debian/Ubuntu) or /etc/httpd/modules/ (RHEL/CentOS), so adjust the LoadModule path accordingly in step 3.
2. Get a local copy of the maxminddb database:
mkdir -p /usr/local/share/GeoIP
wget -O /usr/local/share/GeoIP/GeoLite2-Country.mmdb https://git.io/GeoLite2-Country.mmdb3. Configure maxmind to use the database:
RHEL/CentOS/Enhance:
nano /etc/apache2/conf.d/mod_maxminddb.confDebian/Ubuntu:
nano /etc/apache2/conf-available/mod_maxminddb.conf and then enable:
a2enconf mod_maxminddbThen add:
# Load the module (adjust path if needed)
LoadModule maxminddb_module modules/mod_maxminddb.so
# Enable MaxMindDB
MaxMindDBEnable On
# Define the database
MaxMindDBFile COUNTRY_DB /usr/local/share/GeoIP/GeoLite2-Country.mmdb
# Create environment variable for the country code
MaxMindDBEnv MM_COUNTRY_CODE COUNTRY_DB/country/iso_codeThe last line is case-sensitive: COUNTRY_DB/country/iso_code is correct as per mod_maxminddb.
Then restart apache.
Blocking and allowing countries.
There are a few options here, blocking and allowing are both possible either globally, or on a per site basis.
In some cases, if you see any regularity in countries from which scraping originates, you can deploy a global rule to block these countries, then remove it when the bots have given up.
Global server wide country blocking and allowing.
You can use a pre-virtual host include to configure country blocking or allowing on the server as a whole. The part is what applies this to all sites under /home.
If you’d like to block China, and Brazil you’d look up their ISO A2 country codes.
You’ll also need to restart apache for these changes to come into effect.
The block them using this in a pre-virtual host include:
<Directory "/home">
<IfModule mod_maxminddb.c>
RewriteEngine on
RewriteCond %{ENV:MM_COUNTRY_CODE} ^(BR|CN)$
RewriteRule .* - [F]
</IfModule>Conversely if you’d like to allow a list of European countries you can invert the country condition, then define a list of countries allowed to access sites (again using their ISO A2 country codes. Countries that aren’t listed will be blocked.
<Directory "/home">
<IfModule mod_maxminddb.c>
RewriteEngine on
RewriteCond %{ENV:MM_COUNTRY_CODE} !^(GB|IE|AD|AT|BE|BG|HR|CY|CZ|DK|EE|FI|FR|DE|GR|HU|IT|LV|LT|LU|MT|NL|PL|PT|RO|SK|ES|SE)$
RewriteRule .* - [F]
</IfModule>Per site country blocking and allowing.
The rules above can be used in .htaccess files specific to individual sites to block or allow countries to access these sites.
Some guidance says this won’t work.
Using mod_maxminddb can work in .htaccess if the module is loaded and the environment variable (MM_COUNTRY_CODE) is set before .htaccess is processed.
In practice, this depends on your Apache/OpenLiteSpeed configuration:
MaxMindDBEnable On must be enabled in the server context (not just inside ).
If AllowOverride allows RewriteEngine in .htaccess (usually it does), then RewriteCond %{ENV:MM_COUNTRY_CODE} can work.
The reason many guides say it won’t work in .htaccess is because in a vanilla Apache setup, env vars set in main config aren’t always carried into per-directory contexts; but hosting setups like shared cPanel hosting often make it work because of how the module interacts with per-directory rewrites.
The rules are the same minus the part.
So if you’d like to block China, and Brazil on a per site basis you’d look up their ISO A2 country codes.
Then add this to the site’s .htaccess file:
<IfModule mod_maxminddb.c>
RewriteEngine on
RewriteCond %{ENV:MM_COUNTRY_CODE} ^(BR|CN)$
RewriteRule .* - [F]
</IfModule>Alternatively if you’d like to allow a list of European countries on a per site basis, you can add this to the site’s .htaccess:
<IfModule mod_maxminddb.c>
RewriteEngine on
RewriteCond %{ENV:MM_COUNTRY_CODE} !^(GB|IE|AD|AT|BE|BG|HR|CY|CZ|DK|EE|FI|FR|DE|GR|HU|IT|LV|LT|LU|MT|NL|PL|PT|RO|SK|ES|SE)$
RewriteRule .* - [F]
</IfModule>It is also possible to combine allowing and blocking on a per site basis like this (again, this goes in the .htaccess file):
<IfModule mod_maxminddb.c>
RewriteEngine on
# Block countries explicitly
RewriteCond %{ENV:MM_COUNTRY_CODE} ^(BR|CN)$
RewriteRule .* - [F,L]
# Allow only these European countries; block all others
RewriteCond %{ENV:MM_COUNTRY_CODE} !^(GB|IE|AD|AT|BE|BG|HR|CY|CZ|DK|EE|FI|FR|DE|GR|HU|IT|LV|LT|LU|MT|NL|PL|PT|RO|SK|ES|SE)$
RewriteRule .* - [F,L]
</IfModule>Frequently Asked Questions: Blocking Bot Traffic
What is a bot?
A bot is a program that reads websites automatically, often storing content for analysis. Some bots are legitimate, like search engine crawlers, while others are non-legitimate or malicious.
Which bots are problematic?
Non-legitimate bots cause issues such as scraping, spam harvesting, or hacking attempts. Even some data aggregator bots can overwhelm servers if they ignore rules like robots.txt or use randomized IPs and user agents.
Why is bot traffic increasing now?
Bot traffic has increased due to wider availability of AI, cloud servers with free tiers, and easier deployment methods. More people can run crawlers, often harvesting data for resale or analysis.
What problems can excessive bot traffic cause?
High volumes of bot traffic can slow websites, cause intermittent downtime, increase server load, and trigger support issues if unmanaged.
How can I block or manage these bots?
You can mitigate bot traffic using server-level tools like mod_remoteip and mod_maxminddb to restore real client IPs and implement country-based blocking. IP-based and user agent pattern blocking can also help, but sophisticated bots may circumvent these.
Can country blocking help?
Yes. For example, if most unwanted bot traffic originates outside your target region, you can block countries at the server level or in per-site .htaccess files using MaxMind DB.
Do legitimate bots get blocked?
If configured carefully, legitimate bots like Googlebot or SEO crawlers are usually unaffected. Always review allowed IPs and follow standards like robots.txt.
Does using a CDN affect bot blocking?
Yes. CDNs like Cloudflare can mask the original client IP. Using mod_remoteip ensures the real client IP is captured before applying rules.
What about residential proxy bots?
These are harder to block because they rotate IPs and use randomized user agents. Country-level blocking or rate limiting may help reduce the impact.
Yes, as long as mod_remoteip and mod_maxminddb are enabled server-wide. Some shared hosts may allow .htaccess-level configuration for per-site control.
Where can I learn more about implementing these measures?
Refer to the documentation for mod_remoteip and mod_maxminddb, MaxMind GeoLite2 databases, and Apache RewriteCond rules for country blocking.







