bot traffic

Fear and Loathing in Bot-Vegas (Otherwise Known as the Internet).

The internet has always been rife with automated traffic, crawlers, bots, monitoring and non-human activity. Recently this has begun to grow at a disproportionate rate. Today, for the first time in 10 years of working in web hosting, I saw something that genuinely worried me, which is why I’m writing this blog post.

What is a Bot?

A bot, in the context of the internet, is an automated program or collection of programs that read and/or interact with online services. I would say that they interact with websites, but it’s not just websites that the bots show an interest in. Bots do things with email services, and also other services being operated on publicly accessible servers in the internet, such as checking links in emails, or trying to break in to your stuff. Bad robot. Stop that!

There has to be bots, to some degree. Take adding a website to Google search, for example. That involves a bot reading your website, and the using what it’s read to populate part of a giant database of websites. This database of websites is, in part, what’s used to establish where your website should appear in search engine results for certain search terms.

Without the Googlebot, the internet wouldn’t be as useful as it currently is. If you can remember what the internet was like before Google, you’ll probably remember hours of searching a never ending list of semi-related results, until you find what you’re looking for. Google’s use of their bot and database made the internet a lot more accessible to people, simply by providing the most relevant result (based on your search query) at the top of search results. When this “relevance based search results” came about, it made the entire internet a lot more accessible (and a lot more helpful) to a much wider audience than the internet had initially appealed to.

There’s too much information on the internet to be processed by humans, so Google had to have a bot.

And if Google have a bot, other organisations are going to have bots.

Bots don’t just read websites. Bots can also interact with websites. A lot of contact form spam is generated by bots, which is why you often get asked to tick a box that says “I’m not a robot” when using a contact form.

Good Bots and Bad Bots.

The purpose of bots varies. As I’ve mentioned above the purpose of the Google bot is to read your website and pass that information back to Google. There are also bad bots that try and spam your contact form. There are also bots that are neither good nor bad, such as robots that collect data used for marketing. The purpose of bots varies considerably.

Examples of Good Bots.

  • Robots used by search engines. I’m talking the Googlebot, the Bingbot, Yandex’ bot(s) and bots belonging to lesser known search engines, such as timpi.io. These bots all serve a purpose, which is ultimately to help you find relevant information online. These bots are helping you, and making the internet more accessible.
  • Robots used by SEO tools. SEO tools such as Moz, Semrush and AHRefs all have their own bots. These bots are used to collect information about websites to do things like find backlinks, calculate how competitive certain search terms are, and possibly work out search volumes for keywords as well. Whilst the purpose of these bots is to help webmasters gain greater visibility of their website(s) the amount of traffic these generate can be high, and be a bit annoying for systems administrators (like me) should they cause high CPU usage, as this causes websites to slow down… Which customers complain about. These bots are well intentioned, but accidentally annoying.
  • Data Aggregators. These bots harvest data that’s available on the internet. The prime example of this is Common Crawl which has read an epic amount of the internet, stored it, and made it available for anyone that wants to use it. One of ChatGPT‘s early iterations was trained using a data set from the Common Crawl database. I thought that ChatGPT might initially be obsessed with cats doing funny things, and working out how to generate memes, but unfortunately it wasn’t so innocent in ChatGPT’s early days. On the plus side, you can be informed about how to make better pizza with glue. Just like the bots used by SEO tool, these bots are well intentioned, but annoying.
  • Bots that protect you. I’ll admit we’re not entirely sure about this one, but we do think this happens. We’ve had a few customers complain about website performance and when we’ve looked in to it, the site has been handling a lot more requests than normal. Often many of the website requests originate from mail providers, and the volume of these cause a website to slow down. Coincidentally (or not) many of the customers asking about why their website is slow have just sent a mail shot. What we think is happening is that mail providers are protecting their customers by scanning links in marketing emails for ones that lead to malware. If a link leads to malware the email doesn’t get delivered. That kind of thing. A customer’s website is slow, but at least nobody got their computers infected. It kind of makes sense.

Examples of Bots We’re Not Sure About.

  • Security Scanning Bots. I’m talking about organisations like Censys and Threatview. Although we can see why these bots scan things (so that they can tell their customers if something is suspect) do they have to do this to all sites on an entire server? What’s the environmental footprint of this? More on that later. And can’t you do this with some kind of browser plugin rather than reading 400 websites in entirety in 20 minutes? The intent is good, the effect is unnecessarily resource heavy.
  • Marketing Bots. Zoominfo is an example of one of these (although there are many others). Zoominfo provides a kind of giant business directory that your business gets in to by being on the internet. Zoominfo’s bot finds your website, and you’re listed. Their objective seems to be to link buyers to sellers. Lots of traffic, a paid for service and lots of mentions of things like “ideal customer profile” and “buying signals”. Tasty if you’re in to that kind of thing, I guess. Sucks if you’re a systems administrator that answers the phone.
  • AI Bots. Whilst these are technically data aggregator type bots, we’re not very sure about them. The reason we’re not sure about them is because they’re fairly new, and their activity varies according to organisation. Take Anthropic’s claude bot for example. This scraped a lot of sites on our platform, but we didn’t notice much as it’s crawl rate was fairly unintrusive and it didn’t cause us any problems. Meta/Facebook have AI as well, and their bot crawls at an aggressive rate for a short period of time. We see a short burst of high CPU load that drops off quite quickly. That said, we don’t know if this is their AI bot, or if this is some other facebook crawling mechanism. It’s not like they tell us. I’m sure you can see why we’re not very sure about these types of bots. Amazon’s bot we had to block (again, we’re not completely sure if this was an AI bot or not) but we didn’t really have much choice about this due to the number of servers alerting for CPU load.

Examples of Bad Bots.

  • Spambots. These harvest email addresses from websites and they automatically complete and submit contact forms. Spam (or unsolicited emails) are the objective here. You can thanks these guys for you having to click on pictures of bicycles lots of times just to send a message.
  • Probing bots. These bots are looking for vulnerabilities in websites that can be used to compromise or attack a website. They’ll scan hundreds of websites on a server, then attack or compromise websites that are vulnerable. These guys are why you need to apply your updates (if you’re using a CMS) or secure your code (if you’ve written your own website).
  • Brute forcing bots. Brute forcing is the repeated guessing of usernames and passwords to gain unauthorised access to something that can be logged into. It’s the protection that’s deployed to prevent these kind of bots breaking in that cause your IP address to become blocked when you can’t remember a password and start guessing it. These bots are also the reason why you need security plugins and web application firewalls in your CMS.

All of these bad bots are usually created by hackers. Why would a hacker sit there trying to guess your password by typing it, when they can write a program that can carry out many more guesses per second than a human ever could? They wouldn’t. They’d write that program, run it, then go to the pub. And that’s why these bots exist.

Bots + Bots = What?

As you’ve probably got the picture by now, there’s LOADS OF BOTS. These bots can also do things a lot faster than humans can. Bots also don’t need to sleep. The approximation is that you’ve got a lot of automated things, doing a lot of stuff, to a lot of online services, all of the time.

What does this all add up to?

A lot of website requests, a lot of network traffic, a lot of bandwidth consumption, a lot of processing power, and a lot of electricity. It doesn’t sound very appealing when put like that does it?

Estimates from Cloudflare suggest that about 40% of internet traffic is non-human. We think that (because we’re hosting websites) we receive a greater amount of website requests from robots than we do humans. We think that of every 10 website requests somewhere between 6 and 8 requests are from robots.

You might wonder why these bots are doing all this. Whilst the purpose of the bad bots does vary, it’s usually along the lines of using server resources (email generation, processing) for a nefarious purpose (spamming and bitcoin mining respectively). The good bots, and the ones we’re not too sure about, they’re after data. Lots and lots of data. Us humans, we’e obsessed with data, and we’re using bots to get it!

All this infrastructure that’s used by these bots, both to get to the website, and for the website to be served, and for the discovered data to be both stored and retrieved, that all use electricity. Lots and lots of electricity. This is external to the AI infrastructure of the respective bot, it’s specific to the provider from where the AI bot is reading website data. From our perspective, that’s our electricity bill going up, due to the high volume reading of websites that these AI bots undertake.

Imagine being able to cut your electricity bill by 40% just like that by blocking these robots!

Funnily enough, that was the job I was given in 2019.

2019: Year of the Bot.

It was lock down. We were all working from home, or on furlough, worrying about when we’d next be able to use real toilet paper. A lot people were worrying about that kind of thing, but we were worrying about our network traffic, bandwidth and electricity bill, which just seemed to be getting higher and higher and higher.

At first we thought this was a symptom of more people working from home and turning to online services, but we were also seeing things like spikes in CPU load on servers due to high volumes of website requests in short periods of time.

After a bit of investigation I found out there was something called “petal bot” mass crawling websites on all our servers. This on top of all the “normal” bot traffic (as mentioned above) made the total amount of bot traffic quite a lot. At least noticeably more than normal.

The petal bot belongs to Huawei.

Do you remember Donald Trump banning Huawei from buying US tech? This US tech included Android, which is what, at the time, was what Huawei’s smart phones used as an operating system. Huawei were effectively banned from using a suite of Google services, some of which were integral to arms of their business.

Huawei didn’t take this lying down, and instead invented their own versions of what had been taken away from them by Donald Trump’s ban. This included creating their own search engine, Petal Search, and it was this search engine’s bot that I was seeing crawling our estate.

Thanks, Don, for the extra on the electricity bill… [slow clap].

To me, this seemed nuts. How one country (about 3500 miles away from our data centre) could make a political move against another country (about 5000 miles way from out data centre in the other direction) that resulted in our data centre consuming more electricity.

The effect didn’t end there. We weren’t the only people to notice this, and some of the other people that noticed this (probably) also operate bots.

Since Don’s 2019 tech tantrum, we’ve seen bot traffic increase and increase. Although some of this was probably off the back of bot operators thinking “Well if Petal search can get away with a crawl rate like that… why are we holding back?!?!?”, it wasn’t the only factor contributing to an increase in bot traffic.

AI and Bots.

As I’ve already mentioned above, the behaviour of AI bots varies according to organisation. AI bots are also a fairly new addition to the automated traffic landscape.

One of the other bots that became more prevalent around the 2019 lockdown time was the CCbot, which is Common Crawl’s bot. It was a dataset provided by common crawl that was used to train early ChatGPT models.

CharGPT has, since 2019, become very popular. AI as a whole has, and it’s being integrated into more and more every day. Despite it’s popularity, AI is hungry not just for data, but also for electricity.

ChatGPT uses about 2.9 Wh of energy per query. That’s about what it takes to charge a smartphone battery by about 20%. I’ve typed 5 queries in to ChatGPT so far while writing this, so that’s a full smart phone charge. Ironically one of these queries was “what uses about 2.9 Wh of energy?”.

2.9 Wh of energy is about 10 times more electricity than a standard Google search.

Consider that ChatGPT answers about 200 million queries per day. That’s about 621.4 MWh every day. That’s roughly how much electricity 2700 homes use in a day. I’ve just used another 20%’s worth of a smart phone charge finding that out.

The likes of ChatGPT and other AI need information to be able to do what they do. They get this information as a product of bot activity. Bot activity that involves reading data hosted with external parties.

Not only does ChatGPT consume electricity by answering queries, it also causes the likes of us (as a hosting provider) to consume electricity when it’s bots or data aggregators read websites on our platform. Although I’ve used our platform as an example here, it’s unlikely that it’s just us that’s subject to this activity. This is probably going on for all web hosting providers.

If you consider that there’s nearly as many websites as there are people on the planet, that’s a lot of reading, and an awful lot of electricity. That’s also an awful lot of data to store (which also uses electricity) and a lot of data used for training… yep, that also uses electricity.

The queries themselves are a smaller part of a bigger picture of energy consumption associated with AI.

The people running these bots have a lot of money and power. So much so that some of them are going to buy energy from a fleet of mini nuclear reactors, and others buying power stations so they can build data centres next to them. That must be nice for them. Whilst this does allow them to power their AI models, this doesn’t help anyone who’s hosting data that the AI models will use bots to read.

Providers like ourselves, and most likely many web hosting providers probably don’t have the option of buying power stations or harnessing the power of the atom for our own means. Yet we have an electricity overhead that’s a product of people that do have these means.

What Worried Me About These Bots.

Earlier today, one of our server’s alerted with high CPU load.

I could see the web server was handling a LOT of requests.

And when I started looking at the web server logs, I started seeing a lot of this:

[31/Jan/2025:09:59:49 +0000] “GET /robots.txt HTTP/2” 200 0 “-” “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot”

This is ChatGPT crawling sites on one of our servers. This was the first time I’d seen this. In the past if had been the CCBot mining data that was ultimately used to feed ChatGPT. Now ChatGTP is feeding itself.

The log line above is, apparently ChatGPT visiting a web page when a user asks a question in ChatGPT. On this page it also says:

“ChatGPT-User governs which sites these user requests can be made to. It is not used for crawling the web in any automatic fashion, nor to crawl content for generative AI training.”

What I saw today was a server with a CPU load issue, due to ChatGPT making 145136 requests to a couple of hundred websites in the space of 39 minutes (around 60 requests per second). The websites held on this server are varied in nature, and we’ve got another 19 servers like this, yet this was the only one subject to this type of activity.

This wasn’t an occasional “obtain a data set then use that to train”. This wasn’t the result of a person asking ChatGPT a question. This was almost definitely a ChatGPT crawler being pointed at a server, with the intention of scraping everything it could.

Here’s the problem. AI needs data to use for training. The more data you have, the more training, and the better the bot. Whilst this is a very rough outline of a more complicated situation, training data is a priority for AI.

If ChatGPT accumulates more training data than other AI models or organisations, this will give them a competitive advantage. I wonder what other organisations developing AI would do about this competitive advantage. I also wonder what the effect of that would be.

What I saw today looks a bit like the tip of an iceberg that’s rapidly melting. We’re now likely to be looking at an ongoing AI bot overhead, in addition to all the other bot and human traffic that’s going on all the time, and in addition to all the energy consumption I’ve mentioned above. And that’s the best case scenario.

The worst case is some sort of AI training data arms race.

More AI overhead. More electricity used by our data centre. More data being stored. More AI training.

Where does all this lead? Is it going to be worth it? All this traffic, data storage, computing power and energy consumption? All this against a backdrop of global warming, wars and a break down in social cohesion. Should reading the internet be where this effort is focussed?

It might be worth it in the end. Then again, instead of the problems with nuclear fusion being overcome to provide near limitless free, green energy, we might just end up with lots and lots of Shrimp Jesus.

We are, after all, talking about the internet aren’t we?

Similar Posts

Leave a Reply