This evening I posted an article about an Indiana State Police trooper who uses his position of power to proselytize to motorists he stops. That resulted in Twitter crawling my web server. Which would be fine but the first four requests, in a 715 ms interval, were
GET /robots.txt. Every single request request came from the same address. Every single response was a HTTP 200 status that included the contents of the robots.txt file. Every single response took less than one 1 ms. What the fuck? How hard is it to avoid duplicate requests from a queue (hint: it’s pretty fucking easy)?
I went to the Twitter web page in the hope of finding an email address or web form where I could provide some constructive feedback regarding their web crawler. If it exists I couldn’t find it after searching for nearly ten minutes.