atmire / counter-robots Goto Github PK
View Code? Open in Web Editor NEWOfficial list of user agents that are regarded as robots/spiders by COUNTER
License: MIT License
Official list of user agents that are regarded as robots/spiders by COUNTER
License: MIT License
The following expression creates an error in the .NET programming language:
lycos[\_\+]
The character "_" (char code 95) does not need escaping because it does not have a special meaning in Regular Expressions. Escaping character "_" creates an error in .NET at runtime.
The character "+" (char code 43) is a special character in Regular Expressions, but within a character set it is not required to escape characters other than "\", "-" and "]".
lycos[\_\+]
should be changed to lycos[_+]
User Agent missing from the current list: "DTS Agent"
We saw this user agent at one of our clients. It seems to be an e-mail harvesting robot. I could not find any official page. But when Googling I found similar reports and this http://www.internetofficer.com/web-robot/dts/
Hi!
Unfortunately a few builds of the Moto G 50 have a user-agent like:
Mozilla/5.0 (Linux; Android 12; moto g (50)5G Build/S1RSS32.38-20-9-1; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/106.0.5249.79 Mobile Safari/537.36
This is matched by the rule "rss" in COUNTER_Robots_list.json#L1024.
While I get that this rule targets RSS readers, would it maybe possible to use a more specific regex to avoid false positives?
Maybe rss[^a-z0-9]
would work?
I think these patterns will exclude anyone using the Firefox or Opera browsers from being counted:
^Mozilla$
^Mozilla.4.0$
^Mozilla/4.0+(compatible;)$
^Mozilla/4.0+(compatible;+ICS)$
^Mozilla/4.5+[en]+(Win98;+I)$
^Mozilla.5.0$
^Mozilla/5.0+(compatible;+MSIE+6.0;+Windows+NT+5.0)$
^Mozilla/5.0+like+Gecko$
^Mozilla/5.0(\s|+)Gecko/20100115(\s|+)Firefox/3.6$
^Opera/4$
User Agent missing from the current list: "ArchiveTeam"
The "ArchiveTeam" user agent belongs to https://www.archiveteam.org/index.php?title=Main_Page
This is a robot that is archiving the web.
Please could you consider updating the rule that excludes anything matching 'google', as it excludes usage via the Google Web Light service:
https://developers.google.com/search/docs/advanced/mobile/web-light
Mobile use with UAs like this is excluded from COUNTER reports; I believe this may be genuine human usage, and therefore incorrectly excluded:
Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19
API+scraper
G-i-g-a-b-o-t
Grammarly/1.0 (http://www.grammarly.com) or Grammarly/1.0+(http://www.grammarly.com)
Riddler+(http://riddler.io/about)
EasyBib+AutoCite+(http://autocite-info.citation-api.com/)
I have thousands of hits in my server logs from clients that have the literal string "User-Agent" in their HTTP User-Agent
protocol header (sometimes even twice!). For example:
User-Agent: Drupal (+http://drupal.org/)
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;
User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
I don't know about other protocols, but in HTTP there's no reputable client that does this. As far as I can tell these are bots pretending to be real user agents and doing it poorly. In my case, the IPs associated with these are all on Amazon AWS and the requests are clearly not from a human user.
Should we add User-Agent:
to the list of robots? Let me know what you think.
QQBrowser is a browser used in China. An example user agent field will be "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3641.400 QQBrowser/10.4.3284.400"
Do you have a list of robots targeted by the "core" field?
User Agent missing from the current list: "okhttp" (regex "^okhttp$")
okhttp is a Java HTTP client library (see square/okhttp#1063). This is not a real browser and could be considered as a bot.
User Agent missing from the current list: "HttpClient"
HttpClient is a Java HTTP client library. This is not a real browser and could be considered as a bot.
I have several thousand hits from this user agent in my logs:
ReactorNetty/0.9.3.RELEASE
Seems to be the reactor client/server framework for JVM.
user-agent is a case insensitive substring match.
so as per: https://support.google.com/adsense/answer/99376?hl=en&visit_id=1-636565585661074629-419118318&rd=1
Mediapartners-Google should be added instead of mediapartners-google
Atmire - please add the following new exclusions:
Regex: A6-Indexer
Comments: A6 Corp's bot
Regex: The+Knowledge+AI
Comments: https://www.webmasterworld.com/search_engine_spiders/4896765.htm
Regex: Genieo
Comments: Listed as a bad bot by Distil Networks, see: https://www.distilnetworks.com/bot-directory/bot/genieo/
Hi!
A Motorola Edge 30 for example has the user agent:
Mozilla/5.0 (Linux; Android 13; motorola edge 30 Build/T1RD33.116-33-3; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/113.0.5672.132 Mobile Safari/537.36
because that inevitably matches the term "motor" it's flagged as a robot.
I did a bit of digging to figure out where this term might come from and found this entry on robotstxt.org.
According to this it's from a web crawler from 1996 "to build the database for the www.webindex.de".
Today www.webindex.de redirects to www.cybercon.de, the owners of it and it seems that a soft redirect was already put in place at the end of 2004 (according to http://web.archive.org/web/20041129080254/www.webindex.de).
So I think it's safe to say that this project is not around for almost 20 years now and the string can be removed from the list.
Seeing that most dates are in format: yyyy-MM-dd
However, seeing some entries with format: yyyy-dd-MM
Shouldn't all dates be in the same format?
I just noticed the user agent Java/21
in my server access log. We currently have the following pattern in COUNTER-Robots:
^java\/\d{1,2}.\d
This pattern matches java/1.8
but not Java/21
(see https://regex101.com/r/pweujD/1). Also, I'm just realizing that the dot should be escaped so it is interpreted as a literal dot, not a regex metacharacter.
I suggest the pattern be updated to be ^java\/\d+
or perhaps even just ^java
. Both are enough to uniquely identify the user agent.
User Agent missing from the current list: "oaDOI" (regex "^oaDOI$")
We saw this user agent at one of our clients. I think this user agent belongs to https://oadoi.org/, a service that indexes open access articles. It should be considered as a robot.
User agent string: Docoloc
Docoloc bot is a web scraping bot used by the Docoloc plagiarism detection service. Docoloc bot searches papers for text fragments that also occur in other documents by using web scraping. Docoloc bot respects robots.txt.
(Ref: https://www.distilnetworks.com/bot-directory/bot/docoloc/)
These two patterns do almost the same. The first one is a bit more restrictive, but the second one completely cover the first. So I'd suggest for the matter of simplicity, remove the first, keep the second.
{
"pattern": "^voyager\\/",
"last_changed": "2017-08-08"
},
{
"pattern": "voyager\\/",
"last_changed": "2017-08-08"
},
Currently, there is a "generated" folder with a text list, processed as documented in the readme.
An important use of this repository will be as a git submodule within other repositories. When another project references this one, can it be assumed that the "generated" folder will be populated with a fresh version of the text-only patterns? Or is the project which references this one responsible for regenerating this from the JSON?
For example, when a new robot is identified and incorporated by a pull request, will generated/COUNTER_Robots_list.txt
reflect the change immediately, or will there be a formal release where the change is reflected, or is the text derivative not to be considered authoritative?
User Agent missing from the current list: "LinkTiger"
LinkTiger is a website link checking product and could thus be considered as a robot. More information can be found here https://linktiger.com/product/.
I have 10,000 hits from this user agent in my logs:
Jersey/2.6 (HttpUrlConnection 1.8.0_152)
As far as I know this is the Jersey REST framework, which is not a human user.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.