atmire / counter-robots Goto Github PK

View Code? Open in Web Editor NEW

62.0 62.0 28.0 116 KB

Official list of user agents that are regarded as robots/spiders by COUNTER

License: MIT License

Shell 100.00%

counter-robots's People

Contributors

Stargazers

Watchers

counter-robots's Issues

Escaped characters that don't need to be escaped

The following expression creates an error in the .NET programming language:
lycos[\_\+]

The character "_" (char code 95) does not need escaping because it does not have a special meaning in Regular Expressions. Escaping character "_" creates an error in .NET at runtime.
The character "+" (char code 43) is a special character in Regular Expressions, but within a character set it is not required to escape characters other than "\", "-" and "]".

lycos[\_\+] should be changed to lycos[_+]

DTS Agent

User Agent missing from the current list: "DTS Agent"

We saw this user agent at one of our clients. It seems to be an e-mail harvesting robot. I could not find any official page. But when Googling I found similar reports and this http://www.internetofficer.com/web-robot/dts/

Motorola Moto G 50 detected as robot due to "rss" pattern

Hi!
Unfortunately a few builds of the Moto G 50 have a user-agent like:
Mozilla/5.0 (Linux; Android 12; moto g (50)5G Build/S1RSS32.38-20-9-1; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/106.0.5249.79 Mobile Safari/537.36

(more examples)

This is matched by the rule "rss" in COUNTER_Robots_list.json#L1024.
While I get that this rule targets RSS readers, would it maybe possible to use a more specific regex to avoid false positives?
Maybe rss[^a-z0-9] would work?

Firefox and Opera browsers excluded?

I think these patterns will exclude anyone using the Firefox or Opera browsers from being counted:
^Mozilla$
^Mozilla.4.0$
^Mozilla/4.0+(compatible;)$
^Mozilla/4.0+(compatible;+ICS)$
^Mozilla/4.5+[en]+(Win98;+I)$
^Mozilla.5.0$
^Mozilla/5.0+(compatible;+MSIE+6.0;+Windows+NT+5.0)$
^Mozilla/5.0+like+Gecko$
^Mozilla/5.0(\s|+)Gecko/20100115(\s|+)Firefox/3.6$
^Opera/4$

ArchiveTeam

User Agent missing from the current list: "ArchiveTeam"

The "ArchiveTeam" user agent belongs to https://www.archiveteam.org/index.php?title=Main_Page

This is a robot that is archiving the web.

Google Web Light exclusion

Please could you consider updating the rule that excludes anything matching 'google', as it excludes usage via the Google Web Light service:
https://developers.google.com/search/docs/advanced/mobile/web-light
Mobile use with UAs like this is excluded from COUNTER reports; I believe this may be genuine human usage, and therefore incorrectly excluded:
Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19

New robot user agent strings to add to exclusions

API+scraper
G-i-g-a-b-o-t
Grammarly/1.0 (http://www.grammarly.com) or Grammarly/1.0+(http://www.grammarly.com)
Riddler+(http://riddler.io/about)
EasyBib+AutoCite+(http://autocite-info.citation-api.com/)

User agents with "User-Agent" in their string

I have thousands of hits in my server logs from clients that have the literal string "User-Agent" in their HTTP User-Agent protocol header (sometimes even twice!). For example:

User-Agent: Drupal (+http://drupal.org/)
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;
User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;

I don't know about other protocols, but in HTTP there's no reputable client that does this. As far as I can tell these are bots pretending to be real user agents and doing it poorly. In my case, the IPs associated with these are all on Amazon AWS and the requests are clearly not from a human user.

Should we add User-Agent: to the list of robots? Let me know what you think.

"core" filter removes QQBrowser usage

QQBrowser is a browser used in China. An example user agent field will be "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3641.400 QQBrowser/10.4.3284.400"

Do you have a list of robots targeted by the "core" field?

okhttp

User Agent missing from the current list: "okhttp" (regex "^okhttp$")

okhttp is a Java HTTP client library (see square/okhttp#1063). This is not a real browser and could be considered as a bot.

HttpClient

User Agent missing from the current list: "HttpClient"

HttpClient is a Java HTTP client library. This is not a real browser and could be considered as a bot.

Add ReactorNetty to list

I have several thousand hits from this user agent in my logs:

ReactorNetty/0.9.3.RELEASE

Seems to be the reactor client/server framework for JVM.

Mediapartners\-Google should be added to Robots list.

user-agent is a case insensitive substring match.

so as per: https://support.google.com/adsense/answer/99376?hl=en&visit_id=1-636565585661074629-419118318&rd=1
Mediapartners-Google should be added instead of mediapartners-google

3 new exclusions agreed by COUNTER Robots Working Group

Atmire - please add the following new exclusions:

Regex: A6-Indexer
Comments: A6 Corp's bot

Regex: The+Knowledge+AI
Comments: https://www.webmasterworld.com/search_engine_spiders/4896765.htm

Regex: Genieo
Comments: Listed as a bad bot by Distil Networks, see: https://www.distilnetworks.com/bot-directory/bot/genieo/

Motorola smartphones detected as robot due to "motor" pattern

Hi!
A Motorola Edge 30 for example has the user agent:

Mozilla/5.0 (Linux; Android 13; motorola edge 30 Build/T1RD33.116-33-3; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/113.0.5672.132 Mobile Safari/537.36

because that inevitably matches the term "motor" it's flagged as a robot.

I did a bit of digging to figure out where this term might come from and found this entry on robotstxt.org.
According to this it's from a web crawler from 1996 "to build the database for the www.webindex.de".
Today www.webindex.de redirects to www.cybercon.de, the owners of it and it seems that a soft redirect was already put in place at the end of 2004 (according to http://web.archive.org/web/20041129080254/www.webindex.de).

So I think it's safe to say that this project is not around for almost 20 years now and the string can be removed from the list.

Differences in Date format for last_changed

Seeing that most dates are in format: yyyy-MM-dd

However, seeing some entries with format: yyyy-dd-MM

Shouldn't all dates be in the same format?

Java pattern is too specific and has a syntax error

I just noticed the user agent Java/21 in my server access log. We currently have the following pattern in COUNTER-Robots:

^java\/\d{1,2}.\d

This pattern matches java/1.8 but not Java/21 (see https://regex101.com/r/pweujD/1). Also, I'm just realizing that the dot should be escaped so it is interpreted as a literal dot, not a regex metacharacter.

I suggest the pattern be updated to be ^java\/\d+ or perhaps even just ^java. Both are enough to uniquely identify the user agent.

oaDOI

User Agent missing from the current list: "oaDOI" (regex "^oaDOI$")

We saw this user agent at one of our clients. I think this user agent belongs to https://oadoi.org/, a service that indexes open access articles. It should be considered as a robot.

New robot user agent to add to exclusions

User agent string: Docoloc

Docoloc bot is a web scraping bot used by the Docoloc plagiarism detection service. Docoloc bot searches papers for text fragments that also occur in other documents by using web scraping. Docoloc bot respects robots.txt.

(Ref: https://www.distilnetworks.com/bot-directory/bot/docoloc/)

Redundant pattern "voyager"

These two patterns do almost the same. The first one is a bit more restrictive, but the second one completely cover the first. So I'd suggest for the matter of simplicity, remove the first, keep the second.

  {
    "pattern": "^voyager\\/",
    "last_changed": "2017-08-08"
  },

  {
    "pattern": "voyager\\/",
    "last_changed": "2017-08-08"
  },

Document expectations for "generated" folder

Currently, there is a "generated" folder with a text list, processed as documented in the readme.

An important use of this repository will be as a git submodule within other repositories. When another project references this one, can it be assumed that the "generated" folder will be populated with a fresh version of the text-only patterns? Or is the project which references this one responsible for regenerating this from the JSON?

For example, when a new robot is identified and incorporated by a pull request, will generated/COUNTER_Robots_list.txt reflect the change immediately, or will there be a formal release where the change is reflected, or is the text derivative not to be considered authoritative?

LinkTiger

User Agent missing from the current list: "LinkTiger"

LinkTiger is a website link checking product and could thus be considered as a robot. More information can be found here https://linktiger.com/product/.

Add Jersey to bot list

I have 10,000 hits from this user agent in my logs:

Jersey/2.6 (HttpUrlConnection 1.8.0_152)

As far as I know this is the Jersey REST framework, which is not a human user.

atmire / counter-robots Goto Github PK

counter-robots's People

Contributors

Stargazers

Watchers

Forkers

counter-robots's Issues

Recommend Projects

Recommend Topics

Recommend Org