Coder Social home page Coder Social logo

counter-robots's People

Contributors

ajnyga avatar alanorth avatar atmire-github avatar bram-atmire avatar ctgraham avatar davidatmire avatar heydevhey avatar jmvezic avatar jnugent avatar jonas-atmire avatar mrabro avatar philipvis avatar tomdesair avatar trianglepb avatar yanadepauw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

counter-robots's Issues

Escaped characters that don't need to be escaped

The following expression creates an error in the .NET programming language:
lycos[\_\+]

The character "_" (char code 95) does not need escaping because it does not have a special meaning in Regular Expressions. Escaping character "_" creates an error in .NET at runtime.
The character "+" (char code 43) is a special character in Regular Expressions, but within a character set it is not required to escape characters other than "\", "-" and "]".

lycos[\_\+] should be changed to lycos[_+]

DTS Agent

User Agent missing from the current list: "DTS Agent"

We saw this user agent at one of our clients. It seems to be an e-mail harvesting robot. I could not find any official page. But when Googling I found similar reports and this http://www.internetofficer.com/web-robot/dts/

Motorola Moto G 50 detected as robot due to "rss" pattern

Hi!
Unfortunately a few builds of the Moto G 50 have a user-agent like:
Mozilla/5.0 (Linux; Android 12; moto g (50)5G Build/S1RSS32.38-20-9-1; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/106.0.5249.79 Mobile Safari/537.36

(more examples)

This is matched by the rule "rss" in COUNTER_Robots_list.json#L1024.
While I get that this rule targets RSS readers, would it maybe possible to use a more specific regex to avoid false positives?
Maybe rss[^a-z0-9] would work?

Firefox and Opera browsers excluded?

I think these patterns will exclude anyone using the Firefox or Opera browsers from being counted:
^Mozilla$
^Mozilla.4.0$
^Mozilla/4.0+(compatible;)$
^Mozilla/4.0+(compatible;+ICS)$
^Mozilla/4.5+[en]+(Win98;+I)$
^Mozilla.5.0$
^Mozilla/5.0+(compatible;+MSIE+6.0;+Windows+NT+5.0)$
^Mozilla/5.0+like+Gecko$
^Mozilla/5.0(\s|+)Gecko/20100115(\s|+)Firefox/3.6$
^Opera/4$

Google Web Light exclusion

Please could you consider updating the rule that excludes anything matching 'google', as it excludes usage via the Google Web Light service:
https://developers.google.com/search/docs/advanced/mobile/web-light
Mobile use with UAs like this is excluded from COUNTER reports; I believe this may be genuine human usage, and therefore incorrectly excluded:
Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19

User agents with "User-Agent" in their string

I have thousands of hits in my server logs from clients that have the literal string "User-Agent" in their HTTP User-Agent protocol header (sometimes even twice!). For example:

  • User-Agent: Drupal (+http://drupal.org/)
  • User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
  • User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
  • User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
  • User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;
  • User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;

I don't know about other protocols, but in HTTP there's no reputable client that does this. As far as I can tell these are bots pretending to be real user agents and doing it poorly. In my case, the IPs associated with these are all on Amazon AWS and the requests are clearly not from a human user.

Should we add User-Agent: to the list of robots? Let me know what you think.

"core" filter removes QQBrowser usage

QQBrowser is a browser used in China. An example user agent field will be "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3641.400 QQBrowser/10.4.3284.400"

Do you have a list of robots targeted by the "core" field?

okhttp

User Agent missing from the current list: "okhttp" (regex "^okhttp$")

okhttp is a Java HTTP client library (see square/okhttp#1063). This is not a real browser and could be considered as a bot.

HttpClient

User Agent missing from the current list: "HttpClient"

HttpClient is a Java HTTP client library. This is not a real browser and could be considered as a bot.

Motorola smartphones detected as robot due to "motor" pattern

Hi!
A Motorola Edge 30 for example has the user agent:

Mozilla/5.0 (Linux; Android 13; motorola edge 30 Build/T1RD33.116-33-3; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/113.0.5672.132 Mobile Safari/537.36

because that inevitably matches the term "motor" it's flagged as a robot.

I did a bit of digging to figure out where this term might come from and found this entry on robotstxt.org.
According to this it's from a web crawler from 1996 "to build the database for the www.webindex.de".
Today www.webindex.de redirects to www.cybercon.de, the owners of it and it seems that a soft redirect was already put in place at the end of 2004 (according to http://web.archive.org/web/20041129080254/www.webindex.de).

So I think it's safe to say that this project is not around for almost 20 years now and the string can be removed from the list.

Java pattern is too specific and has a syntax error

I just noticed the user agent Java/21 in my server access log. We currently have the following pattern in COUNTER-Robots:

^java\/\d{1,2}.\d

This pattern matches java/1.8 but not Java/21 (see https://regex101.com/r/pweujD/1). Also, I'm just realizing that the dot should be escaped so it is interpreted as a literal dot, not a regex metacharacter.

I suggest the pattern be updated to be ^java\/\d+ or perhaps even just ^java. Both are enough to uniquely identify the user agent.

oaDOI

User Agent missing from the current list: "oaDOI" (regex "^oaDOI$")

We saw this user agent at one of our clients. I think this user agent belongs to https://oadoi.org/, a service that indexes open access articles. It should be considered as a robot.

Redundant pattern "voyager"

These two patterns do almost the same. The first one is a bit more restrictive, but the second one completely cover the first. So I'd suggest for the matter of simplicity, remove the first, keep the second.

  {
    "pattern": "^voyager\\/",
    "last_changed": "2017-08-08"
  },

  {
    "pattern": "voyager\\/",
    "last_changed": "2017-08-08"
  },

Document expectations for "generated" folder

Currently, there is a "generated" folder with a text list, processed as documented in the readme.

An important use of this repository will be as a git submodule within other repositories. When another project references this one, can it be assumed that the "generated" folder will be populated with a fresh version of the text-only patterns? Or is the project which references this one responsible for regenerating this from the JSON?

For example, when a new robot is identified and incorporated by a pull request, will generated/COUNTER_Robots_list.txt reflect the change immediately, or will there be a formal release where the change is reflected, or is the text derivative not to be considered authoritative?

LinkTiger

User Agent missing from the current list: "LinkTiger"

LinkTiger is a website link checking product and could thus be considered as a robot. More information can be found here https://linktiger.com/product/.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.