Is your feature request related to a problem? Please describe. Th

[Proposal] Rarity score from RegEx about pywhat HOT 2 OPEN

nodtem66 commented on May 28, 2024

[Proposal] Rarity score from RegEx

from pywhat.

Comments (2)

bee-san commented on May 28, 2024

This is cool! Is Regexp Score the 'rarity' in this case? Can you build a function that other people can use to easily test this? Maybe put it into /scripts?

If you do Regexp Score mod 10 it'll put it into the 10-point range for us :) And I might suggest rounding the result in another column too (so 0.51 becomes 1, 0.49 becomes 0) just to see what that's like :D

thanks so much for this, this is absolutely great 🔥 A non-subjective formal way to define rarity would be absolutely amazing :)

from pywhat.

nodtem66 commented on May 28, 2024

It seems to be promising, but it's just a prototype and needs a lot of works.
For example:

What are the best parameters for RegExScore? (repeat_score, in_score, ascii_score, digit_score, etc.)
Some RegExp with high rarity produces a low score, such as

Google Cloud Platform API Key
Regexp = "(?i)^([0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12})$"
rarity = 0.8
score = 0.07

It means that we need more accurate metric for scoring Regexp.

By the way, you can play the script in Google Colab.
This is the RegExScore class source code:

import sre_parse
import strings

# calculate score from literal string (e.g. prefix, suffix)
# use weight metric (*_score) parameters
class RegExScore():
  def __init__(self, 
               repeat_score = 0.01, # score for quantifier `{0,}`
               in_score = 0.1,      # score for character set `[*]`
               ascii_score=1.0,     # score for a fixed ascii `a-zA-z`
               digit_score=0.2,     # score for a fixed digit `0-9`
               literal_default_score=0.01, # score for whitespaces
               debug=False # print the debug message
    ):
    self.repeat_score = repeat_score
    self.in_score = in_score
    self.ascii_score = ascii_score
    self.digit_score = digit_score
    self.literal_default_score = literal_default_score
    self.debug = debug

  def calculate(self, regexp:str):
    return self.token_score(sre_parse.parse(regexp))

  def token_score(self, tokens:tuple):
    score = 0
    for _token in tokens:
      if self.debug:
        print("Loop: ", _token)

      # add the score from subpattern `()`
      if _token[0] == sre_parse.SUBPATTERN:
        _, _, _, child = _token[1]
        if self.debug:
          print(_token[0], len(child))
        score += self.token_score(child)
      # add score from quantifier `{min,max}`
      elif _token[0] == sre_parse.MAX_REPEAT:
        _min, _max, child = _token[1]
        _score = self.repeat_score * (_min + 0 if _max == sre_parse.MAXREPEAT else _max)
        if self.debug:
          print('\tscore:', _score)
        score += _score + self.token_score(child)
      # add score from mean of branch group `A|B|C|D`
      elif _token[0] == sre_parse.BRANCH:
        _, branch = _token[1]
        if self.debug:
          print('\tbranch:', len(branch))
        sub_score = 0
        for child in branch:
          sub_score += self.token_score(child)
        score += sub_score / float(len(branch))
      # add score from character set `[]`
      elif _token[0] == sre_parse.IN:
        if self.debug:
          print('\tscore:', self.in_score)
        score += self.in_score
      # add score from fixed literal
      elif _token[0] == sre_parse.LITERAL:
        literal = chr(_token[1])
        if self.debug:
          print('\tchr:', literal)
        if literal in string.ascii_letters:
          score += self.ascii_score
        elif literal in string.digits:
          score += self.digit_score
        else:
          score += self.literal_default_score
    return score

Feel free to comment or suggest your thoughts.
I'm looking forward to discussing this with anyone.

from pywhat.

[Proposal] Rarity score from RegEx about pywhat HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent