Coder Social home page Coder Social logo

Comments (2)

bee-san avatar bee-san commented on May 28, 2024

This is cool! Is Regexp Score the 'rarity' in this case? Can you build a function that other people can use to easily test this? Maybe put it into /scripts?

If you do Regexp Score mod 10 it'll put it into the 10-point range for us :) And I might suggest rounding the result in another column too (so 0.51 becomes 1, 0.49 becomes 0) just to see what that's like :D

thanks so much for this, this is absolutely great 🔥 A non-subjective formal way to define rarity would be absolutely amazing :)

from pywhat.

nodtem66 avatar nodtem66 commented on May 28, 2024

It seems to be promising, but it's just a prototype and needs a lot of works.
For example:

  • What are the best parameters for RegExScore? (repeat_score, in_score, ascii_score, digit_score, etc.)
  • Some RegExp with high rarity produces a low score, such as
Google Cloud Platform API Key
Regexp = "(?i)^([0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12})$"
rarity = 0.8
score = 0.07
  • It means that we need more accurate metric for scoring Regexp.

By the way, you can play the script in Google Colab.
This is the RegExScore class source code:

import sre_parse
import strings

# calculate score from literal string (e.g. prefix, suffix)
# use weight metric (*_score) parameters
class RegExScore():
  def __init__(self, 
               repeat_score = 0.01, # score for quantifier `{0,}`
               in_score = 0.1,      # score for character set `[*]`
               ascii_score=1.0,     # score for a fixed ascii `a-zA-z`
               digit_score=0.2,     # score for a fixed digit `0-9`
               literal_default_score=0.01, # score for whitespaces
               debug=False # print the debug message
    ):
    self.repeat_score = repeat_score
    self.in_score = in_score
    self.ascii_score = ascii_score
    self.digit_score = digit_score
    self.literal_default_score = literal_default_score
    self.debug = debug

  def calculate(self, regexp:str):
    return self.token_score(sre_parse.parse(regexp))

  def token_score(self, tokens:tuple):
    score = 0
    for _token in tokens:
      if self.debug:
        print("Loop: ", _token)

      # add the score from subpattern `()`
      if _token[0] == sre_parse.SUBPATTERN:
        _, _, _, child = _token[1]
        if self.debug:
          print(_token[0], len(child))
        score += self.token_score(child)
      # add score from quantifier `{min,max}`
      elif _token[0] == sre_parse.MAX_REPEAT:
        _min, _max, child = _token[1]
        _score = self.repeat_score * (_min + 0 if _max == sre_parse.MAXREPEAT else _max)
        if self.debug:
          print('\tscore:', _score)
        score += _score + self.token_score(child)
      # add score from mean of branch group `A|B|C|D`
      elif _token[0] == sre_parse.BRANCH:
        _, branch = _token[1]
        if self.debug:
          print('\tbranch:', len(branch))
        sub_score = 0
        for child in branch:
          sub_score += self.token_score(child)
        score += sub_score / float(len(branch))
      # add score from character set `[]`
      elif _token[0] == sre_parse.IN:
        if self.debug:
          print('\tscore:', self.in_score)
        score += self.in_score
      # add score from fixed literal
      elif _token[0] == sre_parse.LITERAL:
        literal = chr(_token[1])
        if self.debug:
          print('\tchr:', literal)
        if literal in string.ascii_letters:
          score += self.ascii_score
        elif literal in string.digits:
          score += self.digit_score
        else:
          score += self.literal_default_score
    return score

Feel free to comment or suggest your thoughts.
I'm looking forward to discussing this with anyone.

from pywhat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.