Comments (2)
This is cool! Is Regexp Score
the 'rarity' in this case? Can you build a function that other people can use to easily test this? Maybe put it into /scripts
?
If you do Regexp Score mod 10
it'll put it into the 10-point range for us :) And I might suggest rounding the result in another column too (so 0.51 becomes 1, 0.49 becomes 0) just to see what that's like :D
thanks so much for this, this is absolutely great
from pywhat.
It seems to be promising, but it's just a prototype and needs a lot of works.
For example:
- What are the best parameters for RegExScore? (repeat_score, in_score, ascii_score, digit_score, etc.)
- Some RegExp with high rarity produces a low score, such as
Google Cloud Platform API Key
Regexp = "(?i)^([0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12})$"
rarity = 0.8
score = 0.07
- It means that we need more accurate metric for scoring Regexp.
By the way, you can play the script in Google Colab.
This is the RegExScore class source code:
import sre_parse
import strings
# calculate score from literal string (e.g. prefix, suffix)
# use weight metric (*_score) parameters
class RegExScore():
def __init__(self,
repeat_score = 0.01, # score for quantifier `{0,}`
in_score = 0.1, # score for character set `[*]`
ascii_score=1.0, # score for a fixed ascii `a-zA-z`
digit_score=0.2, # score for a fixed digit `0-9`
literal_default_score=0.01, # score for whitespaces
debug=False # print the debug message
):
self.repeat_score = repeat_score
self.in_score = in_score
self.ascii_score = ascii_score
self.digit_score = digit_score
self.literal_default_score = literal_default_score
self.debug = debug
def calculate(self, regexp:str):
return self.token_score(sre_parse.parse(regexp))
def token_score(self, tokens:tuple):
score = 0
for _token in tokens:
if self.debug:
print("Loop: ", _token)
# add the score from subpattern `()`
if _token[0] == sre_parse.SUBPATTERN:
_, _, _, child = _token[1]
if self.debug:
print(_token[0], len(child))
score += self.token_score(child)
# add score from quantifier `{min,max}`
elif _token[0] == sre_parse.MAX_REPEAT:
_min, _max, child = _token[1]
_score = self.repeat_score * (_min + 0 if _max == sre_parse.MAXREPEAT else _max)
if self.debug:
print('\tscore:', _score)
score += _score + self.token_score(child)
# add score from mean of branch group `A|B|C|D`
elif _token[0] == sre_parse.BRANCH:
_, branch = _token[1]
if self.debug:
print('\tbranch:', len(branch))
sub_score = 0
for child in branch:
sub_score += self.token_score(child)
score += sub_score / float(len(branch))
# add score from character set `[]`
elif _token[0] == sre_parse.IN:
if self.debug:
print('\tscore:', self.in_score)
score += self.in_score
# add score from fixed literal
elif _token[0] == sre_parse.LITERAL:
literal = chr(_token[1])
if self.debug:
print('\tchr:', literal)
if literal in string.ascii_letters:
score += self.ascii_score
elif literal in string.digits:
score += self.digit_score
else:
score += self.literal_default_score
return score
Feel free to comment or suggest your thoughts.
I'm looking forward to discussing this with anyone.
from pywhat.
Related Issues (20)
- Lat / Long matches incorrectly HOT 6
- Improve tests HOT 10
- Amazon Web Services EC2 Instance identifier matches incorrectly HOT 2
- IPv6 regex matches on "::" HOT 7
- Make a test that tests whether the names of Regex are capitalised HOT 2
- JSON Web Token (JWT) matches incorrectly HOT 1
- duplicate entries in `regex.json` HOT 3
- CI Checks fail HOT 3
- Add support for list of regex in regex.json HOT 4
- Datadog API Key matches incorrectly
- Discussion about pywhat output order and usability of given matches HOT 3
- [Proposal] Interactive Mode for the user, an interface of the future HOT 1
- [Proposal] parse timestamps HOT 5
- Bitcoin Wallet Address matches incorrectly
- Phone Number matches incorrectly HOT 4
- Google API Key
- regex URL does not fully match every URL
- Passed tags are not valid for example invocation from --help string
- Youtube links are not parsed correctly HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pywhat.