markbaggett / freq Goto Github PK
View Code? Open in Web Editor NEWThis is a repository for freq.py and freq_server.py
License: MIT License
This is a repository for freq.py and freq_server.py
License: MIT License
I'm not sure what your process for contributing to this project is. I have a few things that I would like to tackle to help out the project, is there a specific process you have in place for doing this, or is it simply fork, make the changes, and then PR the changes?
Did you want to create this as a pip
installable package?
Hey Mark,
Over the next couple of weeks I wanted to work on this project. One of the things I'd like to do is reorg the structure of the code here to something like this:
/root (freq)
- freq (source)
- bin (.exe, systemd, upstart, etc)
- test
- setup.py
- README.md
This would organize the code a little better, allow for package install and upload to pip, and creating better installers for local systems. What are your thoughts on this?
Hi,
Is it possible to add new feature that checks order of the domain in Alexa? This way, we can compare the freq score with Alexa and optimize/reduce false positives. It would be perfect if we can query just for the freq score, Alexa order or both.
if a new word is added to the table using tall_str
and that word has already been processed, then that can cause duplicate data essentially. The tool should keep a set of already processed words and check that there aren't duplicates created.
A simple solution could be to keep a set()
internal to FreqCounter
and check that the word hasn't already been processed.
On the line below, there is never a case where the key will be longer than a one char string, and therefore the second statement will always evaluate to true:
https://github.com/MarkBaggett/freq/blob/master/freq.py#L23
if self.parent.ignore_case and (key.islower() or key.isupper()):
can, therefore, be changed to;
if self.parent.ignore_case:
This repo has no license applied. I believe it is intended to be publicly available, in which case it is probably important to clarify that intent (and explicitly limit author liability) with a common OSS license.
hi
i was wondering if you can help me how to use http proxy inside your freq_server.py code.
TNX
One of the CentOS 6 servers uses upstart. Here's the conf file I put in /etc/init/ to run this as a service:
description "Freq Server"
start on filesystem or runlevel [2345]
stop on runlevel [!2345]
respawn
script
export HOME="/usr/bin"
echo $$ > /var/run/freqserver.pid
exec python /opt/freqserver/freq_server.py -ip <IP> <port> /opt/freqserver/english_lowercase.freq
end script
pre-start script
echo "[`date`] Freq Server starting" >> /var/log/messages
end script
pre-stop script
echo "[`date`] Freq Server stopping" >> /var/log/messages
end script
Is there a set of test cases that can be run to ensure the lib is working properly?
This looks like a very cool project. One thing I noticed when reviewing the code is that there are numerous Python coding style violations. Tools like flake8, pylint, and black can help a lot with this.
https://www.python.org/dev/peps/pep-0008/
https://pypi.org/project/flake8/
https://github.com/PyCQA/pylint
https://github.com/ambv/black
One of my favorites was passing bulk data to the older version.
diff --git a/freq.py b/freq.py
index 6466ee0..f01a460 100755
--- a/freq.py
+++ b/freq.py
@@ -160,6 +160,7 @@ if __name__ == "__main__":
parser.add_argument('-m','--measure',required=False,help='Measure likelihood of a given string',dest='measure')
parser.add_argument('-n','--normal',required=False,help='Update the table based on the following normal string',dest='normal')
parser.add_argument('-f','--normalfile',required=False,help='Update the table based on the contents of the normal file',dest='normalfile')
+ parser.add_argument('-b','--bulk',required=False,help='File containing a list of strings to test',dest='bulkfile')
parser.add_argument('-p','--print',action='store_true',required=False,help='Print a table of the most likely letters in order',dest='printtable')
parser.add_argument('-c','--create',action='store_true',required=False,help='Create a new empty frequency table',dest='create')
parser.add_argument('-v','--verbose',action='store_true',required=False,help='show calculation process',dest='verbose')
@@ -199,4 +200,12 @@ if __name__ == "__main__":
for eachline in filehandle:
fc.tally_str(eachline.decode("latin1"))
if args.measure: print(fc.probability(args.measure))
+ if args.bulkfile:
+ try:
+ with open(args.bulkfile, 'r') as bulkFile:
+ for line in bulkFile:
+ line = line.strip()
+ print("%-30s %s" %(line, fc.probability(line)))
+ except FileNotFoundError:
+ sys.stderr.write("Failed to find bulk file: %s\n" %(args.bulkfile))
fc.save(args.freqtable)
I work at a company called Layer 3 Communications in Atlanta, we use Haskell. I wrote a Haskell implementation of freq, which you can see here: https://github.com/chessai/freq
We have a datatype called a FreqTrain which is essentially a nested Map, based off the original code located here. The problem with it is that lookups are slow, because lookups on a Map are log(n), and it's nested, so that's log^2(n), and we perform a number of lookups equal to the length of the string minus 1, or (k - 1) * log^2(n). However, we get O(1) lookup by turning a 'FreqTrain' into a 'Freq', and inlining the results of what the lookup would be to get even better constant factors. Benchmarks show that 'Freq's are about 100 times faster to read from than 'FreqTrain's. Also, training from files takes about 2 seconds.
How do you configure and use this tool?
There seems to be a typo on the following line, but this line would only cause an if the frequency table specified could not be found.
Technically it is valid syntax, but the object does not exist.
Line 201 in 48010fd
Hi Mark,
I just pulled the so-frreqserver (which runs the same command line as yours)
docker pull securityonionsolutions/so-freqserver
docker run -p 10004:10004 securityonionsolutions/so-freqserver
How come it returns 2 scores when I query it? When I installed freq_server from scratch I never had that.
root@ubuntu:/opt# curl http://localhost:10004/measure/crimsoncore.be
(10.0239, 8.6545)
Cheers,
Luk
I created a freqtable from the top 1 million domain names but loading it takes 5 minutes:
$ time freq.py -v -m 'google' test.freq
Ignoring Case: True
Ignoring Characters:
~`!@#$%^&*()_+-
All pairs: ['go', 'og', 'le', 'oo', 'gl']
Probability of go: [0.19683947542892255]
Probability of og: [0.19683947542892255, 0.07902049839017421]
Probability of le: [0.19683947542892255, 0.07902049839017421, 0.23942299582507484]
Probability of oo: [0.19683947542892255, 0.07902049839017421, 0.23942299582507484, 0.10635868072983433]
Probability of gl: [0.19683947542892255, 0.07902049839017421, 0.23942299582507484, 0.10635868072983433, 0.1602173315437978]
Average Probability: 15.637179638356075%
Letter1:248460000 Letter2:48691000 - This pair g:248460000 o:48691000
Letter1:822381000 Letter2:93875000 - This pair o:573921000 g:45184000
Letter1:1203860000 Letter2:184714000 - This pair l:381479000 e:90839000
Letter1:1777781000 Letter2:245530000 - This pair o:573921000 o:60816000
Letter1:2026241000 Letter2:285162000 - This pair g:248460000 l:39632000
Total Word Probability: 285162000/2026241000 = 14.073449308349797
(15.6372, 14.0734)
real 5m6.152s
user 5m2.743s
sys 0m3.197s
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.