markbaggett / freq Goto Github PK

View Code? Open in Web Editor NEW

195.0 195.0 45.0 5.18 MB

This is a repository for freq.py and freq_server.py

License: MIT License

Python 100.00%

freq's People

Contributors

Stargazers

Watchers

Forkers

johnjohnsp1 erikvabu philhagen crackytsi malikvivek tim1512 wicusross dmium mhelbrecht mertzjames esk1llz weev3 mikekiwa tbennett6421 pberba jack51706 nokiabama blue-infosec pcoccoli kevinmustaqim poruchikrj nitsugahcram shamimrezasohag mahawkma koffnox hartl3y94 walexzzy ashishbhadouria wvru intrudr-sec abamidele excloudx6 glambin gmh5225 sethrice gquittet c4p-n1ck jordanlinden hmi79 thomasxm nweller schwiftychris romans-1-16 jin-long

freq's Issues

Contributing

I'm not sure what your process for contributing to this project is. I have a few things that I would like to tackle to help out the project, is there a specific process you have in place for doing this, or is it simply fork, make the changes, and then PR the changes?

pip package

Did you want to create this as a pip installable package?

package reorg

Hey Mark,

Over the next couple of weeks I wanted to work on this project. One of the things I'd like to do is reorg the structure of the code here to something like this:

/root (freq)
- freq (source)
- bin (.exe, systemd, upstart, etc)
- test
- setup.py
- README.md

This would organize the code a little better, allow for package install and upload to pip, and creating better installers for local systems. What are your thoughts on this?

Combining freq score with Alexa top 1M

Hi,

Is it possible to add new feature that checks order of the domain in Alexa? This way, we can compare the freq score with Alexa and optimize/reduce false positives. It would be perfect if we can query just for the freq score, Alexa order or both.

duplicate words can be tallied

if a new word is added to the table using tall_str and that word has already been processed, then that can cause duplicate data essentially. The tool should keep a set of already processed words and check that there aren't duplicates created.

A simple solution could be to keep a set() internal to FreqCounter and check that the word hasn't already been processed.

An always true statement

On the line below, there is never a case where the key will be longer than a one char string, and therefore the second statement will always evaluate to true:

https://github.com/MarkBaggett/freq/blob/master/freq.py#L23

if self.parent.ignore_case and (key.islower() or key.isupper()):

can, therefore, be changed to;

if self.parent.ignore_case:

No license

This repo has no license applied. I believe it is intended to be publicly available, in which case it is probably important to clarify that intent (and explicitly limit author liability) with a common OSS license.

proxy

hi
i was wondering if you can help me how to use http proxy inside your freq_server.py code.
TNX

Mjraha592

upstart config file

One of the CentOS 6 servers uses upstart. Here's the conf file I put in /etc/init/ to run this as a service:


description     "Freq Server"
start on filesystem or runlevel [2345]
stop on runlevel [!2345]

respawn

script
      export HOME="/usr/bin"
      echo $$ > /var/run/freqserver.pid
      exec python /opt/freqserver/freq_server.py -ip <IP> <port> /opt/freqserver/english_lowercase.freq
end script

pre-start script
      echo "[`date`] Freq Server starting" >> /var/log/messages
end script

pre-stop script
      echo "[`date`] Freq Server stopping" >> /var/log/messages
end script

Test Cases

Is there a set of test cases that can be run to ensure the lib is working properly?

Could use a lot of PEP8 style refactoring

This looks like a very cool project. One thing I noticed when reviewing the code is that there are numerous Python coding style violations. Tools like flake8, pylint, and black can help a lot with this.

https://www.python.org/dev/peps/pep-0008/

https://pypi.org/project/flake8/
https://github.com/PyCQA/pylint
https://github.com/ambv/black

Bulk Mode Missing

One of my favorites was passing bulk data to the older version.

diff --git a/freq.py b/freq.py
index 6466ee0..f01a460 100755
--- a/freq.py
+++ b/freq.py
@@ -160,6 +160,7 @@ if __name__ == "__main__":
     parser.add_argument('-m','--measure',required=False,help='Measure likelihood of a given string',dest='measure')
     parser.add_argument('-n','--normal',required=False,help='Update the table based on the following normal string',dest='normal')
     parser.add_argument('-f','--normalfile',required=False,help='Update the table based on the contents of the normal file',dest='normalfile')
+    parser.add_argument('-b','--bulk',required=False,help='File containing a list of strings to test',dest='bulkfile')
     parser.add_argument('-p','--print',action='store_true',required=False,help='Print a table of the most likely letters in order',dest='printtable')
     parser.add_argument('-c','--create',action='store_true',required=False,help='Create a new empty frequency table',dest='create')
     parser.add_argument('-v','--verbose',action='store_true',required=False,help='show calculation process',dest='verbose')
@@ -199,4 +200,12 @@ if __name__ == "__main__":
             for eachline in filehandle:
                 fc.tally_str(eachline.decode("latin1"))
     if args.measure: print(fc.probability(args.measure))
+    if args.bulkfile:
+        try:
+            with open(args.bulkfile, 'r') as bulkFile:
+                for line in bulkFile:
+                    line = line.strip()
+                    print("%-30s %s" %(line, fc.probability(line)))
+        except FileNotFoundError:
+            sys.stderr.write("Failed to find bulk file: %s\n" %(args.bulkfile))
     fc.save(args.freqtable)

I wrote a Haskell implementation

I work at a company called Layer 3 Communications in Atlanta, we use Haskell. I wrote a Haskell implementation of freq, which you can see here: https://github.com/chessai/freq

We have a datatype called a FreqTrain which is essentially a nested Map, based off the original code located here. The problem with it is that lookups are slow, because lookups on a Map are log(n), and it's nested, so that's log^2(n), and we perform a number of lookups equal to the length of the string minus 1, or (k - 1) * log^2(n). However, we get O(1) lookup by turning a 'FreqTrain' into a 'Freq', and inlining the results of what the lookup would be to get even better constant factors. Benchmarks show that 'Freq's are about 100 times faster to read from than 'FreqTrain's. Also, training from files takes about 2 seconds.

Howto use this tool?

How do you configure and use this tool?

Error on raise exception

There seems to be a typo on the following line, but this line would only cause an if the frequency table specified could not be found.

Technically it is valid syntax, but the object does not exist.

freq/freq_server.py

Line 201 in 48010fd

raiseeerver.fcs[eachtable]

get 2 scores returned

Hi Mark,

I just pulled the so-frreqserver (which runs the same command line as yours)

docker pull securityonionsolutions/so-freqserver
docker run -p 10004:10004 securityonionsolutions/so-freqserver

How come it returns 2 scores when I query it? When I installed freq_server from scratch I never had that.

root@ubuntu:/opt# curl http://localhost:10004/measure/crimsoncore.be
(10.0239, 8.6545)

Cheers,
Luk

Loading a table is slow

I created a freqtable from the top 1 million domain names but loading it takes 5 minutes:

$ time freq.py -v -m 'google' test.freq
Ignoring Case: True
Ignoring Characters: 
	~`!@#$%^&*()_+-
All pairs: ['go', 'og', 'le', 'oo', 'gl']
Probability of go: [0.19683947542892255]
Probability of og: [0.19683947542892255, 0.07902049839017421]
Probability of le: [0.19683947542892255, 0.07902049839017421, 0.23942299582507484]
Probability of oo: [0.19683947542892255, 0.07902049839017421, 0.23942299582507484, 0.10635868072983433]
Probability of gl: [0.19683947542892255, 0.07902049839017421, 0.23942299582507484, 0.10635868072983433, 0.1602173315437978]
Average Probability: 15.637179638356075% 


Letter1:248460000 Letter2:48691000  - This pair g:248460000 o:48691000
Letter1:822381000 Letter2:93875000  - This pair o:573921000 g:45184000
Letter1:1203860000 Letter2:184714000  - This pair l:381479000 e:90839000
Letter1:1777781000 Letter2:245530000  - This pair o:573921000 o:60816000
Letter1:2026241000 Letter2:285162000  - This pair g:248460000 l:39632000
Total Word Probability: 285162000/2026241000 = 14.073449308349797
(15.6372, 14.0734)

real	5m6.152s
user	5m2.743s
sys	0m3.197s