Coder Social home page Coder Social logo

Usage data sanitization about resh HOT 7 CLOSED

curusarn avatar curusarn commented on July 18, 2024
Usage data sanitization

from resh.

Comments (7)

curusarn avatar curusarn commented on July 18, 2024

paths -> replace dirs except for common ones
git url -> replace url, replace dirs of path

replace parts of arguments if possible:
scp [email protected]:path/path -> scp HASH@HASH:HASH/HASH

replace whole arguments if they don't match any special form
git commit -m "message" -> git commit -m HASH

keep standard arguments and commands (?)

from resh.

curusarn avatar curusarn commented on July 18, 2024

Almost done: https://github.com/curusarn/resh/tree/dev_2

from resh.

curusarn avatar curusarn commented on July 18, 2024

I'm handling different types of data differently.

Types

Single value entries

e.g. username, hostname (usually sensitive information)

  1. replace with its hash
    • no exceptions, no whitelist

Paths

  1. split by /
  2. repace each part by its hash
    • unless it's in the whitelist or it's only one character
  3. append together

Git origin URL

  1. parse the URL using this library https://github.com/whilp/git-urls
  2. repace each part by its hash
    • unless it's in the whitelist or it's only one character long
  3. get a string of sanitized URL

Command line

I need to replace the command and arguments separately so that I can analyze partial matches later.
However, I don't want to parse bash.
I'm doing the following:

  1. split the line into consecutive strings of letters and/or digits (tokens)
    • command options are detected and left unhashed
  2. replace each token with its hash
    • unless it's in the whitelist or it's only one character long
  3. append together

Whitelisting

I created a whitelist containing various common strings.

  • directories in /
  • commands installed by default on Ubuntu, Debian or Fedora
  • bash and zsh keywords and builtins
  • file-extensions
  • git subcommands
  • some more stuff added by hand:
    • "com", "cz", ...
    • "vim", "emacs", ...
    • "Makefile", "Dockerfile", ...
    • ...

TL;DR

I pretty much hash everything except:

  • Commandline options
  • All non-alphanumeric chars
  • Single-letter or single-digit strings
  • Anything whitelisted (see above^)

from resh.

curusarn avatar curusarn commented on July 18, 2024
  • TODO: add more file extensions to the whitelist

  • TODO: show this to people

from resh.

curusarn avatar curusarn commented on July 18, 2024

I have shown this to 3 of my colleagues. Everyone was okay with the result. I got a suggestion that data is sanitized too much.

from resh.

curusarn avatar curusarn commented on July 18, 2024

I have found a couple of file extension databases. However, they don't seem very fond of other people using their data. I have asked FileInfo.com for permission to use their data.

  • WAIT for fileinfo.com to message you back

from resh.

curusarn avatar curusarn commented on July 18, 2024

I have added a few of common TLDs to the list. Source: https://www.hayksaakian.com/most-popular-tlds/

from resh.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.