Coder Social home page Coder Social logo

benandow / htmltoplaintext Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 6.0 25 KB

This tool converts HTML representations of privacy policies to plaintext. Full details of the approach can be found in Appendix A of our PolicyLint paper in USENIX Security 2019.

License: GNU General Public License v3.0

Dockerfile 0.85% Python 97.97% Shell 1.18%

htmltoplaintext's Introduction

HTML Privacy Policy to Plaintext Converter

This repository hosts the source code for converting HTML representations of privacy policies to plaintext. Note that the purpose of preprocessing is to allow for deeper NLP processing (e.g., POS tagging, dependency parsing, NER). Therefore, the process includes a heuristic to form complete sentences from formatted lists, which may result in duplicated text in the output and may not preserve word count frequencies. The current implementation uses the langdetect Python module to ignore non-English policies. Full details of the approach can be found in Appendix A of the PolicyLint paper listed below.

Instructions:

  • Place HTML privacy policies in ./ext/html_policies

  • Build the docker image: $ ./build.sh

  • Run the docker image: $ ./run.sh

  • The output will be in ./ext/plaintext_policies

Publication

Full details of the approach can be found in Appendix A of the PolicyLint paper listed below:

Benjamin Andow, Samin Yaseer Mahmud, Wenyu Wang, Justin Whitaker, William Enck, Bradley Reaves, Kapil Singh, and Tao Xie. PolicyLint: Investigating Internal Privacy Policy Contradictions on Google Play, Proceedings of the USENIX Security Symposium (SECURITY), August 2019. Santa Clara, CA, USA.

License

The HTML Privacy Policy to Plaintext Converter is licensed under the GPL v3.0 License (See LICENSE.txt).

htmltoplaintext's People

Contributors

benandow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.