Coder Social home page Coder Social logo

unicode-org / ml-confusables-generator Goto Github PK

View Code? Open in Web Editor NEW
8.0 6.0 3.0 5.06 MB

Generates confusables for Han script using ML techniques

License: Other

Shell 0.09% Python 6.20% Jupyter Notebook 93.71%
confusables clustering unicode chinese-characters

ml-confusables-generator's Introduction

ML Confusables Generator

A pair of confusables is a pair of characters which might be used in spoofing attacks due to their visual similarity (for example ‘ν’ and ‘v’). The wide range of characters supported by Unicode poses security vulnerabilities. Security mechanisms listed in UTS#39 (UTS #39) use confusable data (https://www.unicode.org/Public/security/latest/confusables.txt) to combat such attacks. The purpose of this project is to identify novel pairs of confusables using representation learning and custom distance metrics.

Table of Contents

Getting Started

Prerequisite

Installation

  1. Download and install Docker: Get Docker Here.
  2. git clone and cd into git repository.
  3. Make sure all submodules are updated: git submodule update --init --recursive.

Launch Jupyter Notebook Container

  1. In project source folder, run ./scripts/start.sh.
  2. In any browser, go to localhost:8888.
  3. Copy the token from terminal to browser to access Jupyter Notebook.

Launch Command Line Environment Container

  1. In project source folder, run ./scripts/start_cli.sh.
  2. Execute setup script ./scripts/setup.sh.

Interactive Shell in Running Container

  1. Run docker ps to get container id/name.
  2. Run docker exec -it [CONTAINER_NAME/ID] /bin/bash.

Exit Docker Container

  • In Jupyter Notebook terminal, type ctrl + c.
  • In command-line interface, exit.

Usage

Han Script Confusable Generation

  1. From link, download full_data.zip (pre-generated images) file and unzip in data/ directory.
  2. From link, download full_data_triplet1.0_meta.tsv and full_data_triplet1.0_vec.tsv (pre-generated embeddings and labels) into embeddings/ directory.
  3. Create representation clustering object:
    from rep_cls import RepresentationClustering
    rc = RepresentationClustering(embedding_file='embeddings/full_data_triplet1.0_vec.tsv',
                                  label_file='embeddings/full_data_triplet1.0_meta.tsv',
                                  img_dir='data/full_data/')
  4. Generate confusables for specific chracter:
    rc.get_confusables_for_char('褢')
    >>> ['裹', '裏', '裛', '裏']

Full Walk-through

Check main.ipynb.

Pre-trained CNN model

From link, download TripletTransferTF (pre-trained model) folder into ckpts/ directory.

Source file generation

  • To regenerate source files, in source/ directory, run python generate_source_file.py.
  • To check how the source file is selected, see source/Radical-stroke_Index_Analysis.ipynb.

Repo Contents

Main Components

  • main.ipynb: Notebook for setting up, building and deploying confusable detector. Also serves as tutorial script.
  • vis_gen.py: Contains VisualGenerator, class for generating visualization of characters.
  • rep_gen.py: Contains RepresentationGenerator, class for generating representations (embeddings) used for clustering.
  • rep_cls.py: Contains RepresentationClustering, class for clustering representations and finding confusables.
  • distance_metrics.py: Contains Distance, factory class that defines distance metrics for different image format. Also contains enumeration class ImgFormat.

CNN Model Training Scripts

  • configs/sample_config.ini: Sample configuration for model training. To start your own training procedure, create new configuration file following the same format.
  • custom_train.py: Contains ModelTrainer, class that executes training procedure.
  • dataset_builder.py: Contains DatasetBuilder, class that invokes data pre-processing functions for TensorFlow dataset generation.
  • model_builder.py: Contains ModelBuilder, class that creates and initialize TensorFlow models.
  • data_preprocessing.py: Image pre-processing functions.

Dataset Source File

  • source/Radical-stroke_Index_Analysis.ipynb: Jupyter Notebook for radical-stroke analysis and dataset selection.
  • source/generate_source_file.py: Contains functions that produces the same result as Jupyter Notebook file.
  • source/charset_*k.txt: Selected Unicode code points.
  • source/randset_*k.txt: Randomly selected Unicode code points.
  • source/full_dataset.txt: Full dataset containing 21028 code points, used for clustering.

Shell Scripts

Expect all scripts to be executed in base directory. For example, ./scripts/start.sh instead of ./start.sh.

  • scripts/start.sh: Launch a Docker container with Jupyter Notebook.
  • scripts/start_cli.sh: Launch a Docker container with bash.
  • scripts/setup.sh: Should run inside the container, setting up the environment and install all packages.
  • scritps/install_fonts.sh: Install required fonts, included in setup.sh.
  • scripts/download_*.sh: Scripts for downloading pre-established data, model or embeddings from Google Drive.

Unit Tests

  • *_test.py: Run python [MODULE]_test.py for all the unit tests for [MODULE].py.

Utility functions (in utils.py)

  • calculate_from_path: Calculate distance between the two images specified by file path.
  • train_test_split: Split dataset (already created) into training and testing datasets.

Placeholder Folders

  • data/: Default visualization directory.
  • ckpts/: Default model directory.
  • embeddings/: Default embedding directory.

Testing

Expect all tests to be run under the CLI container setup.

Run All Unit Tests

In root folder, run python -m unittest discover -s . -p '*_test.py'.

Run Individual Unit Test

In root folder, run python [MODULE]_test.py

Copyright & Licenses

Copyright © 2020-2024 Unicode, Inc. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.

The project is released under LICENSE.

A CLA is required to contribute to this project - please refer to the CONTRIBUTING.md file (or start a Pull Request) for more information.

ml-confusables-generator's People

Contributors

airbagy avatar frankyftang avatar sffc avatar srl295 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ml-confusables-generator's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.