Coder Social home page Coder Social logo

ca-address-parser's Introduction

ca-address-parser

This is a minimal implementation of a Canada address parser built using spaCy NLP library. This blog post covers the implementation and execution details at length. View on YouTube

screenshot

Prerequisites

Folders

  • config
  • corpus
    • dataset
    • rules
    • spacy-docbins
  • output
    • models
      • model-best
      • model-last

Training corpus

A sample corpus of Canadaian addresses to train/test the parser is present under corpus/dataset folder. JSON based rules required by Entity ruler are present under corpus/rules

Config

config contains files for initializing training parameters:

base_config.cfg: Initializes pipeline and training batch size. base_config_er.cfg: Similar as base_config but with additional entity ruler settings. config.cfg: Pre filled config file obtained after executing inti fill-config. config_er.cfg: Pre filled config file with additional entity ruler settings.

Output

output contains final trained models (with and without entity rules)

Training prerequisites

Before starting the training process, we need to:

i) Obtain a pre-filled training config which has the required training parameters.

ii) Build spacy-docbin (binary serialized representation) files for training and test dataset.

Pre-filled training config: Below command can be executed from command-line to get a pre filled config file. This would take as input the base_config.cfg file and churn out the pre-filled training config file: config.cfg.

python -m spacy init fill-config config/base_config.cfg config/config.cfg

Similarly, to get entity-ruler based config, pointing this command to the base_config_er.cfg would churn out the pre filled config : config_er.cfg

Prepare spacy-docbins: Finally, a spacy-docbin file can be obtained by executing ca_training_data_prep.py.

python ca_training_data_prep.py

This would take raw csv training/test datasets as inputs and churn out docbin files under corpus/spacy-docbins folder.

Training loop execution

To start the training process, below train command can be executed:

python -m spacy train config/config.cfg --paths.train corpus/spacy-docbins/train.spacy --paths.dev corpus/spacy-docbins/test.spacy --output output/models --training.eval_frequency 10 --training.max_steps 300

This saves the output NER models under output folder.

Predictions

Predictions for a few sample Canadian addresses can be checked by executing ca_predict.py

python ca_predict.py

Output

Address string -> 16, 15 SILVER SPRINGS WY NW, AIRDRIE
Parsed address -> [('16', 'ADDR_UNIT'), ('15', 'ADDR_STREET_NUMBER'), ('SILVER SPRINGS', 'ADDR_STREET_NAME'), ('WY', 'ADDR_STREET_TYPE'), ('NW', 'ADDR_STREET_DIRECTION'), ('AIRDRIE', 'ADDR_CITY')]
******
Address string -> 46, 1008 WOODSIDE WY NW, AIRDRIE
Parsed address -> [('46', 'ADDR_UNIT'), ('1008', 'ADDR_STREET_NUMBER'), ('WOODSIDE', 'ADDR_STREET_NAME'), ('WY', 'ADDR_STREET_TYPE'), ('NW', 'ADDR_STREET_DIRECTION'), ('AIRDRIE', 'ADDR_CITY')]
******
Address string -> 9, 209 WOODSIDE DR NW, AIRDRIE
Parsed address -> [('9', 'ADDR_UNIT'), ('209', 'ADDR_STREET_NUMBER'), ('WOODSIDE', 'ADDR_STREET_NAME'), ('DR', 'ADDR_STREET_TYPE'), ('NW', 'ADDR_STREET_DIRECTION'), ('AIRDRIE', 'ADDR_CITY')]
******
Address string -> 42, 4 STONEGATE DR NW, AIRDRIE
Parsed address -> [('42', 'ADDR_UNIT'), ('4', 'ADDR_STREET_NUMBER'), ('STONEGATE', 'ADDR_STREET_NAME'), ('DR', 'ADDR_STREET_TYPE'), ('NW', 'ADDR_STREET_DIRECTION'), ('AIRDRIE', 'ADDR_CITY')]
******
Address string -> 7, 209 WOODSIDE DR NW, AIRDRIE
Parsed address -> [('7', 'ADDR_UNIT'), ('209', 'ADDR_STREET_NUMBER'), ('WOODSIDE', 'ADDR_STREET_NAME'), ('DR', 'ADDR_STREET_TYPE'), ('NW', 'ADDR_STREET_DIRECTION'), ('AIRDRIE', 'ADDR_CITY')]
******
Address string -> 306, 190 KANANASKIS WAY, CANMORE
Parsed address -> [('306', 'ADDR_UNIT'), ('190', 'ADDR_STREET_NUMBER'), ('KANANASKIS', 'ADDR_STREET_NAME'), ('WAY', 'ADDR_UNIT'), ('CANMORE', 'ADDR_STREET_NUMBER')]

Download more data

https://www.statcan.gc.ca/en/lode/databases/oda

ca-address-parser's People

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.