This package contains a library for extracting contact information from HTML pages.
- American contact information extraction
- Netherlands open hours extraction
- Ireland open hours extraction
- Ireland contact address extraction
Clone the repository, then install package requirements (package requires lxml, scikit-learn and python-wapiti):
$ pip install -r requirements.txt
then install package itself:
$ python setup.py install
>>> import wapiti >>> from webstruct.wapiti import WapitiChunker >>> from sklearn.externals import joblib
Load trained model ('wfe.joblib' and 'model.wapiti' files must exists):
>>> feature_encoder = joblib.load('wfe.joblib') >>> wapiti_model = wapiti.Model(model='model.wapiti') >>> ner = WapitiChunker(wapiti_model, feature_encoder)
Get a HTML page somewhere:
>>> import requests >>> page = requests.get(some_url)
and extract information:
>>> for text, label in ner.transform(page.text, page.encoding): ... if label != 'O': ... print("%6s %s" % (label, text)) TEL 800-4-Altman ( 425-8626 ) EMAIL [email protected] ORG Altman Lighting Co. Inc. STREET 57 Alexander Street CITY Yonkers STATE NY ZIPCODE 10701 TEL 1-800-4-ALTMAN ( 425-8626 )
Model should be trained before usage. See '../notebooks/train-token-model.ipynb' IPython notebook for an example.
Make sure nose is installed, then run runtests.sh
script.