Coder Social home page Coder Social logo

fastdatascience / faststylometry Goto Github PK

View Code? Open in Web Editor NEW
28.0 3.0 8.0 10.46 MB

Stylometry library for Burrows' Delta method

Home Page: https://fastdatascience.com/fast-stylometry-python-library/

License: MIT License

Jupyter Notebook 75.95% Python 24.05%
natural-language-processing nlp stylometry

faststylometry's Introduction

Fast Data Science logo

๐ŸŒ fastdatascience.com Fast Data Science | LinkedIn Fast Data Science | X Fast Data Science | Instagram Fast Data Science | Facebook Fast Data Science | YouTube Fast Data Science | Google Fast Data Science | Medium Fast Data Science | Mastodon

Fast Stylometry Python library: Natural Language Processing tool

my badge

PyPI package version number License

You can run the walkthrough notebook in Google Colab with a single click: Open In Colab

โ˜„ Fast Stylometry - Burrows Delta NLP technique โ˜„

Developed by Fast Data Science. Fast Data Science develops products, offers consulting services, and training courses in natural language processing (NLP). Subscribe to our blog for regular news from the NLP universe.

Source code at https://github.com/fastdatascience/faststylometry

Tutorial at https://fastdatascience.com/fast-stylometry-python-library/

Fast Stylometry is a Python library for calculating the Burrows' Delta. Burrows' Delta is an algorithm for comparing the similarity of the writing styles of documents, known as forensic stylometry.

๐Ÿ’ป Installing the Fast Stylometry Python package

You can install from PyPI.

pip install faststylometry

๐ŸŒŸ Using Fast Stylometry NLP library for the first time ๐ŸŒŸ

โš ๏ธ We recommend you follow the walk through notebook titled Burrows Delta Walkthrough.ipynb in order to understand how the library works. If you don't have the correct environment set up on your machine, then you can run the walkthrough notebook easily using this link to create a notebook in Google Colab.

๐Ÿ’ก Usage examples

Demonstration of Burrows' Delta on a small corpus downloaded from Project Gutenberg.

We will test the Burrows' Delta code on two "unknown" texts: Sense and Sensibility by Jane Austen, and Villette by Charlotte Bronte. Both authors are in our training corpus.

You can get the training corpus by cloning https://github.com/fastdatascience/faststylometry, the data is in data. Or you can call download_examples() from Python after importing Fast Stylometry:

from faststylometry import download_examples
download_examples()

๐Ÿ“– Create a corpus

The Burrows Delta Walkthrough.ipynb Jupyter notebook is the best place to start, but here are the basic commands to use the library:

To create a corpus and add books, the pattern is as follows:

from faststylometry import Corpus
corpus = Corpus()
corpus.add_book("Jane Austen", "Pride and Prejudice", [whole book text])

Here is the pattern for creating a corpus and adding books from a directory on your system. You can also use the method util.load_corpus_from_folder(folder, pattern).

import os
import re

from faststylometry.corpus import Corpus

corpus = Corpus()
for root, _, files in os.walk(folder):
    for filename in files:
        if filename.endswith(".txt") and "_" in filename:
            with open(os.path.join(root, filename), "r", encoding="utf-8") as f:
                text = f.read()
            author, book = re.split("_-_", re.sub(r'\.txt', '', filename))

            corpus.add_book(author, book, text)

๐Ÿ’ก Example 1

Download some example data (Project Gutenberg texts) from the Fast Stylometry repository:

from faststylometry import download_examples
download_examples()

Load a corpus and calculate Burrows' Delta

from faststylometry.util import load_corpus_from_folder
from faststylometry.en import tokenise_remove_pronouns_en
from faststylometry.burrows_delta import calculate_burrows_delta

train_corpus = load_corpus_from_folder("data/train")

train_corpus.tokenise(tokenise_remove_pronouns_en)

test_corpus_sense_and_sensibility = load_corpus_from_folder("data/test", pattern="sense")

test_corpus_sense_and_sensibility.tokenise(tokenise_remove_pronouns_en)

calculate_burrows_delta(train_corpus, test_corpus_sense_and_sensibility)

returns a Pandas dataframe of Burrows' Delta scores

๐Ÿ’ก Example 2

Using the probability calibration functionality, you can calculate the probability of two books being by the same author.

from faststylometry.probability import predict_proba, calibrate
calibrate(train_corpus)
predict_proba(train_corpus, test_corpus_sense_and_sensibility)

outputs a Pandas dataframe of probabilities.

โœ‰๏ธ Who to contact

Thomas Wood at Fast Data Science

๐Ÿค Contributing to the project

If you'd like to contribute to this project, you can contact us at https://fastdatascience.com/ or make a pull request on our Github repository. You can also raise an issue.

Developing the library

Automated tests

Test code is in tests/ folder using unittest.

The testing tool tox is used in the automation with GitHub Actions CI/CD.

Use tox locally

Install tox and run it:

pip install tox
tox

In our configuration, tox runs a check of source distribution using check-manifest (which requires your repo to be git-initialized (git init) and added (git add .) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.

The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:

tox -e py39

Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally. But if you insist, click to read the "Generate distribution files" section.

๐Ÿค– Continuous integration/deployment to PyPI

This package is based on the template https://pypi.org/project/example-pypi-package/

This package

  • uses GitHub Actions for both testing and publishing
  • is tested when pushing master or main branch, and is published when create a release
  • includes test files in the source distribution
  • uses setup.cfg for version single-sourcing (setuptools 46.4.0+)

๐Ÿง Re-releasing the package manually

The code to re-release Fast Stylometry on PyPI is as follows:

source activate py311
pip install twine
rm -rf dist
python setup.py sdist
twine upload dist/*

๐Ÿ˜Š Who worked on the Fast Stylometry NLP library?

The tool was developed by:

  • Thomas Wood, Natural Language Processing consultant and data scientist at Fast Data Science.

๐Ÿ“œ License of Fast Stylometry library

MIT License. Copyright (c) 2023 Fast Data Science

โœ๏ธ Citing the Fast Stylometry library

If you are undertaking research in AI, NLP, or other areas, and are publishing your findings, I would be grateful if you could please cite the project.

Wood, T.A., Fast Stylometry [Computer software] (1.0.4). Data Science Ltd. DOI: 10.5281/zenodo.11096941, accessed at https://fastdatascience.com/fast-stylometry-python-library, Fast Data Science (2024)

DOI

A BibTeX entry for LaTeX users is:

@software{faststylometry,
    author = {Wood, T.A.},
    title  = {Fast Stylometry (Computer software), Version 1.0.4},
    year   = {2024},
    url = {https://fastdatascience.com/fast-stylometry-python-library/},
    doi = {10.5281/zenodo.11096941},
}

faststylometry's People

Contributors

woodthom2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

faststylometry's Issues

Assertion error

Hi, I have a strange Assertion Error:
packages version:
numpy==1.24.3
pandas==2.0.0
scikit-learn==1.3.0
nltk==3.7

CODE:

from faststylometry import Corpus
from faststylometry import load_corpus_from_folder
from faststylometry import tokenise_remove_pronouns_en
from faststylometry import calculate_burrows_delta
from faststylometry import predict_proba, calibrate

import nltk
nltk.download("punkt")

train_corpus = load_corpus_from_folder("faststylometry/data/train")
train_corpus.tokenise(tokenise_remove_pronouns_en)

test_corpus = Corpus()
test_corpus.add_book("NONE", "MOCx", 'this is an example.')
test_corpus.add_book("NONE", "MOCxx", 'this is another example')
test_corpus.tokenise(tokenise_remove_pronouns_en)

calculate_burrows_delta(train_corpus, test_corpus, vocab_size = 50000)

ERROR:

AssertionError Traceback (most recent call last)
Cell In[4], line 1
----> 1 calculate_burrows_delta(train_corpus, test_corpus, vocab_size = 50000)

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/faststylometry/burrows_delta.py:205, in calculate_burrows_delta(train_corpus, test_corpus, vocab_size, words_to_exclude, tok_match_pattern)
196 def calculate_burrows_delta(train_corpus: Corpus, test_corpus: Corpus, vocab_size: int = 50, words_to_exclude: set = {},
197 tok_match_pattern: str = r'^[a-z][a-z]+$') -> pd.DataFrame:
198 """
199 Calculate the Burrows' Delta statistic for the test corpus vs every author's subcorpus in the training corpus.
200 :param train_corpus: A corpus of known authors, which we will use as a benchmark to compare to the test corpus by an unknown author.
(...)
203 :return: A DataFrame of Burrows' Delta values for each author in the training corpus.
204 """
--> 205 get_top_tokens(train_corpus, vocab_size, words_to_exclude, tok_match_pattern)
206 get_token_counts(train_corpus)
207 get_token_counts_by_author(train_corpus)

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/faststylometry/burrows_delta.py:48, in get_top_tokens(corpus, vocab_size, words_to_exclude, tok_match_pattern)
39 def get_top_tokens(corpus: Corpus, vocab_size: int, words_to_exclude: set, tok_match_pattern: str) -> list:
40 """
41 Identify the n highest ranking tokens in the corpus.
42
(...)
45 :return: A list of the n most common tokens in the corpus.
46 """
---> 48 assert (len(corpus.tokens) > 0) # you should have tokenised the corpus.
50 token_freqs = Counter()
51 for token_seq in corpus.tokens:

AssertionError:

Error while running

Hi. I want to try your library, but always got the same error:
Traceback (most recent call last):
File "/stylometry/main.py", line 28, in
calculate_burrows_delta(train_corpus, test_corpus, vocab_size = 50)
File "/stylometry/venv/lib/python3.7/site-packages/faststylometry/burrows_delta.py", line 148, in calculate_burrows_delta
get_token_proportions(test_corpus)
File "/stylometry/venv/lib/python3.7/site-packages/faststylometry/burrows_delta.py", line 99, in get_token_proportions
token_proportions = corpus.df_token_counts_by_author.to_numpy() / corpus.df_total_token_counts_by_author.to_numpy()
ValueError: operands could not be broadcast together with shapes (0,51) (0,2)

This error while running your example.
Could you check this please?
Thanks in advance!

Examples don't work

If i run the examples

from faststylometry.util import load_corpus_from_folder
from faststylometry.en import tokenise_remove_pronouns_en
from faststylometry.burrows_delta import calculate_burrows_delta

train_corpus = load_corpus_from_folder("faststylometry/data/train")

train_corpus.tokenise(tokenise_remove_pronouns_en)

test_corpus_sense_and_sensibility = load_corpus_from_folder("faststylometry/data/test", pattern="sense")

test_corpus_sense_and_sensibility.tokenise(tokenise_remove_pronouns_en)

calculate_burrows_delta(train_corpus, test_corpus_sense_and_sensibility)

AssertionError Traceback (most recent call last)
in
11 test_corpus_sense_and_sensibility.tokenise(tokenise_remove_pronouns_en)
12
---> 13 calculate_burrows_delta(train_corpus, test_corpus_sense_and_sensibility)

1 frames
/usr/local/lib/python3.8/dist-packages/faststylometry/burrows_delta.py in get_top_tokens(corpus, vocab_size)
17 """
18
---> 19 assert (len(corpus.tokens) > 0) # you should have tokenised the corpus.
20
21 token_freqs = Counter()

AssertionError:

The train_corpus.tokenise(tokenise_remove_pronouns_en) doesn't seem to do anything.

calibrate() doesn't work if the corpus is just in 2 files

If we have train_corpus in 2 files (author1_-title.txt, author2-_title.txt) than calibrate(train_corpus) will drop an error:

calibrate(train_corpus)

lib/python3.10/dist-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-26-cc70d17a9b30>](https://localhost:8080/#) in <cell line: 1>()
----> 1 calibrate(train_corpus)

5 frames
[/usr/local/lib/python3.10/dist-packages/faststylometry/probability.py](https://localhost:8080/#) in calibrate(corpus, model)
     77     ground_truths, delta_values = get_calibration_curve(corpus)
     78 
---> 79     model.fit(np.reshape(delta_values, (-1, 1)), ground_truths)
     80 
     81     corpus.probability_model = model

[/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py](https://localhost:8080/#) in fit(self, X, y, sample_weight)
   1194             _dtype = [np.float64, np.float32]
   1195 
-> 1196         X, y = self._validate_data(
   1197             X,
   1198             y,

[/usr/local/lib/python3.10/dist-packages/sklearn/base.py](https://localhost:8080/#) in _validate_data(self, X, y, reset, validate_separately, **check_params)
    582                 y = check_array(y, input_name="y", **check_y_params)
    583             else:
--> 584                 X, y = check_X_y(X, y, **check_params)
    585             out = X, y
    586 

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1104         )
   1105 
-> 1106     X = check_array(
   1107         X,
   1108         accept_sparse=accept_sparse,

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    919 
    920         if force_all_finite:
--> 921             _assert_all_finite(
    922                 array,
    923                 input_name=input_name,

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    159                 "#estimators-that-handle-nan-values"
    160             )
--> 161         raise ValueError(msg_err)
    162 
    163 

ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.