Coder Social home page Coder Social logo

imperialite / biberpy Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ssharoff/biberpy

0.0 1.0 0.0 20.73 MB

Python version for Doug Biber's Multidimensional Analysis (MDA)

License: GNU General Public License v3.0

Python 90.72% R 8.53% Shell 0.75%

biberpy's Introduction

Multilingual Python version for Biber's Multi-Dimensional Analysis (MDA)

This is a Python version of Biber's tagger for English, French, Russian and Spanish as well as R scripts for Multi-Dimensional Analysis.

Back in 1988 Doug Biber developed an approach to analysing the composition of a corpus by using a set of extractable features and factor analysis, see his first book on the topic: [https://books.google.co.uk/books?id=CVTPaSSYEroC]

An article-length description is available from [http://www.aclweb.org/anthology/J93-2001]

While this approach is more than 30-years old by now, his attention to designing extractable features is still important for understanding how linguistic features vary in texts of different kinds. See a way of linking the features to predictions of a neural classifier:

@Article{sharoff21rs,
  author = {Sharoff, Serge},
  title =  {Genre Annotation for the Web: text-external and text-internal perspectives},
  journal = {Register studies},
  year =   2021,
  volume = {(in press)}}

[http://corpus.leeds.ac.uk/serge/publications/2021-register.pdf]

The features proposed by Biber and implemented in my tagger include:

  • Lexical features, such as:
    • publicVerbs = acknowledge, admit, agree, assert, claim, complain, declare, deny...
    • timeAdverbials = afterwards, again, earlier, early, eventually, formerly, immediately,...
    • amplifiers = absolutely, altogether, completely, enormously, entirely,...
  • Part-of-speech features, such as:
    • Nominalisations (nouns ending in -tion, -ness, -ment)
    • Prepositions
    • Past tense verbs
  • Syntactic features, such as:
    • be as the main verb
    • that deletions
    • pied piping (as in Which house did she buy ...? where house moved from its expected position after buy)
  • Text-level features, such as:
    • Average word length
    • Average sentence length
    • Type/token ratio

In our Intellitext project (2011-2012) we have implemented these features in a Web interface to corpora in Perl, see:

@inproceedings{wilson10paclic,
  title={Advanced corpus solutions for humanities researchers},
  author={Wilson, James and Hartley, Anthony and Sharoff, Serge and Stephenson, Paul},
  booktitle={Proc PACLIC 24},
  pages={36--43},
  month={November},
  address={Tohoku University},
  year={2010}
}

[http://www.aclweb.org/anthology/Y10-1089]

Since then, I ported the feature extractor from Perl to Python to help with my research on text classification. Also I have ported the available word lists to French and Russian, and Gonzalo Cruz has done this for Spanish, so that the Biber dimensions of variation can be extracted across languages.

The arguments for the script are self-explanatory (run python3 biber-dim.py -h). A typical invocation would be:

python3 biber-dim.py -l en <brown.ol >brown.dat

The default format for the corpus file is one line per document. Another possibility is to use a Jason file, which can be produced from a CONLL file with existing tagging data:

udpipe --tokenize --tag english-ewt.udpipe <brown.ol | ./restoredocids.py | conll2json.py >brown.json python3 biber-dim.py -f json -l en <brown.json >brown-json.dat

The script assumes that the current folder contains a file with language-specific properties with the name LANGUAGE.properties and a frequency list with the name LANGUAGE.tag.num. The format of the lists of properties is as follows:

privateVerbs = anticipate,assume,believe,conclude,decide,demonstrate

The property ids are fixed (the label privateVerbs is used for all languages), while the word lists are language-specific.

If the source file format is not Json, the POS tags and lemmas are coming from a frequency list:

num word lemma pos UD morph
1625260 years year NOUN Number=Plur
399401 went go VERB Tense=Past

This can be obtained, for example, from an available CONLLU file with the annotations in the format of the Universal Dependencies by

cut -f 2-4,6 -s CONLLU.file | sort | uniq -c | sort -nsr >CONLLU.num

My biber-dim.py script produces a tab-separated table with values for each dimension. This can be taken to R for factor analysis and plot making:

Rscript biber-mda.R brown-biber.dat brown-annot.dat

The annot.dat file is optional. It assigns each text in the original ol file to a genre category, so that the texts can be displayed on a plot with meaningful annotations. Here is the plot for the Brown Corpus top-level genre categories:

MDA Brown corpus

biberpy's People

Contributors

informitics avatar ssharoff avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.