Coder Social home page Coder Social logo

pyfoma's People

Contributors

dhdaines avatar mhulden avatar michaelpginn avatar mpsilfve avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyfoma's Issues

Escaping the wildcard character '.'

I have been trying to escape '.' to match a string with an actual period instead of using it as a wild card character, but all escaping methods don't work. I've tried all the following:
x = FST.re("A'.'")
x = FST.re("A\.")
x = FST.re("A\\.")
x = FST.re("A\\\.")
x = FST.re(r"A\.")
x = FST.re(r"A.")
None of them does what I want them to do. Interestingly, escaping two periods is possible using single quotations but not one. So this code works to match two periods followed by an 'A' x = FST.re("A'..'")

pyfoma's AT&T format is not understood by foma

pyfoma uses some shortcuts when outputting AT&T format which are not really part of the AT&T format, might not be accepted by OpenFST, and definitely aren't accepted by foma, namely:

  • epsilon labels are just the empty string
  • transitions without transduction only output one label (even if the machine is otherwise a transducer ... thus fstcompile --acceptor won't work)

Since there isn't any universal standard for what to call epsilon (OpenFST just uses index 0 in the symbol table, whatever it happens to correspond to, while foma seems to use @0@) probably there should be an actual method for outputting AT&T format along with symbol tables.

input and output tapes share the same symbol table causing occasional surprise

This is a somewhat marginal problem which can pop up when composing (somewhat artificial) FSTs:

f1 = FST.re("(foo):(bar)")
print(list(f1.apply("foo")))
# prints ["bar"]
f2 =  FST.re("'':'foo' b a r:z")
print(list(f1.compose(f2).apply("foo")))
# prints [] ... you might expect it to print "foobaz"?

I'm not sure this is a serious problem as I'm having a hard time coming up with a less contrived example. Feel free to close this if you can't think of one either :)

Feature request: define multi-character symbols for regex compiler

Because pyfoma's regex language, unlike foma/xfst, separates characters by default, what might appear to be single input symbols in a regex frequently are not. This is quite obviously true in the case where your data is in NFKD / NFD form, but much more perniciously so in NFKC / NFC, where, for instance, č is a single character, and so is ḥ, but x̌ is not.

This means that rewrite rules in particular may not do what you (or someone reading your code) might expect at first glance.

The defensive linguist will resort to putting single quotes around everything, which is probably a good idea, but this leads to rules that are, well, kind of ugly. Also there are lots of situations where something is fairly obviously not a single character but it would be nice to treat it as one.

This could for instance be an extra argument to FST.regex and FST.rlg, e.g.:

MULTICHARS = "kʷ kʷ̓ x̌".split()
rule = FST.re(some_regex, multichar_tokens=MULTICHARS)

I might make a PR to see if this is easily doable...

`eliminate_flags` changes name/package unexpectedly

In the released version of pyfoma, one creates a flag-eliminated fst with:

from pyfoma.eliminate_flags import eliminate_flags
efst = eliminate_flags(some_other_fst)

This has been changed to something more logical:

from pyfoma.algorithms import eliminate_flags
efst = eliminate_flags(some_other_fst)

Unfortunately, the old function has also changed its name, so the original code will break. If there is no urgent reason to change pyfoma.eliminate_flags.eliminate_flags to pyfoma.eliminate_flags.eliminate_fst_flags then I suggest changing it back so older code will still work.

Term negation can't be applied to multi-character symbols (and complement operator is unimplemented)

In foma/xfst, you can use multi-character symbols in "character classes" and then take the complement of these classes, for example:

def U [ u | uː ];
def Unround kw -> k || \U _ .#.;

There doesn't seem to be any way to do this in pyfoma without some kind of workaround, transforming into a single character, for example - I had thought I could use the complement operator ~ but it doesn't appear to be implemented yet.

It's not really clear to me how this works in foma since the resulting FSTs have magical @ symbols in them when you look at them with graphviz or export them as AT&T format, but it does seem to work :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.