Coder Social home page Coder Social logo

dishmint / lexicalcases Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 61.31 MB

Extract substrings matching a lexical pattern

Home Page: https://www.paclets.com/FaizonZaman/LexicalCases

License: MIT License

Mathematica 100.00%
text-mining text pattern-matching text-analaysis wolfram-language wolfram-mathematica linguistics text-search

lexicalcases's Introduction

LexicalCases [EXPERIMENTAL]

Extract substrings matching a lexical pattern.

Install

Load the paclet from the Paclet Repository

PacletInstall[ResourceObject["FaizonZaman/LexicalCases"]]
Needs["LexicalCases`"]

Supports v14.0+

Usage

Search strings, files or wikipedia articles for a lexical pattern.

oosp = ExampleData[{"Text", "OriginOfSpecies"}];
oospPattern = BoundToken[WordToken[2], BoundToken["specie"|"species"]];

oospResults = LexicalCases[oosp, oospPattern]

All Text Content Types can be used, however, some will take unreasonably long to expand, especially if it's meant to represent a hefty piece of text, like a topic type. The basic parts of speech types are good ones to start with:

alice = ExampleData[{"Text", "AliceInWonderland"}];
alicePattern = "Alice" ~~ TypeToken["Verb"] ~~ TypeToken["Adverb"];

aliceResults = LexicalCases[alice, alicePattern]

Use lexical patterns in StringCases, StringPosition and StringmatchQ by wrapping the pattern with LexicalPattern.

Here's an example creating an operator of StringCases:

aliceOp = StringCases[LexicalPattern["Alice" ~~ TypeToken["Verb"] ~~ TypeToken["Adverb"]]];

The paclet documentation includes additional examples, or visit LexicalCases on the Wolfram Paclet Repository.

lexicalcases's People

Contributors

dishmint avatar

Stargazers

 avatar

Watchers

 avatar

lexicalcases's Issues

[LexicalCases] Incorrect Matches

Screen Shot 2021-11-29 at 10 11 22 PM

calculatin did not appear in the second match. This may be because StringCases is looking for matching substrings, so g was considered appropriate. It could also be that the text type expansion picked up g as a noun or adjective.

[Wikipedia] Match and Missing counts are incorrect

When searching 5000 articles, a Missings count of 9000 was returned. This is impossible. There should only be one Missing["NoMatchFound"] per article without matches.

A similar issue occurred for match counts, where the count given was effectively the number of articles a match appeared in, not the number of occurrences of the word.

[OptionalLexicalPattern] Need to consider implication of surrounding patterns in 0 instance case

OptionalLexicalPattern needs to be resolved differently. Note in the pattern below the location of the optional. OptionalLexicalPatterns match its arguments, or an empty string. so what results when the OptionalLexicalPattern argument is not present is a sequence of two whitespaces, where only one should be there.

LexicalPattern["Alice ", TextType["Verb"], " ", TextType["Preposition"], " ", OptionalLexicalPattern["the"], " ", TextType["Noun"], WordBoundary]

Strings with whitespace need to be PatternSequences

https://github.com/dishmint/TextSequenceCases/blob/6100ad93e2a7255d89382f55b3042772ebac03cd/TextSequenceCases.wl#L47-L54

issue-Screen Shot 2021-08-28 at 6 01 57 PM

Strings with white space will not match in SequenceCases, so results like "Elon Musk" need to become PatternSequence["Elon","Musk"] probably.

Code to test:

tp3 = TextPattern[TextType["Adjective"], "books", OptionalTextPattern["from" | "by"], TextType["Person"]];
TextSequenceCases["I've been reading some cool books by Elon Musk.", tp3]

Implement patterns as StringExpressions

Screen Shot 2021-11-25 at 12 39 16 AM

In the above example I'm getting the verbs, not the quantities.

I'm wondering if I can use native pattern syntax, this might be much easier. I wouldn't have to specify a bunch of utility functions to get the same behavior the pattern functions already offer. If anything, I just need to keep TextType heads.

Support option to return a TextType Association

https://github.com/dishmint/TextSequenceCases/blob/fca402def6ac23cef589908277048a7c705259a5/TextSequenceCases.wl#L115-L118

It would be useful to have an association returned where each element in the result has a key corresponding to its TextType.

So, instead of this:

{{"generally", "extinct", "species"}, {"aboriginally", "distinct", "species"}, ...}

You'd get something like:

{
    <| "Adverb" -> "generally", "Adjective" -> "extinct", "Text" -> "species"|>,
    <| "Adverb" -> "aboriginally", "Adjective" -> "distinct", "Text" -> "species"|>,
    ...
    }

issue-Screen Shot 2021-08-28 at 7 52 22 PM

This would avoid a retroactive tagging step on the user's part when performing analysis on the results.

ConvertToWikipediaSearchQuery needs refactoring

https://github.com/dishmint/TextSequenceCases/blob/2df7a87349f99ee5a7b16970d63c98e132ffd28a/TextSequenceCases.wl#L97-L103

Calling ConvertToWikipediaSearchQuery on this pattern produces an empty string, which WikipediaSearch can't handle, due to deletion of OrderlessTextPattern objects.

TextPattern[TextType["Adjective"], OrderlessTextPattern["movie" | "movies", OptionalTextPattern["from" | "by"], TextType["Person"]]]

Screen Shot 2021-09-02 at 1 29 50 AM

OrderlessTextPattern's shouldn't be deleted. Instead the TextPattern should expand, covering all orderings, or just one, since these queries are meant as keywords for WikipediaSearch (the function name should reflect that: ConvertToWikipediaSearchKeywords).

So then โ€” what happens with TextPatterns whose arguments dissolve, thereby producing " "? A random sample of wikipedia articles would suffice.

Special characters in source text need escaping

This is a problem... VerbPhrase uses up a lot of memory. I can try it on a small text to see if the same issue occurs. This query should be possible, but the scope of fixing it might be beyond the development of this function.

Screen Shot 2021-09-10 at 10 21 11 PM

Rename TextPatternCases to LexicalCases

I think LexicalCases is sleeker and more descriptive. TextTypes are essentially Lexical categories, so I think having Lexical in the function name is nice.

TextSequenceSummary

The result should be a TextSequenceSummary object with accessors for:

  • Data
  • Relative Counts
  • MatchFrequencyPerSentence (?)
    • can't check the sentence unless the source-text tokenization is by WordBoundary instead of Whitespace.
    • Would there be performance loss from tokenizing by WordBoundary (therein preserving punctuation etc.?)

[LexicalCases] Delegate service definitions to separate files

The service functionality should be split up. A separate package file for each supported service, for example, LexicalCasesWikipedia.wl, and LexicalCasesArXiv.wl. Each file would contain code for query parsing consistent with that service. This would clean up LexicalCases.wl and make it easier to read.

Originally posted by @dishmint in #1 (comment)

[LexicalCases] Support list of strings as input

Support a first argument list of strings in LexicalCases and have it work like LexicalCasesFromWikipedia, that is, instead of associating matches with an article, associate them with a file name. Though, I suppose the question is, would you want a separate result for each text, or an aggregate result?

[Documentation] Add Notes doc for comments on best practices

  • Suppressing output for speed increase
  • Increase MaxItems option for more match opportunities
  • Partial matches for TextTypes (explicit WordBoundary) supported by default
  • Ensure full word matches by adding WordBoundary supported with BoundedString function

[LexicalCasesOnString] StringPosition of list-matches will give incorrect results

Map[AssociationThread[{"Match", "Position"} -> #] &]@With[
{cases = MatchTrim[OptionValue["StringTrim"]]@DeleteDuplicates@StringCases[source, RX]},
Thread[{cases, Map[StringPosition[source, #] &][cases]}]
]
]

This needs to change because matches returned as lists will not return the correct results from StringPosition. The reason I implemented it this way was because the threading returned a same-length error, but StringPosition needs to search for the pattern, and then I need to combine the matches and positions appropriately.

Example pattern:

LexicalPattern[adv : TextType["Adverb"], adj : TextType["Adjective"], "music"] :> {adv, adj}

Support String Pattern Symbols

  • DigitCharacter
  • LetterCharacter
  • WhitespaceCharacter
  • WordCharacter
  • WordBoundary

Except doesn't work on words, but works on character types above.

  • Except

The explicit nature of LexicalPattern's doesn't warrant the use of Longest or Shortest.

  • Longest
  • Shortest

Support Replacement Rules in TextPattern

Replacement rules should work:

TextPatternCases[sourcetext, TextPattern["this is a", adj:TextType["Adjective"], TextType["Noun"]] :> <|"Adjective" -> adj |>]

Improve Text Tokenization

  • Text words doesn't respect sentence boundaries

An option is compiling the text pattern to a RegularExpression, that way the source text doesn't need to be 'tokenized' into a list of words.

[LexicalCases] Add File support

File specs should be valid input, that is, expressions with the head File:

LexicalCases[File["path/to/file"], LexicalPattern[...]]

[LexicalCasesOnString] Is it more performant to convert LP to SE for all source text before searching

Right now I'm calling LexicalPatternToStringExpression per source at the same step of searching. I'm wondering if I should have all the string expressions generated before searching. Then I could use MapThread:

MapThread[LexicalCasesOnString[<source>, <pattern>]&, {{source1, pattern1}, {source2, pattern2}}]

Or use MapIndexed

texts = {text1, text2, ...};
MapIndexed[LexicalCasesOnString[texts[[#2]], #1]&, {pattern1, pattern2, ...}]

(parallelized?)

I'll also need to do some profiling of the code before coming to any conclusions. All the more reason to pacletize so i can profile the code from WorkBench.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.