dishmint / lexicalcases Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 0.0 61.31 MB

Extract substrings matching a lexical pattern

Home Page: https://www.paclets.com/FaizonZaman/LexicalCases

License: MIT License

Mathematica 100.00%

text-mining text pattern-matching text-analaysis wolfram-language wolfram-mathematica linguistics text-search

lexicalcases's Introduction

LexicalCases [EXPERIMENTAL]

Extract substrings matching a lexical pattern.

Install

Load the paclet from the Paclet Repository

PacletInstall[ResourceObject["FaizonZaman/LexicalCases"]]
Needs["LexicalCases`"]

Supports v14.0+

Usage

Search strings, files or wikipedia articles for a lexical pattern.

oosp = ExampleData[{"Text", "OriginOfSpecies"}];
oospPattern = BoundToken[WordToken[2], BoundToken["specie"|"species"]];

oospResults = LexicalCases[oosp, oospPattern]

All Text Content Types can be used, however, some will take unreasonably long to expand, especially if it's meant to represent a hefty piece of text, like a topic type. The basic parts of speech types are good ones to start with:

alice = ExampleData[{"Text", "AliceInWonderland"}];
alicePattern = "Alice" ~~ TypeToken["Verb"] ~~ TypeToken["Adverb"];

aliceResults = LexicalCases[alice, alicePattern]

Use lexical patterns in StringCases, StringPosition and StringmatchQ by wrapping the pattern with LexicalPattern.

Here's an example creating an operator of StringCases:

aliceOp = StringCases[LexicalPattern["Alice" ~~ TypeToken["Verb"] ~~ TypeToken["Adverb"]]];

The paclet documentation includes additional examples, or visit LexicalCases on the Wolfram Paclet Repository.

lexicalcases's People

Contributors

Stargazers

Watchers

lexicalcases's Issues

Rename the Repo to TextPatternCases

Reflect the name change in the Repo name

Add second argument to "MatchCountGroups" to limit the number of results returned

For example, the code below would return the 5 most occurring matches

TPSO["MatchCountGroups", 5]

[TextType] Support alternative types in TextType

TextType["Noun"|"Adjective"] should be supported.

[Documentation] Needs updating given new properties

Some properties were renamed, and a few were added, the changes should be reflected in the docs.

Returning position of subsequence

It might be useful to return the match and its position.

Deploy standalone app

InputFields for LexicalPattern and search query
Select Service by drop down

Add note for which version of M is supported

This functionality was developed in 12.3, I should note that in documentation

[LexicalCases] Progress bar stays at 100% without indicating what it's doing

Update ConvertToWikipediaSearchQuery to handle updated TextPattern syntax

TextPattern[TextType["Determiner"], "king" | "queen"];

Resulted in the string "king,queen", where

TextPattern[TextType["Determiner"], " ", "king" | "queen"];

Resulted in the string "king queen"

[LexicalCases] Incorrect Matches

calculatin did not appear in the second match. This may be because StringCases is looking for matching substrings, so g was considered appropriate. It could also be that the text type expansion picked up g as a noun or adjective.

[LexicalSummary] Add WordStem tab

Add WordStem tab where All matches are WordStemmed and grouped by stem.

[Wikipedia] Match and Missing counts are incorrect

When searching 5000 articles, a Missings count of 9000 was returned. This is impossible. There should only be one Missing["NoMatchFound"] per article without matches.

A similar issue occurred for match counts, where the count given was effectively the number of articles a match appeared in, not the number of occurrences of the word.

[OptionalLexicalPattern] Need to consider implication of surrounding patterns in 0 instance case

OptionalLexicalPattern needs to be resolved differently. Note in the pattern below the location of the optional. OptionalLexicalPatterns match its arguments, or an empty string. so what results when the OptionalLexicalPattern argument is not present is a sequence of two whitespaces, where only one should be there.

LexicalPattern["Alice ", TextType["Verb"], " ", TextType["Preposition"], " ", OptionalLexicalPattern["the"], " ", TextType["Noun"], WordBoundary]

Strings with whitespace need to be PatternSequences

https://github.com/dishmint/TextSequenceCases/blob/6100ad93e2a7255d89382f55b3042772ebac03cd/TextSequenceCases.wl#L47-L54

Strings with white space will not match in SequenceCases, so results like "Elon Musk" need to become PatternSequence["Elon","Musk"] probably.

Code to test:

tp3 = TextPattern[TextType["Adjective"], "books", OptionalTextPattern["from" | "by"], TextType["Person"]];
TextSequenceCases["I've been reading some cool books by Elon Musk.", tp3]

[ToTextElementStructure] Support Pattern Symbols

Pattern symbols not hooked up in ToTextElementStructure.

[WikipediaSearch] Content keywords only return articles including the keywords

This returns articles with all the keywords,

WikipediaSearch["Content" -> {"marathon", "race", "hike"}]

but what if I want to consider the keywords individually? LexicalCases should have an input form that provides this feature.

Implement patterns as StringExpressions

In the above example I'm getting the verbs, not the quantities.

I'm wondering if I can use native pattern syntax, this might be much easier. I wouldn't have to specify a bunch of utility functions to get the same behavior the pattern functions already offer. If anything, I just need to keep TextType heads.

Support option to return a TextType Association

https://github.com/dishmint/TextSequenceCases/blob/fca402def6ac23cef589908277048a7c705259a5/TextSequenceCases.wl#L115-L118

It would be useful to have an association returned where each element in the result has a key corresponding to its TextType.

So, instead of this:

{{"generally", "extinct", "species"}, {"aboriginally", "distinct", "species"}, ...}

You'd get something like:

{
    <| "Adverb" -> "generally", "Adjective" -> "extinct", "Text" -> "species"|>,
    <| "Adverb" -> "aboriginally", "Adjective" -> "distinct", "Text" -> "species"|>,
    ...
    }

This would avoid a retroactive tagging step on the user's part when performing analysis on the results.

Add Dashboard property to TextPatternSummary

The Dashboard property features:

A Dataset of max entires
A DateListPlot showing the WordFrequencyData for each of the max entries
A FeatureSpacePlot of the max entries

[LexicalCases] Singular and Plural alternatives only matching the Singular case

I noticed the pattern below only matching the shorter string machine

LexicalCases[$SampleStringLong, LexicalPattern[TextType["Adjective" | "Noun"], " ", "machine" | "machines"]]

Add Properties to extract data displayed in the Dashboard

"PartOfSpeechGroups"
"PercentDataset"

[Documentation] Add docs folder for GitHub pages documentation

There is no /docs directory currently. Add one so that GitHub pages will work

[LexicalCases] Support a list of strings or files as input

A list of strings or files should be processed like multiple wikipedia articles.

LexicalCases[{string1, string2, ...}, LexicalPattern[...]]

ConvertToWikipediaSearchQuery needs refactoring

https://github.com/dishmint/TextSequenceCases/blob/2df7a87349f99ee5a7b16970d63c98e132ffd28a/TextSequenceCases.wl#L97-L103

Calling ConvertToWikipediaSearchQuery on this pattern produces an empty string, which WikipediaSearch can't handle, due to deletion of OrderlessTextPattern objects.

TextPattern[TextType["Adjective"], OrderlessTextPattern["movie" | "movies", OptionalTextPattern["from" | "by"], TextType["Person"]]]

OrderlessTextPattern's shouldn't be deleted. Instead the TextPattern should expand, covering all orderings, or just one, since these queries are meant as keywords for WikipediaSearch (the function name should reflect that: ConvertToWikipediaSearchKeywords).

So then — what happens with TextPatterns whose arguments dissolve, thereby producing " "? A random sample of wikipedia articles would suffice.

Call LexicalCases from WCL Python

Considering implementing a Wolfram Client Library for Python version to make LexicalCases available for python users.

Special characters in source text need escaping

This is a problem... VerbPhrase uses up a lot of memory. I can try it on a small text to see if the same issue occurs. This query should be possible, but the scope of fixing it might be beyond the development of this function.

[LexicalCases] TextType match may be different from the word type

In some cases the text that matches a TextType is not of that type in context. For example, here, defined is a verb, when it matched as an adjective.

So I need to figure out how to maintain/represent the grammatical structure in the LexicalPattern.

Adding whitespace to patterns makes the TextPatternString look fluffy

The string form of the TextPattern looks fluffy because whitespace is now added by the user.

Rename TextPatternCases to LexicalCases

I think LexicalCases is sleeker and more descriptive. TextTypes are essentially Lexical categories, so I think having Lexical in the function name is nice.

Support ctrl+= entities in addition to (or instead of) TextType

The entities should be supported in the text pattern. So long as I can turn them into a form suitable for TextCases to search for their examples.

TextSequenceSummary

The result should be a TextSequenceSummary object with accessors for:

Data
Relative Counts
MatchFrequencyPerSentence (?)
- can't check the sentence unless the source-text tokenization is by WordBoundary instead of Whitespace.
- Would there be performance loss from tokenizing by WordBoundary (therein preserving punctuation etc.?)

[LexicalCases] Delegate service definitions to separate files

The service functionality should be split up. A separate package file for each supported service, for example, LexicalCasesWikipedia.wl, and LexicalCasesArXiv.wl. Each file would contain code for query parsing consistent with that service. This would clean up LexicalCases.wl and make it easier to read.

Originally posted by @dishmint in #1 (comment)

[LexicalCases] Support list of strings as input

Support a first argument list of strings in LexicalCases and have it work like LexicalCasesFromWikipedia, that is, instead of associating matches with an article, associate them with a file name. Though, I suppose the question is, would you want a separate result for each text, or an aggregate result?

[Documentation] Add Notes doc for comments on best practices

Suppressing output for speed increase
Increase MaxItems option for more match opportunities
~~Partial matches for TextTypes (explicit WordBoundary)~~ supported by default
~~Ensure full word matches by adding WordBoundary~~ supported with BoundedString function

[LexicalCasesWikipedia] ProgressIndicator when one article is searched is not useful

Example of problem:

Note how found is at 0, which at the moment can't be avoided. Maybe it would suffice to not have that data show up if there is only one article being searched?

[LexicalCasesOnString] StringPosition of list-matches will give incorrect results

LexicalCases/LexicalCases.wl

Lines 204 to 208 in bc7d238

    
           Map[AssociationThread[{"Match", "Position"} -> #] &]@With[ 
        
           	{cases = MatchTrim[OptionValue["StringTrim"]]@DeleteDuplicates@StringCases[source, RX]}, 
        
           	Thread[{cases, Map[StringPosition[source, #] &][cases]}] 
        
           	] 
        
           ]

This needs to change because matches returned as lists will not return the correct results from StringPosition. The reason I implemented it this way was because the threading returned a same-length error, but StringPosition needs to search for the pattern, and then I need to combine the matches and positions appropriately.

Example pattern:

LexicalPattern[adv : TextType["Adverb"], adj : TextType["Adjective"], "music"] :> {adv, adj}

Convert package to Paclet

The functionality would benefit from being packaged up into a paclet.

Support String Pattern Symbols

Except doesn't work on words, but works on character types above.

Except

The explicit nature of LexicalPattern's doesn't warrant the use of Longest or Shortest.

Longest
Shortest

Add VerificationTests

Implement VerificationTests as there currently are none.

Support Replacement Rules in TextPattern

Replacement rules should work:

TextPatternCases[sourcetext, TextPattern["this is a", adj:TextType["Adjective"], TextType["Noun"]] :> <|"Adjective" -> adj |>]

Modularize TextPatternCasesWikipedia to ease implementation of other services

The steps contained in TextPatternCasesWikipedia should be split out into separate functions so the service calls are clean and easy to read

Nothing could be replaced with a user specified default

https://github.com/dishmint/TextSequenceCases/blob/3de98f90091af05e2411d514e78c49dc3f3a846a/TextSequenceCases.wl#L47

The Nothing on this line could be an OptionValue["OptionalDefault"] instead.

Empty matches are counted giving incorrect match count

Define Format or TextString for TextPatterns

https://github.com/dishmint/TextSequenceCases/blob/797e2f4fe035203d8bf959c306368256b4390fe9/TextSequenceCases.wl#L39-L47

Instead of the custom TextPatternFormat I could use Format or TextString

Format

Format[TextPattern[args__], OutputForm] := StringForm["(> `1` <)", Sequence@@Map[ToString, {args}]]

TextString

TextPattern /: TextString[TextPattern[args__]] := "(>"<>StringJoin[Map[TextString, {args}]]<>"<)"

Improve Text Tokenization

Text words doesn't respect sentence boundaries

An option is compiling the text pattern to a RegularExpression, that way the source text doesn't need to be 'tokenized' into a list of words.

[LexicalCases] Add File support

File specs should be valid input, that is, expressions with the head File:

LexicalCases[File["path/to/file"], LexicalPattern[...]]

[LexicalCasesOnString] Is it more performant to convert LP to SE for all source text before searching

Right now I'm calling LexicalPatternToStringExpression per source at the same step of searching. I'm wondering if I should have all the string expressions generated before searching. Then I could use MapThread:

MapThread[LexicalCasesOnString[<source>, <pattern>]&, {{source1, pattern1}, {source2, pattern2}}]

Or use MapIndexed

texts = {text1, text2, ...};
MapIndexed[LexicalCasesOnString[texts[[#2]], #1]&, {pattern1, pattern2, ...}]

(parallelized?)

I'll also need to do some profiling of the code before coming to any conclusions. All the more reason to pacletize so i can profile the code from WorkBench.

	Map[AssociationThread[{"Match", "Position"} -> #] &]@With[
	{cases = MatchTrim[OptionValue["StringTrim"]]@DeleteDuplicates@StringCases[source, RX]},
	Thread[{cases, Map[StringPosition[source, #] &][cases]}]
	]
	]