Coder Social home page Coder Social logo

prosaic's Introduction

                               o
       _   ,_    __   ,   __,      __
     |/ \_/  |  /  \_/ \_/  |  |  /
     |__/    |_/\__/  \/ \_/|_/|_/\___/
    /|
    \|

prosaic

being a prose scraper & cut-up poetry generator

by vilmibm

using nltk

and licensed under the GPL.

what is prosaic?

prosaic is a tool for cutting up large quantities of text and rearranging it to form poetic works.

prerequisites

  • postgresql 9.0+
  • python 3.5+
  • linux (it probably works on a mac, i donno)
  • you might need some -dev libraries and/or gcc to get nltk to compile

database setup

Prosaic requires a postgresql database. Once you've got postgresql installed, run the following to create a database prosaic can access (assumes you're on linux; refer to google to perform steps like this on osx/windows):

sudo su postgres
createuser prosaic -P
# at password prompt, type prosaic and hit enter
createdb prosaic -O prosaic

quick start

sudo pip install prosaic
prosaic source new pride_and_prejudice pandp.txt
prosaic source new hackers hackers_screenplay.txt
prosaic corpus new pride_and_hackers
prosaic corpus link pride_and_hackers pride_and_prejudice
prosaic corpus link pride_and_hackers hackers
prosaic poem new -cpride_and_hackers -thaiku

and so I warn you.
We will know where we have gone
ALL: HACK THE PLANET

See the full tutorial for more detailed instruction. There is also a cli reference.

use as a library

This is a little complex right now; I'm working on a simpler API.

from io import StringIO
from prosaic.cfg import DEFAULT_DB
from prosaic.models import Database, Source, Corpus, get_session
from prosaic.parsing import process_text
from prosaic.generate import poem_from_template

db = Database(**DEFAULT_DB)

source = Source(name='some_name')
process_text(db, source, StringIO('some very long string of text'))

session = get_session(db)
corpus = Corpus(name='sweet corpus', sources=[source])
session.add(corpus)
session.commit()

# poem_from_template returns raw line dictionaries from the database:
poem_lines = poem_from_template([{'syllables': 5}, {'syllables':7}, {'syllables':5}], 
                                db,
                                corpus.id)

# pull raw text out of each line dictionary and print it:
for line in poem_lines:
  print(line[0])

use on the web

there was a web wrapper at prosaic.party but it had some functionality and performance issues and I've taken it down for now.

write a template

Templates are currently stored as json files (or passed from within code as python dictionaries) that represent an array of json objects, each one containing describing a line of poetry.

A template describes a "desired" poem. Prosaic uses the template to approximate a piece given what text it has in its database. Running prosaic repeatedly with the same template will almost always yield different results.

You can see available templates with prosaic template ls, edit them with prosaic template edit <template name>, and add your own with prosaic template new <template name>.

The rules available are:

  • syllables: integer number of syllables you'd like on a line
  • alliteration: true or false; whether you'd like to see alliteration on a line
  • keyword: string containing a word you want to see on a line
  • fuzzy: you want to see a line that happens near a source sentence that has this string keyword.
  • rhyme: define a rhyme scheme. For example, a couplet template would be: [{"rhyme":"A"}, {"rhyme":"A"}]
  • blank: if set to true, makes a blank line in the output. for making stanzas.

example template

[{"syllables": 10, "keyword": "death", "rhyme": "A"},
 {"syllables": 12, "fuzzy": "death", "rhyme": "B"},
 {"syllables": 10, "rhyme": "A"},
 {"syllables": 10, "rhyme": "B"},
 {"syllables": 8, "fuzzy": "death", "rhyme": "C"},
 {"syllables": 10, "rhyme": "C"}]

full CLI reference

Check out the CLI reference documentation.

how does prosaic work?

prosaic is two parts: a text parser and a poem writer. a human selects text files to feed to prosaic, who will chunk the text up into phrases and tag them with metadata. the human then links each of these parsed text files to a corpus.

once a corpus is prepared, a human then writes (or reuses) a poem template (in json) that describes a desired poetic structure (number of lines, rhyme scheme, topic) and provides it to prosaic, who then uses the weltanschauung algorithm to randomly approximate a poem according to the template.

my personal workflow is to build a highly thematic corpus (for example, thirty-one cyberpunk novels) and, for each poem, a custom template. I then run prosaic between five and twenty times, each time saving and discarding lines or whole stanzas. finally, I augment the piece with original lines and then clean up any grammar / pronoun agreement from what prosaic emitted. the end result is a human-computer collaborative work. you are, of course, welcome to use prosaic however you see fit.

developing

Patches are more than welcome if they come with tests. Tests should always be green in master; if not, please let me know! To run the tests:

# assuming you have pip install'd prosaic from source into an activated venv:
cd test
py.test

changelog

  • 6.1.1
    • fix error handling; this was preventing sources from being made.
  • 6.1.0
    • default to a system-wide nltk_data directory; won't download and install to ~ if found. the path is /usr/share/nltk_data. this is probably only useful on systems where prosaic is installed globally for multiple users (like on tilde.town).
    • not tied to a release, but the readme has database setup instructions now.
  • 6.0.0
    • I guess I forgot to change-log 5.x, oops
    • process_text now takes a read()able thing instead of a string and a database config object as first param
    • parsing is faster but at the expense of less precision
    • slightly saner DB engine handling
  • 4.0.0
    • Port to postgresql + sqlalchemy
    • Completely rewrite command line interface
    • Add a --verbose flag and muzzle the logging that used to happen unless it's present
    • Support a configuration file (~/.prosaic/prosaic.conf) for specifying database connections and default template
    • Rename some modules
    • Remove some vestigial features
  • 3.5.4 - update nltk dependence so prosaic works on python 3.5
  • 3.5.3 - mysterious release i don't know
  • 3.5.2 - handle weird double escaping issues
  • 3.5.1 - fix stupid typo
  • 3.5.0 - prosaic now respects environment variables PROSAIC_DBNAME, PROSAIC_DBPORT and PROSAIC_DBHOST. These are used if not overriden from the command line. If neither environment variables nor CLI args are provided, static defaults are used (these are unchanged).
  • 3.4.0 - flurry of improvements to text pre-processing which makes output much cleaner.
  • 3.3.0 - blank rule; can now add blank lines to output for marking stanzas.
  • 3.2.0 - alliteration support!
  • 3.1.0 - can now install prosaic as a command line tool!! also docs!
  • 3.0.0 - lateral port to python (sorry hy), but there are some breaking naming changes.
  • 2.0.0 - shiny new CLI UI. run hy __init__.hy -h to see/explore the subcommands.
  • 1.0.0 - it works

further reading

Copyright notices

source code contains a copy of CMUdict pronunciation library copyright (c) 2015, Alexander Rudnicky

prosaic's People

Contributors

bgschiller avatar dependabot[bot] avatar equaa avatar vilmibm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prosaic's Issues

consider tracking sources in-database

it's a little frustrating to have a source's source-of-truth exist solely on disk. would be nice to have a structured list of sources. see prosaicweb's weird hacky source for some usecases of why it's nice to have a logical idea of sources (read: the ability to create corpora at a logical level instead of the database collection level).

before doing this work, prosaic should probably be ported to use SQL (postgresql and/or sqlite). See #24 .

prosaic requires NTLK punkt tokenizer

When I run the commands in the README, specifically: hy __init__.hy load some_text0.txt some_mongo_db_name

I get

LookupError:
**********************************************************************
  Resource 'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:

support meta key for a poem template

currently, a template can be thought of as a list of sets of rules. it would be nice though to be able to specify poem-level options like meter (see #3 ) or nodupes. in that case, a template could be thought of as {lines:[], meta: {}}. consider supporting both forms from a template file. #21 should probably be done before this.

get off of mongo

Mongo is not the worst choice for this and its structured queries make ruleset->query composition pretty easy. however, it would be nice to be able to have relationality (perhaps between templates, sources, and phrases). mongo is also annoying to administer and query.

consider:

  1. selecting a structured query library for sql
  2. wrapping both sqlite and postgresql or just postgresql
  3. storing phrases as json types (would hard-depend on pgsql)

Take advantage of static typing

Basically no one uses prosaic so it's probably fine to release a 4.x.z that hard-depends on python 3.5 and its static typing support.

I want to add this support because I like static typing. There is not really a concrete reason beyond that (besides all of the nice things that come with static typing).

Adding a source works, but produces error message

I'm trying to add the transcript of mean_girls, with prosaic source new mean_girls mean_girls.txt. The command prints an error message, but exits successfully:

It seems to have added the source to my db though:


I'd be happy to try and fix this myself, but do you have any advice? Perhaps the error is thrown part-way through processing the file, so most of the phrases have already been added?

Thanks!

allow user to attach weights to rules

will probably mean making the template parser accept either a scalar value for a rule key or an object that specifies a weight (as a number between 0 and 1).

template args

kind of a far flung idea, but templates could put arguments (like $1, $2, etc) into strings and have them interpolated based on command line args.

this might be a horrible idea.

remove sh

sh is really slow and makes basic commands (ie template editing) take way too long. replace it with calls to os library

broken on python 3.5 (probably)

looks like nltk is throwing up and i blame python 3.5

    from nltk.tokenize import RegexpTokenizer
  File "/home/vilmibm/Documents/gamefaqz/prosaicv/lib/python3.5/site-packages/nltk/tokenize/__init__.py", line 65, in <module>
    from nltk.tokenize.regexp   import (RegexpTokenizer, WhitespaceTokenizer,
  File "/home/vilmibm/Documents/gamefaqz/prosaicv/lib/python3.5/site-packages/nltk/tokenize/regexp.py", line 201, in <module>
    blankline_tokenize = BlanklineTokenizer().tokenize
  File "/home/vilmibm/Documents/gamefaqz/prosaicv/lib/python3.5/site-packages/nltk/tokenize/regexp.py", line 172, in __init__
    RegexpTokenizer.__init__(self, r'\s*\n\s*\n\s*', gaps=True)
  File "/home/vilmibm/Documents/gamefaqz/prosaicv/lib/python3.5/site-packages/nltk/tokenize/regexp.py", line 119, in __init__
    self._regexp = compile_regexp_to_noncapturing(pattern, flags)
  File "/home/vilmibm/Documents/gamefaqz/prosaicv/lib/python3.5/site-packages/nltk/internals.py", line 54, in compile_regexp_to_noncapturing
    return sre_compile.compile(convert_regexp_to_noncapturing_parsed(sre_parse.parse(pattern)), flags=flags)
  File "/home/vilmibm/Documents/gamefaqz/prosaicv/lib/python3.5/site-packages/nltk/internals.py", line 50, in convert_regexp_to_noncapturing_parsed
    parsed_pattern.pattern.groups = 1

No entry points defined

While Prosaic is now packaged and on PyPi, its setup.py defines no entry points, meaning it can only be used as a library (unless one manually does the CLI invocation laid out in the readme).

The setup.py should lay out an executable script that allows one to load text / create poems.

tests

for the love of god, please

default rules for templates

ability to set rules that go into every line; most useful in conjunction with max-syllables and min-syllables probably.

rule(s) for PoS features

Think up some basic part of speech pattern rules and expose them; or perhaps do something like:

{"has_pronoun": true} ?

donno

__init__.hy seems to use both prosaic.* imports and relative imports

Again, trying to run the initial load command, referred to in the README as: hy __init__.hy load some_text0.txt some_mongo_db_name

If I run it from the main github checkout directory:
$ hy prosaic/__init__.hy load molly_webster.txt molly_pro

I get:

  File "/Users/<SNIP>/code/prosaic/prosaic/cthulhu.hy", line 25, in <module>
    (import [dogma [keyword-rule
ImportError: No module named 'dogma'

But if I run it like this, from within the directory:
$ hy __init__.hy load ../molly_webster.txt molly_pro

I get

  File "__init__.hy", line 28, in <module>
    (import [prosaic.nyarlathotep [process-txt!]])
ImportError: No module named 'prosaic.nyarlathotep'

This is after what looked like a successful run of $ python setup.py install

make sentence processing faster

I put effort into speeding up generation but not parsing. There's low hanging fruit, there. Namely, I can use a threadpool when doing the final "process sentence" pass.

Also, I'm doing regex cleanup per sentence; I feel like it's got to be faster to apply a combined regex over the whole raw text, no matter how big. That'd be a good thing to benchmark and prove.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.