amperser / proselint Goto Github PK

View Code? Open in Web Editor NEW

4.3K 47.0 176.0 4.87 MB

A linter for prose.

Home Page: http://proselint.com

License: BSD 3-Clause "New" or "Revised" License

JavaScript 29.05% HTML 28.07% CSS 0.30% Python 38.75% Shell 0.21% Ruby 0.04% SCSS 3.57% Procfile 0.01%

linter prose writer advice knowledge language style

proselint's People

Contributors

Stargazers

Watchers

Forkers

inothnagel patchranger andrenanninga carreau uran198 timbuckley utahdave hashin comsecninja tkab traviswhitaker shirish93 jonboiser drkarl darkmfalz jstewmon fdb decorator1991 indentlabs dan-han-101 tchen0123 oshaikh13 jsenn amitms shelltips llsourcell drinks ciarand aenachronism heyheymanman byrass alecglassford ksslng lcd047 palerdot david-a-wheeler viccuad ayberkt joshmgrant saul johnny4000 gwater ericcarraway leftees efueger abaldwinhunter freakyjoe talkingavocado hkjallbring ivarvong vikasgorur willingc rugk catherineh benbaror kylesezhi yilihong infotexture asmeurer pksjce krausefx iemcd chreekat jwworth eric-switzer adammichaelwood themadav silky safezpa j10sanders robaley mcarthurgill jkulesza tommorris bpedroso dhutty mkeshav kristinita mavit jwilk-forks blueyed tkfu ran4 abesto documize bryant1410 feilaoda thearchiver akhmerov kropp mpacer kerri-hicks ivaturi vkt0r yeradis aairey ekmartin code-wins gigahawk ryanmccarl

proselint's Issues

Rubric for hard-to-implement features

Develop some systematic way to describe features within the extracted sources that are not easily implementable but with an eye to why they are not easily implementable and any clues as to what sources may provide a solution to the problem.

Architecture for sharing processed data across rules

If we need to use more computationally intense analyses in multiple rules (e.g., nltk & syntax parsing to identify whether •while• is being used as a conjunction or a adverb) it would make more sense to memoize the output so that it can be accessed by other rules rather than rerun.

This should be a fairly general system that automatically builds the data structures so that they can be shared across individual checks, possibly with some kind of a require type statement being present in more than one rule?

Sort errors by the position in which they occur

Have American English and British English modes?

Create plugin for emacs

Choose a sensible naming/numbering scheme for errors

The convention is a capital letter and a 3 digit code. For example, pep257 uses codes like D100, D302, etc. It might be nice for us to use a 3-letter code for the source of the advice and a 3-digit code for the specific check, e.g., DFW201. The numeric codes can then be organized across sources according to higher-level categories of errors. For example, 100-level codes might be for overused words, phrases, idioms, symbols, and grammatical structures. The 200-level codes might be for nonsensical structures, such as DFW's comparing uncomparables. This fails if a particular author has > 99 pieces of advice of a particular kind, but if we run into that problem, then we're doing great. If that happens it might also suggest that our errors could use some compression (e.g., by merging all the overused single words into one check).

The URLs it leads to are nice and compact, too: http://lifelinter.com/DFW201.

Extract rules from Butterick's Practical Typography

http://practicaltypography.com/

Extract rules from Intelligent Editing website

http://www.intelligentediting.com/resources/

Check for weasel words

Stub placed in checks.

Create a demo where famous authors edit their own text

e.g., DFW on DFW.

Create a Microsoft Word plugin/app

Figure out the right name for a "check"

edit
correction
rule
error
suggestion
guideline
tip
recommendation
pointer

False alarms, corpora, QA, and contributing back

Steps:

Implement a check.
Apply the check to a huge corpus (e.g., all of Wikipedia).
Find false alarms and open as issues on Github.
If fault is in original source text, notify author.

Check for common typographical issues

2 x 4 vs. 2 × 4
2-4 vs. 2–4
Bose-Einstein condensate vs. Bose–Einstein condensate
--- vs. —
+/- vs. ±

(Take a look at Jordan's typography talk for some examples.)

Run checks in parallel

There's an opportunity to run the linter in a way that's massively parallel. The main insights here are that many of the rules can be run independently of each other and that they can be run independently on separate parts of the text (e.g., at the paragraph level).

Create a plugin for Pages

Create a Google docs plugin/app

Extract rules from DFW's "Tense present"

http://instruct.westvalley.edu/lafave/DFW_present_tense.html

This is a particularly nice essay because the first page or two is just a list of idioms and grammatical structures that should be avoided.

Get in touch with Bryan Garner

His book is so thorough an authoritative that it would be great to get him involved in some way, perhaps as an advisor. It would also be amazing if he (or Oxford University Press) allowed deeper integration with the text of his book.

http://www.amazon.com/Garners-Modern-American-Usage-Garner/dp/0195382757

Create a sports detector

One of the entries in GMAU is:

answer back is a common REDUNDANCY, especially in BrE—e.g.: “Hilary and Piers du Pre seem determined to wreak the ultimate revenge on their sister by discrediting her while she lies—unable to answer back [read answer]—in her grave.” Julian Lloyd Webber, “An Insult to Jackie’s Memory,” Daily Telegraph, 4 Jan. 1999, at 15.

In AmE, the phrase is fairly common in sportswriting in the sense “to equal an opponent’s recent scoring effort”—e.g.:
• “Even when the Cougars did score, the Herd answered back in an instant.” Joe Davidson, “Herd Remain on a Roll,” Sacramento Bee, 21 Nov. 1998, at D1.
• “Jake Armstrong quickly answered back for the Knights, but the two-goal cushion was short-lived.” Joe Connor, “La Jolla, Bishop’s Tie One On in Wester,” San Diego Union-Trib., 16 Dec. 1998, at D6.

Some writers have used the sport phrase metaphorically—e.g.: “The last time somebody tried to impose prohibition on Chicago, the city answered back with Al Capone.” Peter Annin, “Prohibition Revisited?” Newsweek, 7 Dec. 1998, at 68. Despite the currency of this usage, answer can carry the entire load by itself.

LANGUAGE-CHANGE INDEX answer back for answer (outside sports): Stage 3

This pattern, where there is an exception to a rule when talking about a particular topic (or where a rule applies only when talking about the topic) will come up many times.

Extract rules from Strunk & White

Create plugin for vim

Create plugin for Sublime Text

It's et al., not et. al

http://writingcenter.waldenu.edu/34.htm

Figure out test inheritance

Extract rules from Safire's "On language" columns

Apply memoized rule checks at the paragraph level

Rules are currently defined as functions over the full text of the document. It would be better to apply the functions to each paragraph separately. The reason for this is that, for many documents (especially large ones), most of the paragraphs will not change between saves or keystrokes, such that when these functions are memoized, most of the linter computations will be available right away.

Extract rules from Garner's usage guide

http://www.amazon.com/Garners-Modern-American-Usage-Garner/dp/0195382757

Create a .proselintrc file

https://github.com/jshint/jshint/blob/master/examples/.jshintrc

Integrate into Sublime Text as a linter

Add rule about consistency

See, e.g., http://www.intelligentediting.com/

Add rule to detect Black English Vernacular

Using "a" vs. "an"

This is the first entry of Garner's Modern American Usage.

There's a discussion about implementing it on StackOverflow.

Build online writing editor using http://codemirror.net/?

Create plugin for Atom editor

Detect sexism and biased language

Extract rules from Orwell's "Politics and the English Language"

https://www.mtholyoke.edu/acad/intrel/orwell46.htm

Don't lint quoted text

If I quote someone, the linter shouldn't try to correct me on their prose.

Great writing should come back nearly clean

It would be good to include an automated test sweet that runs the linter over writing that is written by a great author and has already been heavily edited and copyedited (e.g., an essay from The New Yorker that went on to win the Pulitzer prize in nonfiction) . The linter should be nearly silent.

Unincorporated clichés from GMAU

the following need some more thought before including.

"inclement weather", ?
"there is wide support" in politics
boasts as a transitive verb,
choreograph used figuratively,
giveth ... taketh away
orchestrate in nonmusical contexts
venerable when used for 'old'

it would also be good to go through all the clichés and think of variant forms that might appear.

Create an API

Create a plugin system

I want a single code file for each check that:

Implements the check.
Includes a docstring that is autogenerated into a web page.
Includes test cases that do and do not raise an error.

Here a sample of what I'm imagining:

"""DFW001: Comparing uncomparables.

---
layout:     post
error_code: DFW201
source:     David Foster Wallace
title:      PL001&#58; Comparing an uncomparable
date:       2014-06-10 12:31:19
summary:    Comparing an uncomparable.
categories: check

---

David Foster Wallace says:

> This is one of a class of adjectives, sometimes called "uncomparables", that
can be a little tricky. Among other uncomparables are precise, exact, correct,
entire, accurate, preferable, inevitable, possible, false; there are probably
two dozen in all. These adjectives all describe absolute, non-negotiable
states: something is either false or it's not; something is either
inevitable or it's not. Many writers get careless and try to modify
uncomparables with comparatives like more and less or intensives like very. But
if you really think about them, the core assertions in sentences like "War is
becoming increasingly inevitable as Middle East tensions rise"; "Their cost
estimate was more accurate than the other firms'"; and "As a mortician, he has
a very unique attitude" are nonsense. If something is inevitable, it is bound
to happen; it cannot be bound to happen and then somehow even more bound to
happen. Unique already means one-of-a-kind, so the adj. phrase very unique is
at best redundant and at worst stupid, like "audible to the ear" or
"rectangular in shape". You can blame the culture of marketing for some of
this difficulty. As the number and rhetorical volume of US ads increase, we
become inured to hyperbolic language, which then forces marketers to load
superlatives and uncomparables with high-octane modifiers (special --- very
special --- Super-special! --- Mega-Special!!), and so on. A deeper issue
implicit in the problem of uncomparables is the dissimilarities between
Standard Written English and the language of advertising. Advertising English,
which probably deserves to be studied as its own dialect, operates under
different syntactic rules than SWE, mainly because AE's goals and assumptions
are different. Sentences like "We offer a totally unique dining experience";
"Come on down and receive your free gift"; and "Save up to 50 per cent... and
more!" are perfectly OK in Advertising English — but this is because
Advertising English is aimed at people who are not paying close attention.
If your audience is by definition involuntary, distracted and numbed, then free
gift and totally unique stand a better chance of penetrating — and simple
penetration is what AE is all about. One axiom of Standard Written English is
that your reader is paying close attention and expects you to have done the
same.
"""

import re


def check(text):

    error_code = "PL001"
    msg = "Comparison of an uncomparable."  # do formatting thing

    comparators = [
        "very",
        "more",
        "less",
        "extremely",
        "increasingly"
    ]

    uncomparables = [
        "unique",
        "correct",
        "inevitable",
        "possible",
        "false",
        "true"
    ]

    errors = []
    for comp in comparators:
        for uncomp in uncomparables:
            occurences = [
                m.start() for m in re.finditer(comp + "\s" + uncomp, text)]
            for o in occurences:
                errors.append((1, o, error_code, msg))
    return errors

def test1():
    pass

Check for lexical illusions

https://github.com/btford/write-good/blob/master/lib/lexical-illusions.js

working out how i can best contribute using github/git

I may need your advice on this one.

I know to do pull requests requires having set up a separate fork of the repo (or at least I think I know that), and I successfully managed to add my fork as a repo, but I fear trying to push changes and overriding anything you've done.

Or should I not worry about that? This is the kind of thing that is most frustrating about trying to work on these projects — I don't want to break anything but I'm not sure always how to properly set it up so that everything is correctly following version control protocol.

Regex over semantics?

http://www.clips.ua.ac.be/pages/pattern-search