Coder Social home page Coder Social logo

pnevyk / tipograph Goto Github PK

View Code? Open in Web Editor NEW
53.0 53.0 4.0 547 KB

A little javascript library and command line tool that makes your written content more typographically correct.

License: MIT License

JavaScript 8.56% HTML 91.44%
converter curly-quotes dash quotes typography typography-rules

tipograph's People

Contributors

dependabot[bot] avatar djfarly avatar mhulse avatar pnevyk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

tipograph's Issues

Tipograph gets thrown off with multiple 2-digit years in a string (or something?)

Hi there,

I think I may have come across a bug in Tipograph relating to smartening a string that has multiple 2-digit years inside the string (or something?). Here’s a test case:

replace.all('I wasn\'t a particular fan of the music in the \'80s. And then she blurted, "I thought you said, \'I don\'t like \'80s music\'?"');

Or if it might help, here’s that string in a more human-readable form:

I wasn't a particular fan of the music in the '80s. And then she blurted, "I thought you said, 'I don't like '80s music'?"

Here’s the result:

I wasn’t a particular fan of the music in the ‘80s. And then she blurted, “I thought you said, ’I don’t like ‘80s music’?”

And here’s the result that I’d expect:

I wasn’t a particular fan of the music in the ’80s. And then she blurted, “I thought you said, ‘I don’t like ’80s music’?”

If you happen to be viewing this issue on GitHub.com, you might be stuck seeing this in San Francisco, which happens to have nearly identical left-single-quote and right-single-quote characters. So if it might help, here’s a screenshot of those test cases but with Helvetica:

tipograph - test case

Complete rewrite [Proposal]

The current codebase of Tipograph is old and I have some ideas to make it better.

Tools

  • ES2015 JavaScript with Flow type annotations.

Interface

NodeJS interface should be changed into more "node" way: no capital letters.

// ES2015 modules
import { replace, languages } from 'tipograph';
replace('text'); // replace.all()
replace.quotes('text');

// CommonJS
const { replace, languages } = require('tipograph');

Engine

Regexes are quick and sort of nice and clear way how to express things. But regex-based solution has some drawbacks: mainly that it has to go through the whole source on every rule, input format has to be plain text (HTML support is now quite hacky) and I think it can choke on very large inputs.

I am thinking about a different approach based on theory of finite state transducers. This would be the data pipeline in new architecture:

  1. Parse input file which can be in any format any emit two types of tokens: format and content. Format tokens will be just passed along the pipeline without change and then copied to the output. Content tokens will be a subject of further analysis. The advantage is that Tipograph can eventually support any input format without need to change the core engine.
  2. Tokenize content tokens into smaller units which kind of make sense to typography analysis. This has to be further analysed but I am thinking about for example word, space, number, quote and so on.
  3. These tokens will then serve as an input to finite state transducer which will be fed with these tokens and will emit typographically correct tokens. This transducer will be driven by its state so I believe it is possible to achieve behavior such as quotes substitution and others with this architecture. The challenge will be support of customizable rules as well as turning on and off various substitutions.

This is going to be a long way and I have no much time to do it now. But hopefully, in the future I (we?) will make Tipograph much better tool. If you have any comment, feel free to put it here.

Markdown format

Markdown is one of the most used formats for writing content. Tipograph should have support for it. The parser should follow commonmark spec, and possibly be extended with some widely used extensions.

There is a very basic parser in scripts/readme.js file in this repository, but I think it is not a good place to start.

Changes information

Optionally collect the changes made by rules in the source text. This can be used to inform the user about what has changed and also help to implement wysiwyg editors where the cursor must be properly moved according to changes.

The changes can be identified by computing an edit distance/alignment between the original and the converted text. Vast majority of characters are not affected by the conversion so some more clever way should be used instead of traditional dynamic programming techniques executed on whole texts.

This changes information can be retrieved via a callback passed to the conversion. The callback will also override the returned value from "typo" function (it will be the return value of the callback). The callback is of course optional.

var typo = tipograph();

// keep the original behavior
var converted = typo(original, function (converted, changes) {
    // process the changes
    return converted;
});

// make a structure
var contentAndChanges = typo(original, function (converted, changes) {
    return { content: converted, changes: changes };
});

// stream
fs.createReadStream('input.txt')
    .pipe(tipograph.createStream(/*{ options }*/, function (converted, changes) {
        return { content: converted, changes: changes };
    })
    .pipe(/* a stream that processes contentAndConverted */);

The changes object will have the interface similar to the following:

var original = '"Foo --- bar"';
var converted = '\u201CFoo\u200A\u2014\u200Abar\u201D';
// Array<[fromRange, toRange]>
var changes = [
    [[0, 1], [0, 1]], // '"' -> '\u201C'
    [[4, 9], [4, 7]], // ' --- ' -> '\u200A\u2014\u200A'
    [[12, 13], [10, 11]] // '"' -> '\u201D'
];

Rough idea of the algorithm:

function align(fst, snd) {
    // returns [a, b] where fst[a] == snd[b] and for all i, j, i < a, j < b, fst[i] != snd[j]
    // returns null if a == fst.length or b == snd.length, in other words, no such a, b exist
}

// artificial chars which always match each other
original += '\0';
converted += '\0';

var changes = [];
var i = 0;
var j = 0;
while (i < original.length && j < converted.length) {
    if (original[i] == converted[j]) {
        i++;
        j++;
    } else {
        var alignment;
        var bound = 5;

        // NOTE: this loop is guaranteed to terminate because of '\0' at the ends
        do {
            alignment = align(original.slice(i, i + bound), converted.slice(j, j + bound));
            bound *= 2;
        } while (alignment === null);

        changes.push([[i, i + alignment[0]], [j, j + alignment[1]]]);
        i += alignment[0] + 1;
        j += alignment[1] + 1;
    }
}

// remove the artificial '\0' from the end
converted = converted.slice(0, -1);

return [converted, changes];

Postprocessing

For some use cases it might be handy to replace unicode characters in converted text with special sequences of various formats (e.g. \u2026 to &hellip; for html and \textellipsis for latex).

TypeError: Cannot read property 'all' of undefined

Hello,

Thanks for this code!

I have this function:

import tipograph from 'tipograph';
makePrettyItem: function(item) {
	return tipograph.Replace.all( ... );
},

I am getting:

		return _tipograph2.default.Replace.all( ... );
		                                   ^

TypeError: Cannot read property 'all' of undefined

I can't seem to figure out how to migrate this old syntax to the new ...

Can you point me in the right direction?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.