Coder Social home page Coder Social logo

lexi's Introduction

                                     lexi                                      

lexi is a simple proof-of-concept for a fast, streaming lexer. this is designed
to lex with only a partial view of the source code available at any moment, and
without needing to take up more memory than necessary.

it is fed some portion of the source text, which it stores in a buffer. that
buffer is then tokenized as usual, until the end of the source buffer is
reached. at any point, more source text can be fed into the lexer. when this
happens, the buffer is shifted so that only the non-processed source is kept,
with the new source text being appended onto that. this way, the memory
footprint can be kept low, even for large amounts of source text.

the demo itself works with pipes - it reads input text from stdin and outputs
each token onto its own line with the token type, lexeme length, and lexeme. it
currently lexes any string of non-(ascii-whitespace) characters as one token.

    $ wc -w readme
    281 readme
    $ cat readme | ./lexi | wc -l
    281


-- performance -----------------------------------------------------------------

lexi consistently uses less than 10KiB for large amounts of random data. this
has been tested on 1M, 100M, 1G, and 5G piped from /dev/urandom, as well as on
`bible.txt` and `world192.txt` from the canterbury corpus and the large corpus
from https://corpus.canterbury.ac.nz/descriptions.

    $ cat bible.txt | /usr/bin/time -f "%M" ./lexi > /dev/null
    2536
    $ cat world192.txt | /usr/bin/time -f "%M" ./lexi > /dev/null
    2592
    $ cat /dev/urandom | head -c 5G | /usr/bin/time -f "%M" ./lexi > /dev/null
    2496

the results can vary depending on what is and how it is measured but in general,
lexi never seems to cross 1MiB. 

the point is, lexi needs *far less* than 5G of memory to lex 5G of data.

lexi's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.