Coder Social home page Coder Social logo

chunkedbase.jl's People

Contributors

drvi avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

chunkedbase.jl's Issues

Extend synchronization primitives to allow cross-chunk sync

Imagine we want to do type inference on columns of CSV. We'd start with some initial guess for each column and then process the file in chunks, as we always do. Now one of the workers discovers that a column that was until now considered an Int has to be a Float64. How should we spread this information to the other chunks? Should this be a responsibility of the consume context? Should we define specific callback that would work as a barrier -- we'd stop parsing, wait for all results to enter the barrier, sync their schemas, and release them? Doing this in sync_tasks is problematic, because that only synchronizes chunks belonging to one of the two buffers.

More design work is needed on this issue.

Make `MIN_TASK_SIZE_IN_BYTES` configurable

MIN_TASK_SIZE_IN_BYTES is a global constant that we use to decide which amount of work (in bytes) is worth splitting into multiple tasks. Currently, this is set to 16KiB, but this number has not been researched at all. We should find a good default value for this and also make it easily configurable (maybe as a field of ChunkingContext).

Improve error messages for `FatalLexingError`

These can happen when the buffer is too small to fit the largest row in the input, or when we misconfigured the lexer, or if the file is not parseable. It's hard to know which of these is the problem, but we should try and do a bit of analysis on the current chunk (like looking for how strings are escaped and comparing it with our escape character, look if there are any newlines in the chunk...) to give our best guess (and evidence for it)

Make tracing easier to setup

The repo contains a small tracing framework which is all commented out + there is a src/_traces.jl file which visualizes the collected traces using GLMakie. Ideally, the tracing framework would be hidden behind a toggle that is easy to switch without modifying source code and the plotting capability would be implemented via package extensions.

This comment suggests a way to implement the on/off switch #7 (comment)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Consider retiring the `AbstractResultBuffer`

We should keep specializing on the result buffer type, but the AbstractResultBuffer doesn't carry any real interface with it, so we should abandon it and accept e.g. Dicts and Vectors as valid result buffer types.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.