Coder Social home page Coder Social logo

Comments (15)

cuviper avatar cuviper commented on August 17, 2024

but utf-8 is designed to make such "alignment" easy, you just have to check for bytes with the 0x80 bit set I think

Yes, need to look at the upper two bits, so something like c & 0xc0 == 0x80 is a continuation byte.

One thing that may be challenging is that you don't know how many substrings there will be, unless you pre-scan the whole thing. Even the number of characters is not known, only the number of bytes. The current par_iter internals only work with an ExactParallelIterator at the base.

from rayon.

edre avatar edre commented on August 17, 2024

&str has a handy is_char_boundary to do the bit twidding for you.

You can treat the string as a BoundedParallelIterator where every byte is potentially the start of a split. Shouldn't be any different semantics than a filter.

from rayon.

cuviper avatar cuviper commented on August 17, 2024

Yes, treating it as bounded should be fine. My point is that internal::bridge is not set up to handle that as a starting point. So you might need something like (0..s.len()) as your base iterator, then filter that for offsets that are characters and match as a separator. This seems much different than what @nikomatsakis outlined.

from rayon.

nikomatsakis avatar nikomatsakis commented on August 17, 2024

@cuviper you don't have to use bridge, you can just drive the consumer directly

from rayon.

nikomatsakis avatar nikomatsakis commented on August 17, 2024

(But, I agree we probably want a general purpose helper we can reuse -- until now we never had a "base producer" that didn't handle exact counts.)

from rayon.

edre avatar edre commented on August 17, 2024

The obvious implementation (start at the middle byte, then scan for the first split point) has some algorithmically bad worst-case behavior in the current bridge implementation.

If the input is a large string with no valid split points, the first split will scan the second half of the string, and return (original_string, ""). The next split will start at 1/4 way through the list, scan the next 3/4, and return (original_string, ""). The critical path is O(n log n), worse than the O(n) iterative version.

from rayon.

cuviper avatar cuviper commented on August 17, 2024

It wouldn't be crazy for the splits to keep some state tracking what has already been scanned.

from rayon.

edre avatar edre commented on August 17, 2024

Or the splitter could find the split that includes the midpoint and explicitly include it in one of the halves.

from rayon.

nikomatsakis avatar nikomatsakis commented on August 17, 2024

@edre I think the algorithm I had envisioned was somewhat different than the one you've described.

e.g., to split on whitespace, I had imagined something like:

  • start at middle byte X:
    • if it is whitespace, good, split at ..X and X..
    • if it is not whitespace, scan forward till we find whitespace at Y, split at ..Y and Y..
      • if we don't find anything, scan backward till we find whitespace at Y, split at ..Y and Y..X

something like that, anyway. Seems like would be O(n) at most?

from rayon.

nikomatsakis avatar nikomatsakis commented on August 17, 2024

@cuviper did you implement this?

from rayon.

cuviper avatar cuviper commented on August 17, 2024

par_lines snuck into #188, but I haven't done whitespace yet. I was contemplating how to do matching with arbitrary functions (especially char::is_whitespace), then got side-tracked.

from rayon.

nikomatsakis avatar nikomatsakis commented on August 17, 2024

Ah, ok, because @schuster was fishing about for other projects. =)

from rayon.

nikomatsakis avatar nikomatsakis commented on August 17, 2024

I agree arbitrary functions would be good, but starting with just whitespace might be ok too. Though it seems like supporting some arbitrary F: Fn(char) -> bool would be relatively straightforward, no? I haven't thought about it too hard. Anyway, @schuster, if you're interested in taking a look at implementing par_split_whitespace() -- or some more general form -- let us know.

from rayon.

schuster avatar schuster commented on August 17, 2024

Thanks. I'll look at rev() for now, but if I give up on that one for some reason, maybe I'll come back to this.

from rayon.

cuviper avatar cuviper commented on August 17, 2024

I got it the way I wanted in #214. The challenge for Fn was that I wanted to support it in the same par_split function alongside char separators, just as str::split does with its Pattern trait.

from rayon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.