I was thinking that if you have an &str , you can

Yes, treating it as bounded should be fine. My point is that <code class="notranslate

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

`par_split_whitespace` for string about rayon HOT 15 CLOSED

rayon-rs commented on August 17, 2024

`par_split_whitespace` for string

from rayon.

Comments (15)

cuviper commented on August 17, 2024

but utf-8 is designed to make such "alignment" easy, you just have to check for bytes with the 0x80 bit set I think

Yes, need to look at the upper two bits, so something like c & 0xc0 == 0x80 is a continuation byte.

One thing that may be challenging is that you don't know how many substrings there will be, unless you pre-scan the whole thing. Even the number of characters is not known, only the number of bytes. The current par_iter internals only work with an ExactParallelIterator at the base.

from rayon.

edre commented on August 17, 2024

&str has a handy is_char_boundary to do the bit twidding for you.

You can treat the string as a BoundedParallelIterator where every byte is potentially the start of a split. Shouldn't be any different semantics than a filter.

from rayon.

cuviper commented on August 17, 2024

Yes, treating it as bounded should be fine. My point is that internal::bridge is not set up to handle that as a starting point. So you might need something like (0..s.len()) as your base iterator, then filter that for offsets that are characters and match as a separator. This seems much different than what @nikomatsakis outlined.

from rayon.

nikomatsakis commented on August 17, 2024

@cuviper you don't have to use bridge, you can just drive the consumer directly

from rayon.

nikomatsakis commented on August 17, 2024

(But, I agree we probably want a general purpose helper we can reuse -- until now we never had a "base producer" that didn't handle exact counts.)

from rayon.

edre commented on August 17, 2024

The obvious implementation (start at the middle byte, then scan for the first split point) has some algorithmically bad worst-case behavior in the current bridge implementation.

If the input is a large string with no valid split points, the first split will scan the second half of the string, and return (original_string, ""). The next split will start at 1/4 way through the list, scan the next 3/4, and return (original_string, ""). The critical path is O(n log n), worse than the O(n) iterative version.

from rayon.

cuviper commented on August 17, 2024

It wouldn't be crazy for the splits to keep some state tracking what has already been scanned.

from rayon.

edre commented on August 17, 2024

Or the splitter could find the split that includes the midpoint and explicitly include it in one of the halves.

from rayon.

nikomatsakis commented on August 17, 2024

@edre I think the algorithm I had envisioned was somewhat different than the one you've described.

e.g., to split on whitespace, I had imagined something like:

start at middle byte X:
- if it is whitespace, good, split at ..X and X..
- if it is not whitespace, scan forward till we find whitespace at Y, split at ..Y and Y..
  - if we don't find anything, scan backward till we find whitespace at Y, split at ..Y and Y..X

something like that, anyway. Seems like would be O(n) at most?

from rayon.

nikomatsakis commented on August 17, 2024

@cuviper did you implement this?

from rayon.

cuviper commented on August 17, 2024

par_lines snuck into #188, but I haven't done whitespace yet. I was contemplating how to do matching with arbitrary functions (especially char::is_whitespace), then got side-tracked.

from rayon.

nikomatsakis commented on August 17, 2024

Ah, ok, because @schuster was fishing about for other projects. =)

from rayon.

nikomatsakis commented on August 17, 2024

I agree arbitrary functions would be good, but starting with just whitespace might be ok too. Though it seems like supporting some arbitrary F: Fn(char) -> bool would be relatively straightforward, no? I haven't thought about it too hard. Anyway, @schuster, if you're interested in taking a look at implementing par_split_whitespace() -- or some more general form -- let us know.

from rayon.

schuster commented on August 17, 2024

Thanks. I'll look at rev() for now, but if I give up on that one for some reason, maybe I'll come back to this.

from rayon.

cuviper commented on August 17, 2024

I got it the way I wanted in #214. The challenge for Fn was that I wanted to support it in the same par_split function alongside char separators, just as str::split does with its Pattern trait.

from rayon.

`par_split_whitespace` for string about rayon HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent