Direct reads seem to be running off of one cursor, using gte

I'm going down the path of trying <a href="https://docs.mongodb.com/manual/reference/c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Multithread direct reads about gtm HOT 14 CLOSED

rwynn commented on May 26, 2024

Multithread direct reads

from gtm.

Comments (14)

rwynn commented on May 26, 2024 1

I'm going down the path of trying https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/ with a configurable amount of cursors which read in separate go routines. I need to feature check this and only make it available for 2.6+.

from gtm.

rwynn commented on May 26, 2024 1

@makhdumi thank you for pointing me to the mongo-hadoop-connector code. I've added split vector support to monstache and gtm now because of that. I'll take a look into the oplog read lag at some point.

from gtm.

rwynn commented on May 26, 2024 1

gtm now supports splitting collections and reading them in multiple go routines with separate connections using range selectors.

from gtm.

rwynn commented on May 26, 2024

Hi, the direct read impl was changed recently in this commit:

58b0ade

Due to some feedback received in this issue with respect to a 10 million document collection.

rwynn/monstache#44

It went from multiple cursors each with coordinated skips and limits to what you see today with a gte query.

The report was closed on feedback that the situation got better for the large 10m collection after the change. It would be nice to go back to concurrent readers but keep the improvements for the other fella.

I will be happy to look at a PR. Maybe you could base it off the code prior to the commit referenced above. At that point multi-cursor was optional via DirectReadersPerCol. As far as the chunking method maybe you could specify that in the Options passed in.

from gtm.

rwynn commented on May 26, 2024

By the way, with regard to the sharded vs non-sharded, gtm should be connecting directly to the shards and not going through a mongos connection. So, I'm not sure the query logic needs to take sharding into account. The example suggests connecting to the config server, getting the shards, using a MultiContext, which would be a session for each shard, and setting up a listener on the config server oplog to address new shards being added. Maybe I'm missing something?

from gtm.

makhdumi commented on May 26, 2024

The gte change from skip+limit makes sense. With multiple cursors, I was thinking of multiple range cursors with gte and lt.

Knowing the sharding info just helps with determining the ranges to query on (using the chunks collection), and what the sharding key is.

Otherwise I guess if you don't know that the collection is sharded, you have to query off of _id, i.e. treat it as an unsharded collection? You also have to sort of guess the good ranges: e.g. a simple method would be to just use the total collection size and the min/max _id (I use this for unsharded collections and it performs badly in our case, since the _ids are not at all evenly distributed). Another way would be to "binary search" for each range.

When I use the simple method that results in bad chunking, I get the ~35 Mbps average. But for sharded collections, when I use the already-available ranges in the chunks collection, I get a pretty stable ~80-90 MBps average.

I didn't think of handling when new chunks are added or rebalanced though, which is important (wasn't in my specific situation). I think that makes it a lot trickier.

from gtm.

makhdumi commented on May 26, 2024

I was totally unaware of parallelCollectionScan!

Knowing about this helps a lot with my own stuff too, thank you. I'm looking forward to the implementation in gtm.

from gtm.

rwynn commented on May 26, 2024

Problem is it doesn't seem to work very well with WiredTiger. Looks like it always returns 1 cursor for that engine. And I'm actually seeing slightly better results on 1 million docs with the current code. I guess I need to try other engines to see if it is better in any case.

https://jira.mongodb.org/browse/SERVER-17688

from gtm.

rwynn commented on May 26, 2024

mmapv1 is working great though. Multiple cursors get returned and it is pretty fast.

from gtm.

rwynn commented on May 26, 2024

@makhdumi the parallelCollectionScan stuff has been committed to master now. Please let me know how that goes. Of course, it only works for mmapv1 currently, but should transparently upgrade when WiredTiger is enhanced.

from gtm.

rwynn commented on May 26, 2024

Just added commit 453cc54 which puts each in cursor on a separate connection.

This gives pretty good thoughput for 10 milliion tiny docs. The syncing to ES is turned off in the run, so just mongo read performance via mmapv1.

time go run monstache.go -direct-read-namespace test.test -exit-after-direct-reads -direct-read-cursors 100
INFO 2018/03/16 21:35:08 Successfully connected to MongoDB version 3.6.3
INFO 2018/03/16 21:35:08 Parallel collection scan command returned 17/100 cursors requested for test.test
INFO 2018/03/16 21:35:08 Starting 17 go routines to read test.test
10000000
INFO 2018/03/16 21:35:58 Shutting down

real    0m52.016s
user    2m9.244s
sys     0m24.640s

With the code that doesn't utilize the collection scan it takes 3X as long.

time go run monstache.go -direct-read-namespace test.test -exit-after-direct-reads -direct-read-cursors 1
INFO 2018/03/16 21:40:39 Successfully connected to MongoDB version 3.6.3
INFO 2018/03/16 21:40:39 Parallel collection scan command returned 1/1 cursors requested for test.test
INFO 2018/03/16 21:40:39 Reverting to single-threaded collection read
10000000
INFO 2018/03/16 21:43:21 Shutting down

real    2m43.619s
user    2m37.740s
sys     0m35.496s

from gtm.

rwynn commented on May 26, 2024

@makhdumi

You might want to give monstache another try. Recent changes have removed removed the storage of metadata in Mongodb for the purposes of later deletion. This may have been what was slowing it down for you.

from gtm.

makhdumi commented on May 26, 2024

That might have been it, thank you. I actually had to abandon monstache after I was seeing ~70 ms between each oplog read, but I didn't have time to dig further, so I'm not sure if it was my setup or monstache itself.

Thank you for the initial dump improvements as well. I hadn't seen the latest one.

from gtm.

rwynn commented on May 26, 2024

Thanks. I’m looking into adding split vector support. Seems parallel collection scan is going away when mmapv1 does.

from gtm.

Multithread direct reads about gtm HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent