Coder Social home page Coder Social logo

Multithread direct reads about gtm HOT 14 CLOSED

rwynn avatar rwynn commented on May 26, 2024
Multithread direct reads

from gtm.

Comments (14)

rwynn avatar rwynn commented on May 26, 2024 1

I'm going down the path of trying https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/ with a configurable amount of cursors which read in separate go routines. I need to feature check this and only make it available for 2.6+.

from gtm.

rwynn avatar rwynn commented on May 26, 2024 1

@makhdumi thank you for pointing me to the mongo-hadoop-connector code. I've added split vector support to monstache and gtm now because of that. I'll take a look into the oplog read lag at some point.

from gtm.

rwynn avatar rwynn commented on May 26, 2024 1

gtm now supports splitting collections and reading them in multiple go routines with separate connections using range selectors.

from gtm.

rwynn avatar rwynn commented on May 26, 2024

Hi, the direct read impl was changed recently in this commit:

58b0ade

Due to some feedback received in this issue with respect to a 10 million document collection.

rwynn/monstache#44

It went from multiple cursors each with coordinated skips and limits to what you see today with a gte query.

The report was closed on feedback that the situation got better for the large 10m collection after the change. It would be nice to go back to concurrent readers but keep the improvements for the other fella.

I will be happy to look at a PR. Maybe you could base it off the code prior to the commit referenced above. At that point multi-cursor was optional via DirectReadersPerCol. As far as the chunking method maybe you could specify that in the Options passed in.

from gtm.

rwynn avatar rwynn commented on May 26, 2024

By the way, with regard to the sharded vs non-sharded, gtm should be connecting directly to the shards and not going through a mongos connection. So, I'm not sure the query logic needs to take sharding into account. The example suggests connecting to the config server, getting the shards, using a MultiContext, which would be a session for each shard, and setting up a listener on the config server oplog to address new shards being added. Maybe I'm missing something?

from gtm.

makhdumi avatar makhdumi commented on May 26, 2024

The gte change from skip+limit makes sense. With multiple cursors, I was thinking of multiple range cursors with gte and lt.

Knowing the sharding info just helps with determining the ranges to query on (using the chunks collection), and what the sharding key is.

Otherwise I guess if you don't know that the collection is sharded, you have to query off of _id, i.e. treat it as an unsharded collection? You also have to sort of guess the good ranges: e.g. a simple method would be to just use the total collection size and the min/max _id (I use this for unsharded collections and it performs badly in our case, since the _ids are not at all evenly distributed). Another way would be to "binary search" for each range.

When I use the simple method that results in bad chunking, I get the ~35 Mbps average. But for sharded collections, when I use the already-available ranges in the chunks collection, I get a pretty stable ~80-90 MBps average.

I didn't think of handling when new chunks are added or rebalanced though, which is important (wasn't in my specific situation). I think that makes it a lot trickier.

from gtm.

makhdumi avatar makhdumi commented on May 26, 2024

I was totally unaware of parallelCollectionScan!

Knowing about this helps a lot with my own stuff too, thank you. I'm looking forward to the implementation in gtm.

from gtm.

rwynn avatar rwynn commented on May 26, 2024

Problem is it doesn't seem to work very well with WiredTiger. Looks like it always returns 1 cursor for that engine. And I'm actually seeing slightly better results on 1 million docs with the current code. I guess I need to try other engines to see if it is better in any case.

https://jira.mongodb.org/browse/SERVER-17688

from gtm.

rwynn avatar rwynn commented on May 26, 2024

mmapv1 is working great though. Multiple cursors get returned and it is pretty fast.

from gtm.

rwynn avatar rwynn commented on May 26, 2024

@makhdumi the parallelCollectionScan stuff has been committed to master now. Please let me know how that goes. Of course, it only works for mmapv1 currently, but should transparently upgrade when WiredTiger is enhanced.

from gtm.

rwynn avatar rwynn commented on May 26, 2024

Just added commit 453cc54 which puts each in cursor on a separate connection.

This gives pretty good thoughput for 10 milliion tiny docs. The syncing to ES is turned off in the run, so just mongo read performance via mmapv1.

time go run monstache.go -direct-read-namespace test.test -exit-after-direct-reads -direct-read-cursors 100
INFO 2018/03/16 21:35:08 Successfully connected to MongoDB version 3.6.3
INFO 2018/03/16 21:35:08 Parallel collection scan command returned 17/100 cursors requested for test.test
INFO 2018/03/16 21:35:08 Starting 17 go routines to read test.test
10000000
INFO 2018/03/16 21:35:58 Shutting down

real    0m52.016s
user    2m9.244s
sys     0m24.640s

With the code that doesn't utilize the collection scan it takes 3X as long.

time go run monstache.go -direct-read-namespace test.test -exit-after-direct-reads -direct-read-cursors 1
INFO 2018/03/16 21:40:39 Successfully connected to MongoDB version 3.6.3
INFO 2018/03/16 21:40:39 Parallel collection scan command returned 1/1 cursors requested for test.test
INFO 2018/03/16 21:40:39 Reverting to single-threaded collection read
10000000
INFO 2018/03/16 21:43:21 Shutting down

real    2m43.619s
user    2m37.740s
sys     0m35.496s

from gtm.

rwynn avatar rwynn commented on May 26, 2024

@makhdumi

You might want to give monstache another try. Recent changes have removed removed the storage of metadata in Mongodb for the purposes of later deletion. This may have been what was slowing it down for you.

from gtm.

makhdumi avatar makhdumi commented on May 26, 2024

That might have been it, thank you. I actually had to abandon monstache after I was seeing ~70 ms between each oplog read, but I didn't have time to dig further, so I'm not sure if it was my setup or monstache itself.

Thank you for the initial dump improvements as well. I hadn't seen the latest one.

from gtm.

rwynn avatar rwynn commented on May 26, 2024

Thanks. I’m looking into adding split vector support. Seems parallel collection scan is going away when mmapv1 does.

from gtm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.