Coder Social home page Coder Social logo

Input pipeline about chemprop HOT 8 CLOSED

wengong-jin avatar wengong-jin commented on August 18, 2024
Input pipeline

from chemprop.

Comments (8)

yangkevin2 avatar yangkevin2 commented on August 18, 2024

Hi Florian,

Most of the slow parts should be parallelized (in particular, the main training loop should be run on a GPU with cuda-- it'll be very very slow on a CPU). Is there a particular part that's being very slow for you?

Kevin

from chemprop.

FloMru avatar FloMru commented on August 18, 2024

Ok that is odd.
For a batch size of 256 I get around 1 it/s and my 1080TI hoovers around 10% utilization.
Because my dataset has around 4 million samples, you can imagine that I wait some time for an epoch to finish. :-)
If you say the slow parts are already parallelized, then the error has to be on my site.
Do you have an idead what the problem could be?

from chemprop.

yangkevin2 avatar yangkevin2 commented on August 18, 2024

Ah, we've never run it on anything quite so big. I guess that would explain why it's taking forever. So it takes ~4 hr per epoch?

Our code caches a lot of the computation during the first epoch, though, so the first epoch is the slowest epoch; subsequent epochs should be roughly 4x faster. (Though the cache for a dataset of that size would use something like half a terabyte of RAM... so if you end up having trouble with memory you can chunk your dataset using the --num_chunks option, which also turns off the caching.)

We may also look into parallelizing some of the CPU computation that happens with each batch, if you're still running into trouble; just let us know. (We haven't done this parallelization yet because we usually just cache that computation during the first epoch.)

from chemprop.

FloMru avatar FloMru commented on August 18, 2024

Ah, we've never run it on anything quite so big. I guess that would explain why it's taking forever. So it takes ~4 hr per epoch?

Yeah roughly, it is more in the 3 hour range.

Our code caches a lot of the computation during the first epoch, though, so the first epoch is the slowest epoch; subsequent epochs should be roughly 4x faster. (Though the cache for a dataset of that size would use something like half a terabyte of RAM... so if you end up having trouble with memory you can chunk your dataset using the --num_chunks option, which also turns off the caching.)

I should have said that I already had to turn off the caching (unfortunatly), because I had problems with the dataset using up all my precious memory.

We may also look into parallelizing some of the CPU computation that happens with each batch, if you're still running into trouble; just let us know. (We haven't done this parallelization yet because we usually just cache that computation during the first epoch.)

I also started to look into it:
You call the featurization step (mol2graph) directly in the arguments section of the forward step of the encoder (e.g. for the mpn it is in mpn.py row 335).
For parallelization would´nt it be better to call this somewhere before?
Do you know a better place to start looking for good parallelization opportunities?

from chemprop.

yangkevin2 avatar yangkevin2 commented on August 18, 2024

Yeah, I believe the mol2graph step is the slowest CPU-based step based on some profiling tests we've run in the past, so that's probably the best place to start. We can look into parallelizing it too.

from chemprop.

yangkevin2 avatar yangkevin2 commented on August 18, 2024

Hi Florian,

Try pulling the master branch again. You can use the new option --no_cache to turn of the caching without having to hack it, and you can use the new option --parallel_featurization to do the CPU-based featurization asynchronously with the model (which will probably become default in the near future). We observed ~75% speedup compared to the previous version with cache turned off, when running with this option on a dataset of about 100k (this was rather surprising to us too; even though we typically cache from the second epoch onwards, it seems like the featurization was still taking more time than we thought). If you find that it's using too much RAM, you can just decrease the value of the flag --batches_per_queue_group, which would likely cause just a small performance hit. Hope this helps! And please let us know if anything goes wrong when using these new options.

(There are a lot of other code changes since we finally merged our dev branch, but the basic interface should still be the same-- most of the new code is just our experimental options. We just merged so that we could sync the branches before making some helpful engineering changes, like the one I described above. Please let us know if you encounter any problems, though.)

from chemprop.

FloMru avatar FloMru commented on August 18, 2024

Hi Kevin,

that was fast!
Thanks a lot and I will try the new code as soon as possible, but probably not until the new year.
I´ll give you my feedback, too.

I wish you happy holidays!

Best wishes,
Florian

from chemprop.

yangkevin2 avatar yangkevin2 commented on August 18, 2024

Sounds good, happy holidays to you too!

from chemprop.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.