Coder Social home page Coder Social logo

foldingathome / openmm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from openmm/openmm

21.0 21.0 4.0 114.91 MB

OpenMM is a toolkit for molecular simulation using high performance GPU code.

CMake 0.76% Shell 0.10% Python 9.45% HTML 0.01% Makefile 0.06% TeX 0.10% C++ 73.28% C 3.65% Cuda 4.53% Common Lisp 0.01% Batchfile 0.07% PowerShell 0.01% SWIG 0.23% Rich Text Format 7.75%

openmm's People

Contributors

andysim avatar chayast avatar chrisdembia avatar craabreu avatar dwtowner avatar frabjous5 avatar jchodera avatar jing-huang avatar jlmaccal avatar joaorodrigues avatar kyleabeauchamp avatar leeping avatar leucinw avatar mark-mb avatar mj-harvey avatar mjschnie avatar mpharrigan avatar olllom avatar peastman avatar proteneer avatar rafwiewiora avatar rmcgibbo avatar saurabhbelsare avatar schwancr avatar sherm1 avatar smikes avatar sunhwan avatar swails avatar thtrummer avatar z-gong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openmm's Issues

Merge changes from 7.5

I've created a new branch core22-openmm-7.5.0 and merged all the changes from 7.5 into it. We probably want to make that the new default branch.

Considerations on project size (atom count) and AMD (GCN) performance

Over the last time I have been trying to analyze the low performance of AMD GPUs (GCN) on small projects, trying to find improvements. Now I'm considering a possible reason for the observation of low performance on small atom count simulations. That observation - if confirmed - could help project owners to systematically decide which projects will have reasonable performance on AMD GCN-type GPUs and which not, thus helping to decide which projects to assign to different GPUs.

AMD GCN has a wavefront size of 64 threads, however these 64 threads are not executed on a full compute unit (CU) in parallel, but on one SIMD unit (out of 4 in a CU) sequentialized into four consecutive 16-thread parts, thus taking 4 cycles for a wavefront. In a CU there are 4 SIMD units, the Radeon VII (my case) has 60 CUs. As a conclusion, to occupy the GPU completely, not 3840 threads are required, as that would seem obvious (wavefront size 64 * 60 CUs), but 15360 threads (wavefront size 64 * (4 SIMDs/CU) * 60 CUs) and needing 4 cycles. That construction makes the architecture very wide, wider than even the NVidia RTX3090, but executing that wide thread count not in one, but four cycles (effectively 3840 threads per cycle).

If my conclusion is correct, that would explain why AMD GCN devices are much more sensitive to small projects than even the widest NVidia GPUs. I have to confirm that I don't really know how NVidia handles it's warps, but as far as I understand, they are executed in parallel (not partially serialized like AMD GCN).

If we correlate the above thoughts with an excerpt of kernel call characteristics for a sample project 16921(79,24,68)

Executing Kernel computeNonbonded, workUnits 61440, blockSize 64, size 61440
Executing Kernel reduceForces, workUnits 21024, blockSize 128, size 21120
Executing Kernel integrateLangevinPart1, workUnits 21016, blockSize 64, size 21056
Executing Kernel applySettleToPositions, workUnits 6810, blockSize 64, size 6848
Executing Kernel applyShakeToPositions, workUnits 173, blockSize 64, size 192
Executing Kernel integrateLangevinPart2, workUnits 21016, blockSize 64, size 21056
Executing Kernel clearFourBuffers, workUnits 320760, blockSize 128, size 122880
Executing Kernel findBlockBounds, workUnits 21016, blockSize 64, size 21056
Executing Kernel sortShortList, workUnits 256, blockSize 256, size 256
Executing Kernel sortBoxData, workUnits 21016, blockSize 64, size 21056
Executing Kernel findBlocksWithInteractions, workUnits 21016, blockSize 256, size 21248
Executing Kernel updateBsplines, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeRange, workUnits 256, blockSize 256, size 256
Executing Kernel assignElementsToBuckets, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeBucketPositions, workUnits 256, blockSize 256, size 256
Executing Kernel copyDataToBuckets, workUnits 21016, blockSize 64, size 21056
Executing Kernel sortBuckets, workUnits 21120, blockSize 128, size 21120
Executing Kernel gridSpreadCharge, workUnits 21016, blockSize 64, size 21056
Executing Kernel finishSpreadCharge, workUnits 157464, blockSize 64, size 61440
Executing Kernel packForwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel unpackForwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel gridEvaluateEnergy, workUnits 157464, blockSize 64, size 61440
Executing Kernel reciprocalConvolution, workUnits 157464, blockSize 64, size 61440
Executing Kernel packBackwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel unpackBackwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel gridInterpolateForce, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeBondedForces, workUnits 23455, blockSize 64, size 23488

we see that many kernels of that project run with 21016 threads (the project is specified with 21000 atoms in the project list). Assuming that not different kernels are executed in parallel, that means that the first 15360 threads will fully occupy the GPU, but the remaining 5656 threads only with about 1/3 of the capacity.

Other projects, like RUN9 of the benchmark projects (with 4071 atoms), will load the Radeon VII only at about 25% for many of the important kernels.

On the opposite in large projects the part of ineffective GPU load will create a small share only, that's where in general good performance of GCN devices is observed.

As a conclusion projects with slightly less than x * effective thread width (15360 for Radeon VII, gfx906) should run well, projects slightly above will run rather poor (specially if not covered by a very high atom count). So a project with 15000 atoms probably will run well, a project with 16000 atoms rather poor. For GPU use optimization that relation probably should be considered besides a general classification of a project to a group of GPUs.

Critical cross check of above thoughts is welcome!

cl2.hpp?

@peastman: I'm seeing this warning during compilation---should we do anything about it?

[ 88%] Building CXX object platforms/opencl/sharedTarget/CMakeFiles/OpenMMOpenCL.dir/__/src/OpenCLArray.cpp.o
In file included from /home/conda/openmm/platforms/opencl/src/OpenCLArray.cpp:27:
In file included from /home/conda/openmm/platforms/opencl/./include/OpenCLArray.h:35:
/home/conda/openmm/platforms/opencl/src/cl.hpp:155:9: warning: This version of the OpenCL Host API C++ bindings is deprecated, please use cl2.hpp instead. [-W#pragma-messages]
#pragma message("This version of the OpenCL Host API C++ bindings is deprecated, please use cl2.hpp instead.")

Debugging further NaNs

@peastman: @ThWuensche mentioned in the slack:

With the new version that issue is fixed, but unfortunately with 13424 I get NaNs further on in the runs, later than before though. I captured a few WUs and will run them on local openMM (FoldingAtHome/openmm and peastman/openmm) this evening
I have tried 13424(2186,46,2) locally with current FoldingAtHome/openmm and it works. However through F@H it failed with particle coordinate nan at step 2007. Can you find something in the logs that would explain that difference? Actually the behaviour on F@H looks similar to what we had before both patches, but it's quite unlikely that the patches are missing.

It seems likely that we must still be missing some bugfixes that appear in 7.5.0 but not in this patched 7.4.2. I'll try a quick core22 test build from openmm/openmm master right now to see if we can verify this is the case.

Add travis

This is just a reminder to add travis-ci, which may require grabbing more recent versions of .travis.yml and devtools/.

Different performance from test WU run on FAH and on openMM

The last days I have played with the AMD HIP port of openMM on FAH test WUs from the 17102 test project on my Radeon VIIs. I have compared the ns/day results from the FAH benchmark with ns/day values from these systems run on openmm master (7.5) with HIP platform and openmm 7.4.2 according to the branch run in FAH core22.

That the results with platform HIP are different from those with platform openCL is logical. However I have also seen performance difference (of about 10%) in the comparision between the FAH reported ns/day values and the ns/day values on a local run of the system in openMM.

For example RUN10:

  • FAH benchmark results 15ns/day
  • openMM HIP 13.2ns/day
  • openMM OpenCL 12.2ns/day

or example RUN13:

  • FAH benchmark results 51ns/day
  • openMM HIP 47.1ns/day
  • openMM OpenCL 42.9ns/day

So it seems the results on openMM openCL are 10%-20% lower than these on FAH, which I don't understand. I would expect the opposite, since the runs on FAH include also checkpoints.

Would be good to understand the differences and achieve similar results in execution of the runs on openCL in FAH and openCL on local openMM. As long as there are significant differences effective benchmarks are not possible before integration of a new approach into a new FAH core, which is a big effort. Being able to run benchmarks in advance directly in openMM would be helpful to analyse performance effects of different changes.

This is the script I used, derived from the script to generate the 17101 (and probably 17102) test WUs:

from simtk import openmm, unit
import time
import os

template = """
<config>
 <numSteps v="{numSteps}"/>  
 <xtcFreq v="{xtcFreq}"/>
 <checkpointFreq v="{checkpointFreq}"/>
 <precision v="mixed"/>
 <xtcAtoms v="solute"/> 
</config>

"""

nsteps = 50000
wu_duration = 10*unit.minutes
ncheckpoints_per_wu = 4

from glob import glob
runs = glob('RUNS/17102*')
runs.sort()

platform = openmm.Platform.getPlatformByName('HIP')
print(platform.getOpenMMVersion())
platform.setPropertyDefaultValue('Precision', 'mixed')

def load(run, filename):    
    with open(os.path.join(run, filename), 'rt') as infile:        
        return openmm.XmlSerializer.deserialize(infile.read())

for run in runs:
    run = run + "/01/"
    print(run)

    # Read core.xml
    coredata = dict()    
    coredata['checkpointFreq'] = 0 #int(nsteps_per_wu / ncheckpoints_per_wu)
    coredata['numSteps'] = 0 #ncheckpoints_per_wu * coredata['checkpointFreq']
    coredata['xtcFreq'] = 0 #coredata['numSteps']

    system = load(run, 'system.xml')
    state = load(run, 'state.xml')
    integrator = load(run, 'integrator.xml')
    
    context = openmm.Context(system, integrator, platform)
    context.setState(state)
    
    initial_time = time.time()
    integrator.step(nsteps)
    state = context.getState()
    elapsed_time = (time.time() - initial_time) * unit.seconds
    time_per_step = elapsed_time / nsteps
    ns_per_day = (nsteps * integrator.getStepSize()) / elapsed_time / (unit.nanoseconds/unit.day)
    nsteps_per_wu = int(wu_duration / time_per_step)
    
    print(f'{run} {system.getNumParticles()} particles : {ns_per_day:.1f} ns/day : {coredata}')

What happened to windowsExportCuda.h?

Hello there,

I just wanted to ask what happened to the file windowsExportCuda.h from version 7.4.2

Is it no longer needed, has it been moved or something similar?
Thanks for your answers

Cut release for core21-0.0.13

We apparently neglected to cut a release for the version of core21 compiled for 0.0.12.

This is just a reminder that we should cut a release for 0.0.13.

Update to OpenMM 7.2 code?

@peastman : While OpenMM 7.2 has just been released, the non-forcefield code has been super stable for quite some time. What do you think about updating the core code to match 7.2 and pushing ahead with a modern core 22? @gbowman can help coordinate, and hopefully @jcoffland should be able to lead in putting this together.

simulations always blow up

Hi,

I am trying to do an openmm test internally on FAH, and the NVT simulation could run successfully on my workstation.
However, it keeps blowing-up when I grab it from FAH internally. I think there should be something wrong with the *.xml files. Could anyone help me out? Thanks.

--Hongbin

from simtk.openmm.app import *
from simtk.openmm import *
from simtk.unit import *

pdb = PDBFile('alanine.pdb')

forcefield = ForceField('amber99sb.xml', 'tip3p.xml')
system = forcefield.createSystem(pdb.topology, nonbondedMethod=PME, nonbondedCutoff=1*nanometer, constraints=HBonds)
integrator = LangevinIntegrator(300*kelvin, 1/picosecond, 0.002*picoseconds)


simulation = Simulation(topology, system, integrator)
simulation.context.setPositions(pdb.positions)
simulation.context.setVelocitiesToTemperature(300*kelvin)

with open('system.xml', 'w') as f:
    system_xml = XmlSerializer.serialize(system)
    f.write(system_xml)

with open('integrator.xml', 'w') as f:
    integrator_xml = XmlSerializer.serialize(integrator)
    f.write(integrator_xml)


state = simulation.context.getState(getPositions=True, getVelocities=True, getForces=True, getEnergy=True, getParameters=True, enforcePeriodicBox=True)

with open('state.xml', 'w') as f:
    f.write(XmlSerializer.serialize(state))
    print('saved state.xml')

Is SHAKE being misused by Core_22 projects?

FAHCore_22 is being developed using a recent version of OpenMM. It is encountering a significant number of "bad state" failures.

In a discussion of that new FAHCore, an unrelated (?) fact was mentioned that simulations can be accelerated by using larger time-steps provided the mass of the H atoms is redistributed, which let me to this paper: Long-Time-Step Molecular Dynamics through Hydrogen Mass
Repartitioning
by Chad W. Hopkins et.al. out of the University of Florida,

I do not have first hand-knowledge if that paper is being applied to the projects that are failing. (Somebody should check). Neither do I know if the routine SHAKE is being use (either tha Windows version of the Linux version for GPUs).

Near the end of that paper, there is a warning that SHAKE will fail with time-steps greater that 1 fs. Could that be what's causing our Bad State errors?

If so, can a simple test of the timestep be incorporated into calls to SHAKE warning that there may be an upcoming failure?

Obviously I may be complete wrong about all of this.

What code was used in core18?

We don't have a core18 branch. Is there at least a commit hash that stores what was used to build core18?

The donors are asking us to fix some issues with core18, but I'm not even sure we have any record of how to build it or what went into it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.