The USER-OMP package was developed by Axel Kohlmeyer at Temple University. It provides multi-threaded versions of most pair styles, nearly all bonded styles (bond, angle, dihedral, improper), several Kspace styles, and a few fix styles. The package currently uses the OpenMP interface for multi-threading.
Here is a quick overview of how to use the USER-OMP package, assuming one or more 16-core nodes. More details follow.
use -fopenmp with CCFLAGS and LINKFLAGS in Makefile.machine make yes-user-omp make mpi # build with USER-OMP package, if settings added to Makefile.mpi make omp # or Makefile.omp already has settings
lmp_mpi -sf omp -pk omp 16 < in.script # 1 MPI task, 16 threads mpirun -np 4 lmp_mpi -sf omp -pk omp 4 -in in.script # 4 MPI tasks, 4 threads/task mpirun -np 32 -ppn 4 lmp_mpi -sf omp -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task
Required hardware/software:
Your compiler must support the OpenMP interface. You should have one or more multi-core CPUs so that multiple threads can be launched by each MPI task running on a CPU.
Building LAMMPS with the USER-OMP package:
The lines above illustrate how to include/build with the USER-OMP package in two steps, using the "make" command. Or how to do it with one command as described in Section 4 of the manual.
Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must include "-fopenmp". Likewise, if you use an Intel compiler, the CCFLAGS setting must include "-restrict".
Run with the USER-OMP package from the command line:
The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
You need to choose how many OpenMP threads per MPI task will be used by the USER-OMP package. Note that the product of MPI tasks * threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer.
As in the lines above, use the "-sf omp" command-line switch, which will automatically append "omp" to styles that support it. The "-sf omp" switch also issues a default package omp 0 command, which will set the number of threads per MPI task via the OMP_NUM_THREADS environment variable.
You can also use the "-pk omp Nt" command-line switch, to explicitly set Nt = # of OpenMP threads per MPI task to use, as well as additional options. Its syntax is the same as the package omp command whose doc page gives details, including the default values used if it is not specified. It also gives more details on how to set the number of threads via the OMP_NUM_THREADS environment variable.
Or run with the USER-OMP package by editing an input script:
The discussion above for the mpirun/mpiexec command, MPI tasks/node, and threads/MPI task is the same.
Use the suffix omp command, or you can explicitly add an "omp" suffix to individual styles in your input script, e.g.
pair_style lj/cut/omp 2.5
You must also use the package omp command to enable the USER-OMP package. When you do this you also specify how many threads per MPI task to use. The command doc page explains other options and how to set the number of threads via the OMP_NUM_THREADS environment variable.
Speed-ups to expect:
Depending on which styles are accelerated, you should look for a reduction in the "Pair time", "Bond time", "KSpace time", and "Loop time" values printed at the end of a run.
You may see a small performance advantage (5 to 20%) when running a USER-OMP style (in serial or parallel) with a single thread per MPI task, versus running standard LAMMPS with its standard un-accelerated styles (in serial or all-MPI parallelization with 1 task/core). This is because many of the USER-OMP styles contain similar optimizations to those used in the OPT package, described in Section 5.3.5.
With multiple threads/task, the optimal choice of number of MPI tasks/node and OpenMP threads/task can vary a lot and should always be tested via benchmark runs for a specific simulation running on a specific machine, paying attention to guidelines discussed in the next sub-section.
A description of the multi-threading strategy used in the USER-OMP package and some performance examples are presented here
Guidelines for best performance:
For many problems on current generation CPUs, running the USER-OMP package with a single thread/task is faster than running with multiple threads/task. This is because the MPI parallelization in LAMMPS is often more efficient than multi-threading as implemented in the USER-OMP package. The parallel efficiency (in a threaded sense) also varies for different USER-OMP styles.
Using multiple threads/task can be more effective under the following circumstances:
Additional performance tips are as follows:
Restrictions:
None.