hpc / xpmem Goto Github PK

Linux Cross-Memory Attach

License: GNU Lesser General Public License v2.1

Shell 1.06% C 94.27% Makefile 1.08% M4 3.59%

xpmem's Introduction

# Introduction
This is an experimental version of XPMEM based on a version provided by Cray
and uploaded to https://code.google.com/p/xpmem. This version supports any
Linux kernel 3.12 and newer (tested up to 5.8.x). Keep in mind there may be bugs and
this version may cause kernel panics, code crashes, eat your cat, etc.

XPMEM is a Linux kernel module that enables a process to map the
memory of another process into its virtual address space. Source code
can be obtained by cloning the Git repository, original Mercurial
repository or by downloading a tarball from the link above.

The XPMEM API has three main functions:

  xpmem_make()    
  xpmem_get()
  xpmem_attach()

A process calls xpmem_make() to export a region of its virtual address
space. Other processes can then attach to the region by calling
xpmem_get() and xpmem_attach(). After a memory region is attached, it
is accessed via direct loads and stores. This enables upper-level
protocols such as MPI and SHMEM to perform single-copy address-space
to address-space transfers, completely at user-level.

Note, there is a limitation to the usage of an attached region. Any
system call that will call get_user_pages() on the region from the
non-owning process with get EFAULT. This include pthread mutexes
and condition variable, and SYS V semaphores. We intend to address
this limitation in a future release.

XPMEM regions are free to have "holes" in them, meaning virtual memory
regions that are not allocated. This makes XPMEM somewhat more
flexible than mmap(). A process could, for example, export a region
via XPMEM starting at address 0 and extending 4 GB. Accesses to
allocated (valid) virtual addresses in this region proceed normally,
and pages are mapped between address spaces on demand. A segfault will
occur if the source process or any other process mapping the region
tries to access an unallocated (invalid) virtual address in the
region.

# Known issues

* Memory regions mapped with XPMEM cannot be pinned with
  [ibv_dereg_mr](https://linux.die.net/man/3/ibv_reg_mr)

xpmem's People

Contributors

Stargazers

Watchers

xpmem's Issues

Fail to build on 5.8/x86_64: implicit declaration of function 'flush_tlb_mm_range'

I tried building xpmem with kernel 5.11 and got the following error:

  CC [M]  /home/tzafrirc/Proj/Packs/xpmem/xpmem/kernel/xpmem_mmu_notifier.o
In file included from /home/tzafrirc/Proj/Packs/xpmem/xpmem/kernel/xpmem_mmu_notifier.c:20:
./arch/x86/include/asm/tlb.h: In function ‘tlb_flush’:
./arch/x86/include/asm/tlb.h:24:2: error: implicit declaration of function ‘flush_tlb_mm_range’; did you mean ‘flush_icache_range’? [-Werror=implicit-function-declaration]
   24 |  flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
      |  ^~~~~~~~~~~~~~~~~~
      |  flush_icache_range

Kernel 5.7 is OK but kernel 5.8 onwards break. I tried changing the order of includes to place asm/tlbflush.h before asm/tlb.h but I see that this function does not seem to be available for modules.

HPE Adapt (1 of at least 4) - Non-native huge page support

HPE is planning to work with the community to enhance this xpmem repo, to provide the functionality currently supported by our in-house COS-only version. This issue is number 1 in a series of at least 4, and seeks to identify the known issues that must be addressed.

Use conditional compilation to allow support of non-native huge pages, when built for a kernel that supports this feature (e.g., COS)

audit kernel module namespace security

We'd like to use XPMEM in environments with user namespaces, and perhaps other namespaces, enabled. A prerequisite for this is for the kernel module to be secure when used from inside namespaces.

For example, the UID used for access checking need to be the real host UID, not the in-namespace UID. But I think there are other things to check too.

I spoke with @jlowellwofford today and he told me he might be able to help with this.

Performance issue

Hi,

Maybe I found a performance issue of XPMEM.

xpmem_attach use the function "remap_pfn_range" to map page to vma.
But this funtion will split 2M huge page to 4KB page
4KB TLB miss will increase, this is a performance issue

Is there any way to fix this issue?

Zombie/defunct processes caused by xpmem?

I'm using xpmem in our home-brew application (OpenMPI + our own xpmem for in-node comm), on an AMD EPYC cluster, 7.7 (Maipo), kernel 3.10.0-1062.9.1.el7.x86_64. Sometimes after the applications finishes, multiple compute nodes have plenty of zombie/defunct processes that never die. Looking at the stack of some of those processes, I see this:

[<ffffffffb6acbf5e>] __synchronize_srcu+0xfe/0x150
[<ffffffffb6acbfcd>] synchronize_srcu+0x1d/0x20
[<ffffffffb6c1c10d>] mmu_notifier_unregister+0xad/0xe0
[<ffffffffc0b5e614>] xpmem_mmu_notifier_unlink+0x54/0x97 [xpmem]
[<ffffffffc0b5a13d>] xpmem_flush+0x13d/0x1c0 [xpmem]
[<ffffffffb6c47ce7>] filp_close+0x37/0x90
[<ffffffffb6c6b0b8>] put_files_struct+0x88/0xe0
[<ffffffffb6c6b1b9>] exit_files+0x49/0x50
[<ffffffffb6aa2022>] do_exit+0x2b2/0xa50
[<ffffffffb6aa283f>] do_group_exit+0x3f/0xa0
[<ffffffffb6aa28b4>] SyS_exit_group+0x14/0x20
[<ffffffffb718dede>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

So they seem to be hanging on some XPMEM-related process cleanup. This is strange for a few reasons: I checked, and in the code I match each xpmem_attach with an xpmem_detatch. Also, it seems strange that the kernel would be unable to end a process, because xpmem is unable to perform cleanup.

Does anyone have any ideas as to what might be the problem here?

Thanks a lot!

separate prefix for module?

Hi,

when building xpmem for module environment with  ./configure --prefix=/cm/shared/apps/xpmem/2.6.5  the module installs to /cm/shared/apps/xpmem/2.6.5/lib/modules/`uname -r`

Is it possible to add kernel module install path that is separate to the prefix install path?

torel@n005:~/workspace/XPMEM/xpmem-2.6.5-aarch64/kernel$ egrep -i -e "libdir|kerneldir|moduledir" Makefile
pkglibdir = $(libdir)/xpmem
am__installdirs = "$(DESTDIR)$(initdir)" "$(DESTDIR)$(moduledir)"
kerneldir = /lib/modules/4.15.0-117-generic/build/
libdir = ${exec_prefix}/lib
moduledir = $(libdir)/modules/4.15.0-117-generic/

Brgds,
Tor

`PDE_DATA` problems on AARCH64

In xpmem_pfn.c I noticed the following lines:

#if LINUX_VERSION_CODE < KERNEL_VERSION(5,17,0)
#define pde_data(indoe) PDE_DATA(inode)
#elif LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0)
#define pde_data(inode) ((PDE(inode)->data))
#endif

First, I believe #define pde_data(indoe) PDE_DATA(inode) is a typo?

Second, isn't if more logical if checking version below 3.10.0 instead (as below)?

#if LINUX_VERSION_CODE < KERNEL_VERSION(3,10,0)
#define pde_data(inode) ((PDE(inode)->data))
#elif LINUX_VERSION_CODE < KERNEL_VERSION(5,17,0)
#define pde_data(inode) PDE_DATA(inode)
#endif

Third, I tried to compile it on Rocky Linux release 9.3 (Blue Onyx) on AARCH64 with kernel version 5.14.0-362.18.1.el9_3.0.1.aarch64, it is already using pde_data instead of PDE_DATA

Build errors on kernel 6.1 and higher

During build on a Kernel 6.4.2, the following error message appears:

$ make
[...]
  CC [M]  /home/pfriese/build/xpmem/kernel/xpmem_attach.o
/home/pfriese/build/xpmem/kernel/xpmem_attach.c: In function ‘xpmem_attach’:
/home/pfriese/build/xpmem/kernel/xpmem_attach.c:510:62: error: ‘struct vm_area_struct’ has no member named ‘vm_next’
  510 |                                 ; existing_vma = existing_vma->vm_next) {
      |                                                              ^~
/home/pfriese/build/xpmem/kernel/xpmem_attach.c:530:23: error: assignment of read-only member ‘vm_flags’
  530 |         vma->vm_flags |=
[...]

The first issue originates from commit 763ecb0 on 2022-09-27, titled "mm: remove the vma linked list".

The second issue originates from commit bc292ab on 2023-02-10, titled "mm: introduce vma->vm_flags wrapper functions".

openeuler build xpmem error

I had a problem compiling and installing xpmem on the arm machine.
My system is OpenEuler 20.03 SP3 and the kernel version is 4.19.90-2112.8.0.0131.oe1.aarch64.
I have two versions of xpmem. One version is unknown. However, the xpmem can be successfully installed on CentOS 7.6 (kernel version 4.14.0-115.el7a.0.1.aarch64). I compile this version on my euler system and configure it successfully. However, the following error occurs during compilation:
/home/host64/mpi/xpmem-master/kernel/xpmem_attach.c: In function ‘xpmem_clear_PTEs_of_att’:
/home/host64/mpi/xpmem-master/kernel/xpmem_attach.c:809:7: error: void value not ignored as it ought to be
809 | ret = zap_vma_ptes (vma, unpin_at, invalidate_len);
| ^
make[4]: *** [scripts/Makefile.build:304: /home/host64/mpi/xpmem-master/kernel/xpmem_attach.o] Error 1
make[3]: *** [Makefile:1524: module/home/host64/mpi/xpmem-master/kernel] Error 2
make[3]: Leaving directory '/usr/src/kernels/4.19.90-2112.8.0.0131.oe1.aarch64'
make[2]: *** [Makefile:557: xpmem.ko] Error 2
make[2]: Leaving directory '/home/host64/mpi/xpmem-master/kernel'
make[1]: *** [Makefile:490: all-recursive] Error 1
make[1]: Leaving directory '/home/host64/mpi/xpmem-master'
make: *** [Makefile:376: all] Error 2
###################################
I thought the kernel version was too high. Therefore, I downloaded xpmem-2.6.2 and xpmem-2.6.3 for installation. However, an error message was displayed during the configuration.
configure: error: Default library path or default prefix must be specified
The contents of the config.log file are as follows:
configure:9545: checking for memset
configure:9545: gcc -o conftest -g -O2 conftest.c >&5
conftest.c:56:6: warning: conflicting types for built-in function 'memset'; expected 'void *(void *, int, long unsigned int)' [-Wbuiltin-declaration-mismatch]
56 | char memset ();
| ^~~~~~
conftest.c:44:1: note: 'memset' is declared in header '<string.h>'
43 | # include <limits.h>
44 | #else
configure:9545: $? = 0
configure:9545: result: yes
configure:9581: error: Default library path or default prefix must be specified
################
I don't know if anyone has encountered it and knows how to solve it.
Thank you.

Add a LGPL or BSD licensed version of libxpmem

I am talking to Cray and will engage SGI to see about relicensing the library as LGPL. If that is impossible a new library will have to be written.

PBS_MOM killed on job exit (xpmem_close_handler)

xpmem_close_handler is forcing a sigkill of the current thread group. In certain cases that means it is killing off the PBS_MOM. Initial guess is that we have some kind of race to free up memory when user job processes are exiting and the appropriate xpmem_detach isn't winning the race.

Original Stack trace recovered via systemtap:

0xffffffffa1170e57 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x8e57/0x0]
0xffffffffa117297e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xa97e/0x0]
0xffffffffa1172f8e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xaf8e/0x0]
0xffffffffa1174295 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xc295/0x0]
0xffffffffa116801d [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x1d/0x0]
0xffffffff810932b5 : __send_signal+0x245/0x450 [kernel]
0xffffffff8101bfe4 : try_stack_unwind+0x194/0x1b0 [kernel]
0xffffffff8101ae04 : dump_trace+0x64/0x3b0 [kernel]
0xffffffffa1172e88 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xae88/0x0]
0xffffffffa1172f8e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xaf8e/0x0]
0xffffffffa1174295 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xc295/0x0]
0xffffffffa116801d [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x1d/0x0]
0xffffffff810932b5 : __send_signal+0x245/0x450 [kernel]
0xffffffff810934fe : send_signal+0x3e/0x80 [kernel]
0xffffffff81093d30 : force_sig_info+0xb0/0xe0 [kernel]
0xffffffff81093d76 : force_sig+0x16/0x20 [kernel]
0xffffffffa0238a01 : xpmem_close_handler+0x151/0x270 [xpmem]
0xffffffff811d774d : remove_vma+0x2d/0x70 [kernel]
0xffffffff811db09a : exit_mmap+0xea/0x150 [kernel]
0xffffffff81082edf : mmput+0x4f/0x110 [kernel]

We enabled xpmem_debug and captured the following trace along with dmesg log.

The job was started at 16:30, so you can extract the log with (grep "Jun 11 16:3" r1i6n18.gbe.ice.issp.u-tokyo.ac.jp

20180611.tar.gz

Missing tag/release for 2.6.4, 2.6.5

2.6.4: 5220548
2.6.5: 4efeed9

Can these commits get tagged?

HPE Adapt (4 of at least 4) - Reliance on kernel functions not normally exposed.

HPE is planning to work with the community to enhance this xpmem repo, to provide the functionality currently supported by our in-house COS-only version. This issue is number 4 in a series of at least 4, and seeks to identify the known issues that must be addressed.

HPE xpmem currently relies on a small number of kernel function that are not normally exposed to kernel modules. These may be side effects of other HPE xpmem specifics, and therefore may be addressed in turn by #55, #56, #57. This is a place-holder to track any stragglers. Details to follow.

Cannot build xpmem vs. a kernel that built modules_prepare

I often build modules vs. a partially-built kernel: building the target 'modules_prepare' gives a kernel that is good enough for building the module.

The configure script, however, requires Module.symvers that is not generated by this target. My very basic workaround (that does not check modules_prepare properly. And may not work with some older kernels) is:

--- a/m4/ac_path_kernel_source.m4
+++ b/m4/ac_path_kernel_source.m4
@@ -24,7 +24,7 @@ AC_DEFUN([AC_PATH_KERNEL_SOURCE_SEARCH],
         /usr/src/linux-source-${vers} \
         /usr/src/linux /lib/modules/${vers}/source
     do
-      if test -e $dir/Module.symvers ; then
+      if test -e $dir/include/generated/autoconf.h ; then
         kerneldir=`dirname $dir/Makefile`/ || continue
         no_kernel=no
         break

Softlockup using xpmem and BUG: scheduling while atomic

Dear Nathan Hjelm,

I am experiencing a quite reproduceable softlockup using xpmem and openMPI 1.10.2.

The tested kernel versions are 4.3.5 vanilla and openSUSE's 4.1.15-8-default, using gcc version 4.8.5 (SUSE Linux).
Using the xpmem Version of 08.2015 works fine.

If you need more info, just let me know, I'll be happy to help you.

To reproduce the error, just compile osu-micro-benchmarks-4.4.1 and run the osu_alltoall / allgather benchmark a couple of times, usually it hangs up after the 10th run. The benchmark completes wihtout an error, but the finalizing step seems to hang.
If you then kill the mpirun process the following will happen:

If you then try to rerun the benchmark, the console freezes and you cannot remove the xpmem-module, it is still in use.

[58264.252334] BUG: scheduling while atomic: osu_alltoall/20072/0x00000002
[58264.252746] Modules linked in: xpmem(O) r8152(O) mii rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables svcrdma(O) knem(O) vtsspp(O) sep3_15(O) pax(O) rdma_ucm(O) ib_ucm(O) rdma_cm(O) iw_cm(O) configfs af_packet ib_ipoib(O) inet_lro ib_cm(O) ib_uverbs(O) ib_umad(O) mlx4_ib(O) ib_sa(O) ib_mad(O) ib_core(O) ib_addr(O) iscsi_ibft iscsi_boot_sysfs ast ttm drm_kms_helper x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif drm syscopyarea sysfillrect sysimgblt kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul iTCO_wdt ie31200_edac joydev edac_core iTCO_vendor_support lpc_ich glue_helper mfd_core i2c_i801 ablk_helper shpchp pcspkr
[58264.252773] cryptd 8250_fintek tpm_tis tpm_crb ipmi_si ipmi_msghandler battery tpm video thermal processor button fan nfsd auth_rpcgss nfs_acl lockd grace sr_mod cdrom uas usb_storage hid_generic usbhid btrfs xor raid6_pq crc32c_intel megaraid_sas igb i2c_algo_bit xhci_pci dca ehci_pci xhci_hcd ehci_hcd mlx4_core(O) e1000e mlx_compat(O) ptp usbcore pps_core usb_common sunrpc sg [last unloaded: xpmem]
[58264.252791] CPU: 4 PID: 20072 Comm: osu_alltoall Tainted: G W O 4.1.15-8-default #1
[58264.252792] Hardware name: Supermicro X10SLL-F/X10SLL-F, BIOS 3.0 04/24/2015
[58264.252792] 0000000000000000 ffff880096813c08 ffffffff81659718 ffff88081fd16700
[58264.252794] ffffffff8165497b ffff880096813c58 ffffffff8165bfb2 0000000000000200
[58264.252795] ffff8807b6662390 0000000000000000 ffff880096814000 ffff880096813da8
[58264.252796] Call Trace:
[58264.252804] [] dump_trace+0x8c/0x340
[58264.252806] [] show_stack_log_lvl+0xfc/0x1a0
[58264.252808] [] show_stack+0x21/0x50
[58264.252811] [] dump_stack+0x47/0x67
[58264.252813] [] __schedule_bug+0x4b/0x59
[58264.252815] [] thread_return+0x629/0x637
[58264.252817] [] schedule+0x37/0x90
[58264.252819] [] schedule_timeout+0x242/0x2d0
[58264.252821] [] wait_for_completion+0x9d/0x110
[58264.252824] [] __synchronize_srcu+0xcc/0x100
[58264.252829] [] mmu_notifier_unregister+0xa0/0xd0
[58264.252832] [] xpmem_flush+0x84/0x180 [xpmem]
[58264.252846] [] filp_close+0x2a/0x70
[58264.252849] [] put_files_struct+0x88/0xe0
[58264.252853] [] do_exit+0x2b7/0xb80
[58264.252855] [] do_group_exit+0x3d/0xa0
[58264.252856] [] SyS_exit_group+0x14/0x20
[58264.252858] [] system_call_fastpath+0x16/0x75
[58264.252862] [<00007f550b8b41c9>] 0x7f550b8b41c9

Sometimes another error occurs:

[ 149.718738] hugetlbfs: osu_alltoall (2940): Using mlock ulimits for SHM_HUGETLB is deprecated
[ 268.648471] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [osu_alltoall:3141]
[ 268.648988] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables svcrdma(O) xpmem(O) knem(O) sep3_15(O) pax(O) af_packet rdma_ucm(O) ib_ucm(O) rdma_cm(O) iw_cm(O) configfs ib_ipoib(O) inet_lro ib_cm(O) ib_uverbs(O) ib_umad(O) mlx4_ib(O) ib_sa(O) ib_mad(O) ib_core(O) ib_addr(O) x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ext4 crct10dif_pclmul crc32_pclmul mbcache jbd2 aesni_intel aes_x86_64 lpc_ich lrw gf128mul glue_helper i2c_i801 mlx4_core(O) mlx_compat(O) ablk_helper cryptd ipmi_si mfd_core ie31200_edac ipmi_msghandler battery edac_core processor nfsd thermal button fan auth_rpcgss oid_registry nfs_acl lockd grace hid_generic usbhid
[ 268.649014] crc32c_intel megaraid_sas xhci_pci ehci_pci ehci_hcd xhci_hcd e1000e usbcore usb_common sunrpc
[ 268.649019] CPU: 6 PID: 3141 Comm: osu_alltoall Tainted: G O 4.3.5-mini #1
[ 268.649020] Hardware name: Supermicro X10SLL-F/X10SLL-F, BIOS 3.0 04/24/2015
[ 268.649021] task: ffff8807dd7d5640 ti: ffff8807c3a70000 task.ti: ffff8807c3a70000
[ 268.649021] RIP: 0010:[] [] queued_spin_lock_slowpath+0x14c/0x160
[ 268.649026] RSP: 0018:ffff8807c3a73de8 EFLAGS: 00000202
[ 268.649027] RAX: 0000000000000101 RBX: ffff8807d8bd0d00 RCX: 0000000000000001
[ 268.649027] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffffffa031c6c8
[ 268.649028] RBP: ffff8807c3a73de8 R08: 0000000000000101 R09: 00000000d8be4801
[ 268.649029] R10: ffffea001f62f900 R11: ffffffff812b6f6f R12: ffff8807dd7d5640
[ 268.649029] R13: 0000000000000000 R14: ffff8807f6d279e8 R15: 0000000000000025
[ 268.649030] FS: 0000000000000000(0000) GS:ffff88081fd80000(0000) knlGS:0000000000000000
[ 268.649031] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 268.649031] CR2: 00007faa9a818500 CR3: 0000000001a0c000 CR4: 00000000001406e0
[ 268.649032] Stack:
[ 268.649033] ffff8807c3a73df8 ffffffff8158f050 ffff8807c3a73e18 ffffffffa03172cf
[ 268.649034] ffff8807d8bd0c00 ffff8807f6d279c0 ffff8807c3a73e40 ffffffff8117f0af
[ 268.649035] 0000000000000221 0000000000000000 ffff8807f6d279c0 ffff8807c3a73e80
[ 268.649036] Call Trace:
[ 268.649040] [] _raw_spin_lock+0x20/0x30
[ 268.649043] [] xpmem_flush+0x6f/0x160 [xpmem]
[ 268.649045] [] filp_close+0x2f/0x70
[ 268.649047] [] put_files_struct+0x83/0xe0
[ 268.649048] [] exit_files+0x41/0x50
[ 268.649050] [] do_exit+0x28b/0xad0
[ 268.649053] [] ? do_audit_syscall_entry+0x66/0x70
[ 268.649054] [] do_group_exit+0x3f/0xa0
[ 268.649055] [] SyS_exit_group+0x14/0x20
[ 268.649057] [] entry_SYSCALL_64_fastpath+0x16/0x6e
[ 268.649058] Code: 01 48 8b 02 48 85 c0 75 0a f3 90 48 8b 02 48 85 c0 74 f6 c7 40 08 01 00 00 00 e9 4f ff ff ff 83 fa 01 75 07 e9 4c ff ff ff f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 0f 1f 40 00 0f

Kind regards,
Tobias Kloeffel

HPE Adapt (2 of at least 4) - Handle pages being unmapped.

HPE is planning to work with the community to enhance this xpmem repo, to provide the functionality currently supported by our in-house COS-only version. This issue is number 2 in a series of at least 4, and seeks to identify the known issues that must be addressed.

HPE COS xpmem relies on kernel patches to handle the scenario where pages are unmapped out from under xpmem (I believe this is know to occur with profilers/debuggers.) I am told that the linux kernel deprecated the notification mechanism previously used.

Need to investigate whether this xpmem is susceptible to the issue, or if it has already been addressed.

Build error on cpus_allowed/cpus_mask with kernel 4.18.0-240.1.1

Building xpmem failed with below error using kernel 4.18.0-240.1.1 on CentOS 8.3.

kernel/xpmem_pfn.c:255:25: error: 'struct task_struct' has no member named 'cpus_allowed'; did you mean 'nr_cpus_allowed'?
saved_mask = current->cpus_allowed;
^~~~~~~~~~~~

Below if statement in kernel/xpmem_pfn.c was checking for kernel version for 5.3.0-x or newer to use cpus_mask.
However, kernel 4.18.0-240.1.1 on CentOS 8.3 introduced "RH_KABI_RENAME(cpumask_t cpus_allowed, cpumask_t cpus_mask);" in /usr/src/kernels/4.18.0-240.1.1.el8_3.x86_64/include/linux/sched.h and cpus_mask had to be used instead of cpus_allowed.
Thus it failed to compile.

#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 3, 0)
saved_mask = current->cpus_mask;
#else
saved_mask = current->cpus_allowed;
#endif

HPE Adapt (3 of at least 4) - Avoid page fault deadlock, stemming from copy-on-write race.

HPE is planning to work with the community to enhance this xpmem repo, to provide the functionality currently supported by our in-house COS-only version. This issue is number 3 in a series of at least 4, and seeks to identify the known issues that must be addressed.

HPE xpmem currently relies on kernel mods (outside the xpmem kernel module) to avoid a known page fault deadlock, caused by a copy-on-write race condition. More details to follow.

build failure on fedora29

kernel release... 5.2.18-100.fc29.x86_64
gcc (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2)

build fails with
CC [M] /home/dkokron/play/XPMEM/xpmem/kernel/xpmem_attach.o
/home/dkokron/play/XPMEM/xpmem/kernel/xpmem_attach.c:351:11: error: initialization of ‘vm_fault_t (*)(struct vm_fault )’ {aka ‘unsigned int ()(struct vm_fault )’} from incompatible pointer type ‘int ()(struct vm_fault *)’ [-Werror=incompatible-pointer-types]
.fault = xpmem_fault_handler
^~~~~~~~~~~~~~~~~~~
/home/dkokron/play/XPMEM/xpmem/kernel/xpmem_attach.c:351:11: note: (near initialization for ‘xpmem_vm_ops.fault’)

Runtime failures in XPMEM when running MVAPICH2X or OpenMPI+UCX on POWER9 system

We recently tried XPMEM build of MVAPICH2X and OpenMPI+UCX on POWER9 system. However, we are seeing issues when running. Please see below the details and the reproducer.

XPMEM version: https://github.com/hjelmn/xpmem as of cae86010097cd85f0e749736dc86850f85f7edbc

UCX verion:

commit c22daf4fb1e408aedf5a7dc11ed72f87c0f27cc9
Merge: d669d54 e07fd32
Author: Yossi Itigin <[email protected]>
Date:   Fri Nov 6 17:17:08 2020 +0200

    Merge pull request #5881 from brminich/topic/iodemo_rt_exceeded

    TEST/IODEMO/AZP: Fix client tmo option in IODEMO

UCX configuration:

--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations --with-xpmem=/opt/xpmem

*OpenMPI: Tarball 4.0.5

OpenMPI configuration:

 --with-ucx=/home/users/hashmij/xpmem-work/ucx/install --without-verbs

Kernel version

$ uname -r 
4.14.0-115.18.1.el7a.ppc64le

System details

$ lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0-159
Thread(s) per core:    4
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          6
Model:                 2.2 (pvr 004e 1202)
Model name:            POWER9, altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2300.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              10240K
NUMA node0 CPU(s):     0-79
NUMA node8 CPU(s):     80-159
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):

Reproducer

Build osu-microbenchmarks with OpenMPI built with UCX (with xpmem transport support).

./configure CC=mpicc CXX=mpicxx

Run basic osu_latency test

$ OMPI_DIR/bin/mpirun -np 2 -x UCX_TLS=self,sm ./install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency

UCX output

[gorgon:39733:0:39733] Caught signal 7 (Bus error: nonexistent physical address)
[gorgon:39732:0:39732] Caught signal 7 (Bus error: nonexistent physical address)

/home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c: [ uct_mm_ep_get_remote_seg() ]
      ...
       81
       82     /* slow path - attach new segment */
       83     return uct_mm_ep_attach_remote_seg(ep, seg_id, length, address_p);
==>    84 }
       85
       86
       87 /* send a signal to remote interface using Unix-domain socket */


/home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c: [ uct_mm_ep_get_remote_seg() ]
      ...
       81
       82     /* slow path - attach new segment */
       83     return uct_mm_ep_attach_remote_seg(ep, seg_id, length, address_p);
==>    84 }
       85
       86
       87 /* send a signal to remote interface using Unix-domain socket */

==== backtrace (tid:  39733) ====
 0 0x0000000000056410 ucs_debug_print_backtrace()  /home/users/hashmij/xpmem-work/ucx/src/ucs/debug/debug.c:656
 1 0x0000000000019b3c uct_mm_ep_get_remote_seg()  /home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c:84
 2 0x0000000000019d38 uct_mm_ep_t_new()  /home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c:194
 3 0x0000000000015558 uct_ep_create()  /home/users/hashmij/xpmem-work/ucx/src/uct/base/uct_iface.c:550
 4 0x000000000006598c ucp_wireup_connect_lane_to_iface()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:805
 5 0x000000000006598c ucp_wireup_connect_lane()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:888
 6 0x000000000006598c ucp_wireup_init_lanes()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:1207
 7 0x0000000000022448 ucp_ep_create_to_worker_addr()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:421
 8 0x00000000000236f0 ucp_ep_create_api_to_worker_addr()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:674
 9 0x00000000000236f0 ucp_ep_create()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:740
10 0x0000000000005ca4 mca_pml_ucx_add_proc_common()  pml_ucx.c:0
11 0x0000000000005f68 mca_pml_ucx_add_procs()  ???:0
12 0x0000000000117bb4 ompi_mpi_init()  ???:0
13 0x00000000000ac1b8 MPI_Init()  ???:0
14 0x0000000010001510 main()  /home/users/hashmij/xpmem-work/osu_benchmarks-ompi/mpi/pt2pt/osu_latency.c:37
15 0x0000000000025200 generic_start_main.isra.0()  libc-start.c:0
16 0x00000000000253f4 __libc_start_main()  ???:0
=================================
[gorgon:39733] *** Process received signal ***
[gorgon:39733] Signal: Bus error (7)
[gorgon:39733] Signal code:  (-6)
[gorgon:39733] Failing at address: 0x3cab00009b35
[gorgon:39733] [ 0] [0x7fffb51004d8]
[gorgon:39733] [ 1] /home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(+0x19b3c)[0x7fffb1bc9b3c]
[gorgon:39733] [ 2] /home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(uct_mm_ep_t_new+0x68)[0x7fffb1bc9d38]
[gorgon:39733] [ 3] /home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(uct_ep_create+0x78)[0x7fffb1bc5558]
[gorgon:39733] [ 4] ==== backtrace (tid:  39732) ====
/home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x8bc)[0x7fffb1c7598c]
[gorgon:39733] [ 5]  0 0x0000000000056410 ucs_debug_print_backtrace()  /home/users/hashmij/xpmem-work/ucx/src/ucs/debug/debug.c:656
 1 0x0000000000019b3c uct_mm_ep_get_remote_seg()  /home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c:84
 2 0x0000000000019d38 uct_mm_ep_t_new()  /home/users/hashmij/xpmem-work/ucx/src/uct/sm/mm/base/mm_ep.c:194
 3 0x0000000000015558 uct_ep_create()  /home/users/hashmij/xpmem-work/ucx/src/uct/base/uct_iface.c:550
 4 0x000000000006598c ucp_wireup_connect_lane_to_iface()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:805
 5 0x000000000006598c ucp_wireup_connect_lane()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:888
 6 0x000000000006598c ucp_wireup_init_lanes()  /home/users/hashmij/xpmem-work/ucx/src/ucp/wireup/wireup.c:1207
/home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_ep_create_to_worker_addr+0x98)[0x7fffb1c32448]
[gorgon:39733] [ 6] /home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_ep_create+0x6f0)[0x7fffb1c336f0]
[gorgon:39733] [ 7] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/openmpi/mca_pml_ucx.so(+0x5ca4)[0x7fffb1cf5ca4]
[gorgon:39733]  7 0x0000000000022448 ucp_ep_create_to_worker_addr()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:421
 8 0x00000000000236f0 ucp_ep_create_api_to_worker_addr()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:674
 9 0x00000000000236f0 ucp_ep_create()  /home/users/hashmij/xpmem-work/ucx/src/ucp/core/ucp_ep.c:740
10 0x0000000000005ca4 mca_pml_ucx_add_proc_common()  pml_ucx.c:0
[ 8] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_add_procs+0x98)[0x7fffb1cf5f68]
[gorgon:39733] [ 9] 11 0x0000000000005f68 mca_pml_ucx_add_procs()  ???:0
12 0x0000000000117bb4 ompi_mpi_init()  ???:0
13 0x00000000000ac1b8 MPI_Init()  ???:0
14 0x0000000010001510 main()  /home/users/hashmij/xpmem-work/osu_benchmarks-ompi/mpi/pt2pt/osu_latency.c:37
15 0x0000000000025200 generic_start_main.isra.0()  libc-start.c:0
16 0x00000000000253f4 __libc_start_main()  ???:0
=================================
/home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/libmpi.so.40(ompi_mpi_init+0xc54)[0x7fffb50a7bb4]
[gorgon:39733] [10] [gorgon:39732] *** Process received signal ***
[gorgon:39732] Signal: Bus error (7)
[gorgon:39732] Signal code:  (-6)
[gorgon:39732] Failing at address: 0x3cab00009b34
[gorgon:39732] [ 0] [0x7fffbc4704d8]
[gorgon:39732] [ 1] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/libmpi.so.40(MPI_Init+0x98)[0x7fffb503c1b8]
[gorgon:39733] [11] ./install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x10001510]
/home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(+0x19b3c)[0x7fffb8f39b3c]
[gorgon:39732] [ 2] [gorgon:39733] [12] /home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(uct_mm_ep_t_new+0x68)[0x7fffb8f39d38]
[gorgon:39732] [ 3] /home/users/hashmij/xpmem-work/ucx/install/lib/libuct.so.0(uct_ep_create+0x78)[0x7fffb8f35558]
[gorgon:39732] [ 4] /lib64/libc.so.6(+0x25200)[0x7fffb4ab5200]
[gorgon:39733] [13] /home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_wireup_init_lanes+0x8bc)[0x7fffb8fe598c]
[gorgon:39732] [ 5] /home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_ep_create_to_worker_addr+0x98)[0x7fffb8fa2448]
[gorgon:39732] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x7fffb4ab53f4]
[gorgon:39733] *** End of error message ***
/home/users/hashmij/xpmem-work/ucx/install/lib/libucp.so.0(ucp_ep_create+0x6f0)[0x7fffb8fa36f0]
[gorgon:39732] [ 7] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/openmpi/mca_pml_ucx.so(+0x5ca4)[0x7fffb9065ca4]
[gorgon:39732] [ 8] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_add_procs+0x98)[0x7fffb9065f68]
[gorgon:39732] [ 9] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/libmpi.so.40(ompi_mpi_init+0xc54)[0x7fffbc417bb4]
[gorgon:39732] [10] /home/users/hashmij/xpmem-work/openmpi-4.0.5/install/lib/libmpi.so.40(MPI_Init+0x98)[0x7fffbc3ac1b8]
[gorgon:39732] [11] ./install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x10001510]
[gorgon:39732] [12] /lib64/libc.so.6(+0x25200)[0x7fffbbe25200]
[gorgon:39732] [13] /lib64/libc.so.6(__libc_start_main+0xc4)[0x7fffbbe253f4]
[gorgon:39732] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node gorgon exited on signal 7 (Bus error).
--------------------------------------------------------------------------

Kernel Log

$ dmesg | tail
...
[6890564.268944] xpmem_fault_handler: pfn mismatch: 466880 != 1930663
[6890564.269032] xpmem_fault_handler: pfn mismatch: 2088245 != 1302727

CC: @shamisp @hjelmn

cray-xpmem.pc: prefix, exec_prefix not set

prefix and exec_prefix are not set in cray-xpmem.pc; this leaves these variables empty when processed by pkgconf/pkg-config, resulting in erroneous includedir and libdir variables. In my case, a spurious -L/lib got into my link line for Open MPI, causing my linker to attempt to link to 32-bit versions of libm, libc, etc. into a 64-bit build. Needless to say, that did not go well.

This occurs because Autoconf variables are not expanded recursively during substitution. Simply adding the following to the top of cray-xpmem.pc.in should be sufficient to fix this.

prefix=@prefix@
exec_prefix=@exec_prefix@

I believe that module.in has a similar issue.

Build on 4.10 fails

This is from Ubuntu 16.04, which is based on 4.10.

xpmem_attach.c: In function xpmem_fault_handler:
xpmem_attach.c:172:41: error: struct vm_fault has no member named virtual_address

The kernel structure vm_fault changed from 4.9 to 4.10, replacing virtual_address with address. The xpmem source traps for this >= kernel 4.11, but doesn't handle 4.10 correctly. A quick attempt to change the kernel check from 4.11 to 4.10 did not work.

4.10 needs an xpmem_fault_handler() with 2 arguments, but the vaddr pulled from vmf->address. I.e. the else check from line 162, but the if check from line 168. Trying to select this option creates a duplicate variable (vma), and it's not clear to me which one to use.

XPMEM runtime warning/error

I am trying to use XPMEM with openMPI4.x, and have used the below configure command to configure openMPI4.1.0:-
$ ompi_info --all|grep 'command line'
Configure command line: '--prefix=/home/server/ompi4_xmem' '--with-xpmem=/home/server/xpmm' '--enable-mpi-fortran' '--enable-mpi-cxx' '--enable-shared=yes' '--enable-static=yes' '--enable-mpi1-compatibility'
User-specified command line parameters passed to ROMIO's configure script
Complete set of command-line parameters passed to ROMIO's configure script

But I am getting a warning/error when running the FFTW inbuilt MPI benchmark.
$ mpirun --map-by core -rank-by core --bind-to core ./mpi-bench -s ic1000000
WARNING: Could not generate an xpmem segment id for this process'
address space.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: lib-server-03
Error code: 2 (No such file or directory)
Problem: ic1000000, setup: 580.97 ms, time: 1.76 ms, ``mflops'': 56555
[lib-server-03:1297333] 127 more processes have sent help message help-btl-vader.txt / xpmem-make-failed
[lib-server-03:1297333] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Due to the above warning/error, I am not sure if the MPI program is using XPMEM or CMA.
Can you please help me in resolving this warning/error?

Thanks in advance.

Fix github "About" description

Hi, it looks like the current github description, "Linux Cross-Memory Attach", references CMA. IIRC, XPMEM is an abbreviation for "Cross-Partition Memory".

Add redhat standard spec file

There has been a request to RPM-ify XPMEM. Though it isn't my disto of choice I plan to add a spec file that fits redhat's guidelines.

hpc / xpmem Goto Github PK

xpmem's Introduction

xpmem's People

Contributors

Stargazers

Watchers

Forkers

xpmem's Issues

Recommend Projects

Recommend Topics

Recommend Org