Coder Social home page Coder Social logo

infiniswap's Introduction

Infiniswap: Efficient Memory Disaggregation

Infiniswap is a remote memory paging system designed specifically for an RDMA network. It opportunistically harvests and transparently exposes unused memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines' remote memory. Because one-sided RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions.

Extensive benchmarks on workloads from memory-intensive applications ranging from in-memory databases such as VoltDB and Memcached to popular big data software Apache Spark, PowerGraph, and GraphX show that Infiniswap provides order-of-magnitude performance improvements when working sets do not completely fit in memory. Simultaneously, it boosts cluster memory utilization by almost 50%.

Detailed design and performance benchmarks are available in our NSDI'17 paper.

Prerequisites

The following prerequisites are required to use Infiniswap:

  • Software

    • Operating system: Ubuntu 14.04 (kernel 3.13.0, also tested on 4.4.0/4.11.0)
    • Container: LXC (or any other container technologies) with cgroup (memory and swap) enabled
    • RDMA NIC driver: MLNX_OFED 3.2/3.3/3.4/4.1 (recommend 4.1), and select the right version for your operating system.
  • Hardware

    • Mellanox ConnectX-3/4 (InfiniBand)
    • An empty and unused disk partition

Code Organization

The Infiniswap codebase is organized under three directories.

  • infiniswap_bd: Infiniswap block device (kernel module).
  • infiniswap_daemon: Infiniswap daemon (user-level process) that exposes its local memory as remote memory.
  • setup: scripts for setup and installation.

Important Parameters

There are several important parameters to configure in Infiniswap:

  • Infiniswap block device (in infiniswap_bd/infiniswap.h)

    1. BACKUP_DISK [disk partition]
      It's the name of the backup disk in Infiniswap block device.
      How to check the disk partition status and list?
      "sudo fdisk -l"
    2. STACKBD_SIZE_G [size in GB]
      It defines the size of Infiniswap block device (also backup disk).
    3. MAX_SGL_LEN [num of pages]
      It specifies how many pages can be included in a single swap-out request (IO request).
    4. BIO_PAGE_CAP [num of pages]
      It limits the maximum value of MAX_SGL_LEN.
    5. MAX_MR_SIZE_GB [size]
      It sets the maximum number of slabs from a single Infiniswap daemon. Each slab is 1GB.
    // example, in "infiniswap.h" 
    #define BACKUP_DISK "/dev/sda4"  
    #define STACKBD_SZIE_G 12  // 12GB
    #define MAX_SGL_LEN 32  // 32 x 4KB = 128KB, it's the max size for a single "struct bio" object.
    #define BIO_PAGE_CAP 32
    #define MAX_MR_SIZE_GB 32 //this infiniswap block device can get 32 slabs from each infiniswap daemon.
  • Infiniswap daemon (in infiniswap_daemon/rdma-common.h)

    1. MAX_FREE_MEM_GB [size]
      It is the maximum size (in GB) of remote memory this daemon can provide (from free memory of the local host).
    2. MAX_MR_SIZE_GB [size]
      It limits the maximum number of slabs this daemon can provide to a single infiniswap block device.
      This value should be the same of "MAX_MR_SIZE_GB" in "infiniswap.h".
    3. MAX_CLIENT [number]
      It defines how many infiniswap block devices a single daemon can connect to.
    4. FREE_MEM_EVICT_THRESHOLD [size in GB]
      This is the "HeadRoom" mentioned in our paper.
      When the remaining free memory of the host machines is lower than this threshold, infiniswap daemon will start to evict mapped slabs.
    // example, in "rdma-common.h" 
    #define MAX_CLIENT 32     
    
    /* Followings should be assigned based on 
    * memory information (DRAM capacity, regular memory usage, ...) 
    * of the host machine of infiniswap daemon.    
    */
    #define MAX_FREE_MEM_GB 32    
    #define MAX_MR_SIZE_GB  32    
    #define FREE_MEM_EVICT_THRESHOLD 8    

How to configure those parameters?

  • If you use the provided installation script (setup/install.sh) You can configure those parameters by changing the value of the variables in setup/install.sh before installation. In setup/install.sh, the definition of the variable and which parameter it maps to have been declared. You can edit its value as needed. For example,

    #stackbd (backup) disk size, also the total size of remote memory of this bd
    #(STACKBD_SIZE), default is 12
    stackbd_size=12
  • If you choose to build Infiniswap manually, you need to add configuration options to configure command. You can get the definitions of those options by

    # after ./autogen.sh
    ./configure --help

    See its Optional Features, like:

    --enable-stackbd_size   User defines the size of stackbd (backup) disk which
                            should be >= the size of remote memory, default is
                            12
    

    For example, if your Infiniswap block device has 24GB space in both its backup disk and remote memory, you need to

    ./configure --enable-stackbd_size=24

How to Build and Install

In a simple one-to-one experiment, we have two machines (M1 and M2).
Applications run in container on M1. M1 needs remote memory from M2.
We need to install infiniswap block device on M1, and install infiniswap daemon on M2.

  1. Setup InfiniBand NIC on both machines:
cd setup  
# ./ib_setup.sh <ip>    
# assume all IB NICs are connected in the same LAN (192.168.0.x)
# M1:192.168.0.11, M2:192.168.0.12
sudo ./ib_setup.sh 192.168.0.11
  1. Compile infiniswap daemon on M2:
cd setup
# edit the parameters in install.sh 
./install.sh daemon
  1. Install infiniswap block device on M1:
cd setup
# edit the parameters in install.sh
./install.sh bd

Or, how to manually build Infiniswap?

  • Infiniswap daemon
cd infiniswap_daemon
./autogen.sh
./configure [options] 
make
  • Infiniswap block device
cd infiniswap_bd
./autogen.sh
./configure [options] 
make
sudo make install

If you want to change the parameters of Infiniswap, you can add options when executing configure. Please read how to add configure options for details.

How to Run

  1. Start infiniswap daemon on M2:

    cd infiniswap_daemon   
    # ./infiniswap-daemon <ip> <port> 
    # pick up an unused port number
    ./infiniswap-daemon 192.168.0.12 9400
  2. Prepare server (portal) list on M1:

    # Edit the port.list file (<infiniswap path>/setup/portal.list)
    # portal.list format, the port number of each server is assigned above.  
    Line1: number of servers
    Line2: <server1 ip>:<port>  
    Line3: <server2 ip>:<port>
    Line4: ...
    
    # in this example, M1 only has one server
    1
    192.168.0.12:9400
  3. Disable existing swap partitions on M1:

    # check existing swap partitions
    sudo swapon -s
    
    # disable existing swap partitions
    sudo swapoff <swap partitions>
  4. Create an infiniswap block device on M1:

    cd setup
    # create block device: nbdx-infiniswap0
    # make nbdx-infiniswap0 a swap partition
    sudo ./infiniswap_bd_setup.sh
    # If you have the error: 
    #   "insmod: ERROR: could not insert module infiniswap.ko: Invalid parameters"
    # or get the following message from kernel (dmesg):
    #   "infiniswap: disagrees about version of symbol: xxxx"
    # You need a proper Module.symvers file for the MLNX_OFED driver (kernel module)
    #
    cd infiniswap_bd
    make clean
    cd ../setup
    # Solution 1 (copy the Module.symvers file from MLNX_OFED dkms folder):
    # provide mlnx_ofed_version: 3.2,3.3,3.4,4.1, or not (default is 4.*)
    ./get_module.symvers.sh {mlnx_ofed_version}
    # ./get_module.symvers.sh 4.1
    # Or solution 2 (generate a new Module.symvers file)
    ./create_Module.symvers.sh {mlnx_ofed_version}
    # Then, recompile infiniswap block device from step 3 in "How to Build and Install"
  5. Configure memory limitation of container (LXC)

    # edit "memory.limit_in_bytes" in "config" file of container (LXC)
    
    # For example, this container on M1 can use 5GB local memory at most.
    # Additional memory data will be stored in the remote memory provided by M2.   
    lxc.cgroup.memory.limit_in_bytes = 5G

Now, you can start your applications (in container).
The extra memory data from applications will be stored in remote memory.

FAQ

  1. Does infiniswap support transparent huge page?
    Yes. Infiniswap relies on the swap mechanism in the original Linux kernel. Current kernel (we have tested up to 4.10) splits the huge page into basic pages (4KB) before swapping out the huge page.
    (In mm/vmscan.c, shrink_page_list() calls split_huge_page_to_list() to split the huge page.)
    Therefore, whether transparent huge page is enabled or not makes no difference for infiniswap.

  2. Can we use Docker container, other than LXC?
    Yes. Infiniswap requires container-based environment. However, it has no dependency on LXC. Any container technologies that can limit memory resource and enable swapping should be feasible.
    We haven't tried Docker yet. If you find any problems when running infiniswap in a Docker environment, please contact us.

  3. Invalid parameters error when insert module? There are two ways of compiling infiniswap; using 1) inbox driver 2) Mellanox OFED. When you use inbox driver, you can compile/link against kernel headers/modules. When you use Mellanox OFED, you need to compile/link against OFED headers/modules. This should be handled by configure file, and refer the Makefile that links OFED modules.

  4. Others issues about compatibility

    • lookup_bdev() has different input arguments in the kernel patch. By default, we assume the patch is not installed. If you OS has this patch, you should:
      • If you use setup/install.sh, please set
        # setup/install.sh
        have_lookup_bdev_patch=1  #the default value is 0.
      • Or, if you build infiniswap_bd manually, add --enable-lookup_bdev in the configuration step.

Contact

This work is by Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. Shin. You can email us at infiniswap at umich dot edu, file issues, or submit pull requests.

infiniswap's People

Contributors

juncgu avatar leeymcj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

infiniswap's Issues

Add support for swap-in request with multiple pages in Linux kernel 4.x

In Linux kernel 4.x, swap-in (read) request is allowed to batch multiple pages. Multiple pages can be in a single swap-in request.
Infiniswap was initially developed on kernel 3.13 which only allows single page in a swap-in request. Therefore, current Infiniswap can only accept the swap-in request with one page.
The multi-page swap-in request should be supported for kernel 4.x.

NULL bio structure

New kernel (>4.x) generates requests with NULL bio
Stackbd tries to clone it but panicking with NULL pointer exception.

failed to create device: [Errno 1] Operation not permitted Invalid create_device operation

As the README.md guides, we arrive at the step "How to Run",when we input the command "sudo ./infiniswap_bd_setup.sh", there are some errors happened. it indicates that "failed to create device: [Errno 1] Operation not permitted
Invalid create_device operation
".
Then we open the nbdxadm file and read the code, we find the error occurs at this place"
try:
print nbdxdevice
with open(os.path.join(device_path, 'device'), 'w') as device_file:
device_file.write(nbdxdevice)
print 123
except IOError, e:
print 'failed to create device: %s' % e
os.rmdir(device_path)
return 1
"
It seems that it could open and bulid the directory and the file device,but it can't write or modificate the file "device".
How can we solute this problem, wish you could help us.

Is this project still being developed?

I'm intrigued by what this might be able to provide, but I'm curious if this has been abandoned or if there is on-going development out of tree. If so, will they be merged in at some point?

BUG: unable to handle kernel NULL pointer dereference at (null)

During the running of the program, there will always be a null pointer exception. How can I fix this problem?

247 [ 1425.681532] BUG: unable to handle kernel NULL pointer dereference at (null)
248 [ 1425.701767] IP: [] stackbd_make_request2+0x74/0x150 [infiniswap]
249 [ 1425.722366] PGD 0
250 [ 1425.731941] Oops: 0002 [#1] SMP
251 [ 1425.742142] Modules linked in: infiniswap(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) ib_core(OE) mlx4_en(OE) mlx4_core(OE) mlx_compat(OE) veth 8021q garp mrp nfsv3 ipod(OE) ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc overlay aufs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ast ttm drm_kms_helper drm joydev input_leds fb_sys_fops syscopyarea sysfillrect dcdbas sysimgblt irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd ioatdma mei_me sb_edac lpc_ich mei edac_core shpchp wmi ipmi_si 8250_fintek ipmi_msghandler acpi_pad mac_hid lp parport nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache hid_generic usbhid hid igb i2c_algo_bit ixgbe(OE) vxlan ip6_udp_tunnel udp_tunnel dca ahci ptp libahci pps_core fjes [last unloaded: mlx_compat]
252 [ 1426.081679] CPU: 8 PID: 11134 Comm: python3 Tainted: G OE 4.4.0-140-generic #166~14.04.1-Ubuntu
253 [ 1426.111690] Hardware name: Dell Inc. PowerEdge C6220 II/09N44V, BIOS 2.3.1 01/02/2014
254 [ 1426.132262] task: ffff880849d1aa00 ti: ffff880846f1c000 task.ti: ffff880846f1c000
255 [ 1426.152463] RIP: 0010:[] [] stackbd_make_request2+0x74/0x150 [infiniswap]
256 [ 1426.182476] RSP: 0000:ffff880846f1fb60 EFLAGS: 00010086
257 [ 1426.202285] RAX: 0000000000000000 RBX: ffff880be3321f00 RCX: 0000000000000000
258 [ 1426.231659] RDX: 0000000000000001 RSI: 0000000000000004 RDI: 0000000000000046
259 [ 1426.251873] RBP: ffff880846f1fb70 R08: 0000000000000007 R09: 0000000002080020
260 [ 1426.272087] R10: 0000000000000000 R11: ffffffff81cd7fe8 R12: ffff880851b4d5c0
261 [ 1426.292334] R13: 000000000000000e R14: 0000000000001000 R15: ffff88084ed92cc0
262 [ 1426.312646] FS: 00007f84dc364180(0000) GS:ffff88105e600000(0000) knlGS:0000000000000000
263 [ 1426.342119] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
264 [ 1426.362056] CR2: 0000000000000000 CR3: 00000008470c0000 CR4: 0000000000160670
265 [ 1426.382240] Stack:
266 [ 1426.391847] ffff880851b4d5c0 ffff880036763000 ffff880846f1fbf8 ffffffffc0021100
267 [ 1426.412434] ffffe8fffee07800 ffff880846f1fbb0 ffffffff813b5f5f ffff880851b4d800
268 [ 1426.441635] 0000000382349000 000000000000000e 000000000000000e ffff880036763908
269 [ 1426.462237] Call Trace:
270 [ 1426.471676] [] IS_queue_rq+0x100/0x3b0 [infiniswap]
271 [ 1426.491829] [] ? part_round_stats+0x4f/0x60
272 [ 1426.511655] [] blk_mq_make_request+0x225/0x420
273 [ 1426.531656] [] generic_make_request+0xfd/0x2b0
274 [ 1426.551645] [] submit_bio+0x77/0x150
275 [ 1426.562431] [] ? end_swap_bio_write+0x80/0x80
276 [ 1426.582262] [] ? map_swap_page+0x12/0x20
277 [ 1426.602015] [] swap_readpage+0xc1/0xe0
278 [ 1426.612631] [] read_swap_cache_async+0x28/0x40
279 [ 1426.632457] [] swapin_readahead+0xfd/0x190
280 [ 1426.652428] [] handle_mm_fault+0x9a4/0x1b60
281 [ 1426.672294] [] ? __schedule+0x296/0x820
282 [ 1426.691870] [] ? __schedule+0x296/0x820
283 [ 1426.711656] [] __do_page_fault+0x19e/0x430
284 [ 1426.722431] [] do_page_fault+0x22/0x30
285 [ 1426.742059] [] page_fault+0x28/0x30
286 [ 1426.752640] Code: e9 cb 00 00 00 48 89 02 48 89 05 d8 7a 00 00 48 8b 1b 48 85 db 74 49 48 8b 15 09 6b 1d c2 be 20 00 08 02 48 89 df e8 fc 25 39 c1 <48> c7 00 00 00 00 00 48 8b 15 ae 7a 00 00 48 c7 40 48 30 06 02
287 [ 1426.822448] RIP [] stackbd_make_request2+0x74/0x150 [infiniswap]
288 [ 1426.842441] RSP
289 [ 1426.852608] CR2: 0000000000000000
290 [ 1426.874461] ---[ end trace 1887d9117942a1fc ]---
291 [ 1426.881991] ------------[ cut here ]------------
292 [ 1426.892431] WARNING: CPU: 8 PID: 11134 at /build/linux-lts-xenial-0Iypes/linux-lts-xenial-4.4.0/kernel/exit.c:661 do_exit+0x50/0xae0()
293 [ 1426.932225] Modules linked in: infiniswap(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) ib_core(OE) mlx4_en(OE) mlx4_core(OE) mlx_compat(OE) veth 8021q garp mrp nfsv3 ipod(OE) ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc overlay aufs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ast ttm drm_kms_helper drm joydev input_leds fb_sys_fops syscopyarea sysfillrect dcdbas sysimgblt irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd ioatdma mei_me sb_edac lpc_ich mei edac_core shpchp wmi ipmi_si 8250_fintek ipmi_msghandler acpi_pad mac_hid lp parport nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache hid_generic usbhid hid igb i2c_algo_bit ixgbe(OE) vxlan ip6_udp_tunnel udp_tunnel dca ahci ptp libahci pps_core fjes [last unloaded: mlx_compat]
294 [ 1427.262584] CPU: 8 PID: 11134 Comm: python3 Tainted: G D OE 4.4.0-140-generic #166~14.04.1-Ubuntu
295 [ 1427.292590] Hardware name: Dell Inc. PowerEdge C6220 II/09N44V, BIOS 2.3.1 01/02/2014
296 [ 1427.322117] 0000000000000000 ffff880846f1f880 ffffffff813e9d19 0000000000000000
297 [ 1427.342590] ffffffff81cbf198 ffff880846f1f8b8 ffffffff81082786 0000000000000009
298 [ 1427.371826] ffff880846f1fab8 0000000000000046 ffff880849d1aa00 0000000000000000
299 [ 1427.392216] Call Trace:
300 [ 1427.401760] [] dump_stack+0x63/0x8a
301 [ 1427.412545] [] warn_slowpath_common+0x86/0xc0
302 [ 1427.432196] [] warn_slowpath_null+0x1a/0x20
303 [ 1427.452162] [] do_exit+0x50/0xae0
304 [ 1427.471790] [] oops_end+0x93/0xd0
305 [ 1427.482202] [] no_context+0x10d/0x370
306 [ 1427.502159] [] __bad_area_nosemaphore+0x109/0x210
307 [ 1427.521995] [] bad_area+0x43/0x4a
308 [ 1427.541554] [] __do_page_fault+0x381/0x430
309 [ 1427.552367] [] do_page_fault+0x22/0x30
310 [ 1427.571996] [] page_fault+0x28/0x30
311 [ 1427.591565] [] ? stackbd_make_request2+0x74/0x150 [infiniswap]
312 [ 1427.611949] [] IS_queue_rq+0x100/0x3b0 [infiniswap]
313 [ 1427.631973] [] ? part_round_stats+0x4f/0x60
314 [ 1427.651578] [] blk_mq_make_request+0x225/0x420
315 [ 1427.671571] [] generic_make_request+0xfd/0x2b0
316 [ 1427.682555] [] submit_bio+0x77/0x150
317 [ 1427.702149] [] ? end_swap_bio_write+0x80/0x80
318 [ 1427.722190] [] ? map_swap_page+0x12/0x20
319 [ 1427.741585] [] swap_readpage+0xc1/0xe0
320 [ 1427.752376] [] read_swap_cache_async+0x28/0x40
321 [ 1427.772234] [] swapin_readahead+0xfd/0x190
322 [ 1427.792197] [] handle_mm_fault+0x9a4/0x1b60
323 [ 1427.811935] [] ? __schedule+0x296/0x820
324 [ 1427.831567] [] ? __schedule+0x296/0x820
325 [ 1427.842353] [] __do_page_fault+0x19e/0x430
326 [ 1427.862168] [] do_page_fault+0x22/0x30
327 [ 1427.881928] [] page_fault+0x28/0x30
328 [ 1427.901549] ---[ end trace 1887d9117942a1fd ]---

Extend infiniswap code for kernel 4.x

Infiniswap has come compatibility issues on Linux kernel 4.x.
We need to check and modify the APIs in the following parts:

  1. Block IO
  2. rdma_cm
  3. ib verbs

Failure with kernel 4.11.0-13-generic and MLNX OFED 4.1

Hello--I'm trying to get Infiniswap to work on 4.11 to perhaps see if there is a performance benefit from improvements in the linux block layer over 3.13, but I'm getting the error message below on the bd client. Are these the components that the README is referring to in mentioning 4.11 kernel support?

  • Ubuntu 16.04.1 LTS
  • kernel 4.11.0-13-generic
  • MLNX OFED MLNX OFED 4.1-1.0.2.0

A nearly identical setup worked with 14.04 and kernel 4.4 (same OFED). The module loads fine and nbdxadm establishes the connection. However mkswap /dev/infiniswap0 produces the log messages below. When using the swap device, all I/O will go to disk.

Client logs:

[  610.783911] rdma_resolve_addr - rdma_resolve_route successful
[  610.783916] IS_setup_qp: enabling unsafe global rkey
[  610.783983] created pd ffff8f4a5902e280
[  610.785580] created cq ffff8f4a578b6a00
[  610.786266] created qp ffff8f4a5853f000
[  610.786267] IS: IS_setup_buffers called on cb ffff8f425bb5e000
[  610.786267] IS: size of IS_rdma_info 584
[  610.786270] IS: cb->mem=1 
[  610.786271] IS: IS_setup_buffers, in cb->mem==DMA 
[  610.786292] IS: allocated & registered buffers...
[  610.809119] cma_event type 9 cma_id ffff8f4a585df000 (parent)
[  610.809121] ESTABLISHED
[  610.809126] rdma_connect successful
[  610.809188] IS: client receives unknown msg
[  610.809189] IS: recv wc error: -1

I've added debugging lines to confirm that the received message size is 584, but the type received is 0. You can see below that the deamon thinks the type sent is 4.

Daemon output:

listening on port 9400.
rdma_session_init, get free_mem 61
rdma_session_init, allocated mem 48
free_mem, is called, last 13 GB, weight: 0.700000, 0.300000
received connection request.
connection build
send_free_mem_size , 48
message size = 584
RDMA sending type 4

blocked for more than 120 seconds

I always get the following error, is there any way to fix it?

162 [ 843.072392] INFO: task python3:5604 blocked for more than 120 seconds.
163 [ 843.092463] Tainted: G OE 4.4.0-140-generic #166~14.04.1-Ubuntu
164 [ 843.112863] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
165 [ 843.133422] python3 D ffff880838017788 0 5604 3943 0x00000000
166 [ 843.133427] ffff880838017788 0000000100000000 ffff880857d15400 ffff880838018000
167 [ 843.133429] 0000000000000000 7fffffffffffffff ffff88107fff5908 ffffffff818205c0
168 [ 843.133432] ffff8808380177a0 ffffffff8181fe75 ffff88085ee97300 ffff880838017848
169 [ 843.133434] Call Trace:
170 [ 843.133443] [] ? bit_wait+0x50/0x50
171 [ 843.133446] [] schedule+0x35/0x80
172 [ 843.133449] [] schedule_timeout+0x23b/0x2d0
173 [ 843.133455] [] ? blk_flush_plug_list+0xc4/0x200
174 [ 843.133460] [] ? ktime_get+0x3e/0xb0
175 [ 843.133462] [] ? bit_wait+0x50/0x50
176 [ 843.133464] [] io_schedule_timeout+0xa6/0x110
177 [ 843.133467] [] bit_wait_io+0x1b/0x60
178 [ 843.133469] [] __wait_on_bit+0x62/0x90
179 [ 843.133473] [] ? pageout.isra.43+0x1a5/0x280
180 [ 843.133478] [] wait_on_page_bit+0xc0/0xd0
181 [ 843.133483] [] ? autoremove_wake_function+0x40/0x40
182 [ 843.133486] [] shrink_page_list+0x6d4/0x770
183 [ 843.133489] [] shrink_inactive_list+0x1e9/0x500
184 [ 843.133491] [] shrink_lruvec+0x59e/0x740
185 [ 843.133494] [] shrink_zone+0xdc/0x2c0
186 [ 843.133496] [] do_try_to_free_pages+0x153/0x3e0
187 [ 843.133499] [] ? get_page_from_freelist+0x422/0x920
188 [ 843.133504] [] ? find_next_bit+0x19/0x20
189 [ 843.133506] [] try_to_free_mem_cgroup_pages+0xb7/0x170
190 [ 843.133511] [] try_charge+0x180/0x630
191 [ 843.133514] [] mem_cgroup_try_charge+0x63/0x1b0
192 [ 843.133517] [] handle_mm_fault+0x1295/0x1b60
193 [ 843.133520] [] ? __schedule+0x2a2/0x820
194 [ 843.133522] [] ? __schedule+0x296/0x820
195 [ 843.133524] [] ? __schedule+0x296/0x820
196 [ 843.133529] [] __do_page_fault+0x19e/0x430
197 [ 843.133531] [] do_page_fault+0x22/0x30
198 [ 843.133535] [] page_fault+0x28/0x30

IS fail to compile with MLNX_OFED 4.1 on CentOS

Have you guys tried MLNX_OFED-4.1 with IS?

I was running on CentOS 7.2 with vanilla-3.13.0, and some rdma APIs (e.g. rdma_create_id()) in MLNX_OFED-4.1 header files actually come from higher version kernel.

If you guys are able to compile/run IS with 4.1, please close this issue. If not, I'd suggest you guys try this out and remove the 4.1 from README.

Also, a decent error handling is really appreciated, especially for kernel modules. NO one wants a power cycle just for a BACKUP_DISK typo or rmmod.

Reset mapped chunks when block device is down

When a block device is down, the daemon devices should reallocate the memory for that block device so that other block devices can use that memory. Probably, each daemon needs to check whether the block device is down (for some time) so that those memory could be freed and used by other block devices.

got a soft lockup when i try to use it

I try to use it in cloudlab, , but got a soft lockup. Do you have any idea aout it?
The detail is as follow:

environmont:
infiniswap, i use the lastest master branch version
ubuntu14.04(3.13.0-168-generic)
MLNX_OFED_LINUX-3.3-1.0.4.0-ubuntu14.04-x86_64
docker-ce 17.06.0~ce-0~ubuntu

hardware
i use two m510 nodes, with ConnectX-3.

I do just as the readme.md said, use the script to install and run successfully. and then use a docker to run app. But when i try to run tpcc benchmark in voltdb, the container and node will down. After i change /proc/sys/kernel/softlockup_panic to 1 to make the node reachable, and rerun it with a simple program just allocting mem continuously in container. When it try to use swap, it get the same error again. I use dmesg, and get the follow message:

[ 680.398599] BUG: soft lockup - CPU#6 stuck for 22s! [docker:11971]
[ 680.426590] Modules linked in: infiniswap(OX) ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack bridge stp llc aufs nfsv3 ipod(OX) ib_iser(OX) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi gpio_ich x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich ipmi_si shpchp nfsd knem(OX) mac_hid wmi acpi_power_meter auth_rpcgss nfs_acl lp parport nfs lockd sunrpc fscache rdma_ucm(OX) ib_ucm(OX) rdma_cm(OX) iw_cm(OX) configfs ib_ipoib(OX) ib_cm(OX) ib_uverbs(OX) ib_umad(OX) mlx5_ib(OX) mlx5_core(OX) mlx4_ib(OX) ib_sa(OX) ib_mad(OX) ib_core(OX) ib_addr(OX) ib_netlink(OX) mlx4_en(OX) vxlan ip_tunnel ptp pps_core mlx4_core(OX) nvme mlx_compat(OX)
[ 680.791448] CPU: 6 PID: 11971 Comm: docker Tainted: G D OX 3.13.0-168-generic #218-Ubuntu
[ 680.833909] Hardware name: HP ProLiant m510 Server Cartridge/ProLiant m510 Server Cartridge, BIOS H05 05/09/2016
[ 680.882638] task: ffff880ffe6fc800 ti: ffff880fdab22000 task.ti: ffff880fdab22000
[ 680.918015] RIP: 0010:[] [] smp_call_function_many+0x28e/0x2f0
[ 680.958596] RSP: 0000:ffff880fdab23c30 EFLAGS: 00000202
[ 680.982393] RAX: 000000000000000d RBX: ffff88107fcd46a8 RCX: ffff88107fdb7ab8
[ 681.014626] RDX: 000000000000000d RSI: 0000000000000100 RDI: 0000000000000000
[ 681.046588] RBP: ffff880fdab23c80 R08: ffff88107fcd4688 R09: 0000000000000004
[ 681.078563] R10: ffff88107fcd4688 R11: 0000000000000000 R12: 0000000000000006
[ 681.110655] R13: 0000010000000006 R14: 000000000000faa0 R15: 000000fc00013b80
[ 681.142801] FS: 00007f4f157fa700(0000) GS:ffff88107fcc0000(0000) knlGS:0000000000000000
[ 681.179177] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 681.205209] CR2: 00007f4f1e5676c0 CR3: 0000001024c26000 CR4: 0000000000360770
[ 681.237166] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 681.269155] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

[ 681.301400] Stack:
[ 681.310363] ffff88107fcd46a8 0000000000014640 ffff880fdab23c90 ffffffff81060e20
[ 681.343509] 0000010000000001 ffff8810237c33c0 00007f4f1e5686c0 ffff8810237c3100
[ 681.376719] 00007f4f1e5676c0 ffff880fdab5fb38 ffff880fdab23ca8 ffffffff8106103e
[ 681.410197] Call Trace:
[ 681.421135] [] ? do_kernel_range_flush+0x40/0x40
[ 681.449502] [] native_flush_tlb_others+0x2e/0x30
[ 681.477577] [] flush_tlb_mm_range+0x8a/0x120
[ 681.504522] [] ptep_clear_flush+0x53/0x60
[ 681.529867] [] do_wp_page+0x2a5/0x860
[ 681.554043] [] handle_mm_fault+0x6fb/0xfb0
[ 681.579771] [] ? n_tty_read+0x40c/0xc00
[ 681.604430] [] __do_page_fault+0x183/0x570
[ 681.630373] [] ? wake_up_state+0x20/0x20
[ 681.655321] [] do_page_fault+0x1a/0x70
[ 681.680765] [] page_fault+0x28/0x30
[ 681.703836] Code: 3b 05 cf 52 c3 00 89 c2 0f 8d fd fd ff ff 48 98 49 8b 4d 00 48 03 0c c5 40 7c d1 81 f6 41 20 01 74 cb 0f 1f 00 f3 90 f6 41 20 01 <75> f8 eb be 0f b6 4d d0 48 8b 55 c0 44 89 ef 48 8b 75 c8 e8 ca

Hoping for your help,
thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.