Coder Social home page Coder Social logo

Comments (18)

fyc1007261 avatar fyc1007261 commented on June 17, 2024

Hi @lastweek !
I have looked into the source code and found that LegoOS checks all the ports of IB NIC. If there is one using Ethernet, it panics. However, the IB-supported NIC on CloudLab has two ports, one IB and one Ethernet, which is configured permanently and may not be changed by software. Is it possible to use only the IB port to do the RDMA? If so, could you please give some instructions of modifying the Lego source code?
Thanks a lot!

from legoos.

lastweek avatar lastweek commented on June 17, 2024

Hi @fyc1007261,

Sorry for the inconvenience. The current driver does not support RoCE, thus once RoCE is detected, it will simply panic. (To be precise, I'm not sure if it can work on RoCE. I forgot if I omit some code regarding RoCE in mlx4.)

The error message from make install is fine. As long as the kernel image is installed at /boot, and you can find it on the grub menu.

"Please wait for enough IB MAD (number 7)" means both machine are waiting for the MAD control messages from Infiniband switch. For 1P-1M configuration, you need to have a IB switch, and both machines are connected to the switch. What's the configuration you are using?

from legoos.

fyc1007261 avatar fyc1007261 commented on June 17, 2024

Hi @lastweek ,
Thanks for your reply! I looked into the source code of LegoOS and found that if the driver finds that any of the ports of Infiniband-supporting NIC is using Ethernet, then it panics. However, the NIC on CloudLab has 2 ports with one using IB and the other using Ethernet. I may try to modify some source code of LegoOS to let it think there is only one port and use that port only. Is there anything that I should pay attention to?

For the IB MAD problem, I am using another IB-supporting NIC that is not Mellanox but Qlogic QLE instead. I suspect that it might not be supported by the driver...?

Thanks so much for your help!

from legoos.

lastweek avatar lastweek commented on June 17, 2024

Hi @fyc1007261,

It's a LegoOS bug indeed. You should try the approach you proposed. You should pay attention to the port number, make sure you are using the IB port.

About the Qlogic QLE machine, are you running LegoOS on top of that? I don't think mlx driver can run with that.. Anyhow, can you tell me more about your hardware setup? Thanks.

from legoos.

fyc1007261 avatar fyc1007261 commented on June 17, 2024

Hi @lastweek ,

It is true that the driver does not support QLE NICs. I am now trying with 1P-1M settings with Mellanox MX354A NIC and SX6036G/U1 IB switches. (Melanox IS5035 is not provided on CloudLab)

from legoos.

fyc1007261 avatar fyc1007261 commented on June 17, 2024

Hi @lastweek ,
I finally succeeded in deploying with the 1P-1M configurations by hard-coded all the num_ports variables to 1. Thanks a lot for you help!

from legoos.

lastweek avatar lastweek commented on June 17, 2024

Cool!!! Would you mind share your solutions with us? Being able to run on CloudLab is a big deal!!

from legoos.

fyc1007261 avatar fyc1007261 commented on June 17, 2024

Hi @lastweek ,

For 1P-1M settings, I used the r320 hardware in Apt Utah with CentOS 7 image and simply connect 3 raw PCs together (though only 2 are used currently). Then modify the code in the drivers/ directory to hard code all num_ports variables to 1 (because one of the r320 NIC uses Ethernet). After this, just follow the instructions you provided on the GitHub.

As for the Storage node, there might me some problems with the CentOS image on CloudLab that I cannot install Linux 3.11.1 on it so far. I will keep trying on it.

from legoos.

lastweek avatar lastweek commented on June 17, 2024

Cool!! Let me know if you have issues installing 3.11.1. A very concise instruction is: 1) Download 3.11.1 from kernel.org. 2) copy /boot/config-3.10.xxx (the default config) into linux-3.11.1/, 3) make oldconfig, 4) make modules_install && make install, 5) reboot into 3.11.1

Let me know how it goes!

from legoos.

fyc1007261 avatar fyc1007261 commented on June 17, 2024

Hi @lastweek ,

There is still something wrong with my CentOS or my 3.11.1 kernel so that I failed the intsall the 3.11.1.

Is it possible to use higher stable versions such as 3.16.70? I found there are some differences between the kernel code which the linux-modules is using. I plan to modify some implementations in linux-modules to fit the 3.16.70 version. Did the newer kernel just modify some interface or that the newer kernel has changed some important code inside that may lead to the failure of LegoOS's storage node? In other words, is it possible that my plan will work?

Thanks a lot for your help!

from legoos.

lastweek avatar lastweek commented on June 17, 2024

Hi @fyc1007261,

That might work, I've done similar things (port some old RDMA code to 4.x kernel). That time I changed some protection domain and some other stuff. However, this might be time-consuming and error-prone. Before you proceed, can you share more details on the installation failure? e.g., panic messages

from legoos.

fyc1007261 avatar fyc1007261 commented on June 17, 2024

Hi @lastweek ,

I use the cp /path/to/oldconfig .config -> make oldconfig (default for new configurations) -> make -> make modules_install -> make install steps. The 3.11.1 kernel just didn't show anything after I type enter to select 3.11.1 at the boot loader. I also tried same steps in my VMWare and got the same results. The VMWare monitor says that the CPU of client OS has been disabled and I cannot figure out where the problem is.

Have you ever met such problems or could you please give some suggestions? Thanks!

from legoos.

lastweek avatar lastweek commented on June 17, 2024

About your /path/to/oldconfig, which kernel version is it?

from legoos.

fyc1007261 avatar fyc1007261 commented on June 17, 2024

It is 3.10.0-957.12.2.el7.x86_64, which is the default version for CentOS 7 on CloudLab

from legoos.

lastweek avatar lastweek commented on June 17, 2024

3.11.1config.txt

Hi @fyc1007261, I uploaded an old config file from our machine. Though the machine is different, do you wanna give it a try?

from legoos.

fyc1007261 avatar fyc1007261 commented on June 17, 2024

Thanks so much! I will try it soon and report to you later.

from legoos.

fyc1007261 avatar fyc1007261 commented on June 17, 2024

Hi @lastweek ,

Unfortunately, your config still won't work :(
I may try QEMU to find out what's wrong inside the kernel.

By the way, I tried running storage node with 3.16.70, but the processor monitor panicked with fatal exception, saying BUG: unable to handle kernel paging request at ffff880439d28b10. Might that be an error caused by the difference between the two kernel versions?

Thanks a lot for your support!

from legoos.

fyc1007261 avatar fyc1007261 commented on June 17, 2024

Hi @lastweek ,

Thanks to your previous help, I have succeeded on deploying 1P-1M-1S on Ubuntu 14.04 with 3.11.1 kernel now! I am now able to run some simple python scripts. It works quite well for printing messages, using Python original modules (like time, copy, etc.) and using local modules (another Python script in the same folder). But when it comes to import external modules (I tried numpy), the processor monitor panics, saying unable to handle kernel paging request at <some address>. I wonder if LegoOS does support external modules. I am now using Python 2.7 with pip 19.2.1 and numpy 1.16.4. Numpy was installed via pip. Could you please kindly give some suggestions?

Thanks again for you patience!

from legoos.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.