Coder Social home page Coder Social logo

Comments (3)

dcrowell77 avatar dcrowell77 commented on September 18, 2024
  1. HW414700 describes an early chip bug related to missing some SUE reporting. Prior to Nimbus 2.1 there was a chance of missing errors so with this setting applied it will force a checkstop in those cases.
  2. If you search in the code you can see that HW414700 is in a lot of places. It seems to affect more than just regular memory since I see it in some of the other initfiles too. In general it will cause more failures to checkstop versus properly failing with SUE/machinecheck.

Why are you interested in forcing checkstops for these kinds of errors? In general we would want the errors to flow upward into a possibly non-fatal machinecheck/SUE that the OS could handle accordingly. These systems are designed to avoid full system checkstops whenever possible.

from hostboot.

Grubby0624 avatar Grubby0624 commented on September 18, 2024

Thank you for your answer

  1. Nimbus 2.1 will cause the OS to be stuck during the DIMM RAS test, and the system serial port continues to report the error "Memory failure: 0x20000000: reserved kernel page still referenced by 1 users" for several hours. I think this is abnormal and unacceptable from the perspective of use
  2. Nimbus 2.2/2.3 does not have this phenomenon. After the DIMM RAS test, the checkstop is triggered and the corresponding DIMM is restarted
    So I guess Nimbus 2.1 also has the bug of missing some SUE reporting. That's why I'm interested in "forcing checkstops for these kinds of errors"

from hostboot.

dcrowell77 avatar dcrowell77 commented on September 18, 2024

It seems unlikely that DD2.1 has the bug but it went unaddressed. However, that level of part is technically only supported as part of the https://github.com/ibm-op-release/op-build branch. It looks like you are trying to use our most current code level. There are all sorts of other settings that could be incorrect for DD2.1 if you are using master. It is possible that you are missing some other tangentially related behavior that the OS interacts with to properly handle the error. It does seem like the OS knows the memory is bad, which I think means that the initial chip bug was fixed since that was a case of not reporting the error at all (a silent failure).

from hostboot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.