With the release of JPype1 version 1.4.1, CI jobs have started to fail, e.g. <a href="

Can you give me a rough deion of your test? For example, somet

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I spent some further time investigating this, including using <a href="https://pypi.or

Adjust for JPype1 v1.4.1,about iiasa/ixmp

Comments (7)

Thrameos commented on June 12, 2024

There was no intended change of behavior in JPype. But we had to change mechanisms. If the new mechanism is triggering a referencing problem, please try to replicate.

from ixmp.

khaeru commented on June 12, 2024

Totally understood, and thank you for the comment.

As the description says, we can try to condense a MWE from our somewhat complicated code base—only, no bandwidth to do this right now. This issue is just to put a pin in the matter until we manage to find time.

from ixmp.

Thrameos commented on June 12, 2024

Can you give me a rough description of your test?

For example, something like this:

Create a Java object instance
Check the ref count is 1
Set up a reference loop
Delete the original reference
Call garbage collector
Check if hash of item appears in GC list

(which is incidentally one of the patterns that I check).

Most important details....

What is the type of objects that was created (class, object, primitive, proxy, string, exception)?
How was it structurally in relationships to others? (we created a class, then an object, then a proxy, then placed the proxy in a container, etc)
How was it determined if the object was leaking?
Does the leak checker fail on one version or multiple? (We were forced to change patterns for Py 3.11, but the older ones could use a different pattern?
Last, was an exception part of the process? (We changed both the value creation and the exception frame generator. Both are suspects as the gc could be triggering on a Frame creation rather than a JPype object create.)

We have built in leak checkers for objects, strings, and primitives (brute force create a lot and make sure they go away), but that can miss structural defects in which the order, timing, or connections of an object causes an issue.

from ixmp.

glatterf42 commented on June 12, 2024

With the release of JPype 1.5.0, this has become relevant again since all our workflows on Pyhon < 3.11 are breaking. This is because the temporary workaround specifies pip install "JPype1 != 1.4.1", which is also satisfied by installing 1.5.0. Interestingly, 1.5.0 also works for Python 3.11. For the sake of completeness, there are also other other minor version changes from the last working runs to the first failed ones: ipykernel went from 6.27.1 to 6.28.0, jupyter-core from 5.5.1 to 5.6.0, and jupyter-server-terminals from 0.5.0 to 0.5.1. However, they all seem unrelated, especially seeing that the error messages are exactly the same again as described above here.
So a path to mitigation would be clear: instead of temporarily enforcing JPype1 != 1.4.1, enforce JPype1 < 1.4.1 or even JPype1 == 1.4.0, both of which would not automatically test if potential newer versions than 1.5.0 might resolve this issue. So a proper fix would be desirable and since you've taken an interest before, @Thrameos, maybe you could take another look? I'm not familiar with Java, unfortunately, but I'll do my best to break down our test to answer your questions. Any help is much appreciated :)

from ixmp.

glatterf42 commented on June 12, 2024

TL;DR: this has gotten and taken much longer than anticipated. For my money, the issue can likely be explained by understanding the changes you made for Python 3.11. Jpype1 1.5.0 (and 1.4.1, for that matter) work like expected for Python 3.11, it's only failing on older versions on our side.

The first failing test is located in ixmp/tests/backend/test_base.py:

    def test_del_ts(self, test_mp):
        """Test CachingBackend.del_ts()."""
        # Since CachingBackend is an abstract class, test it via JDBCBackend
        backend = test_mp._backend
        cache_size_pre = len(backend._cache)

        # Load data, thereby adding to the cache
        s = make_dantzig(test_mp)
        s.par("d")

        # Cache size has increased
        assert cache_size_pre + 1 == len(backend._cache)

        # Delete the object; associated cache is freed
        del s

        # Objects were invalidated/removed from cache
        assert cache_size_pre == len(backend._cache)

self is an instance of the test class and should not matter, but test_mp is an ixmp.core.platform.Platform object. This is a python class in our modelling framework that connects models to a data-storing backend. The backend, in turn, is defined in the test_mp fixture as an instance of the JDBC backend. This is another python class (inheriting from CachingBackend) that uses JPype/JDBC to connect to an Oracle or HyperSQL database. In our test case, if that's relevant, we are using a local HyperSQL database. backend._cache is just a dictionary, then.
Next, the function make_dantzig creates a scenario containing the data of the Dantzig transportation problem. The line s.par("d") accesses the data of the distance parameter d, which looks (in python) something like this:

i            j           value unit
seattle      new-york    2.5   km
seattle      chicago     1.7   km
seattle      topeka      1.8   km
san-diego    new-york    2.5   km
san-diego    chicago     1.8   km
san-diego    topeka      1.4   km

(Just noticing: these distances are supposed to represent thousands of miles rather than km, so our units are wrong here, but that hardly matters for the error, I think.)
On the python side, this is stored in the backend._cache dict such that one value is added, which is confirmed by the first successful assert line. Then, we are deleting this whole scenario s again and expect this action to free the backend._cache, too, which is not happening for JPype1 >= 1.4.1.
The del function has been overwritten for the TimeSeries class, which the Scenario class inherits from, so del s will call the JDBCBackend del_ts function, which first calls the general CachingBackend.del_ts(), which just calls its own cache_invalidate(). This, finally, should remove all values from the backend._cache dict (but might return None if some key could not be found). Back in JDBCBackend.del_ts(), the last step is to call JDBCBackend.gc(), which shows some explicit mention of jpype:

    def gc(cls):
        if _GC_AGGRESSIVE:
            # log.debug('Collect garbage')
            try:
                java.System.gc()
            except jpype.JVMNotRunning:
                pass
            gc.collect()

Lastly, the overwriting of the del function only calls this whole thing as part of a try clause, excepting (AttributeError, ReferenceError) and moving on since we assume the object has already been garbage-collected. So if one of these is raised without backend._cache being emptied, this would probably result in exactly the error we are seeing.
(Not sure this can interfere here, but platform.__getattr__ has been overwritten and could raise an AttributeError.)

On the java side, I can't tell you much about what's happening, I'm afraid. When the Scenario object is initialized, it also initializes the JDBCBackend, which does things like start_jvm(), and

        try:
            self.jobj = java.Platform("Python", properties)
        except java.NoClassDefFoundError as e:  # pragma: no cover
            raise NameError(
                f"{e}\nCheck that dependencies of ixmp.jar are "
                f"included in {Path(__file__).parents[2] / 'lib'}"
            )
        except jpype.JException as e:  # pragma: no cover
            # Handle Java exceptions
            jclass = e.__class__.__name__
            if jclass.endswith("HikariPool.PoolInitializationException"):
                redacted = copy(kwargs)
                redacted.update(user="(HIDDEN)", password="(HIDDEN)")
                msg = f"unable to connect to database:\n{repr(redacted)}"
            elif jclass.endswith("FlywayException"):
                msg = "when initializing database:"
                if "applied migration" in e.args[0]:
                    msg += (
                        "\n\nThe schema of the database does not match the schema of "
                        "this version of ixmp. To resolve, either install the version "
                        "of ixmp used to create the database, or delete it and retry."
                    )
            else:
                _raise_jexception(e)
            raise RuntimeError(f"{msg}\n(Java: {jclass})")

, so it is checking for some exceptions. The next step, then, finally gets deep into java territory:
s.par("d") calls s._backend("item_get_elements", "par", name, filters) (with name="d" and filters=None), which calls s.platform._backend(s, method, *args, **kwargs) (with method="item_get_elements" and the rest passed along as *args). s.platform._backend is an instance of (through inheritance) the abstract Backend class, which does specify a __call__(self, obj, method, *args, **kwargs) function to return getattr(self, method)(obj, *args, **kwargs). With self=JDBCBackend, this means the function in question is this:

# type="par", name="d"
    def item_get_elements(self, s, type, name, filters=None):  # noqa: C901
        if filters:
            # Convert filter elements to strings
            filters = {dim: as_str_list(ele) for dim, ele in filters.items()}

        try:
            # Retrieve the cached value with this exact set of filters
            return self.cache_get(s, type, name, filters)
        except KeyError:
            pass  # Cache miss

        try:
            # Retrieve a cached, unfiltered value of the same item
            unfiltered = self.cache_get(s, type, name, None)
        except KeyError:
            pass  # Cache miss
        else:
            # Success; filter and return
            return filtered(unfiltered, filters)

        # Failed to load item from cache

        # Retrieve the item
        item = self._get_item(s, type, name, load=True)
        idx_names = list(item.getIdxNames())
        idx_sets = list(item.getIdxSets())

        # Get list of elements, using filters if provided
        if filters is not None:
            jFilter = java.HashMap()

            for idx_name, values in filters.items():
                # Retrieve the elements of the index set as a list
                idx_set = idx_sets[idx_names.index(idx_name)]
                elements = self.item_get_elements(s, "set", idx_set).tolist()

                # Filter for only included values and store
                filtered_elements = filter(lambda e: e in values, elements)
                jFilter.put(idx_name, to_jlist(filtered_elements))

            jList = item.getElements(jFilter)
        else:
            jList = item.getElements()

        if item.getDim() > 0:
            # Mapping set or multi-dimensional equation, parameter, or variable
            columns = copy(idx_names)

            # Prepare dtypes for index columns
            dtypes = {}
            for idx_name, idx_set in zip(columns, idx_sets):
                # NB using categoricals could be more memory-efficient, but requires
                #    adjustment of tests/documentation. See
                #    https://github.com/iiasa/ixmp/issues/228
                # dtypes[idx_name] = CategoricalDtype(
                #     self.item_get_elements(s, 'set', idx_set))
                dtypes[idx_name] = str

            # Prepare dtypes for additional columns
            if type == "par":
                columns.extend(["value", "unit"])
                dtypes.update(value=float, unit=str)
                # Same as above
                # dtypes['unit'] = CategoricalDtype(self.jobj.getUnitList())
            elif type in ("equ", "var"):
                columns.extend(["lvl", "mrg"])
                dtypes.update(lvl=float, mrg=float)

            # Copy vectors from Java into pd.Series to form DataFrame columns
            columns = []

            def _get(method, name, *args):
                columns.append(
                    pd.Series(
                        # NB [:] causes JPype to use a faster code path
                        getattr(item, f"get{method}")(*args, jList)[:],
                        dtype=dtypes[name],
                        name=name,
                    )
                )

            # Index columns
            for i, idx_name in enumerate(idx_names):
                _get("Col", idx_name, i)

            # Data columns
            if type == "par":
                _get("Values", "value")
                _get("Units", "unit")
            elif type in ("equ", "var"):
                _get("Levels", "lvl")
                _get("Marginals", "mrg")

            result = pd.concat(columns, axis=1, copy=False)
        elif type == "set":
            # Index sets
            # dtype=object is to silence a warning in pandas 1.0
            result = pd.Series(item.getCol(0, jList)[:], dtype=object)
        elif type == "par":
            # Scalar parameter
            result = dict(
                value=float(item.getScalarValue().floatValue()),
                unit=str(item.getScalarUnit()),
            )
        elif type in ("equ", "var"):
            # Scalar equation or variable
            result = dict(
                lvl=float(item.getScalarLevel().floatValue()),
                mrg=float(item.getScalarMarginal().floatValue()),
            )

        # Store cache
        self.cache(s, type, name, filters, result)

        return result

From the ixmp java source code (which is private for some reason, I believe), I would say that Parameter is a class: public class Parameter extends Item {, but I can't really tell you without reading up much more on java if we set up a reference loop or some such, sorry.
In the test, we then call the del construction as described above, which does not mention java or jpype a lot.

from ixmp.

khaeru commented on June 12, 2024

@glatterf42 thanks for digging in—I hope that was instructive!

The fault probably lies in code that I originally added in #213, several years ago. The basic purpose of this code is to aggressively free up references to Python objects (like Scenario, TimeSeries, etc.) so that they and the associated ixmp_source Java objects (via JPype) can be garbage-collected, freeing memory. This was deemed necessary for larger applications, like running many instances of the MESSAGEix-GLOBIOM global model in older workflows (using the scenario_runner/runscript_main pattern).

I will open a PR to remove the workaround for the current issue and adjust/update the tests and/or caching/memory-management code. Then we can discuss whatever the solution turns out to be.

from ixmp.

khaeru commented on June 12, 2024

I spent some further time investigating this, including using refcycle. After del s here:

ixmp/ixmp/tests/backend/test_base.py

Lines 148 to 149 in d92cfcd

    
           # Delete the object; associated cache is freed 
        
           del s

…I inserted code like:

import gc
gc.collect()
gc.collect()
gc.collect()

import sys
import traceback

import refcycle

import ixmp

snapshot = refcycle.snapshot()
scenarios = snapshot.find_by(lambda obj: isinstance(obj, ixmp.Scenario))
s2 = scenarios[0]
print(f"{s2 = }")
snapshot.ancestors(s2).export_image("refcycle-main.svg")

tbs = list(
    snapshot.ancestors(s2).find_by(
        lambda obj: type(obj).__name__ == "traceback"
    )
)
print(f"{tbs = }")
del snapshot
gc.collect()

for i, obj in enumerate(tbs):
    print(f"{i} {obj}")
    print(f"{obj.tb_frame = }")
    traceback.print_tb(obj, file=sys.stdout)

This produced output like the following:

In short:

At that point in the test, there are six builtins.traceback objects that exist.
These appear to originate in the Java code, rather than in the Python code (first frame object referenced by each traceback object).
Each of those frame stacks ultimately contain 1 or more references to the s Scenario instance.
These are what make the reference count non-zero, and prevent s from being garbage-collected.
Again, this occurs only with JPype ≥1.4.1 (including 1.5.0) and Python ≤ 3.10. It does not (see the CI runs on #504) occur with Python 3.11 or 3.12. I don't know why this is.
I also spent several hours fiddling with the exception handling. For instance, because tracebacks are associated with exceptions, I tried aggressively deleting and GC'ing jpype.JException instances and returning only Python exceptions in jdbc.py. Nothing seemed to make those traceback objects disappear.

from ixmp.

Adjust for JPype1 v1.4.1 about ixmp HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent