So in trying to write tests for the Tags system, I've had some trouble with the design

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Design of ContainerFile / Tags system about mdsynthesis HOT 6 CLOSED

datreant commented on July 2, 2024

Design of ContainerFile / Tags system

from mdsynthesis.

Comments (6)

dotsdl commented on July 2, 2024

@richardjgowers

A Container has two layers: the interface layer composed of its own methods, properties, and attached aggregators, and the backend layer which handles the details of storing the state of the object in a serialized form (ContainerFile and its subclasses). The reason for this separation is to make it easy to write interface elements like aggregators without having to think too carefully about atomicity of operations to the file, read and write locks (sensitive!), etc. Basically, the interface elements handle the details of how information from the user is obtained and returned, while the containerfile elements handle the details of storage to what amounts to a small database. I think it's important to keep this separation of roles for the sake of maintainability/sanity.

Even for something as simple as Tags, the interface methods do work that the ContainerFile doesn't (and shouldn't) have to. One example is Tags.add(), which can handle tags that are entered as lists of lists cleanly (perhaps rare), then feeds a flattened list of tags to the ContainerFile. The ContainerFile.add_tags() method for adding tags is as thin as it can be while fully completing the operation of adding tags nonredundantly, while Tags.add() includes machinery for making life convenient for a user that has better things to think about than whether their list of strings is flat.

I suppose in short: interface elements implement user convenience, while containerfile elements implement sensitive atomicity of serialization.

I'm not 100% sure on how this would work with all the file decorators, maybe generic Containerfile.read_table and Containerfile.write_table methods which Aggregators could use...

In principle, ContainerFile contains methods that give complete access to its elements in a safe way; that is, use of them won't result in a ContainerFile in a damaged or inconsistent state. Also, the schema for the ContainerFile tables are defined within it, and these have to remain clearly defined, preferably in one place like they are now. Otherwise it becomes difficult to guarantee containers produced with one version of MDS will work with a new version.

from mdsynthesis.

dotsdl commented on July 2, 2024

Because many aggregators can't really function without a Container instance attached, I wouldn't worry about testing them as standalone units. Instead, I would treat them as part of the Container class they are found associated with. They have to function in that context anyway in actual use. :D

As for the ContainerFile and its subclasses, those should work completely independently of any interface components. We should eventually build tests for each method of ContainerFile and its subclasses, even if much of this code is "covered" by testing at the level of the interface (Sim, Group).

from mdsynthesis.

richardjgowers commented on July 2, 2024

I wanted to create a fake container file, which would essentially store everything in memory instead.. that way we could isolate the Aggregators on their own. So you could check that your duplicate management etc is all working, and any errors are occuring as you try and save, but the correct stuff is being sent to be saved?

I understand the line between ContainerFile and Tags, but things like add are confusing because the args are handled in Tags, and then the duplicates are handled in ContainerFile. Wouldn't it be cleaner to have ContainerFile just serving up appropriate handles for users (Aggs)? Looking through, all the Aggregators ever use is .handle

from mdsynthesis.

dotsdl commented on July 2, 2024

I wanted to create a fake container file, which would essentially store everything in memory instead.. that way we could isolate the Aggregators on their own. So you could check that your duplicate management etc is all working, and any errors are occuring as you try and save, but the correct stuff is being sent to be saved?

I think this would be necessary if ContainerFile and Container needed to talk to each other (bidirectional codependency), but because containerfiles work just as well without an attached Container, it's only necessary to write tests for the ContainerFile and then test Container with the real ContainerFile backend. Faking a containerfile would create the need to write tests for the fake: why do that when you can test the production one, which we have to do anyway?

I understand the line between ContainerFile and Tags, but things like add are confusing because the args are handled in Tags, and then the duplicates are handled in ContainerFile.

Here's how I reason the separation: nonredundancy is an inherent feature of a set of tags, because we expect tags to work like a set (we only care about whether something is present). It's therefore containerfile's job to make sure that its stored table of tags is nonredundant. Not including this at the level of the containerfile means that any interface to manipulate tags must also reinvent removal of redundancy, and any accessors to tags must do the same to compensate for lapses on the part of any adders. It's a mess.

As for making things easy for the user by not caring how funky their lists of tags they want to add look, that's not the containerfile's problem. That's a convenience issue that was decided at the interface level, and the flexibility could have been left out. For Tags.add(), we could make it so that it only takes tags presented as individual arguments or only as a single, unnested list, but that's a choice that the underlying concept of what tags are and how they should be stored isn't concerned with.

Wouldn't it be cleaner to have ContainerFile just serving up appropriate handles for users (Aggs)? Looking through, all the Aggregators ever use is .handle

I don't think it would be cleaner, because the current implementation separates interface from serialization. One could completely rewrite the interface of Tags without needing to write a bit of pytables code, which is a huge plus. If instead we exposed ContainerFile.add_tags() as the method for adding tags, changing how flexible this is at the interface level means tampering around the underlying serialization scheme. It also means that what should be a low-level interface for adding tags to the ContainerFile can itself start collecting bloat.

I think for Tags this scheme looks like overengineering, but for something like Universes or Selections I think it's easier to see how this organizational scheme keeps the structure of a Container from becoming a mess. An even better example is Members, which I recently refactored to share an interface with Bundle, an in-memory version of Members. Members uses a ContainerFile's member table as its backend, while Bundle uses a _BundleBackend instance with a structured array instead. That might be of interest to you.

On that note, perhaps I'll change the _containerfile attribute to _backend for all Aggregator classes to reflect the separation between interfaces and the underlying backend.

from mdsynthesis.

richardjgowers commented on July 2, 2024

Ok if you're happy then I'll leave it. I'll look at another way of testing Tags.

from mdsynthesis.

dotsdl commented on July 2, 2024

This is a good discussion to have, since the underlying design determines future possibilities. For something like Tags, I think it's important to make sure it works at the Container level (interface), for all Container subclasses (this is already done in test_containers.py, in lines 62 to 96 at the moment) and also at the storage/backend level, the only backend being ContainerFile at the moment. Those tests are not yet written, but we will need them.

I changed the _containerfile references in all containers and aggregators to _backend in 86e0732. This reflects the agnosticism of Sims/Group/Tags/etc. to the underlying backend they are talking to, which need not even be writing to a file. For example, a backend could be written that stores everything in a MySQL database, a flat-file, in-memory, etc, but a Sim that uses this backend will have exactly the same interface as one that uses the default pytables HDF store.

from mdsynthesis.

Design of ContainerFile / Tags system about mdsynthesis HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent