sociomantic-tsunami / dlsproto Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 18.0 582 KB

Distributed Log Store protocol definition, client, fake node, and tests

License: Boost Software License 1.0

Makefile 0.13% D 99.77% Shell 0.09% Dockerfile 0.01%

dlsproto's People

Contributors

Stargazers

Watchers

Forkers

nemanja-boric-sociomantic gavinnorman burgos daniel-zullo mathias-baumann-sociomantic gavin-norman-sociomantic mihails-strasuns-sociomantic jenkins-sociomantic stefan-koch-sociomantic matthew-love-sociomantic scott-gibson-sociomantic bogdan-szabo-sociomantic mihails-strasuns tiyash-basu-sociomantic matthias-wende-sociomantic don-clugston-sociomantic jens-mueller-sociomantic geod24

dlsproto's Issues

Disconnecting node during `GetRange` makes it never end

Having two nodes, A and B, disconnecting a single node during running GetRange request, makes GetRange request never end. Same happens if the request uses a single node and it disconnects - the request never ends.

Since the DLS node never resumes the request, disconnecting the client should simply count towards finished nodes. In general, I think we should take this in the careful consideration when implementing the Continue after disconnect - this should not be allowed if the client requested stop.

Remove Trusty support

Remove trusty support from all projects. Trusty is already more than 3 years old and the latest LTS (Xenial) has already been out there for more than a year, so it should be enough to maintain the latest LTS only now.

Abort GetRange if `Stopped` message not received for some amount of seconds

If the GetRange is waiting on the Stopped message from the node and it never arrives, and yet node stays connected, the entire request will hang and it will not be able to proceed. Simple registering a timer to resume a receive fiber with the timeout message should be enough. Client should then continue with the stopping the request (perhaps it should also raise node_error).

Rework GetRange to use batch-suspendable system from DMQ Consume

sociomantic-tsunami/dmqproto#3 discusses a (hopefully) much simpler and more reliable way of implementing a suspendable request. If the reworking of Consume is successful, GetAll should be reworked in the same way.

Expose client stats of records/bytes transfered

Client should allow inspecting the number of bytes/records transferred for each connection.

Check log levels in usage example notifiers

e.g. https://github.com/sociomantic-tsunami/dlsproto/blob/neo/src/dlsproto/client/UsageExamples.d#L323-L330 is logging started, finished, and stopped as errors. This is harmless enough, but makes for a somewhat confusing usage example.

Pass request parameters to notifier as const

See sociomantic-tsunami/swarm#7.

Stop using templates Const, Immut and Inout

The templates Const, Immut and Inout were used for transitioning from D1 to D2. Now that swarm has been converted, they should be replaced by plain D2 keywords.

Provide iteration summary with the Finished/Stopped notification

Node should provide client with the summary of the iteration, with the information such as total number of records in the range and the number of records in the range matching the given filters.

Add project to dlang's Jenkins

Similarly to what we already have with Ocean, we should add all other open source projects to dlang Jenkins to make sure new DMDs don't break them.

Update to ocean v4.0.0, swarm v5.0.0, and turtle v9.0.0

getRangeSize / getRangeCount to get number of bytes or record count

Sometimes when implementing features of a DLS reader we want to know exactly how much data we're dealing with for a certain range. It guides us in how we end up implementing our algorithms, whether we can keep some X range of data in memory, etc.

The current workaround is that we do these tests manually with getRange. But that of course causes a lot of network traffic for data which is ultimately just discarded.

Ideally it would be supported for for getRange and its filter/regex equivalents.

Update usage examples to work with a DaemonApp base

Like https://github.com/sociomantic-tsunami/dhtproto/blob/neo/src/dhtproto/client/UsageExamples.d.

This makes the examples simpler for end-users to follow.

Update to features of swarm v4.4.0

https://github.com/sociomantic-tsunami/swarm/releases/tag/v4.4.0

Include client addr/port in logging in legacy request handlers, as appropriate. (checked, not relevant)
Add neo tests for streaming / batch request behaviour on connection disruption.

Timeout for the connection (low)(in)activity in getRange

The legacy clients have implemented the algorithms for making sure that all nodes are alive and sending traffic. It is implemented via means of monitoring the events when the individual nodes finish running the request and observing the time for the remaining nodes to finish. In the case when client sees that the individual node needs a long time to complete the request, it would "timeout" and it would stop the request.

Ideally, the client should have a mean of aborting the request on a connection that seems staled, without client needing to track the individual nodes behaviours.

D2: Automatically convert tags and push them

To move forward with D2 we need to provide automatically converted libraries for projects that are built only for D2.

Every time a new tag is pushed, we have to convert it to D2, tag the converted version with a +d2 build metadata information appended (so tag v2.3.4 --- conversion D2 ---> v2.3.4+d2) and push it back (making sure the new tag is not converted again!).

Allow choosing the forward or backward iteration during GetRange

It's often the case that clients can decide if they want to inspect old data, given the sufficient amount of the fresh data. If we allow client to iterate from the end to the start of the time range (in chunks), client can stop iteration if they decide they can ignore the older data.

Remove de facto deprecated `GetSize` and `GetChannelSize

These requests are already unsupported by the protocol, but they can't be deprecated since
we can't get rid of the deprecations in D1 due to the call chain analysis bug where the deprecated
struct is triggering the deprecation itself. Since they are doing nothing since two last major, we can just
remove them.

D2 dlsproto

This issue is automatically created to act as a card for
Sociomantic D2 migration project board.

All other issues and pull request in this repository that are related to D2
migration should be referenced from this issue. It is highly appreciated
to provide short status reports on any breakthroughs in form of issue comments.

Please adjust this issue column in https://github.com/orgs/sociomantic/projects/4
board to reflect current stage.

If this is not a D project (GitHub language detection is imperfect), please
both close the issue and remove matching card from the linked project.

Allow configuring test client's port

In case we have several tests running in parallel, each with the separate node, we need to make
client's connect to the different ports. This is possible by overriding DlsTestCase.prepare method, but it should be easier to configure this (probably via static variable, configurable from TestRunners).

See sociomantic-tsunami/dlsnode#7

Request to get data from multiple ranges/channels?

A fairly common usage is for a client to request several chunks of data in different ranges / channels. How exactly they do this varies by application. Some might simply do sequential GetRanges, others might start a certain number of GetRanges in parallel, etc. Existing client code in fact second guesses the DLS, attempting to optimise the throughput by a certain GetRange assignment strategy.

It occurs to me that this is not ideal. The DLS nodes ought to be the ones deciding the most efficient way to serve requests, not the clients. The node implementation can change over time, meaning that clients' attempts at gaming the system might become outmoded or even horribly inefficient.

We should think about adding a request to get data from multiple ranges/channels, allowing the node to decide the best way to serve the data.

Block level percentage progress during GetRange

The node could inform the client about the current iteration progress. The progress doesn't need to be fine grained, reporting progress in block should be sufficient.

Clarify (doc, usage examples) that `node_error` means the request is broken

Generally, trying it again will have the same result.

Abort the request on the connection after the inactivity timeout

If there's a bug in the node's implementation, or the routing error in the network, the node might not
be able to proceed with sending the data to the client. Perhaps setting the inactivity timeout could abort the iteration on the given node, without user having to setup/maintain inactivity timer in their code (which is different than total-time timeout, as the timer needs to be reset each time we get data from the node).

Add client README

Like sociomantic-tsunami/dhtproto#50.

Drop D1 support

Use `swarm.neo.util.Batch` and increase batch size

Swarm implements record batching utility inside swarm.neo.util.Batch and we should adapt dlsproto to use it. Also, it could be that the usual batch size (of 64k) is very small for the amounts of data such as DLS - the amount of batches received is very high, leading to lots of bandwidth used just for signalling.

Remove deprecated functions used from swarm in next major

14:59:59 ./submodules/dlsproto/src/dlsproto/client/internal/SuspendableRequest.d(474): Deprecation: function swarm.neo.connection.RequestOnConnBase.RequestOnConnBase.EventDispatcher.receiveAndHandleEvents is deprecated - Use nextEvent instead
14:59:59 ./submodules/dlsproto/src/dlsproto/client/internal/SuspendableRequest.d(639): Deprecation: function swarm.neo.connection.RequestOnConnBase.RequestOnConnBase.EventDispatcher.sendReceive is deprecated - Use nextEvent instead
14:59:59 ./submodules/dlsproto/src/dlsproto/client/internal/SuspendableRequest.d(699): Deprecation: function swarm.neo.connection.RequestOnConnBase.RequestOnConnBase.EventDispatcher.receiveAndHandleEvents is deprecated - Use nextEvent instead

DLS neo GetRange requests never finished

The issue can be reproduced running a GetRange Request with the data below

Problem disconnecting a GetRange request

The scenario is as follows:

The app disconnects a DLS neo GetRange request due to a long request timeout
- Dls request failed due to a specific node
- Expected non-zero errno after failure (connection hung up on read)
Received request stopped notification on the app side
App retries the request after 5 seconds
Request finished successfully
App processed 4 more requests successfully
and suddenly the app crashes due to dlsproto.client.request.internal.GetRange.GetRangeHandler.forceStopRequest()

DoubleBuffering implementation in GetRange sometimes hangs

It looks like that #60 introduces a bug where on rare occasions not all connections are able to finish the GetRange request. This doesn't seem trivially reproducible, but it requires a lot of data.

Fix documentation for dlstest.DlsClient

The documentation in the examples/tests for the method connectionNotifier() in module dlstest.DlsClient is incorrect for the tag neo-alpha-1

        /***********************************************************************

            Connection notifier used by the client (see the outer class' ctor).

            Params:
                node_address = address/port of node which notification refers to
                e = exception instance indicating an error (null indicates
                    connection success)

        ***********************************************************************/

        private void connectionNotifier ( RawClient.Neo.ConnNotification info)
        {
            with (info.Active) switch (info.active)
            {
            case connected:
                log.trace("Neo connection established (on {}:{})",
                    info.connected.node_addr.address_bytes,
                    info.connected.node_addr.port);

                if (this.connect_task && this.connect_task.suspended())
                    this.connect_task.resume();

                break;
            case error_while_connecting:
                with (info.error_while_connecting)
                {
                    this.connection_error = true;
                    log.error("Neo connection error: {} (on {}:{})",
                            getMsg(e),
                            node_addr.address_bytes, node_addr.port);
                }
                break;
            default:
                assert(false);
            }
        }

Add sanity check limit to all protocol reads

Using https://github.com/sociomantic-tsunami/swarm/blob/v4.x.x/src/swarm/protocol/FiberSelectReader.d#L206-L226.

Missing doc in neo DLS client

The request assignment methods can throw. This is not mentioned.
The doc of the node_disconnected notification of GetRange does not mention that the request will be restarted on that node when the connection is back up.

Is the neo wrapper in the test DlsClient necessary?

The neo client already has a built-in task-blocking API, so could be used directly in test cases. (This is what dhtproto does, for example.)

Compile with `-de`

Deprecate legacy client features

Automatically resume GetRange iteration after reconnection from the breaking point

With each record batch, the node sends the index in the bucket file (either as a file offset or a count index) of the last record included. If an error happens, after re-connection the client can tell the node the last index so it can carry on from that point.