jube-home / jube Goto Github PK

Jube is an open-source software designed for monitoring transactions and events. It offers a range of powerful features including real-time data wrangling, artificial intelligence, decision making, and case management. Jube's exceptional performance is particularly evident in its application to fraud prevention and abuse detection scenarios.

Home Page: https://www.jube.io

License: GNU Affero General Public License v3.0

C# 78.67% HTML 9.37% JavaScript 2.40% CSS 7.20% Less 0.02% Rich Text Format 1.50% Smalltalk 0.85% Dockerfile 0.01%

case-management data-mining data-visualization event-monitoring faas faas-platform fraud fraud-detection fraud-prevention machine-learning

jube's People

Contributors

Stargazers

Watchers

Forkers

brwinfo muhammetsahin

jube's Issues

Move Jobs to Crystal Quartz

There are several jobs that exist in Jube, characterised as being very long running procedures launched periodically. Currently these jobs exist inside a perpetual loop based on long thread sleep. It works just fine however does not cluster all that well and complicates the tear down of the process.

It is desirable to move Jobs to Quartz.net, indeed Crystal Quartz.net to provide for a user interface into job execution.

Implement Crystal Quartz.net for the purpose of existing jobs. Implement in such a way that additional job binaries can be included to provide for some extensibility of the software without the need for external processes.

All values available in current Dependency Injection should be available, passed to, the Quartz.net context such that the new IJob implementations are a near drop in replacement.

Improve Startup Time by having Assembly Hash Cache in the database too

Jube is written to support scalable cloud operations. It means that many small instances of Jube, perhaps containers, can be created to achieve scalability. A principle requirement of the strategy where containers are created to handle bursts dynamically is new instances of the software must load very quickly, in under a minute. A requirement that is coming up extensively is the support of cloud operations and dynamic scalability via Kubernetes (although the use of containers is less so, more so very lightweight VMs amounting to the same thing).

The software does load quickly, however, new instances will bring back all configurations from the database, for all models, and proceed to lay that out in the instance memory, a process that is very fast (not to mention unavoidable). The issue is that for each configuration that contacts rule code, this will compile to an assembly, and then be stored locally in the hash cache (dictionary of code hash to its assembly), so as not to duplicate the compilation of identical code.

The task is to refactor the hash cache to be part of the compilation class first, moving the compilation class to an instance in place of the hash cache. This compile class, which will now also include the hash cache, will also use a table in Postgres storing byte arrays and the code hash.

On a call to compile, as now, the hash cache will first be inspected for the key value combination (hashed code vs assembly), then in the absence of that, it will fall back to a table in Postgres for the same (noting that the initial Rosyln compilation is a byte array that should be trivial to store) and only in the event of unavailability will the code go on to be compiled to an assembly. It is of course case that newly compiled code be made available to the hash cache in both Postgres and the instance.

The approach will remove the need for code to be recompiled as new instances are created in the cluster, which should improve the startup time, making models available soon after instantiation.

It would also be advantageous to include more compile-time data and errors for the purpose of monitoring and production support. At the moment the compile errors only appear in logs.

Model Wizard

Creating a machine learning model in Jube can be a convoluted process involving creating a model, specifying fields to be extracted, specifying tags and then loading data via HTTP endpoint, before being available for training in the embedded Exhaustive machine learning algorithm. The requirements contrast to products which can achieve the same through the application of a CSV file. It follows that despite having more advanced capabilities the adoption may be reduced to other products. While Jube was not designed as an automated machine learning Wizard, there appears increasing overlap

It is proposed that a Model Wizard be created to take a CSV file and parse the metadata and data itself, automatically creating all configuration elements that are otherwise created manually. The file will be parsed for its data to identify the universe of categorical variables, with these being created as Boolean XPath expressions (a process which currently is done typically outside of Jube).

Task: Ensure JSON Path Expression returns a Boolean value

As categorical data pivoting will be done in Jube, JSON Path must be available in the Request XPath Model Configuration to return based on Expression, for example, $.[?(@.=='Politician')].

Task: Create a new page to parse the CSV file

The new page called Model Wizard, existing under the Models menu item, will accept a CSV file as an upload and proceed to parse the headers. For each header the data will be inspected:

Is all numeric, in which case will be treated as Float for the purpose of model configuration.
Has the presence of string data, in which case will be treated as String for the purpose of model configuration.

In keeping with the stateless nature of the design, the parsing will be stored in tables in the database for recall by the user interface. At this stage, the model will not be created.

Task: Allocate Dependent Variable

With the metadata having been established, the page must accept further configuration parameters, specifically including the dependent variable, which will go on to be a tag value, corresponding Exhaustive Model and Activation Rule.

Task: Create Model

Based on metadata and configuration create the model in Jube comprising:

Headers will be transposed to Request XPath configuration elements.
For each String in Categorical variables the header will be transposed as an expression (i.e. Categorical Data Pivoting).
For each String in the Categorical variable specified as Dependent Variable a Tag element will be created and;
An Exhaustive configuration element will be created to target the Tag disposition for machine learning and;
For good measure, an Activation Rule element will be created targeting the return value from Exhaustive models, where > 0.5 will drive activation. The Activation Rule is not strictly necessary as the Exhaustive recall values are available in their raw form on recall.

Task: Load Data from CSV into JSON for storage in the Archive

Transpose the CSV file to a JSON representation and store it in the Archive table which will make the data available for Exhaustive training.

Task: Synchronise Model

Insert data to cause the model to synchronise and thus start Exhaustive training.

Replace Binary Serializer with Newtonsoft Json Serialiser (if possible) or find work around

In .Net 7 the BinarySerializer is deprecated. The BinarySerializer is not used extensively, except to save Neural Network model states.

The suggested replacement for BinarySerialization is Json, however, it is not clear the extent to which this absolutely serializes an object, including internal properties and the total state.

This ticket is to replace all transient and persistent serialized objects with a Json serialisation if possible, and figure out a workaround.

This refactoring is currently the main blocker in a path to .Net 7.

Improve Support for Docker

Jube can be built to Docker but there is currently no option to do this in the default build, instead, it is compiled as debug, using dotnet run. This works, but it is only really a demonstration and expects the end user to publish for their release.

Include a Docker file and .yaml file as required in the software, compiling Docker to release and not debug.

Most users appear to require that the software is architected to support containerisation, which it is, with differing appetite for Docker, but universal appetite for Kubenates or similar cloud provider scaling options.

SQL Statements against cache for abstraction rules selecting only columns required by abstraction rule

The Abstraction Rule process performed a prepared select statement against the cache tables. At some point this will be an index only lookup for reason of improved use of covered indexes.

The select statement is performed only once for each key and logic processed against that dataset in memory. The amount is data fields being brought back need only be that required of the rule. Reducing the select will have the effect of increasing performance through less data coming across the wire and also less casting of fields that are never to be used.

This ticket is to perform analysis on rule parsing to ensure that only fields that used are included in the select.

Replace Kendo ListView control

In Jube's List and Dictionary functionality, each value is presented and managed by a Kendo ListView control. This control used to be used a lot however the implementation was always problematic, and the user experience was poor. For example, the control was replaced on the Models page and has a much more elegant customised way to maintain lists.

Review the user interface and replace all instances of the ListView Kendo control with either Kendo Grid (most likely) or the tooling used in the Models page (which would be more elegant but require more work).

Text box to capture notification body text too small

The text box to capture notifications throughout the system is far too small:

Style this properly to accept rich text or html style notifications.

Some thought should be given to the dispatch of the body, to ensure that it is HTML.

Remove vulnerable package SharpCompress CVE-2021-39208 Score: 4.3

OOdana static analysis highlighted a medium severity vulnerable package in use.

The package is not implemented and has already been refactored away from.

Create a catalog of fields in use in rules to suggest covered indexes and reduce fetch from database.

Oftentimes the amount of data being used in rules is a small subset of the data that is presented. It follows that forcing the database to go out to Page and Tuple is quite expensive, especially when the index is likely in the buffer cache.

It is not currently possible to see the fields in use by rules that depend on cache data.

Create a function in model sync that will examine the Request XPath elements in existence in rules and create a catalogue of fields in use.

With the catalogue, only select data back from the database that is required in the rules. Limiting the select statement has a big impact on query performance and jsonb field parsing.

Optionally, build - or at a minimum suggest - indexes which cover the fields to avoid the need to go to Page and Tuple.

Upgrade LINQ2DB to latest version

An observation from other projects worked on recently, the patterns and version of LINQ2DB are slightly old and would benefit from being upgraded to the very latest version. As part of this, explore the manner in which the LINQ2DB context is being instantiated.

This task if part of a wider effort to support .Net 7 and bring all Nuget packages current.

Move PostgreSQL Cache Updated and Inserts to a Single Transaction. Move Case Creation and TTL Counter Entry to Bulk Insert.

During online transaction processing, rather the invoke process, there are several insert \ update interactions with the PostgreSQL database which happen inline. These interactions are expensive, but given PostgreSQL being used as cache unavoidable.

All inserts and updates should be done inside the same transaction to avoid excessive commit. It follows that it is necessary to batch up the inserts \ updates, where not possible to bulk insert (in the case of Time To Live (TTL) Counter Entries and Case Creation) and execute in a single transaction.

For inserts which are not time sensitive, such as TTL Counter Entries (used to wind back TTL counters) and Case creation, move these back bulk and \ or background processes.

Write FluentMigration trace to log4net or at a minimum the console.

At the moment FluentMigration trace is disabled. Enable it. Ideally, configure the log to write to the implementation of Log4Net rather than the console, but at a minimum the console.

On startup detect and bind to the public IP address if observable.

To reduce the required environment variables for the default installation, explore the possibility of detecting and binding to the public IP address of the virtual machine.

Update licence from GPLv3 Only to GPLv3 or the latest.

The startup splash already includes a reference to GPLv3 or the latest however the README.md sets out GPLv3 Only.

Relicence software and all annotations to GPLv3 or the latest effective on commit of this branch.

Can't turn an Exhaustive model on after training.

If an Exhaustive mode is not set to Active at a point before training, it is not able to be activated after training. The Update button is hidden by design, however, it would be better if only certain fields were locked for update:

Update the page to lock certain fields rather than remove the Update button altogether.

Implement Redis Cache

PostgreSQL is used for the purpose of caching transaction data currently, and while it cannot be considered to be an in memory database, the shared buffers means that read performance, while slower than an in memory cache, it is not to an extent that materially affects response time when traded off against the durability guarantees provided by PostgreSQL. Read performance aside, in memory databases are extremely expensive to run, a nightmare to administer and demand a degree of duplication - in Jube at least - given its key value pair access paths (while PostgreSQL queries are indexed on multiple keys, these keys would instead be duplicated, with transaction history being stored in its value HSET).

There is no contest in writes, and Jube response times are severely impacted. For example, reading Abstraction keys overall might take 3ms to 6ms, writing might be 17 ms, which is hard to defend in a real time system. Currently writes to the cache are made in the transaction flow, which is important, as serialisation across requests is required. Ideally all writes would be moved out to background threads performing bulk inserts, but this would not provide for the serialisation guarantees from transaction to transaction (consider a velocity rule made up of counts on the PostgreSQL cache). Turning asynchronous commit on allows some relief, but without moving to UNLOGGED tables (which attract their own problems) it still does not come close to write performance of desirable.

Redis will be implemented as an in memory database as follows.

In respect to the Abstraction cache:

In the event that cache is enabled, no comparable inserts will be made to the PostgreSQL cache tables. Instead the value will be key set in Redis on the basis of the search key and values (e.g. IP:123.456.789) and HSET of a MessagePack serialisation of the payload dictionary, for each transaction. In this respect the key serves to index, where the transactions are covered in the HSET values.
On each transaction, using async methods, a request will be made to Redis on Abstraction key (e.g. IP:123.456.789).
A Time To Live (TTL) definition will be created to accompany the specification of the search key. Given the TTL definition, the expiry of the key will be extended out on each transaction (otherwise the key will be allowed to expire), removing all HSET values. There is no member expire supported in Redis at this time, which means that data will not expire until there are no further transactions on that key. It follows that in the real-time flow there should be some online pruning of the values and \ or;
A background job that serves to prune the expired HSET values also.

In respect to TTL Counter:

The transactional incrementing of TTL Counters will be done in a Redis write.
The writing of the TTL Counter entry, which is used to decrement TTL Counters in a background process will be written to an in memory asynchronous queue for bulk insert, which will be done in a separate thread across all models and all tenants (as above). It follows hat PostgreSQL will continue in use for winding back TTL Counters.
The background process will update the Redis cache instead of the same table in PostgreSQL (this is to say not duplicated). The durability guarantees provided by Redis cache of the AOF log \ rewriting will ensure that the Redis cache is unlikely to need to be reconstituted, and the risk of 1 second of incremental counter loss can be conveyed as a risk.

In respect to cached Abstraction values:

The background process responsible for calculating counters will also write the values to Redis based on the Abstraction key with the aggregations being stored in the HREF.
The transactional process will instead read from the Redis cache rather than the equivalent in PostgreSQL.
There are no proposals to deprecate the writing of aggregations to PostgreSQL as this is useful for tracing the calculations, which is a complex process and benefits from the verbose trace.
Same durability considerations in respect to guarantees provided by Redis cache of the AOF log \ rewriting.

The functionality will be optional and in the absence of a Redis cache being available, existing functionality will prevail.

Connection strings to Redis should be contained at the application level and fully support multiple endpoints such that FAILOVER can be invoked to resize Redis instances.

Guid for example model wrong in documentation

The documentation details the wrong guid in examples. For example:

90c425fd-101a-420b-91d1-cb7a24a969ccc

Used as:

https://localhost:5001/api/invoke/EntityAnalysisModel/90c425fd-101a-420b-91d1-cb7a24a969ccc

Should be:

https://localhost:5001/api/invoke/EntityAnalysisModel/90c425fd-101a-420b-91d1-cb7a24a969cc

Is wrong in many places:

Find and replace incorrect Guid values in documentation.

Instruct Stop Training of a Neural Network in Exhaustive Adaptation

There is currently no means to instruct the stop of an Exhaustive Adaptation training process. Include a button on the Exhaustive

Adaptation training page that will set a flag in the instance to Stop, which will be checked for each new topology exploration. At this stage, it is not proposed to send termination instructions to the thread, as in production this will more than likely be instantiated in a dedicated thread, for a dedicated training instance.

Build Error After Merge Duplicate Attributes in Accord.net Projects

The most recent merge to master has introduced a problem in the build step relating to duplicate attributes in the build step. It is not clear what resources have been moved or ignored that this can happen, such that it built and run in development branch and not master despite no merge conflicts.

5>Accord.Genetic.AssemblyInfo.cs(14,12): Error CS0579 : Duplicate 'System.Reflection.AssemblyConfigurationAttribute' attribute
5>Accord.Genetic.AssemblyInfo.cs(18,12): Error CS0579 : Duplicate 'System.Reflection.AssemblyTitleAttribute' attribute
5>------- Finished building project: Accord.Genetic. Succeeded: False. Errors: 2. Warnings: 0

7>Accord.MachineLearning.AssemblyInfo.cs(14,12): Error CS0579 : Duplicate 'System.Reflection.AssemblyConfigurationAttribute' attribute
7>Accord.MachineLearning.AssemblyInfo.cs(18,12): Error CS0579 : Duplicate 'System.Reflection.AssemblyTitleAttribute' attribute
7>------- Finished building project: Accord.MachineLearning. Succeeded: False. Errors: 2. Warnings: 0

This did not happen in the branch and appears to be owing to the generation of assembly information for the project. Adding the attribute:

false

In the project files should resolve the issue.

Make clear Default is Demo Training Dataset on Exhaustive Page

For reasons of being able to demonstrate the Exhaustive training functionality in the default installation Exhaustive training a model on a demonstration dataset. The absence of a clear message may lead users to wonder what the platform is not training on data tagged or laid out in the database.

Create a clear splash note on the page that makes it clear that it is targeting demonstration data, make the dataset available for download or inspection, and mention the environment variable that needs to be changed for production data to be used.

Export Models

Hi, thank you for such a smart software) We just recently started experimenting with it and a question arose: Is it possible somehow to export/import a model with all its request XPaths and rules?

Remove Accors.net Obsolete Dependencies

Following the migration of the Accord.net open-source code to .net 6 and as part of the Jube solution, several obsolete warnings have emerged:

Serializer.cs(118, 21): [SYSLIB0011] 'BinaryFormatter.Serialize(Stream, object)' is obsolete: 'BinaryFormatter serialization is obsolete and should not be used. See https://aka.ms/binaryformatter for more information.'

Serializer.cs(122, 17): [SYSLIB0011] 'BinaryFormatter.Serialize(Stream, object)' is obsolete: 'BinaryFormatter serialization is obsolete and should not be used. See https://aka.ms/binaryformatter for more information.'

Serializer.cs(369, 35): [SYSLIB0011] 'BinaryFormatter.Deserialize(Stream)' is obsolete: 'BinaryFormatter serialization is obsolete and should not be used. See https://aka.ms/binaryformatter for more information.'

Serializer.cs(373, 31): [SYSLIB0011] 'BinaryFormatter.Deserialize(Stream)' is obsolete: 'BinaryFormatter serialization is obsolete and should not be used. See https://aka.ms/binaryformatter for more information.'

ExtensionMethods.cs(672, 29): [SYSLIB0014] 'WebClient.WebClient()' is obsolete: 'WebRequest, HttpWebRequest, ServicePoint, and WebClient are obsolete. Use HttpClient instead.'

In the case of serialisation, this should not be handled by Accord.net at all now. In the case of a web client, there should not be any use at all. All can be removed.

Prune Accord.Net Libraries

It is hard to imagine finding time to work on this ticket, however, creating anyway.

Recently a project was concluded to upgrade from .Net 6 to .Net 8. This was a big project as some the machine learning libraries that were in use were archived. The archived code contained uses of the BinarySeralizer that was unsafe and would not build after .Net 6. The Accord.Net libraries, being written in C#, were brought into the solution and built through to .Net 8, removing all of the references to BinarySeralizer and any other obsolete code. Serialisation of Neural Networks was further complicated by there being no drop in replacement to BinarySeralizer and modifications needed to be made to make it work with Newtonsoft json serialization.

The Accord.Net libraries are massive, and the Jube use is highly partial. At some point this library code needs to be pruned to remove any methods that are not in use to reduce the ongoing maintenance cost. Mostly obsolete methods are being removed and their use swapped with not supported exceptions, and this does not appear to have caused any breaking changes in Jube.

This ticket is to examine the use of Accord.Net in Jube and remove any code that is not used, then set about refactoring code that is in use. The code is not all that bad, as it build and works under .Net 8, hence this ticket is not an immediate priority.

Change Cache Indexing to use a Hash Index and not a B-Tree Index

Currently, the indexing on the cache table uses B-Tree indexing (not covered, but they may be in a separate project).

The use of a B-Tree index is redundant as queries to the cache table are on equality only. It follows that greater performance could be obtained without any penalty other than adding the index type in the creation process.

This change won't be breaking, but it would mean that indexes would need to be duplicated for existing users on the upgrade, pending a drop manually.

Not returning all values uploaded into a list

Hello,

I uploaded the enclosed file with 659 records.

Unfortunately I can only see the first 18 records in the grid and have no way to see the other records (see screen shot enclosed)

When I query the DB I can see the records are there (see enclosed)

select "EntityAnalysisModelListId", count(*), "Deleted"  from "EntityAnalysisModelListValue"  
where "Deleted" is null
group by  "EntityAnalysisModelListId",  "Deleted"

Can you consider adding:

A button to allow a user to progress to see more records / edit them
A counter stating how many records there are in total

Thank you for your consideration.
malicious user agents.txt

Partition Strategy in Cache and Archive Tables

Data that is used for real-time processing is already logically separated from slower-moving data in the Archive, a partition of sorts. However, Jube greatly underutilises the partitioning capabilities of Postgres.

The task is to modify the Cache and Archive tables to have a partition hierarchy as Tenant Registry ID \ Default >>> Model Default >>>> CreatedDate \ Default.

Most ideally some functionality will exist in the system to be to create and prune partitions automatically. For example, the Cache may have one-day partitions up to a maximum of 7 days. The Archive may have monthly partitions up to a maximum of a year. In a similar manner to the index server, this partition management should exist in a thread inside a Jube instance.

Upgrade to .Net 8

The software is currently written in .Net 6. It should be trivial to upgrade this to .Net 7, as part of a general Nuget Package upgrade.

There are known issues relating to the use of the BinarySerializer for saving Neural Network Models in the database, as BinarySerlizer is deprecated in .Net 7. A separate research and implementation ticket is open to replace the use of the BinarySerializer.

Password Requirements too strict

Hello,

I recently spun up jube.io and on the first login screen it asked me to change my password.

This seems to take a password policy (regex) from a social media platform i.e. 16 characters, upper and lower case, special characters limited.

I would appreciate if the passwords could be made more complex i.e. the length is unlimited or increased to 100 and all special characters can be used.

Thank you for your consideration.

Convert Invoke Method to support async fully

The Redis project calls for reads to be performed against the Redis cache in parallel to the PostgreSQL Cache. More generally the async methods provide for a much better experience where the thread would otherwise be waiting on IO. The async methods have been proven to perform better than blocking methods by quite a margin and the total linear processing of the invoke method, putting aside the ForkAstractionKeys setting which is badly implemented.

Upgrade all PostgreSQL cache calls to fully support async methods.

Deprecate ForkAstractionKeys and make this the only processing method using task completion joining on wait complete.

Remove Matrix Room Link from Startup Message in Console.

The Matrix chat has not had enough use to be used as a support channel. Instead GitHub Issues, LiveChat and Helpdesk. Remove reference from Startup Message in Console.

Instruct index creation out of model synchronisation to a background queue. Build indexes concurrently. Drop Duplicate Indexes.

At the moment indexes on search keys are done inside model synchronisation, which is blocking. Also, and it is a bug, the indexes are built without using the concurrent option which brings about locking.

Instead of building the index in the model synchronisation routine, send that instruction to a background table queue if the definition of that index does not already exist. A separate thread will poll this table queue and begin the process of concurrent index creation.

Support covered fields in the indexes based upon the fields in use in rules.

Include trace in processed payload when switch is passed

For response time trouble shooting the method is to enable INFO level logging in the application which writes out the verbose logging for the transaction. This has some production implications.

Create a switch in the HTTP or AMQP headers that will produce this same trace in the processed payload. This will facilitate the more rapid tracking of response time problems for a given transaction..