cert-polska / ursadb Goto Github PK

Trigram database written in C++, suited for malware indexing

License: BSD 3-Clause "New" or "Revised" License

C++ 84.78% CMake 2.18% Dockerfile 0.35% Shell 0.13% Python 12.49% C 0.08%

yara malware database security-tools security-automation

ursadb's Introduction

UrsaDB

A 3gram search engine for querying terabytes of data in milliseconds. Optimized for working with binary files (for example, malware dumps).

Created in CERT.PL. Originally by Jarosław Jedynak (tailcall.net), extended and improved by Michał Leszczyński.

This repository is only for UrsaDB project (ngram database). See CERT-Polska/mquery for more user friendly UI.

Installation

See installation instructions

Quickstart

Create new database:

mkdir /opt/ursadb
ursadb_new /opt/ursadb/db.ursa

Run UrsaDB server:

ursadb /opt/ursadb/db.ursa

Connect with UrsaCLI:

$ ursacli
[2020-04-13 18:16:36.511] [info] Connected to UrsaDB v1.3.0 (connection id: 006B8B4571)
ursadb>

Index some files:

ursadb> index "/opt/samples" with [gram3, text4, wide8, hash4];

Now you can perform queries. For example, match all files with three null bytes:

ursadb> select {00 00 00};

Read the syntax documentation to learn more about available commands.

Learn more

More documentation can be found in the docs directory.

You can also read the hosted version here: cert-polska.github.io/ursadb.

Contact

If you have any problems, bugs or feature requests related to UrsaDB, you're encouraged to create a GitHub issue.

Funding acknowledgement

ursadb's People

Contributors

Stargazers

Watchers

ursadb's Issues

Bump the Catch version

Ursadb tests don't build with a new glibc version:

In file included from /nix/store/4pqv2mwdn88h7xvsm7a5zplrd8sxzvw0-glibc-2.35-163-dev/include/signal.h:328,
                 from /build/source/extern/./catch/Catch.h:4811,
                 from /build/source/src/Tests.cpp:8:
/build/source/extern/./catch/Catch.h:7441:45: error: size of array 'altStackMem' is not an integral constant-expression
 7441 |     char FatalConditionHandler::altStackMem[SIGSTKSZ] = {};

Another, similar issue: doctest/doctest#473

We should bump the Catch version and this will hopefully resolve this problem.

How to reproduce:

Certainly building ursadb_test in a modern nix derivation causes the build to fail. But any build with glibc-2.35 (or maybe even a bit older) should cause the build to fail.

Add e2e correctness tests

Create a set of correctness tests, using the current e2etest framework.

Use a real-world Yara rule corpus (maybe the same that mquery uses - https://github.com/CERT-Polska/mquery/tree/master/src/tests/yararules/testdata).
Store a predefined set of test malware files (for example, 1000) somewhere (left to discussion - probably hosted zip file). There should be at least one matching file for every yara rule (if possible).

Index all the downloaded files, and for every yara rule and set of expected results:

run ursadb query and ensure there are no false positives
run yara directly (using https://github.com/VirusTotal/yara-python) on all the files and ensure that there are no false negatives

Add `debug parse` command

We need a way to introspect the running database, or debug some issues without either recompiling the db with debug prints or attaching with gdb.

Especially now that I'll be working on query parsing & optimisation, I have a few ideas in mind. Right now I'd like to have something easy - debug parse command:

debug parse index "hmm" with [gram3];

Will parse the command index "hmm" with [gram3] in this case, pretty print the parsed tree and return it to the user. Expected result is for example:

index
    "hmm"
    index_type_list
        gram3

(or whatever the real query tree looks like).

This can be returned as a structured JSON or as plain text - it's intended for debugging so we don't make any guarantees about the output.

Don't regenerate `namecache` files every time

Right now we ignore the previous namecache files and create new ones every time the db is restarted. This slows down startup significantly. We should save name of the cache file to dataset file and load it when it's already generatedl.

Reminder: remove deprecation suppression once cppzmq is updated

Relevant PR: #61

Keep better track of memory

We should at least pretend that we keep track of memory usage.

Ursadb will happily consume all the RAM it can get. Granted, it tries to be lightweight, but that's sometimes hard when just the filelist is bigger than current RAM.

DB should have an idea how much ram it's allowed to use
Every command should try to estimate how much RAM it'll use, and "lock" this much RAM in the coordinator
Commands that can't execute due to RAM shortage should block (I think?)

Size limitations

Has any testing currently been done as to the limitations of the db? Mainly regarding:

Maximum number of files
Maximum file size

Outdated Dockerbuild

The UrsaDB docker build on Dockerhub is last updated more than 2 years ago. Many of the features that have been added are therefore not widely available via docker and Dockerhub. Could some effort be made into updating the specific image.

I should mention that there exists a more updated version of UrsaDB on Dockerhub. However I'm unsure how much that project is officially facilitated by CERT-PL. Additionatly all online documentation regarding installation still references to cert-pl/ursadb as the official image.

Create `ursacli`, a cli tool for ursadb written in C++

Why do we need this?

Because having the only client written in python is problematic in a dockerised environment - for example, to debug a trivial issue on docker we need to:

download python
install pip
install python dependencies
install git
clone ursadb-cli repo
run the tool

Having a stanard tool will make our life way easier

Query timeout

Kill select queries after certain time threshold is reached.

Automate releases

Right now releases are a tedious and manual process. We should improve it:

At least it's documented: https://github.com/CERT-Polska/ursadb/blob/master/docs/new_release.md
Automate docker image build with a proper version (image is built, but without a proper tag like mqueryci/ursadb:v1.5.1)
Automate drafting the release with proper tar.gz artifacts.

Update contact information in readme

Jarosław Jedynak ([email protected])
Michał Leszczyński ([email protected])

are both outdated

Command Failed: index

Hi,

when we start indexing a location following error occurs:
ERR filesystem error: cannot get file size: No such file or directory [/home/ubuntu/DB/home/ubuntu/DB/gram3.set.63df787f.db.ursa]

I dont understand why DB root folder path is getting prepended?

In this case DB root path is /home/ubuntu/DB

Please help...

Reindexing caused a zmq protocol violation

After recent changes (I assume) we've added or removed a zmq frame somewhere, and reindexing operation ended up with

ursadb_1      | Runtime error: Expected zero-size frame

Probably during taking down lock in the coordinator, I didn't check

In result:

the whole operation failed (I assume)
the worker is now deadlocked, and it'll wait for the confirmation forever

We should:

fix the root cause of the bug
handle zmq errors more gracefully

Edit I think the core cause of the bug might've been reindexing a [gram3] dataset with [gram3, text4] indexes only. Probably didn't expect to drop files?

Computer resources consuming

Hi!

I am using ursadb for searching malware samples with yara rules, and my goal is make ursadb work as faster as possible. However I'm facing with 12-15 seconds queries when make select in gigabyte sized datasets (there are approximately 40-50 datasets). Also I noticed that ursa consumes only 10-15% cpu of a single processor thread, 4-5 MB/sec of disk (measured using iotop).

So my question is: is it possible to boost ursadb searching and make to use ursadb max of computer resources?

We have optane ssd, I assume the searching speed should be higher.

I have try to decrease datasets size
I set config parameters (database_workers) to different values

It didn't help

Improve the command line interface - parameters

Suboptimal error messages and help:

❯ ./ursadb a
[2022-11-24 14:33:43.912] [info] UrsaDB v1.5.0+(unknown commit)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to open database file
fish: Job 1, './ursadb a' terminated by signal SIGABRT (Abort)

❯ ./ursadb --help
[2022-11-24 14:32:58.358] [info] UrsaDB v1.5.0+(unknown commit)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to open database file
fish: Job 1, './ursadb --help' terminated by signal SIGABRT (Abort)

[META] Ursadb performance improvements

The problem

I'm trying to create a public instance of mquery again. After setting up a mid-size instance (a few TBs, on HDD) I've noticed that some ursadb queries run much slower than I would expect. I suspect that I've introduced a few performance regression over the last 1.5 years (I didn't have sufficiently large dataset to test. I would like to find and fix all performance regressions, and hopefully make the performance better than ever before.

The solution

This is a metaissue to track ideas and work being done. I will create separate issues later.

Early tests suggest that the biggest problem (on HDD) is slow disk read and seek times. We should limit the number of read operations in this case. This isn't as big problem on SSD, but disk IO is still at least 50% of query time - improving it will be nice.

Issues related to scientific testing and a benchmark suite

Create a benchmarking utility to test performance improvements/regressions in a scientific way (this is possible thanks to counters introduced in ursadb 1.4) (I already have a tool for this, but I need to clean up the code and publish it somewhere)
Create a benchmarking suite to test various ursadb versions in a scientific way. (I plan to use yara rules from signature-base)
Check how much disk speed influences ursadb speed. Evaluate potential disk speed/precision tradeoffs. Especially evaluate differences between performance on HDD and SSD (ideally including the cloud environments, for example gp2)
Write down and publish the benchmarking results somewhere. We can start with a simple static page. Final findings may be saved in the documentation or some separate pdf file.

Issues related to things that can be fixed (all changes here should be benchmarked before merging to master). I also didn't think this all through yet - some of the ideas probably don't make sense.

Idea: cache primitive (string) queries. It's easy to write a yara rule, where a string is evaluated twice or more. We should cache string query results and reuse them (OnDiskDataset::query)
Idea: rethink querygraphs (QueryGraph.cpp). They are a speed-precision tradeoff, favouring precision. This may not be the best idea - benchmark them on some real data.
Idea: alternative to querygraph is a simple linear query. This mostly is how ursadb works now (because mquery is not aware of some more complex ursadb features). If we go that way, ensure that we don't do read requests unnecessarily for overlapping ranges - for example select "abcde" on indexes gram3/text3 will be split into ("abc" & "bcd" & "cde") & ("abcd" & "bcde") and this is 100% pointless and wasting CPU and disk time. This is a very obvious optimisation, but doesn't play with querygraphs in their current form nicely. #191
Idea: cache raw index n-gram queries. This sounds simple, but doing this naively will eat up ungodly amounts of RAM. For example, we should only cache n-gram queries that will actually be reused. Maybe we can also optimise common subexpressions? For example "abcdefghijklm" | "abcdefghijkln" is, in theory, equivalent to "abcdefghijkl" & ("klm" | "kln") Can we do this elegantly?
Idea: rethink hash4 index. How much does it really help on big datasets? Especially compare this to the index size (it's as big as gram3 index). Write down the results in the documentation.
Idea: rethink index on disk storage format. Smaller storage = cheaper disks, better cache usage, and better bang for buck with SSDs. I don't think I can invent a better storage format myself, but maybe someone smart already published some research on that topic. This is a very speculative and low priority improvement idea.

Caveats

That's all I have for now. Most of the issues here circle around how to optimise the number of reads and disk IO. There may be other areas of improvment (for example, how to cut down the number of "ors" or "ands"), but I didn't think about that yet. I also didn't think about optimising the "constant", aka making the individual and/or//minof operations faster. I actually think they're implemented in a quite performant way, and there's not much we can do to make them much faster.

Implement query graph pruning

Ignore query graph paths that are "too branched".

Intuitively, When we have something like {AA BB CC DD ?? ?? ?? ?? ?? ?? EE FF GG HH} we should just split this into {AA BB CC DD} & {EE FF GG HH}.

Implementation-wise, compute branching level (using ngram density estimation) for every node, and collapse nodes where it's too big.

This is a blocking issue for releasing query graphs, because without it some queries will not be stable enough.

Add a new command: `index with taints`

Right now, if I want to have a new dataset tagged as xyz, I need to run two separate commands:

index "/xxx" with [gram3]
dataset [id of the newly created dataset] taint "xyz";

This is pretty complex, especially since indexing can take a long time and dataset ID is not known upfront. Instead, I'd like to have a combined command (syntax subject to change):

index "/xxx" with [gram3] with taints ["xyz"]

Add custom exceptions for better error handling

eg: remove catching runtime errors in Daemon.cpp

Investigate random test failure

https://github.com/CERT-Polska/ursadb/runs/8068754933

This might have happened because the assert was overly strict, but there is a small chance that this is a legitimate bug - worth investigating. We should fix this, because flaky tests are annoying.

The PR that caused this run didn't change the code, so it's not a problem with the code.

Update the documentation

Similarly to mquery, update the documentation, contributing guidelines, installation instruction, etc

Panic when a query is attempted on dataset removed from disk

I just accidentaly removed a dataset from disk (from my test instance, it's OK). After that I've tried to query, and it hanged indefinitely instead of crashing. Investigate if that was just UB weirdness or weird behaviour in the code. Ideally also change the code so that we panic instead of hanging.

Implement alternatives in the ursadb query syntax

For example, I'd like to do a following query:

select {(41 | 61) (42 | 62) (43 | 63) (44 | 64)}

With graph-based queries (introduced in #56) this shouldn't pose a problem to the optimiser and we should support it.

In this case, the above query is equivalent to yara's

"abcd" nocase

We may consider adding case-insenstive literal to ursadb grammar (`i"abcd")? but that's much lower priority (mquery won't use it anyway).

Ideally, the implementation should be generic enough to support, for example:

select {(41 | 61 6? 63) (42 | 62 (64 | 65) 66) }

But this may be solved in the follow-up PR (this is much harder than just simple list of options)

[RFC] Reorganize directory structure

Currently, there's a lot of files in the root directory of the project which seems a bit messy.
My idea would be to split code into 3 directories:

libursa - after library target in CMake
src - containing source files of executables (with one directory per executable (?))
third-party - with 3rd party libraries

(EDIT) Or maybe move libursa into src? Either way, i'm looking for suggestions 🤔

@msm-code @icedevml

Better support for wildcards

Right now, we basically ignore all wildcards. So for example:

{AA BB ?? DD EE}

We will translate this to:

({AA BB 01} |{ AA BB 02} | {AA BB 03}) & ({BB 01 DD} | {BB 02 DD} | {BB 03 DD}) & ...

Instead of:

({AA BB 01} & {BB 01 DD} & {02 DD EE}) | ({ AA BB 02} & {BB ?? DD} & {?? DD EE}) | ...

These may look similar, but they're not and the precision difference is be huge.

`min of` matches all files indexed in ursadb (1.4.1 regression)

After refactor performed in commit 86a8905 - min of query returns all indexed files when count is equal to the amount of non-trivial strings (all of them Yara condition).

Reduction to AND doesn't work correctly in QueryResult::do_min_of_real, because it calls do_and operation directly on the SortedSet object (SortedSet::do_and instead of QueryResult::do_and) related with QueryResult::everything()
https://github.com/CERT-Polska/ursadb/blob/master/libursa/QueryResult.cpp#L53-L60

In that case, SortedSet::do_and doesn't affect the "everything" state of QueryResult, so it still holds "everything" regardless of the do_and argument.

The result is that all of them Yara condition actually gets all of the files in queried datasets.

What is UrsaDB

Could you maybe add a high-level overview of what exactly a trigram/n-gram database is, what is stored, and how UrsaDB works?

While the readme tells you how to get started, it doesn't actually explain what UrsaDB is :)

Probable memory corruption somewhere in task

No other symptoms, but this screen (happened once):

connection ID and task id doesn't look right (uninitialised memory maybe?)

Doesn't build

FYI...per the instructions here's my results:

-- The C compiler identification is unknown
-- The CXX compiler identification is unknown
CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_C_COMPILER:

    gcc-7

  is not a full path and was not found in the PATH.

  Tell CMake where to find the compiler by setting either the environment
  variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
  the compiler, or to the compiler name if it is in the PATH.


CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_CXX_COMPILER:

    g++-7

  is not a full path and was not found in the PATH.

  Tell CMake where to find the compiler by setting either the environment
  variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.


-- Configuring incomplete, errors occurred!

cmake version is 3.7.2. Thank you.

allocate memory not released after indexing

After indexing a new path, allocated memory for ursadb process is never released.
See below for an example:

Tested branch: fix-4
Indexing a new path with (from ursadb-client): index "/mnt/samples/" with [gram3];

-- Before indexing --
ursadb 3.1MiB

-- After index --
ursadb 2.5GiB
it stays in this state unless restarted

-- After restarting ursadb process --
ursadb 4.3MiB

Add a performance counter for unique ngram reads

Looks like we don't track the most important metric - how many unique ngrams did we read. Benchmarking done in #197 has shown, that reading a ngram for a second time is almost as fast as reading it from RAM, but the first read is very slow (especially on HDD).

We should add a performance counter for that and return that in a response.

Fix CI tests

Ref: #166

[META] Release v1.5: Watch Your Language

Mquery is nearing a new release, and ursadb scored quite a few features in the meantime too.

This release has a lot of improvements in the internals. The focus is on improving support for wildcards and more advanced queries. Support for tags was also added.

Already done:

Still missing:

Invalid comparison in Database::commit_task

if statement on line 231 causes following warning:

[...] Database.cpp:231:27: warning: result of comparison of constant 18446744073709551615 with expression of type 'uint32_t' (aka 'unsigned int') is always false [-Wtautological-constant-out-of-range-compare]
            if (split_loc == std::string::npos) {

Investigate github pages build fails

Code quality: reimplement DbChange as variant

DbChange class is clearly used in ways it was not designed for.

The class definition:

enum class DbChangeType {
    Insert = 1,
    Drop = 2,
    Reload = 3,
    ToggleTaint = 4,
    NewIterator = 5,
    UpdateIterator = 6
};

class DBChange {
   public:
    DbChangeType type;
    std::string obj_name;
    std::string parameter;

    DBChange(const DbChangeType &type, const std::string &obj_name,
             const std::string &parameter = "")
        : type(type), obj_name(obj_name), parameter(parameter) {}
};

Uses parameter (a public field!) as a black-box container for basically everything, and database is expected to dispatch this checking change type.

The most egregious example of this is the UpdateIterator change, which uses string parameter to pass two integers (!).

I think we should rewrite this class using std::variant, similarly to the Command class.

So we'll have use:

using DbChange =
    std::variant<InsertDatasetChange, DropDatasetChange, ......>;

And dispatch using std::visit method.

Add metadata support

This is currently just a rough idea and needs more design.

Find a way to assign key-value metadata fields to samples. This will be necessary to implement CERT-Polska/mquery#230.

The expected result is to be able to add arbitrary metadata to datasets and do queries like

select "1234" and meta(pe.number_of_sections) == 3

(syntax subject to change).

I will update this ticket with more design and/or create subtickets when I have time to work on this.

Improve startup error messages for corrupted state

The startup script creates a database if it doesn't exist:

#!/bin/bash

cd /var/lib/ursadb

if [ ! -f "$1" ]
then
    /usr/bin/ursadb_new "$1"
fi

/usr/bin/dumb-init -- /usr/bin/ursadb $@

This checks if a file named $1 already exists. Notably, it doesn't hande the case when $1 exists but is a directory. This situation is also not handled well by the C++ code.

We should improve error message in this situation and guide users somewhat.

Command failed: corrupted index

Hello,

I was using your mquery project and when I tried to index samples with ursadb I get the following error after indexing a few samples.

ursadb_1 | new dataset
ursadb_1 | Command failed: corrupted index, file is too small
ursadb_1 | worker finished: 3, he was doing task 3

Enable LTO for release builds

This should make the db a bit faster in most workloads

Ensure ursadb is thread safe

Right now there are some edge cases when ursadb can have problems, when there are a lot of threads writing and reading data at the same time. We should ensure that we're hanling all these cases correctly.

Add docker build to CI

Right now we only build in the CI using native cmake. We should also check that the project builds with our own Dockerfile.

Add a build ID to unreleased versions of Debian packages

As discussed in #44:

Add a build ID to unreleased versions. So the commit after 1.3.0 will be build as 1.3.0+202012131 for example.

Consider (optionally?) compacting in the background

Consider (optionally?) compacting in the background, or measure the impact compacting has on database performnace.

In the current version compacting is (almost) always a good idea for performance, and in theory we recommend everyone to compact their database, but ursadb won't use idle time to compact the files. Consider doing that in the background when the system is idle (or move this issue to mquery if that should be done by the clients).

Implement a GC for iterators

(long overdue issue) Automatically remove iterators after some time has passed since they were last used. Document this change.

Add E2E tests for the service

Right now the highest level we test are integration tests for single commands. While this is nice, we also need some e2e sniff tests to avoid critical regressions related to master API.

Implement (optional?) query timeouts

It's relatively easy to run extremely complex query in ursadb (for example by ORing a huge amount of strings). To prevent database getting "stuck" on queries like this, we could make it possible to set a per-query timeout.

Create a benchmark suite

We need a way to test if DB is getting faster/slower with time.

decide on the framework
add benchmarks to test indexing performance (including for large and for small files
add benchmarks to test query performance (including very large queries)
integrate with CI

[Question] Best practices

I download periodically malicious files from different sources and put them on a common storage. What are the best practices that you suggest to index new files, while avoiding to unnecessarily re-index those that have already been scanned?

Thank you