dosaboy / searchkit Goto Github PK

1.0 2.0 2.0 239 KB

License: Apache License 2.0

Python 100.00%

searchkit's Introduction

Searchkit

Python library providing file search tools.

The basic principle of searchkit is that you add one or more file or path and then register one or more search against those paths. Searches are executed in parallel and different types are supported such as simple one line search or multiline/sequence search. Constraints can optionally be applied to searches.

Search Types

Different types of search are supported. Add one or more search definition to a FileSearcher object, registering them against a file, directory or glob path. Results are collected and returned as a SearchResultsCollection which provides different ways to retrieve results.

Simple Search

The SearchDef class supports matching one or more patterns against each line in a file. Patterns are executed until the first match is found.

When defining a search, you can optionally specify field names so that result values can be retrieved by name rather than index e.g. for the following content:

    PID TTY          TIME CMD
 111024 pts/4    00:00:00 bash
 111031 pts/4    00:00:00 ps

You can define as search as follows:

SearchDef(r'.*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)')

and retrieve results with:

for r in results:
    pid = r.get(1)
    tty = r.get(2)
    time = r.get(3)
    cmd = r.get(4)

or alternatively:

for r in results:
    pid, tty, time, cmd = r

or you can provide field names and types:

fields = ResultFieldInfo({'PID': int, 'TTY': str, 'TIME': str, 'CMD': str})
SearchDef(r'.*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)', field_info=fields)

and retrieve results with:

for r in results:
    pid = r.PID
    tty = r.TTY
    time = r.TIME
    cmd = r.CMD

Sequence Search

The SequenceSearchDef class supports matching string sequences ("sections") over multiple lines by matching a start, end and optional body in between. These section components are each defined with their own SearchDef object.

Search Constraints

If searching e.g. a log file where each line starts with a timestamp and you only want results that match after a specific time then you can use search.constraints.SearchConstraintSearchSince and apply to either the whole file or each line in turn. The latter allows constraints to be associated with a SearchDef and therefore only apply within the context of that search.

Installation

searchkit is packaged in pypi and can be installed as follows:

sudo apt install python3-pip
pip install searchkit

Example Usage

An example simple search is as follows:

from searchkit import FileSearcher, SearchDef

fname = 'foo.txt'
open(fname, 'w').write('the quick brown fox')
fs = FileSearcher()
fs.add(SearchDef(r'.+ \S+ (\S+) .+'), fname)
results = fs.run()
for r in results.find_by_path(fname):
    print(r.get(1))

An example sequence search is as follows:

from searchkit import FileSearcher, SequenceSearchDef, SearchDef

content = """
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'foo'"""

fname = 'my.log'
open(fname, 'w').write(content)

start = SearchDef(r'Traceback')
body = SearchDef(r'.+')
# terminate sequence with start of next or EOF so no end def needed.

fs = FileSearcher()
fs.add(SequenceSearchDef(start, tag='myseq', body=body), fname)
results = fs.run()
for seq, results in results.find_sequence_by_tag('myseq').items():
    for r in results:
        if 'body' in r.tag:
            print(r.get(0))

An example search with constraints is as follows:

from searchkit import FileSearcher, SearchDef
from searchkit.constraints import SearchConstraintSearchSince, TimestampMatcherBase

class MyDateTimeMatcher(TimestampMatcherBase):
    @property
    def patterns(self):
        return [r'^(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2}) '
                r'(?P<hours>\d{2}):(?P<minutes>\d{2}):(?P<seconds>\d{2})']

fname = 'foo.txt'
with open(fname, 'w') as fd:
  fd.write('2023-01-01 12:34:24 feeling cold\n')
  fd.write('2023-06-01 12:34:24 feeling hot')

today = '2023-06-02 12:34:24'
constraint = SearchConstraintSearchSince(today, None,
                                         ts_matcher_cls=MyDateTimeMatcher)
fs = FileSearcher(constraint=constraint)
fs.add(SearchDef(r'\S+ \S+ \S+ (\S+)'), fname)
results = fs.run()
for r in results.find_by_path(fname):
    print(r.get(1) == 'hot')

searchkit's People

Contributors

Watchers

Forkers

nicolasbock mustafakemalgilor

searchkit's Issues

Move to `canonical` namespace

Since hotsos depends on this package, we should really host it under the canonical namespace.

SearchDef parameter "hint" has a confusing name

The SearchDef parameter "hint" is defined as an optional parameter with the following description:

optional pre-search term. If provided, this is expected to match in order for the main search to be executed

where the effects of the parameter are not that obvious when the description for the "hint" parameter is not read by the user downstream. The name "hint" suggests that it's optional, it'll help the SearchDef to do its' job better (e.g., performance optimization), and it's not necessary to provide the core functionality (i.e. the match result would be the same with/without the "hint" parameter), but it's definitely doing more than that.

The parameter may be better understood if we rename it to sth like "prefilter".

Use different loggers for different log sources

Currently, searchkit uses the default logger for logging, and there is no distinction between log sources. Searchkit ideally should have its own dedicated loggers, preferably in a reasonable granularity to have better control on logging, e.g. ("searchkit.searchdef", "searchkit.constraints.datesince")

store each key as file by default in mpcache

Depending on how the cache is used it may be more efficient to store all keys in a single file vs. using a separate file per key. By default we should use the latter going forward.

consider using msgpack

Using https://msgpack.org to send information between the worker processes and the main collector might yield some performance improvements. Needs investigating.

Sub file-level parallelization

Currently, searchkit does the work parallelization on the file level, so the smallest unit of work is a "file", but this approach does not scale well when file size distribution is uneven. Consider the following log files:

a.log (4 KiB)
b.log (4 GiB)
c.log (17 KiB)

... and consider that we have 3 execution units (process/thread etc.) for simplicity. The amount of total data is 4G + 21 KiB, but the load distribution will be 4KiB, 4GiB, and 17KiB, which is heavily unbalanced work-wise. Ideally, each execution unit should process ~4G/3 GiB of data, so the load balancing would be even.

This could be achieved without making major architectural changes if we could make a large file appear as multiple smaller files (pseudofiles). The actual splitting would be done at line-feed marker levels so cross pseudo-file boundary access would not be necessary.

Note that this might impact the performance for gzip-compressed files since gzip requires file to be decompressed for every file seek operation.

Weird behavior in "SearchConstraintSearchSince"

I noticed that the existing SearchConstraintSearchSince would seek the end of the matching line, instead of the start of the line, e.g. consider the following example:

dummy.log:

  2022-01-01 00:00:00.00 L0
  2022-01-01 01:00:00.00 L1
  2022-01-02 00:00:00.00 L2
  2022-01-02 01:00:00.00 L3
  2022-01-03 00:00:00.00 L4

... and let's have a FileSearcher with the following constraint:

  self.current_date = self.get_date('Tue Jan 01 00:00:00 UTC 2022')
  c = SearchConstraintSearchSince(current_date=self.current_date,
                                  cache_path=self.constraints_cache_path,
                                  ts_matcher_cls=TimestampSimple, days=7)
  s = FileSearcher(constraint=c)
  sd = SearchDef(r"{}\S+ (.+)".format(self.datetime_expr), tag='mysd')
  fname = os.path.join(self.data_root, 'dummy.log')
  s.add(sd, path=fname)
  results = s.run()

Here, constraint c will have 2021-12-25 00:00:00 as since date, which is older than all of the timestamps in the file, but when the filesearcher s is run, results only contain the lines L1, L2, L3 and L4. Shouldn't it also contain the L0, given that it satisfies the constraint (2022-01-01 00:00:00.00 >= 2021-12-25 00:00:00)