Coder Social home page Coder Social logo

elasticsearch-django's Introduction

This project now requires Python 3.8+ and Django 3.2+. For previous versions please refer to the relevant tag or branch.

Elasticsearch for Django

This is a lightweight Django app for people who are using Elasticsearch with Django, and want to manage their indexes.

Compatibility

The master branch is now based on elasticsearch-py 8. If you are using older versions, please switch to the relevant branch (released on PyPI as 2.x, 5.x, 6.x).

Search Index Lifecycle

The basic lifecycle for a search index is simple:

  1. Create an index
  2. Post documents to the index
  3. Query the index

Relating this to our use of search within a Django project it looks like this:

  1. Create mapping file for a named index
  2. Add index configuration to Django settings
  3. Map models to document types in the index
  4. Post document representation of objects to the index
  5. Update the index when an object is updated
  6. Remove the document when an object is deleted
  7. Query the index
  8. Convert search results into a QuerySet (preserving relevance)

Django Implementation

This section shows how to set up Django to recognise ES indexes, and the models that should appear in an index. From this setup you should be able to run the management commands that will create and populate each index, and keep the indexes in sync with the database.

Create index mapping file

The prerequisite to configuring Django to work with an index is having the mapping for the index available. This is a bit chicken-and-egg, but the underlying assumption is that you are capable of creating the index mappings outside of Django itself, as raw JSON. (The easiest way to spoof this is to POST a JSON document representing your document type at URL on your ES instance (POST http://ELASTICSEARCH_URL/{{index_name}}) and then retrieving the auto-magic mapping that ES created via GET http://ELASTICSEARCH_URL/{{index_name}}/_mapping.)

Once you have the JSON mapping, you should save it in the root of the Django project as search/mappings/{{index_name}}.json.

Configure Django settings

The Django settings for search are contained in a dictionary called SEARCH_SETTINGS, which should be in the main django.conf.settings file. The dictionary has three root nodes, connections, indexes and settings. Below is an example:

    SEARCH_SETTINGS = {
        'connections': {
            'default': getenv('ELASTICSEARCH_URL'),
            'backup': {
                # all Elasticsearch init kwargs can be used here
                'cloud_id': '{{ cloud_id }}'
            }
        },
        'indexes': {
            'blog': {
                'models': [
                    'website.BlogPost',
                ]
            }
        },
        'settings': {
            # batch size for ES bulk api operations
            'chunk_size': 500,
            # default page size for search results
            'page_size': 25,
            # set to True to connect post_save/delete signals
            'auto_sync': True,
            # List of models which will never auto_sync even if auto_sync is True
            'never_auto_sync': [],
            # if true, then indexes must have mapping files
            'strict_validation': False
        }
    }

The connections node is (hopefully) self-explanatory - we support multiple connections, but in practice you should only need the one - 'default' connection. This is the URL used to connect to your ES instance. The settings node contains site-wide search settings. The indexes nodes is where we configure how Django and ES play together, and is where most of the work happens.

Note that prior to v8.2 the connection value had to be a connection string; since v8.2 this can still be a connection string, but can also be a dictionary that contains any kwarg that can be passed to the Elasticsearch init method.

Index settings

Inside the index node we have a collection of named indexes - in this case just the single index called blog. Inside each index we have a models key which contains a list of Django models that should appear in the index, denoted in app.ModelName format. You can have multiple models in an index, and a model can appear in multiple indexes. How models and indexes interact is described in the next section.

Configuration Validation

When the app boots up it validates the settings, which involves the following:

  1. Do each of the indexes specified have a mapping file?
  2. Do each of the models implement the required mixins?

Implement search document mixins

So far we have configured Django to know the names of the indexes we want, and the models that we want to index. What it doesn't yet know is which objects to index, and how to convert an object to its search index document. This is done by implementing two separate mixins - SearchDocumentMixin and SearchDocumentManagerMixin. The configuration validation routine will tell you if these are not implemented. SearchDocumentMixin

This mixin is responsible for the seaerch index document format. We are indexing JSON representations of each object, and we have two methods on the mixin responsible for outputting the correct format - as_search_document and as_search_document_update.

An aside on the mechanics of the auto_sync process, which is hooked up using Django's post_save and post_delete model signals. ES supports partial updates to documents that already exist, and we make a fundamental assumption about indexing models - that if you pass the update_fields kwarg to a model.save method call, then you are performing a partial update, and this will be propagated to ES as a partial update only.

To this end, we have two methods for generating the model's JSON representation - as_search_document, which should return a dict that represents the entire object; and as_search_document_update, which takes the update_fields kwarg. This method handler two partial update 'strategies', defined in the SEARCH_SETTINGS, 'full' and 'partial'. The default 'full' strategy simply proxies the as_search_document method - i.e. partial updates are treated as a full document update. The 'partial' strategy is more intelligent - it will map the update_fields specified to the field names defined in the index mapping files. If a field name is passed into the save method but is not in the mapping file, it is ignored. In addition, if the underlying Django model field is a related object, a ValueError will be raised, as we cannot serialize this automatically. In this scenario, you will need to override the method in your subclass - see the code for more details.

To better understand this, let us say that we have a model (MyModel) that is configured to be included in an index called myindex. If we save an object, without passing update_fields, then this is considered a full document update, which triggers the object's index_search_document method:

obj = MyModel.objects.first()
obj.save()
...
# AUTO_SYNC=true will trigger a re-index of the complete object document:
obj.index_search_document(index='myindex')

However, if we only want to update a single field (say the timestamp), and we pass this in to the save method, then this will trigger the update_search_document method, passing in the names of the fields that we want updated.

# save a single field on the object
obj.save(update_fields=['timestamp'])
...
# AUTO_SYNC=true will trigger a partial update of the object document
obj.update_search_document(index, update_fields=['timestamp'])

We pass the name of the index being updated as the first arg, as objects may have different representations in different indexes:

    def as_search_document(self, index):
        return {'name': "foo"} if index == 'foo' else {'name': "bar"}

In the case of the second method, the simplest possible implementation would be a dictionary containing the names of the fields being updated and their new values, and this is the default implementation. If the fields passed in are simple fields (numbers, dates, strings, etc.) then a simple {'field_name': getattr(obj, field_name} is returned. However, if the field name relates to a complex object (e.g. a related object) then this method will raise an InvalidUpdateFields exception. In this scenario you should override the default implementationwith one of your own.

    def as_search_document_update(self, index, update_fields):
        if 'user' in update_fields:
            # remove so that it won't raise a ValueError
            update_fields.remove('user')
            doc = super().as_search_document_update(index, update_fields)
            doc['user'] = self.user.get_full_name()
            return doc
        return super().as_search_document_update(index, update_fields)

The reason we have split out the update from the full-document index comes from a real problem that we ourselves suffered. The full object representation that we were using was quite DB intensive - we were storing properties of the model that required walking the ORM tree. However, because we were also touching the objects (see below) to record activity timestamps, we ended up flooding the database with queries simply to update a single field in the output document. Partial updates solves this issue:

    def touch(self):
        self.timestamp = now()
        self.save(update_fields=['timestamp'])

    def as_search_document_update(self, index, update_fields):
        if list(update_fields) == ['timestamp']:
            # only propagate changes if it's +1hr since the last timestamp change
            if now() - self.timestamp < timedelta(hours=1):
                return {}
            else:
                return {'timestamp': self.timestamp}
        ....

Processing updates async

If you are generating a lot of index updates you may want to run them async (via some kind of queueing mechanism). There is no built-in method to do this, given the range of queueing libraries and patterns available, however it is possible using the pre_index, pre_update and pre_delete signals. In this case, you should also turn off AUTO_SYNC (as this will run the updates synchronously), and process the updates yourself. The signals pass in the kwargs required by the relevant model methods, as well as the instance involved:

# ensure that SEARCH_AUTO_SYNC=False

from django.dispatch import receiver
import django_rq
from elasticsearch_django.signals import (
    pre_index,
    pre_update,
    pre_delete
)

queue = django_rq.get_queue("elasticsearch")


@receiver(pre_index, dispatch_uid="async_index_document")
def index_search_document_async(sender, **kwargs):
    """Queue up search index document update via RQ."""
    instance = kwargs.pop("instance")
    queue.enqueue(
        instance.update_search_document,
        index=kwargs.pop("index"),
    )


@receiver(pre_update, dispatch_uid="async_update_document")
def update_search_document_async(sender, **kwargs):
    """Queue up search index document update via RQ."""
    instance = kwargs.pop("instance")
    queue.enqueue(
        instance.index_search_document,
        index=kwargs.pop("index"),
        update_fields=kwargs.pop("update_fields"),
    )


@receiver(pre_delete, dispatch_uid="async_delete_document")
def delete_search_document_async(sender, **kwargs):
    """Queue up search index document deletion via RQ."""
    instance = kwargs.pop("instance")
    queue.enqueue(
        instance.delete_search_document,
        index=kwargs.pop("index"),
    )

SearchDocumentManagerMixin

This mixin must be implemented by the model's default manager (objects). It also requires a single method implementation - get_search_queryset() - which returns a queryset of objects that are to be indexed. This can also use the index kwarg to provide different sets of objects to different indexes.

    def get_search_queryset(self, index='_all'):
        return self.get_queryset().filter(foo='bar')

We now have the bare bones of our search implementation. We can now use the included management commands to create and populate our search index:

# create the index 'foo' from the 'foo.json' mapping file
$ ./manage.py create_search_index foo

# populate foo with all the relevant objects
$ ./manage.py update_search_index foo

The next step is to ensure that our models stay in sync with the index.

Add model signal handlers to update index

If the setting auto_sync is True, then on AppConfig.ready each model configured for use in an index has its post_save and post_delete signals connected. This means that they will be kept in sync across all indexes that they appear in whenever the relevant model method is called. (There is some very basic caching to prevent too many updates - the object document is cached for one minute, and if there is no change in the document the index update is ignored.)

There is a VERY IMPORTANT caveat to the signal handling. It will only pick up on changes to the model itself, and not on related (ForeignKey, ManyToManyField) model changes. If the search document is affected by such a change then you will need to implement additional signal handling yourself.

In addition to object.save(), SeachDocumentMixin also provides the update_search_index(self, action, index='_all', update_fields=None, force=False) method. Action should be 'index', 'update' or 'delete'. The difference between 'index' and 'update' is that 'update' is a partial update that only changes the fields specified, rather than re-updating the entire document. If action is 'update' whilst update_fields is None, action will be changed to index.

We now have documents in our search index, kept up to date with their Django counterparts. We are ready to start querying ES.


Search Queries (How to Search)

Running search queries

SearchQuery

The elasticsearch_django.models.SearchQuery model wraps this functionality up and provides helper properties, as well as logging the query:

from elasticsearch_django.settings import get_client
from elasticsearch_django.models import execute_search

# run a default match_all query
sq = execute_search(index="foo", query={"match_all": {}})
# the raw response is stored on the return object,
# but is not stored on the object in the database.
print(sq.response)

Calling the execute_search function will execute the underlying search, log the query JSON, the number of hits, and the list of hit meta information for future analysis. The execute method also includes these additional kwargs:

  • user - the user who is making the query, useful for logging
  • search_terms - the search query supplied by the user (as opposed to the DSL) - not used by ES, but stored in the logs
  • reference - a free text reference field - used for grouping searches together - could be session id.
  • save - by default the SearchQuery created will be saved, but passing in False will prevent this.

Converting search hits into Django objects

Running a search against an index will return a page of results, each containing the _source attribute which is the search document itself (as created by the SearchDocumentMixin.as_search_document method), together with meta info about the result - most significantly the relevance score, which is the magic value used for ranking (ordering) results. However, the search document probably doesn't contain all the of the information that you need to display the result, so what you really need is a standard Django QuerySet, containing the objects in the search results, but maintaining the order. This means injecting the ES score into the queryset, and then using it for ordering. There is a method on the SearchDocumentManagerMixin called from_search_query which will do this for you. It uses raw SQL to add the score as an annotation to each object in the queryset. (It also adds the 'rank' - so that even if the score is identical for all hits, the ordering is preserved.)

from models import BlogPost

# run a default match_all query
sq = execute_search(index="blog", query={"match_all": {}})
for obj in BlogPost.objects.from_search_query(sq):
    print obj.search_score, obj.search_rank

elasticsearch-django's People

Contributors

0x416e64 avatar djm avatar emab avatar emorozov avatar hugorodgerbrown avatar miphreal avatar tahakhan0 avatar tim-mccurrach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-django's Issues

Unable to get repr for <class 'django_elasticsearch.query.EsQueryset'>

hi, I wrote test code

from django_elasticsearch.models import EsIndexable

class UserInfoSE(EsIndexable, models.Model):
    tel = models.CharField(max_length=50, null=True, unique=False, default='')

UserInfoSE.es.queryset.all()

and settings:

ELASTICSEARCH_AUTO_INDEX = True
ELASTICSEARCH_URL = 'http://192.168.99.101:9200'

but I got some error: Unable to get repr for <class 'django_elasticsearch.query.EsQueryset'>

Make `SearchDocumentMixin` transaction aware

If a model utilises the SearchDocumentMixin the check for whether a record is within a queryset will be run when the signal fires which in some cases is before the transaction has been committed to the database.

This leads to search documents not being created as they "don't exist" and get ignored in the eventual update.

The easiest way to replicate this is to have a model which is processing search updates synchronously.
If your log levels are set to DEBUG you will see that the model isn't in the QuerySet and exits without a search update.
Subsequent .save() calls to the created model will result in a 404 for the search document update as it doesn't exist.

Add support for partial updates

We currently update a complete document on sync. It would be nice to be able to automatically do a partial document update (supported by ES) on the save signal handler if update_fields is passed, e.g.

# save the object, and push a complete document to ES
obj.save()

# save the object, and push a partial update to ES containing just `first_name`
obj.save(update_fields=['first_name'])

Add user search terms to the SearchQuery log

Because of the arbitrary complexity of the query DSL, it can often be difficult to parse out exactly what the user is searching for - which is probably the most interesting thing to store. We should allow client code to specify what the user typed in as well as the DSL blob.

Add setting to set `_source` default for all queries

By default a query will return the source document, unless you set "_source": false in the query itself. For many use cases this is sensible, but when using this library specifically the more common behaviour is to use ES to run the search but to then extract the final data from the ORM using the list of ids returned by ES. In this instance the _source is redundant and yet makes up the majority of the over-the-wire content. It would be nice to be able to turn this off by default.

We already use kwargs.setdefault for the page size.

Elasticsearch Version Support

Is there a location in the docs that specifies what version(s) of ElasticSearch this library actively supports? If not is it possible to add that?

Thanks for creating this lib btw.

Remove elasticsearch-dsl

elasticsearch-dsl does not support ES8, and it looks like it's abandoned - see elastic/elasticsearch-dsl-py#1569

In order to make an ES8-compatible version of this package we need to remove our dependence on ES-DSL. This is a breaking change, so the v8 version will not be backwards compatible.

NB this will be compatible with elasticsearch-py v8, which is the version that formally supports ES8 itself. However, ES8 is backwards compatible with the elasticsearch-py v7 - so you can run v7 client and v8 server.

Deletion not reflected in index.

Here is my ElasticManager.

This check fails in the post_save post_delete signal receiver, hence the document isn't removed from the index.

The only way i see this working is if the get_search_queryset method returns a queryset containing the deleted object as well.

I guess

if action != 'delete' and not self.__class__.objects.in_search_queryset(self.id, index=index)

should fix it.

I could create a P.R. tomorrow.

Support `retry_on_conflict` ES parameter for updates

To avoid version_conflict_engine_exception 409 errors during heavy processing by multiple workers, or bulk imports.

From ES issues:

The second update request is happening at the same time as another request, so between fetching the document, updating it, and reindexing it, another request made an update resulting in the 409,

In between the get and indexing phases of the update, it is possible that another process might have already updated the same document. By default, the update will fail with a version conflict exception. The retry_on_conflict parameter controls how many times to retry the update before finally throwing an exception.

So our updates should support being able to set the retry_on_conflict parameter (integer).

Allow greater control over model updates forcing an update

The automated signal-attached model updates have a single ON/OFF control switch. In some situations this results in a high frequency of updates when not strictly necessary (e.g. a model field that is not indexed being changed and the model saved will result in an update being sent even if the document hasn't changed (caching aside).

It would be good to be able to intercept this process with a custom function that returns True/False to determine whether to update or not.

Adding wait_for when index is being updated

Hi there,
have a problem when I'm writing tests for elasticsearch. every time I need to add something like time.sleep() in the test code to wait while ES index is updated.
I'd like to use wait_for params but cannot find the nice way to use it.
Do you have any solutions?

Make search document _id field editable

The id of the search document was originally fixed as the model id attr, and a recent update (#58) updated this to use the model pk attr, for people using custom PKs. However this has recently (today) highlighted a related issue where the search document itself is distinct from the model that you want to search. i.e. when you have a custom 1:1 model just for managing the search document, just to de-couple it from the underlying model that it represents.

As an example we have a Profile model that we index, but it's very large and complex, and we want to add in some additional data to the search document that is not in the profile, and is calculated from other models. We end up with a combination of a large model that contains methods that only pertain to search, and with a lot of attrs that are not relevant to search, so instead we have a Profile:ProfileSearchDocument (1:1) relationship. We index the ProfileSearchDocument, but when we want to reconstitute the queryset we need the Profile model. The from_search_results model manager method would, in this instance, return Profile objects based on ProfileSearchDocument ids, resulting in the wrong objects being returned.

Having a get_search_document_id method on the SearchDocumentMixin would allow you customise the document _id field to whatever is relevant, in this case the Profile.pk.

Allow easy access to response results

elasticsearch-django strips almost all information returned by ES. It makes difficult, for example, implementing filtering by facets, like this:

        self.search_facets = {}
        for aggregation in search_response.aggregations:
            self.search_facets[aggregation.field] = aggregation.buckets

SearchQuery doesn't provide access to response.aggregations and other response fields, and that limits its usability in my opinion.

<model> object has no attribute 'as_search_action' during `update_search_index`

I'm trying to implement this package as it looks like the only Django Elasticsearch DSL with v8 support - thanks for the work. I've created my mappings doc, set up the settings, implemented the required methods on my models. Now trying to run update_search_index. I'm getting the following error for my DrugLabel model / index:

2023-02-28 08:48:35,661 ERROR Unable to create search action for source: FDA, product_name: Mycelex , generic_name: Clotrimazole, version_date: 2006-08-14, source_product_number: 17314-9400-1, marketer: Bayer Pharmaceuticals Corp
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/elasticsearch_django/index.py", line 185, in bulk_actions
    yield obj.as_search_action(index=index, action=action)
AttributeError: 'DrugLabel' object has no attribute 'as_search_action'

Is as_search_action another method which needs to be implemented on the model? I took a look at the tests but didn't see anything there. A full example project using this package would be really useful, as the README has some pretty big gaps for new users unfamiliar with the package.

invalid handling null-score in from_search_query

In case of sorting with ES, response comes with score equals None.

from_search_query method cant correctly handle this case.

what it do:
SELECT CASE foo."id" WHEN '0c77a42e-4978-4da9-a35b-7200f5403f77' THEN None WHEN '0d9c56de-26f6-4b7b-8372-a1620f061614' THEN None WHEN '67814da1-ebfe-4630-8e0d-f83db9a4d543' THEN None WHEN '8b80f82c-5f28-451b-84e0-45c492e9f1da' THEN None WHEN 'a8a90aea-21ad-4976-abd5-b136d9c8cf23' THEN None WHEN 'cd4c261e-029a-4a03-a515-44c84aee2d24' THEN None WHEN '7f69a0e2-21d8-4fed-818d-9bfd029ca960' THEN None WHEN '7a570e74-af36-42d3-88d2-b57a73ef4b3d' THEN None WHEN '8e407d5a-239b-4a46-be76-f01b143d15c2' THEN None ELSE 0 END

Exception:
django.db.utils.ProgrammingError: column "none" does not exist

Store "hits.fields" if supplied

The current implementation deliberately omits the _source data from a query response as this is of unknown size, and storing it could get out of hand. However, if the query passes in fields explicitly then it can be assumed that the user wants those fields - so we should store them. Or at least give the option to store them.

`rebuild_search_index` UnboundLocalError

~ $ python manage.py rebuild_search_index people
Opbeat is disabled
System check identified some issues:

WARNINGS:
invoicing.TimesheetEntryFinanceState.timesheet_entry: (fields.W342) Setting unique=True on a ForeignKey has the same effect as using a OneToOneField.
	HINT: ForeignKey(unique=True) is usually better served by a OneToOneField.
WARNING This will permanently delete the index 'people'.
Are you sure you wish to continue? [y/N] y
WARNING DELETE https://REDACTED.bonsaisearch.net:443/people [status:404 request:0.044s]
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/__init__.py", line 367, in execute_from_command_line
    utility.execute()
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/__init__.py", line 359, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/base.py", line 294, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/base.py", line 345, in execute
    output = self.handle(*args, **options)
  File "/app/.heroku/python/lib/python2.7/site-packages/elasticsearch_django/management/commands/__init__.py", line 47, in handle
    data = self.do_index_command(index, **options)
  File "/app/.heroku/python/lib/python2.7/site-packages/elasticsearch_django/management/commands/rebuild_search_index.py", line 40, in do_index_command
    'delete': delete,
UnboundLocalError: local variable 'delete' referenced before assignment```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.