Problem We do not currently have any visibility into the number o

Thank you for jumping on this <a class="user-mention notranslate" data-hovercard-type=

Count provider occurrences in executed query results about openverse-api HOT 5 CLOSED

sarayourfriend commented on June 11, 2024

Count provider occurrences in executed query results

from openverse-api.

Comments (5)

AetherUnbound commented on June 11, 2024 2

Thank you for jumping on this @sarayourfriend, and for your thoughts @rwidom! I do think storing the preliminary information suggested in the issue description is a great start, and will at least give us a sense for how frequently iNaturalist shows up before & after the ingestion occurs. For now, I think this is sufficient - we could even avoid adding these statistics beyond page 3 or 4, as you mention @sarayourfriend. I do think having more information about our searches & provider diversity would be helpful, but for ensuring that iNaturalist doesn't flood all search results after we ingest it I believe that storing this information in Redis and checking it before/after ingestion will give us the impression we need!

I think we might want an RFC similar to the frontend event tracking for deciding which metrics and how to store the data related to searches themselves, and I don't think we'd want to block the iNaturalist ingestion for that effort. Looking at #1088, I think we'd want to avoid tracking searches in Postgres at this time so we don't have to worry about a migration for the API deployment.

from openverse-api.

rwidom commented on June 11, 2024

This may be obvious / already assumed, but it would be great to save the search terms and maybe the page number of the results here as well. Yes? And where would the info get stored? I could imagine stashing docs on s3 with that info as a quick solution, or setting up a new postgres table? I'm totally ignorant about the API, so please forgive me if these questions are silly in any way.

from openverse-api.

sarayourfriend commented on June 11, 2024

My plan was to store the tallies naively in Redis as described in the issue description, mostly because it is fast and simple to implement. Both seem important to me to avoid delaying turning on the iNaturalist DAG for very long.

I thought I remembered an issue existing in this repository to store queries and their results in Redis, but I can't find it now. It might have been #19, but I recall one with more details about what to store.

That being said, I'm happy to store whatever the folks with more data analysis skills than me think would be useful. I can definitely see how raw result position and page number would be useful information. If iNaturalist is dominating pages 10 – 200 maybe we care less as long as the first 3 or 4 pages are still as diverse.

@rwidom, can you suggest a data model for storing this in Postgres? Should we store individual searches with queries as a JSON blob and the results as a Postgres array of IDs, along with the page number? Page size and result position can be derived from the result's position in the array plus the length of the array multiplied by the page count. It also occurred to me that we could store each search result as a separate row with the overall position with a foreign key to the query stored in a separate table. Not sure which is more useful or easy to query.

One more complication, even if we store the individual searches, the API is cached pretty heavily in Cloudflare and repeated queries for the same terms will just used the cached response, so we wouldn't necessarily see the total frequency at which people are seeing iNaturalist, just how often it shows in queries overall. i.e., assume that for any unique query, there'd likely only be a single entry in a table storing the executed queries.

My primary goal though is to find a solution that we can implement quickly and easily throw out or iterate on. A table of queries and results would grow pretty quickly and be hard to change in the future, so I'd consider it something worth spending more time on up-front to get right.

If you can share more details about how you envision using S3 for this for a fast solution, I'd be grateful. I hadn't considered how we could use it for this.

from openverse-api.

sarayourfriend commented on June 11, 2024

I like the suggestion to restrict this data gathering to the top few pages. That will probably target the most relevant cases for this particular study.

Agreed to not block on a wider search metrics project.

Thanks for the input!

from openverse-api.

sarayourfriend commented on June 11, 2024

I've reverted the Postgres approach in the PR and switched it back to the Redis approach.

I'll add unit tests and then undraft the PR so we can get this in ASAP and avoid too many delays to iNaturalist getting into the API fully.

from openverse-api.

Count provider occurrences in executed query results about openverse-api HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent