Coder Social home page Coder Social logo

Comments (5)

AetherUnbound avatar AetherUnbound commented on June 11, 2024 2

Thank you for jumping on this @sarayourfriend, and for your thoughts @rwidom! I do think storing the preliminary information suggested in the issue description is a great start, and will at least give us a sense for how frequently iNaturalist shows up before & after the ingestion occurs. For now, I think this is sufficient - we could even avoid adding these statistics beyond page 3 or 4, as you mention @sarayourfriend. I do think having more information about our searches & provider diversity would be helpful, but for ensuring that iNaturalist doesn't flood all search results after we ingest it I believe that storing this information in Redis and checking it before/after ingestion will give us the impression we need!

I think we might want an RFC similar to the frontend event tracking for deciding which metrics and how to store the data related to searches themselves, and I don't think we'd want to block the iNaturalist ingestion for that effort. Looking at #1088, I think we'd want to avoid tracking searches in Postgres at this time so we don't have to worry about a migration for the API deployment.

from openverse-api.

rwidom avatar rwidom commented on June 11, 2024

This may be obvious / already assumed, but it would be great to save the search terms and maybe the page number of the results here as well. Yes? And where would the info get stored? I could imagine stashing docs on s3 with that info as a quick solution, or setting up a new postgres table? I'm totally ignorant about the API, so please forgive me if these questions are silly in any way.

from openverse-api.

sarayourfriend avatar sarayourfriend commented on June 11, 2024

My plan was to store the tallies naively in Redis as described in the issue description, mostly because it is fast and simple to implement. Both seem important to me to avoid delaying turning on the iNaturalist DAG for very long.

I thought I remembered an issue existing in this repository to store queries and their results in Redis, but I can't find it now. It might have been #19, but I recall one with more details about what to store.

That being said, I'm happy to store whatever the folks with more data analysis skills than me think would be useful. I can definitely see how raw result position and page number would be useful information. If iNaturalist is dominating pages 10 – 200 maybe we care less as long as the first 3 or 4 pages are still as diverse.

@rwidom, can you suggest a data model for storing this in Postgres? Should we store individual searches with queries as a JSON blob and the results as a Postgres array of IDs, along with the page number? Page size and result position can be derived from the result's position in the array plus the length of the array multiplied by the page count. It also occurred to me that we could store each search result as a separate row with the overall position with a foreign key to the query stored in a separate table. Not sure which is more useful or easy to query.

One more complication, even if we store the individual searches, the API is cached pretty heavily in Cloudflare and repeated queries for the same terms will just used the cached response, so we wouldn't necessarily see the total frequency at which people are seeing iNaturalist, just how often it shows in queries overall. i.e., assume that for any unique query, there'd likely only be a single entry in a table storing the executed queries.

My primary goal though is to find a solution that we can implement quickly and easily throw out or iterate on. A table of queries and results would grow pretty quickly and be hard to change in the future, so I'd consider it something worth spending more time on up-front to get right.

If you can share more details about how you envision using S3 for this for a fast solution, I'd be grateful. I hadn't considered how we could use it for this.

from openverse-api.

sarayourfriend avatar sarayourfriend commented on June 11, 2024

I like the suggestion to restrict this data gathering to the top few pages. That will probably target the most relevant cases for this particular study.

Agreed to not block on a wider search metrics project.

Thanks for the input!

from openverse-api.

sarayourfriend avatar sarayourfriend commented on June 11, 2024

I've reverted the Postgres approach in the PR and switched it back to the Redis approach.

I'll add unit tests and then undraft the PR so we can get this in ASAP and avoid too many delays to iNaturalist getting into the API fully.

from openverse-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.