Comments (5)
Thank you for jumping on this @sarayourfriend, and for your thoughts @rwidom! I do think storing the preliminary information suggested in the issue description is a great start, and will at least give us a sense for how frequently iNaturalist shows up before & after the ingestion occurs. For now, I think this is sufficient - we could even avoid adding these statistics beyond page 3 or 4, as you mention @sarayourfriend. I do think having more information about our searches & provider diversity would be helpful, but for ensuring that iNaturalist doesn't flood all search results after we ingest it I believe that storing this information in Redis and checking it before/after ingestion will give us the impression we need!
I think we might want an RFC similar to the frontend event tracking for deciding which metrics and how to store the data related to searches themselves, and I don't think we'd want to block the iNaturalist ingestion for that effort. Looking at #1088, I think we'd want to avoid tracking searches in Postgres at this time so we don't have to worry about a migration for the API deployment.
from openverse-api.
This may be obvious / already assumed, but it would be great to save the search terms and maybe the page number of the results here as well. Yes? And where would the info get stored? I could imagine stashing docs on s3 with that info as a quick solution, or setting up a new postgres table? I'm totally ignorant about the API, so please forgive me if these questions are silly in any way.
from openverse-api.
My plan was to store the tallies naively in Redis as described in the issue description, mostly because it is fast and simple to implement. Both seem important to me to avoid delaying turning on the iNaturalist DAG for very long.
I thought I remembered an issue existing in this repository to store queries and their results in Redis, but I can't find it now. It might have been #19, but I recall one with more details about what to store.
That being said, I'm happy to store whatever the folks with more data analysis skills than me think would be useful. I can definitely see how raw result position and page number would be useful information. If iNaturalist is dominating pages 10 – 200 maybe we care less as long as the first 3 or 4 pages are still as diverse.
@rwidom, can you suggest a data model for storing this in Postgres? Should we store individual searches with queries as a JSON blob and the results as a Postgres array of IDs, along with the page number? Page size and result position can be derived from the result's position in the array plus the length of the array multiplied by the page count. It also occurred to me that we could store each search result as a separate row with the overall position with a foreign key to the query stored in a separate table. Not sure which is more useful or easy to query.
One more complication, even if we store the individual searches, the API is cached pretty heavily in Cloudflare and repeated queries for the same terms will just used the cached response, so we wouldn't necessarily see the total frequency at which people are seeing iNaturalist, just how often it shows in queries overall. i.e., assume that for any unique query, there'd likely only be a single entry in a table storing the executed queries.
My primary goal though is to find a solution that we can implement quickly and easily throw out or iterate on. A table of queries and results would grow pretty quickly and be hard to change in the future, so I'd consider it something worth spending more time on up-front to get right.
If you can share more details about how you envision using S3 for this for a fast solution, I'd be grateful. I hadn't considered how we could use it for this.
from openverse-api.
I like the suggestion to restrict this data gathering to the top few pages. That will probably target the most relevant cases for this particular study.
Agreed to not block on a wider search metrics project.
Thanks for the input!
from openverse-api.
I've reverted the Postgres approach in the PR and switched it back to the Redis approach.
I'll add unit tests and then undraft the PR so we can get this in ASAP and avoid too many delays to iNaturalist getting into the API fully.
from openverse-api.
Related Issues (20)
- Use `pre-commit` as .pyz executable
- Add documentation describing the data migration process we should follow
- `attribution` is null in search results view HOT 6
- Freesound results are timing out HOT 2
- Remove resendverificationemails command and tests HOT 4
- Openverse API is no longer reachable due to Cloudflare DDoS protection HOT 5
- Make dependabot update github actions HOT 6
- Add Django database migration to Dockerfile entrypoint
- Current package set fails to install HOT 2
- Add `coreutils` to pre-requisites (specifically for macOS)
- Set up a `tallies` cache alias for tracking longer-term values HOT 1
- Use `get_redis_connection` rather than `django.core.cache` for thumbnail timeouts HOT 1
- Add database connectivity to healthcheck endpoint
- Photon not working with WP Photo Directory images HOT 5
- Results may be `None` during tallying HOT 3
- An attempt to get `accuracy` key results in SystemExit HOT 2
- Explore the use of Photon as a thumbnail service HOT 6
- Remove SourceLogo class HOT 2
- Link for automated tests file broken in DOCUMENTATION_GUIDELINES.md HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openverse-api.