metarank / metarank Goto Github PK

A low code Machine Learning personalized ranking service for articles, listings, search results, recommendations that boosts user engagement. A friendly Learn-to-Rank engine

Home Page: https://metarank.ai

License: Apache License 2.0

Scala 99.24% Shell 0.36% Smarty 0.17% Java 0.23%

ranking scala search personalization machine-learning deep-learning data-engineering feature-engineering feature-extraction kubernetes

metarank's Introduction

Metarank: real time personalization as a service

Docs | Website | Community Slack | Blog | Demo

What is Metarank?

Metarank is an open-source ranking service. It can help you to build a personalized semantic/neural search and recommendations.

If you just want to get started, try:

the quickstart tutorial of implementing Learning-to-Rank on top of your search engine.
a semantic search guide of building an LLM-based neural search.
a collaborative filtering recommendations guide to create a "you may also like" widget as seen on many e-commerce stores.

Why Metarank?

With Metarank, you can make your existing search and recommendations smarter:

Integrate customer signals like clicks and purchases into the ranking - and optimize for maximal CTR!
Track visitor profile and make search results adapt to user actions with real-time personalization.
Use LLMs in bi- and cross-encoder mode to make your search understand the true meaning of search queries.

Metarank is fast:

optimized for reranking latency, it can handle even large result sets within 10-20ms. See benchmarks.
as a stateless cloud-native service (with state managed by Redis), it can scale horizontally and process thousands of RPS. See Kubernetes deployment guide for details.

Save your development time:

Metarank can compute dozens of typical ranking signals out of the box: CTR, referer, User-Agent, time, etc - you don't need to write custom ad-hoc code for most common ranking factors. See the full list of supported ranking signals in our docs.
There are integrations with many possible streaming processing systems to ingest visitor signals: See data sources for details.

What can you build with Metarank?

Metarank helps you build advanced ranking systems for search and recommendations:

Semantic search: use state-of-the-art LLMs to make your Elasticsearch/OpenSearch understand the meaning of your queries
Recommendations: traditional collaborative-filtering and new-age semantic content recommendations.
Learning-to-Rank: optimize your existing search

Content

Blog posts:

Meetups and conference talks:

Building an open-source online Learn-to-rank engine, Haystack EU 23, slides
Overcoming position and presentation biases in search and recommender systems, Data Natives Meetup Berlin, slides
Learning-to-rank: Deep, fast, precise - choose any two, DataTalks meetup, slides

Main features

Semantic neural search: [TODO]
Recommendations: trending and similar-items (MF ALS).
Personalization: secondary reranking (LambdaMART)
AutoML: automatic feature generation and model re-training
A/B testing: multiple model serving

Demo

You can play with Metarank demo on demo.metarank.ai:

The demo itself and the data used are open-source and you can grab a copy of training events and config file in the github repo.

Metarank in One Minute

Let us show how you can start personalizing content with LambdaMART-based reranking in just under a minute:

Prepare the data: we will get the dataset and config file from the demo.metarank.ai
Start Metarank in a standalone mode: it will import the data, train the ML model and start the API.
Send a couple of requests to the API.

Step 1: Prepare data

We will use the ranklens dataset, which is used in our Demo, so just download the data file

curl -O -L https://github.com/metarank/metarank/raw/master/src/test/resources/ranklens/events/events.jsonl.gz

Step 2: Prepare configuration file

We will again use the configuration file from our Demo. It utilizes in-memory store, so no other dependencies are needed.

curl -O -L https://raw.githubusercontent.com/metarank/metarank/master/src/test/resources/ranklens/config.yml

Step 3: Start Metarank!

With the final step we will use Metarank’s standalone mode that combines training and running the API into one command:

docker run -i -t -p 8080:8080 -v $(pwd):/opt/metarank metarank/metarank:latest standalone --config /opt/metarank/config.yml --data /opt/metarank/events.jsonl.gz

You will see some useful output while Metarank is starting and grinding through the data. Once this is done, you can send requests to localhost:8080 to get personalized results.

Here we will interact with several movies by clicking on one of them and observing the results.

First, let's see the initial output provided by Metarank without before we interact with it

# get initial ranking for some items
curl http://localhost:8080/rank/xgboost \
    -d '{
    "event": "ranking",
    "id": "id1",
    "items": [
        {"id":"72998"}, {"id":"67197"}, {"id":"77561"},
        {"id":"68358"}, {"id":"79132"}, {"id":"103228"}, 
        {"id":"72378"}, {"id":"85131"}, {"id":"94864"}, 
        {"id":"68791"}, {"id":"93363"}, {"id":"112623"}
    ],
    "user": "alice",
    "session": "alice1",
    "timestamp": 1661431886711
}'

# {"item":"72998","score":0.9602446652021992},{"item":"79132","score":0.7819134441404151},{"item":"68358","score":0.33377910321385645},{"item":"112623","score":0.32591281190727805},{"item":"103228","score":0.31640256043322723},{"item":"77561","score":0.3040782705414116},{"item":"94864","score":0.17659007036183608},{"item":"72378","score":0.06164568676567339},{"item":"93363","score":0.058120639770243385},{"item":"68791","score":0.026919880032451306},{"item":"85131","score":-0.35794106000271037},{"item":"67197","score":-0.48735167237049154}

# tell Metarank which items were presented to the user and in which order from the previous request
# optionally, we can include the score calculated by Metarank or your internal retrieval system
curl http://localhost:8080/feedback \
 -d '{
  "event": "ranking",
  "fields": [],
  "id": "test-ranking",
  "items": [
    {"id":"72998","score":0.9602446652021992},{"id":"79132","score":0.7819134441404151},{"id":"68358","score":0.33377910321385645},
    {"id":"112623","score":0.32591281190727805},{"id":"103228","score":0.31640256043322723},{"id":"77561","score":0.3040782705414116},
    {"id":"94864","score":0.17659007036183608},{"id":"72378","score":0.06164568676567339},{"id":"93363","score":0.058120639770243385},
    {"id":"68791","score":0.026919880032451306},{"id":"85131","score":-0.35794106000271037},{"id":"67197","score":-0.48735167237049154}
  ],
  "user": "test2",
  "session": "test2",
  "timestamp": 1661431888711
}'

Now, let's intereact with the items 93363

# click on the item with id 93363
curl http://localhost:8080/feedback \
 -d '{
  "event": "interaction",
  "type": "click",
  "fields": [],
  "id": "test-interaction",
  "ranking": "test-ranking",
  "item": "93363",
  "user": "test",
  "session": "test",
  "timestamp": 1661431890711
}'

Now, Metarank will personalize the items, the order of the items in the response will be different

# personalize the same list of items
# they will be returned in a different order by Metarank
curl http://localhost:8080/rank/xgboost \
 -d '{
  "event": "ranking",
  "fields": [],
  "id": "test-personalized",
  "items": [
    {"id":"72998"}, {"id":"67197"}, {"id":"77561"},
    {"id":"68358"}, {"id":"79132"}, {"id":"103228"}, 
    {"id":"72378"}, {"id":"85131"}, {"id":"94864"}, 
    {"id":"68791"}, {"id":"93363"}, {"id":"112623"}
  ],
  "user": "test",
  "session": "test",
  "timestamp": 1661431892711
}'

# {"items":[{"item":"93363","score":2.2013986484185124},{"item":"72998","score":1.1542776301073876},{"item":"68358","score":0.9828904282341605},{"item":"112623","score":0.9521647429731446},{"item":"79132","score":0.9258841742518286},{"item":"77561","score":0.8990921381835769},{"item":"103228","score":0.8990921381835769},{"item":"94864","score":0.7131600718467729},{"item":"68791","score":0.624462038351694},{"item":"72378","score":0.5269765094008626},{"item":"85131","score":0.29198666089255343},{"item":"67197","score":0.16412780810560743}]}

Useful Links

What's next?

Check out a more in-depth Quickstart full Reference.

If you have any questions, don't hesitate to join our Slack!

License

This project is released under the Apache 2.0 license, as specified in the License file.

metarank's People

Contributors

Stargazers

Watchers

Forkers

scala-steward vgoloviznin braginivan andreysaksonov isaka hirajanwin jithinraj devdoshi antennaesdk bobthehands mbrukman stvhanna ivrrimum datastudysquad chinayuan dylanxult whiteorg tiagoboldt davgit admariner lmataa ruthraiahthulasi neurostep linecode saurabharch hadryan apfeff auaan laplacekorea zorrock djnzx cyberflamego keyvec onlyone0001 anujsrc carloswalterbr jaypark-1eco kyroskoh francishero eodenyire maatticlabs nasnl leos-code spread0x jonahharris hadsed thedigitaloctopus snowpedroferreira ravichoudhary33 daoos abdukhashimov nikdon researchoor iabdukhoshimov shibe mullerhai ericnoam sergiipolokhalo dakl kyougyun postpcera nicojuicy edithlliott tomsquest nakamotojp nekonton idaniellaw slaanish truesoni brunoscaglione visioninhope nadirlaskar djangoonewo wlad-jg shahinsharifi adrielc avsolatorio andrei-ghenov deadbrainviv digitalarchivo dealcrane

metarank's Issues

S3 access is broken due to flink JAR hell

To access S3 from flink jobs, one should include a flink-s3-fs-hadoop dependency. The main issue with this artifact is that it:

bundles ancient version of aws-sdk (from 2016)
bundles ancient version of Guava (v11, also from 2016)

While building a release fat-jar, there is a ton of file conflicts for multiple bundled dependencies (so we also depend on guava, but classpath already contains a v11 from this jar). The possible solutions are:

repackage the flink-s3-fs-hadoop in a way so there will be no bundling. But then we need to support this fork forever.
only require the flink-s3-fs-hadoop jar be present in classpath on runtime when there is a s3:// prefix used somewhere. It would be fine for a docker container as we can automate it, but painful for local runs. But anyway who is going to use S3 on local runs?

item metadata events should be validated over schema

so you can only write fields matching the schema

store: batching support

Some events are triggering too many random writes from db. It can be OK if there is proper caching (narrator: it is not yet), but in some cases we need the way to pull multiple values at the same time.

Main use case: the reranking request. We need to pull all product metadata at once, otherwise the latency can be extremely high.

As this feature is not critically important and not affecting customer-facing features, then it can be optional.

migrate to scala 2.13

XGBoost is the problematic dependency due to it's dependency on spark.
Spark 3.1 is going to support 2.13, so we can just wait for early 2021, or make a PR to xgboost to have a pure no-spark xgboost-jvm available.

property test for an ingestion pipeline

As now it seems to be working, but without proper validation of the behavior.

Can I use only click data?

Hi,
Thanks for stepping in to solve one of the major issue in personalized content serving.
I have a doubt in ranking events. I believe it is like impression events where all impression is grouped together for an user.

Can I use only click data? i.e. I want to use only metadata and interaction events for personalization. Please suggest changes in this config if how can I skip ranking event.

interactions:
  - name: click
    weight: 1.0
features:
  - name: popularity
    type: number
    scope: item
    source: metadata.popularity

  - name: vote_avg
    type: number
    scope: item
    source: metadata.vote_avg

  - name: vote_cnt
    type: number
    scope: item
    source: metadata.vote_cnt

  - name: budget
    type: number
    scope: item
    source: metadata.budget

  - name: release_date
    type: number
    scope: item
    source: metadata.release_date

  - name: runtime
    type: number
    scope: item
    source: metadata.runtime

  - name: title_length
    type: word_count
    source: metadata.title
    scope: item

  - name: genre
    type: string
    scope: item
    source: metadata.genres
    values:
      - drama
      - comedy
      - thriller
      - action
      - adventure
      - romance
      - crime
      - science fiction
      - fantasy
      - family
      - horror
      - mystery
      - animation
      - history
      - music

  - name: ctr
    type: rate
    top: click
    bottom: impression
    scope: item
    bucket: 24h
    periods: [7,30]

  - name: liked_genre
    type: interacted_with
    interaction: click
    field: metadata.genres
    scope: session
    count: 10
    duration: 24h

  - name: liked_actors
    type: interacted_with
    interaction: click
    field: metadata.actors
    scope: session
    count: 10
    duration: 24h

  - name: liked_tags
    type: interacted_with
    interaction: click
    field: metadata.tags
    scope: session
    count: 10
    duration: 24h

  - name: liked_director
    type: interacted_with
    interaction: click
    field: metadata.director
    scope: session
    count: 10
    duration: 24h

setup scala steward for dependency updates tracking

Redis support for external storage

As it's the simplest thing to implement for MVP.

Refactor a store API to support different access patterns

As now it's only can serialize a complete values, which is quite costly for large blobs like map. The idea is to have it designed as in Apache Flink:

a state desriptor (value, list, map)
state api for accessing these types
state impl can have either a generic implementation for all access patterns, or implement something more specific to the underlying storage engine (for example, for postgres to to map lists to arrays)

Batch jsonl/json[] support for ingestion API

multi-keyspace support

Currently there is only a single featurespace supported in the config file. It may not be enough for some common use cases:

separate staging and prod
multiple ranking targets

Protobuf support for ingestion API

field-query matching feature

TF/IDF or BM25. Or something custom. The main problem with tfidf is that it depends on term frequency, which is constantly changing for existing documents while you're adding new documents. So maybe smth like BoW intersection over union for a field and query.

vector and categorial feature support

Currently features emit float lists. When exporting the data it quickly becomes a mess:

not clear to which particular feature the number belongs
with windowed features, it would be nice to know which window it is
CSV export will have no way of generating header.

We can have a small of types to encode this info there:

with feature name, e.g. "count"
scope, so it will be count of "click"
window: "last 7 days"

Tutorial for ranklens dataset

Write down the steps done in RankLens test, but for humans.

ERR Protocol error: invalid bulk length

When we rank too many items over too many features, the redis read request becomes too large:

20:20:22.701 ERROR org.http4s.server.service-errors - Error servicing request: POST /rank from 127.0.0.1
redis.clients.jedis.exceptions.JedisDataException: ERR Protocol error: invalid bulk length
        at redis.clients.jedis.Protocol.processError(Protocol.java:96)
        at redis.clients.jedis.Protocol.process(Protocol.java:137)
        at redis.clients.jedis.Protocol.read(Protocol.java:192)
        at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:316)
        at redis.clients.jedis.Connection.getOne(Connection.java:298)
        at redis.clients.jedis.Connection.executeCommand(Connection.java:123)
        at redis.clients.jedis.Jedis.mget(Jedis.java:818)
        at io.findify.featury.connector.redis.RedisStore.$anonfun$read$3(RedisStore.scala:47)
        at apply @ io.findify.featury.connector.redis.RedisStore.$anonfun$read$2(RedisStore.scala:47)
        at flatMap @ io.findify.featury.connector.redis.RedisStore.$anonfun$read$2(RedisStore.scala:47)
        at make @ ai.metarank.mode.inference.FlinkMinicluster$.resource(FlinkMinicluster.scala:13)
        at make @ ai.metarank.mode.inference.FlinkMinicluster$.resource(FlinkMinicluster.scala:13)
        at use @ ai.metarank.mode.inference.Inference$.$anonfun$run$4(Inference.scala:33)
        at flatMap @ ai.metarank.mode.inference.api.RankApi.$anonfun$rerank$2(RankApi.scala:34)
        at apply @ ai.metarank.mode.inference.api.RankApi.rerank(RankApi.scala:33)
        at flatMap @ ai.metarank.mode.inference.api.RankApi.rerank(RankApi.scala:33)
        at flatMap @ ai.metarank.mode.inference.api.RankApi$$anonfun$1.$anonfun$applyOrElse$1(RankApi.scala:25)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)

So we should split such requests into smaller chunks and send them in parallel instead.

end-to-end ranklens test

It should mimic the docs, but with a full set of user interactions: bootstrap, then train+upload+inference. It should also send a couple of test rank+feedback events to see that the ranking changes.

Proper error reporting on config loading errors

As now it will just spit exception without any user-friendly information on how to fix the issue.

FeatureSchema shoud not have duplicate definitions

So fields/features won't have conflicts

Event processing pipeline

As now we have a stub for it. So we need receive the event, and pipe in thru the aggregator pipeline.

dataset export for booster training

for lightgbm4j

feed contents feature

Currently we ignore all the messages about feed metadata. We need to have:

a tool to validate events against schema defined in config
store all the metadata in db

Shitch to single ingester - single schema config model

As it makes implementation too complicated with almost dynamic schema.

scala.MatchError: config.yml (of class java.lang.String)

Hi,
I am trying to reproduce ranklens tutorial . I can't pass Data Bootstraping section. It is throwing error that events.jsonl.gz is not GZIP file. I downloaded whole file from this repo itself. I am posting small subset of log which explains this error. Btw thanks for making it opensource <3.

00:58:03.840 INFO  o.a.f.c.f.s.a.LocalityAwareSplitAssigner - Assigning split to non-localized request: Optional[FileSourceSplit: file:/home/ranklens/events/events.jsonl.gz [0, 132) (no host info) ID=0000000001 position=null]
00:58:03.849 INFO  o.a.flink.runtime.taskmanager.Task - Source: load -> select-feedback (1/1)#0 (3b6e79acb377db9330027f1d178a19c7) switched from INITIALIZING to RUNNING.
00:58:03.850 INFO  o.a.f.r.e.ExecutionGraph - Source: load -> select-feedback (1/1) (3b6e79acb377db9330027f1d178a19c7) switched from INITIALIZING to RUNNING.
00:58:03.849 INFO  o.a.f.c.f.s.i.StaticFileSplitEnumerator - Assigned split to subtask 0 : FileSourceSplit: file:/home/ranklens/events/events.jsonl.gz [0, 132) (no host info) ID=0000000001 position=null
00:58:03.852 INFO  o.a.f.c.b.s.reader.SourceReaderBase - Adding split(s) to reader: [FileSourceSplit: file:/home/ranklens/events/events.jsonl.gz [0, 132) (no host info) ID=0000000001 position=null]
00:58:03.860 INFO  o.a.f.c.b.s.r.fetcher.SplitFetcher - Starting split fetcher 0
00:58:03.871 ERROR o.a.f.c.b.s.r.f.SplitFetcherManager - Received uncaught exception.
java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception while polling the records
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:150)
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:105)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.zip.ZipException: Not in GZIP format
        at java.base/java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:166)
        at java.base/java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:80)
        at java.base/java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:92)
        at org.apache.flink.api.common.io.compression.GzipInflaterInputStreamFactory.create(GzipInflaterInputStreamFactory.java:42)
        at org.apache.flink.api.common.io.compression.GzipInflaterInputStreamFactory.create(GzipInflaterInputStreamFactory.java:30)
        at org.apache.flink.connector.file.src.impl.StreamFormatAdapter.lambda$openStream$3(StreamFormatAdapter.java:168)
        at org.apache.flink.connector.file.src.util.Utils.doWithCleanupOnException(Utils.java:45)
        at org.apache.flink.connector.file.src.impl.StreamFormatAdapter.openStream(StreamFormatAdapter.java:162)
        at org.apache.flink.connector.file.src.impl.StreamFormatAdapter.createReader(StreamFormatAdapter.java:65)
        at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.checkSplitOrStartNext(FileSourceSplitReader.java:112)
        at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:65)
        at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58)
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:142)
        ... 6 common frames omitted

Edit 1:
I have tried training with both metarank-assembly-0.2.2.jar and metarank-assembly-0.2.1.jar but getting the same error. 0.2.0 isn't working also.

Add user/session to the interaction event

So we don't need to join it with the ranking event all the time

generate release artifacts

So now user needs to build the metarank from sources. It would be great to have a metarank-0.2.0-bin.tar.gz archive available right on github releases page with proper build.

We already have sbt-native-packager in sbt plugins which can do most of the job, but we can optionally go further and automate the release process with sbt github release plugin.

External storage API

Postgres support for external storage

As this one should be more suitable for production usage.

support multiple keyspaces in config file

So we can have multiple schemas running at once.

field_match feature should validate config over item schema

So you should not be able to do matching over non-existent field.

Add caching for sbt in the CI

So we don't pull it all the time on each build.

Tech design specification

As a part of official documentation, not just for internal use.

Document user-defined ranking objectives

For MVP I guess it will be enough to have click-focused ranking. In the future we may have something more user-configurable.

Reranking API

Define the schema and cover it in the docs.

Feature extractor internal API

Essential parts:

dimensionality support (categorial features as one-hot encoding)
save/restore from bytes

RocksDB storage impl

It can be useful to locally play with the tool, when distributes setup is not needed.

add cache to github actions CI

To speed up the build, so we will never re-resolve all the dependencies

cli tool

Now to execute parts of metarank user needs to enter quite a long magical commands like
java -cp metarank-assembly-0.2.0.jar ai.metarank.mode.inference.Inference. It would be nice to have something more human-friendly thing like merarank inference <opts>.

Add scalafmt check to CI

So new PRs should be well-formatted.

YAML support for config file

docker image has no libgomp installed

TrainTest causes JVM to crash

The crash is happening inside the LightGBM CreateBooster method and only when both Flink-based tests were run before the TrainTest. Probably it's happening due to double-loading of native library in two separate classloaders - in this case it's only scoped to tests and should not really affect non-devs.

For now the test is ignored in CI, but it should be fixed in the future.

state codec support

Now for CircularReservoir we have manually coded implementation for codec. It's ok on this scale, but if we need to store richer records (like arbitrary case classes), then we need to have automated way of deriving codecs for custom case classes.

Open questions: