Coder Social home page Coder Social logo

apache / incubator-bluemarlin Goto Github PK

View Code? Open in Web Editor NEW
2.0 8.0 7.0 90.99 MB

Blue Marlin is a critical web infrastructure for advertising based monetization. It is a cloud platform that adds intelligence to a plain Ad System.

Home Page: https://incubator.apache.org/

License: Apache License 2.0

Java 29.36% Python 69.92% Shell 0.72%
apache bluemarlin

incubator-bluemarlin's Introduction

incubator-bluemarlin's People

Contributors

faezehvaseghi avatar jbonofre avatar jimmylao avatar radibnia77 avatar spyglass700 avatar wangwang55 avatar xun-hu-at-futurewei-com avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

incubator-bluemarlin's Issues

[BLUEMARLIN-28] : For DIN-Lookalike model, train.py runs only if line 40 in model.py is commented.

  1. Training fails at the beginning itself if line 40 in model.py is not commented. If we comment line 40 in model.py then train.py runs successfully.
    In below code taken from model.py, if we comment -> user_emb_w which is taken from line 40 of model.py, then training is successful.
    hidden_units = 128

    user_emb_w = tf.get_variable("user_emb_w", [user_count, hidden_units])
    item_emb_w = tf.get_variable("item_emb_w", [item_count, hidden_units // 2])
    item_b = tf.get_variable("item_b", [item_count],
    initializer=tf.constant_initializer(0.0))
    cate_emb_w = tf.get_variable("cate_emb_w", [cate_count, hidden_units // 2])
    cate_list = tf.convert_to_tensor(cate_list, dtype=tf.int64)
    `
    Below is the error displayed.


2022-02-21 17:42:43.558189: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at random_op.cc:76 : Resource exhausted: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node user_emb_w/Initializer/random_uniform/RandomUniform}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "lookalike_model/trainer/train.py", line 179, in
sess.run(tf.global_variables_initializer())
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node user_emb_w/Initializer/random_uniform/RandomUniform (defined at usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Original stack trace for 'user_emb_w/Initializer/random_uniform/RandomUniform':
File "/algorithm/lookalike_model/trainer/train.py", line 178, in
model = Model(user_count, item_count, cate_count, cate_list, predict_batch_size, predict_ads_num)
File "/algorithm/lookalike_model/trainer/model.py", line 40, in init
user_emb_w = tf.get_variable("user_emb_w", [user_count, hidden_units])
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1500, in get_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1243, in get_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 567, in get_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 519, in _true_getter
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 933, in _get_single_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 258, in call
return cls._variable_v1_call(*args, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 219, in _variable_v1_call
shape=shape)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 197, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 2519, in default_variable_creator
shape=shape)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 262, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 1688, in init
shape=shape)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 1818, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 905, in
partition_info=partition_info)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/init_ops.py", line 533, in call
shape, -limit, limit, dtype, seed=self.seed)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/random_ops.py", line 245, in random_uniform
rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_random_ops.py", line 822, in random_uniform
name=name)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()

[BLUEMARLIN-25] : Multiple GPU support for DIN lookalike model training

  1. Current DIN Lookalike model training is not supporting multiple gpu. We have two gpu available but it is using only one gpu always. It is desired that during training, It should use all available gpu.
  2. Or Can the script be modified to Tensorflow 2.0, In this version there are api for using all available gpu.

[BLUEMARLIN-26] Contribute to Interest related queries

DLPredictor predicts traffic based on profile and geolocation attributes. We like to expand the system to accept interests and behavioral attributes as well (TBR project). We like to know how real-world queries might be.

DIN Lookalike: Data Sampling understanding

Hello Jimmy,

We would like to understand how you did sampling in the lookalike_build_dataset.py script.

image

It would be great if you can share an informal description of this sampling method, so that we can reproduce it at our end.

Thanks.

Number of Distinct Users in Trainready Table

Hello,

As discussed in previous meetings, the total number of records in Trainready table is same as the total number of distinct users(aids). We have confirmed it on our side.
It would be helpful if you once check your data and verify if you have same number of records as the number of distinct users or not in the final Trainready table.

Thanks and Regards

[BLUEMARLIN-19] test link

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

dlpredictor model description

This is a illustration of the seq2seq model with attention for DLpredictor.
seq2seq_plus_attention

This is the relationship of functions defined in model.py

Functions_in_model

[BLUEMARLIN-21] Validation of the distribution of similarity scores for Lookalike

process:

  1. build DIN model
  2. generate user profile based on his/her keyword score (interest), then compute similarity score among all pairs of users
  3. analyze the distribution of resultant similarity scores to see if they are focused in some narrow range or spread on between 0 and 1 (cosine similarity)

results:

  1. Here’s an example of first 20 user’s keyword score profile.
user_id kw1 kw2 kw3 kw4 kw5 kw6 kw7 kw8 kw9 kw10 kw11 kw12 kw13 kw14 kw15
1 0.000 0.000 0.000 0.130 0.000 0.399 0.000 0.000 0.000 0.612 0.000 0.000 0.301 0.458 0.000
5 0.000 0.000 0.078 0.000 0.000 0.416 0.000 0.000 0.366 0.436 0.384 0.000 0.189 0.000 0.541
8 0.000 0.000 0.000 0.000 0.000 0.563 0.000 0.000 0.649 0.678 0.000 0.000 0.000 0.600 0.000
10 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.279 0.000 0.125 0.000 0.223 0.000
11 0.000 0.000 0.000 0.000 0.000 0.354 0.000 0.000 0.000 0.000 0.000 0.162 0.275 0.000 0.000
15 0.000 0.000 0.099 0.000 0.000 0.000 0.000 0.000 0.000 0.509 0.000 0.000 0.249 0.000 0.000
22 0.000 0.000 0.152 0.000 0.000 0.000 0.000 0.000 0.000 0.515 0.000 0.000 0.000 0.423 0.000
30 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.474 0.000 0.000 0.000 0.000 0.000
34 0.000 0.000 0.000 0.000 0.299 0.000 0.000 0.000 0.000 0.410 0.000 0.149 0.000 0.383 0.000
35 0.000 0.000 0.145 0.000 0.000 0.646 0.000 0.000 0.311 0.000 0.000 0.000 0.000 0.440 0.000
37 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.423 0.000 0.000 0.000 0.000 0.000 0.000
39 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.496 0.000 0.000 0.000 0.000 0.000 0.000
41 0.000 0.000 0.000 0.000 0.000 0.327 0.250 0.000 0.000 0.000 0.307 0.000 0.000 0.382 0.000
43 0.000 0.000 0.000 0.000 0.000 0.349 0.000 0.000 0.000 0.430 0.000 0.000 0.000 0.000 0.000
47 0.000 0.000 0.094 0.000 0.000 0.000 0.000 0.000 0.000 0.424 0.000 0.000 0.000 0.000 0.000
49 0.305 0.509 0.000 0.000 0.000 0.721 0.000 0.000 0.000 0.758 0.000 0.000 0.000 0.740 0.000
51 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.415 0.000 0.000 0.128 0.000 0.000 0.000
52 0.000 0.000 0.134 0.000 0.000 0.336 0.000 0.000 0.000 0.446 0.000 0.090 0.000 0.415 0.000
53 0.106 0.000 0.000 0.000 0.406 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
55 0.000 0.000 0.000 0.122 0.000 0.000 0.000 0.000 0.000 0.371 0.000 0.000 0.000 0.000 0.000
  1. Pairwise user similarity score was computed based on each user’s keyword score profile. Here’s a example of pairwise similarity scores based on keyword profile score of 1st 20 users above. It’s shown that the similarity score is well distributed between 0 and 1 instead of all focusing on lower end (0) or high end (1).
user1 user2 user3 user4 user5 user6 user7 user8 user9 user10 user11 user12 user13 user14 user15 user16 user17 user18 user19 user20
user1 1.000 0.536 0.794 0.782 0.510 0.729 0.807 0.663 0.708 0.583 0.000 0.000 0.517 0.788 0.648 0.837 0.000 0.906 0.000 0.674
user2 0.536 1.000 0.621 0.325 0.422 0.486 0.349 0.441 0.277 0.466 0.370 0.370 0.401 0.607 0.447 0.451 0.354 0.488 0.000 0.419
user3 0.794 0.621 1.000 0.684 0.335 0.481 0.707 0.543 0.623 0.779 0.520 0.520 0.517 0.706 0.530 0.774 0.497 0.831 0.000 0.516
user4 0.782 0.325 0.684 1.000 0.112 0.653 0.920 0.737 0.884 0.304 0.000 0.000 0.352 0.572 0.720 0.704 0.097 0.844 0.000 0.701
user5 0.510 0.422 0.335 0.112 1.000 0.250 0.000 0.000 0.078 0.562 0.000 0.000 0.379 0.468 0.000 0.379 0.100 0.393 0.000 0.000
user6 0.729 0.486 0.481 0.653 0.250 1.000 0.705 0.885 0.556 0.029 0.000 0.000 0.000 0.687 0.901 0.475 0.000 0.585 0.000 0.841
user7 0.807 0.349 0.707 0.920 0.000 0.705 1.000 0.753 0.836 0.357 0.000 0.000 0.369 0.585 0.784 0.729 0.000 0.872 0.000 0.716
user8 0.663 0.441 0.543 0.737 0.000 0.885 0.753 1.000 0.628 0.000 0.000 0.000 0.000 0.776 0.976 0.537 0.000 0.624 0.000 0.950
user9 0.708 0.277 0.623 0.884 0.078 0.556 0.836 0.628 1.000 0.302 0.000 0.000 0.350 0.488 0.613 0.644 0.067 0.761 0.443 0.597
user10 0.583 0.466 0.779 0.304 0.562 0.029 0.357 0.000 0.302 1.000 0.365 0.365 0.694 0.477 0.037 0.656 0.349 0.688 0.000 0.000
user11 0.000 0.370 0.520 0.000 0.000 0.000 0.000 0.000 0.000 0.365 1.000 1.000 0.000 0.000 0.000 0.000 0.956 0.000 0.000 0.000
user12 0.000 0.370 0.520 0.000 0.000 0.000 0.000 0.000 0.000 0.365 1.000 1.000 0.000 0.000 0.000 0.000 0.956 0.000 0.000 0.000
user13 0.517 0.401 0.517 0.352 0.379 0.000 0.369 0.000 0.350 0.694 0.000 0.000 1.000 0.322 0.000 0.573 0.000 0.587 0.000 0.000
user14 0.788 0.607 0.706 0.572 0.468 0.687 0.585 0.776 0.488 0.477 0.000 0.000 0.322 1.000 0.758 0.739 0.000 0.782 0.000 0.738
user15 0.648 0.447 0.530 0.720 0.000 0.901 0.784 0.976 0.613 0.037 0.000 0.000 0.000 0.758 1.000 0.524 0.000 0.650 0.000 0.928
user16 0.837 0.451 0.774 0.704 0.379 0.475 0.729 0.537 0.644 0.656 0.000 0.000 0.573 0.739 0.524 1.000 0.000 0.880 0.055 0.510
user17 0.000 0.354 0.497 0.097 0.100 0.000 0.000 0.000 0.067 0.349 0.956 0.956 0.000 0.000 0.000 0.000 1.000 0.037 0.000 0.000
user18 0.906 0.488 0.831 0.844 0.393 0.585 0.872 0.624 0.761 0.688 0.000 0.000 0.587 0.782 0.650 0.880 0.037 1.000 0.000 0.593
user19 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.443 0.000 0.000 0.000 0.000 0.000 0.000 0.055 0.000 0.000 1.000 0.000
user20 0.674 0.419 0.516 0.701 0.000 0.841 0.716 0.950 0.597 0.000 0.000 0.000 0.000 0.738 0.928 0.510 0.000 0.593 0.000 1.000
  1. Computed pairwise similarity score distribution among first 20k user, resulting in 20,000 x 20,000 similarity score matrix (cosine similarity score was used), the distribution of the values in the matrix is shown below -> it’s almost a perfect normal distribution.
    image

Add trained model files

Please add your trained model files for dl-predictor to
Model/predictor-dl-model/experiments

[BLUEMARLIN-24] DL predictor for both ad requests and impressions

Dears,
we have a requirement to use DL predictor to predict both ad requests and impressions in parallel and save the predictions of impressions and ad-requests into different output tables and elastic search indexes. Can we discuss this feasibility today ?
is it possible to organize a meeting before Thursday's weekly meeting?

Remove 0 imp uckeys before p_n

In dlpredictor pipeline/main_cluster, the uckeys with 0 imp in ts (training window) are removed before the calculation of std and mean.
This modification resulted to have fewer uckeys but the overall performance of the model stayed intact.
The new overall error rate at slot-id is 11.1% compared to previous 10.7%

[BLUEMARLIN-29] : For DIN-Lookalike model, training is very slow.

Training scenario:

   Following datasets details include users with minimum one click count with step = 10.
   test_dataset_count = 110755727,
   train_dataset_count = 517801469,
   user_count = 94315979,
   item_count = 19
   
   EPOCH = 250
   train_batch_size = 20480
       test_batch_size = 2048
   
   Current model takes around 12 hours to train 1 epoch if we use all datasets. If we use around 50% datasets by randomly selecting
   then also model takes around 7-8 hours to train for 1 epoch.
   
   By this analogy, If we want to train the model for complete 250 epochs on full datasets, then it will take around 125 days.
   
   Currently we are using Tensorflow 1.15, Two GPU are there in training but only one GPU is used.

Target about model.

   1. It is required that model should not take more than 24 hours to train.
   2. Model should be able to use all the available GPU.
   3. Is it possible to further reduce the datasets with regard to size without losing insights.
   4. Is it possible to get DIN-Lookalike model and trainer code in Tensorflow 2.0 version.

DL Model accuracy differences

As per the weekly meeting discussions, the hyper parameter "train_skip_first" value 12 was the reason for difference in predictions using the model you shared and the model we trained. Below is the predicted data and accuracy before and after updating the train_skip_first.
If this value need to be used for train_skip_first, please update the hparams.py file

<style> </style>
Slot id Actual values Model trained Using feeder files shared by BM with train_skip_first = 0 Predictions using model trained after changing train_skip_first = 12
a47eavw7ex 618,874,425 710379690 679778335
66bcd2720e5011e79bc8fa163e05184e 141,771,777 150189391 134106959
x0ej5xhk60kjwq 140,838,705 166090973 150899108
l03493p0r3 111,445,915 95256513 97194166
7b0d7b55ab0c11e68b7900163e3e481d 105,754,734 110975030 89637188
b6le0s4qo8 101,684,589 116538098 108032992
e351de37263311e6af7500163e291137 77,548,938 85192615 69053892
a290af82884e11e5bdec00163e291137 72,593,126 78752741 62835164
68bcd2720e5011e79bc8fa163e05184e 48,699,653 52896931 43842830
f1iprgyl13 38,603,295 40964141 33260309
w3wx3nv9ow5i97 35,212,394 32632932 29544796
w9fmyd5r0i 33,533,582 35375640 31485616
d971z9825e 26,376,462 29203099 20635427
l2d4ec6csv 21,556,143 27453184 27377759
z041bf6g4s 17,440,168 25409583 6177790
71bcd2720e5011e79bc8fa163e05184e 12,459,624 10217224 8926710
5cd1c663263511e6af7500163e291137 7,096,585 6485828 5206453
x2fpfbm8rt 6,390,364 6004435 4104797
d9jucwkpr3 4,475,456 6371902 4366933
k4werqx13k 4,464,357 4717659 3375137
a8syykhszz 886,345 880801 790722
j1430itab9wj3b 864,567 698216 797136
s4z85pd1h8 637,263 714682 562948
d4d7362e879511e5bdec00163e291137 188,922 42751 35520
17dd6d8098bf11e5bdec00163e291137 168,582 60899 92402

[BLUEMARLIN-30]: DIN Lookalike potential issue

Hello @jimmylao,
I was wondering that whether doing operations on negative and positive sampling would exclude some of the device IDs out from training data ?
If it does then we might get an error at the time of ctr generation for those DIDs which were not present during the training.
Wanted to know your opinion on this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.