apache / incubator-bluemarlin Goto Github PK

Blue Marlin is a critical web infrastructure for advertising based monetization. It is a cloud platform that adds intelligence to a plain Ad System.

Home Page: https://incubator.apache.org/

License: Apache License 2.0

Java 29.36% Python 69.92% Shell 0.72%

apache bluemarlin

incubator-bluemarlin's Introduction

Apache BlueMarlin

For previous announcements, please go to our twitter: https://twitter.com/fw_marlin

Blue Marlin events, please go to https://github.com/apache/incubator-bluemarlin/wiki/Editing-Blue-Marlin-Events

Blue Marlin meeting notes, please go to https://github.com/apache/incubator-bluemarlin/wiki/Blue-Marlin-Meeting-Notes

incubator-bluemarlin's People

Contributors

Stargazers

Watchers

Forkers

radibnia77 sreev isabella232 satyamswarup rangaswamymr faezehvaseghi

incubator-bluemarlin's Issues

[BLUEMARLIN-20] test-link-2

This is a test.

[BLUEMARLIN-28] : For DIN-Lookalike model, train.py runs only if line 40 in model.py is commented.

Training fails at the beginning itself if line 40 in model.py is not commented. If we comment line 40 in model.py then train.py runs successfully.
In below code taken from model.py, if we comment -> user_emb_w which is taken from line 40 of model.py, then training is successful.
hidden_units = 128

user_emb_w = tf.get_variable("user_emb_w", [user_count, hidden_units])
item_emb_w = tf.get_variable("item_emb_w", [item_count, hidden_units // 2])
item_b = tf.get_variable("item_b", [item_count],
initializer=tf.constant_initializer(0.0))
cate_emb_w = tf.get_variable("cate_emb_w", [cate_count, hidden_units // 2])
cate_list = tf.convert_to_tensor(cate_list, dtype=tf.int64)
`
Below is the error displayed.

2022-02-21 17:42:43.558189: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at random_op.cc:76 : Resource exhausted: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node user_emb_w/Initializer/random_uniform/RandomUniform}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "lookalike_model/trainer/train.py", line 179, in
sess.run(tf.global_variables_initializer())
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[94315979,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node user_emb_w/Initializer/random_uniform/RandomUniform (defined at usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Original stack trace for 'user_emb_w/Initializer/random_uniform/RandomUniform':
File "/algorithm/lookalike_model/trainer/train.py", line 178, in
model = Model(user_count, item_count, cate_count, cate_list, predict_batch_size, predict_ads_num)
File "/algorithm/lookalike_model/trainer/model.py", line 40, in init
user_emb_w = tf.get_variable("user_emb_w", [user_count, hidden_units])
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1500, in get_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 1243, in get_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 567, in get_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 519, in _true_getter
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 933, in _get_single_variable
aggregation=aggregation)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 258, in call
return cls._variable_v1_call(*args, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 219, in _variable_v1_call
shape=shape)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 197, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 2519, in default_variable_creator
shape=shape)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 262, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 1688, in init
shape=shape)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 1818, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/variable_scope.py", line 905, in
partition_info=partition_info)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/init_ops.py", line 533, in call
shape, -limit, limit, dtype, seed=self.seed)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/random_ops.py", line 245, in random_uniform
rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_random_ops.py", line 822, in random_uniform
name=name)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "usr/local/python3/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()

Lookalike: spark data pipeline issue

For the latest code that you have released for lookalike, we are facing a problem in the last step of data pipeline i.e. tf record generation.
script: https://github.com/apache/incubator-bluemarlin/blob/main/Model/lookalike-model/lookalike_model/pipeline/main_tfrecord_generator.py

We are currently running the pipeline for 120 Million AIDs and we get the following error on the highlighted line.

Kindly help us in running the pipeline for big data.

[BLUEMARLIN-25] : Multiple GPU support for DIN lookalike model training

Current DIN Lookalike model training is not supporting multiple gpu. We have two gpu available but it is using only one gpu always. It is desired that during training, It should use all available gpu.
Or Can the script be modified to Tensorflow 2.0, In this version there are api for using all available gpu.

[BLUEMARLIN-26] Contribute to Interest related queries

DLPredictor predicts traffic based on profile and geolocation attributes. We like to expand the system to accept interests and behavioral attributes as well (TBR project). We like to know how real-world queries might be.

DIN Lookalike: Data Sampling understanding

Hello Jimmy,

We would like to understand how you did sampling in the lookalike_build_dataset.py script.

It would be great if you can share an informal description of this sampling method, so that we can reproduce it at our end.

Thanks.

open discussion of potential of request based prediction for dlpredictor

we'll have an open discussion for the potential of request based prediction for dlpreditor

time: 9:30am EST 2/8/2022
location: zoom meeting
zoom id: 693 070 5942

[BLUEMARLIN-23] Contribute to Factdata schema to use Request instead of Impression

The current Factdata is based on the Impression table, not the Request table. To have more accurate results, we need to use the Request table which contains fewer human interrupted data.

Number of Distinct Users in Trainready Table

Hello,

As discussed in previous meetings, the total number of records in Trainready table is same as the total number of distinct users(aids). We have confirmed it on our side.
It would be helpful if you once check your data and verify if you have same number of records as the number of distinct users or not in the final Trainready table.

Thanks and Regards

[BLUEMARLIN-19] test link

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

[BLUEMARLIN-23] Test-3

Testing

dlpredictor model description

This is a illustration of the seq2seq model with attention for DLpredictor.

This is the relationship of functions defined in model.py

[BLUEMARLIN-21] Validation of the distribution of similarity scores for Lookalike

process:

build DIN model
generate user profile based on his/her keyword score (interest), then compute similarity score among all pairs of users
analyze the distribution of resultant similarity scores to see if they are focused in some narrow range or spread on between 0 and 1 (cosine similarity)

results:

Here’s an example of first 20 user’s keyword score profile.

user_id	kw1	kw2	kw3	kw4	kw5	kw6	kw7	kw9	kw10	kw11	kw12	kw13	kw14	kw15
1	0.000	0.000	0.000	0.130	0.000	0.399	0.000	0.000	0.612	0.000	0.000	0.301	0.458	0.000
5	0.000	0.000	0.078	0.000	0.000	0.416	0.000	0.366	0.436	0.384	0.000	0.189	0.000	0.541
8	0.000	0.000	0.000	0.000	0.000	0.563	0.000	0.649	0.678	0.000	0.000	0.000	0.600	0.000
10	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.279	0.000	0.125	0.000	0.223	0.000
11	0.000	0.000	0.000	0.000	0.000	0.354	0.000	0.000	0.000	0.000	0.162	0.275	0.000	0.000
15	0.000	0.000	0.099	0.000	0.000	0.000	0.000	0.000	0.509	0.000	0.000	0.249	0.000	0.000
22	0.000	0.000	0.152	0.000	0.000	0.000	0.000	0.000	0.515	0.000	0.000	0.000	0.423	0.000
30	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.474	0.000	0.000	0.000	0.000	0.000
34	0.000	0.000	0.000	0.000	0.299	0.000	0.000	0.000	0.410	0.000	0.149	0.000	0.383	0.000
35	0.000	0.000	0.145	0.000	0.000	0.646	0.000	0.311	0.000	0.000	0.000	0.000	0.440	0.000
37	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.423	0.000	0.000	0.000	0.000	0.000	0.000
39	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.496	0.000	0.000	0.000	0.000	0.000	0.000
41	0.000	0.000	0.000	0.000	0.000	0.327	0.250	0.000	0.000	0.307	0.000	0.000	0.382	0.000
43	0.000	0.000	0.000	0.000	0.000	0.349	0.000	0.000	0.430	0.000	0.000	0.000	0.000	0.000
47	0.000	0.000	0.094	0.000	0.000	0.000	0.000	0.000	0.424	0.000	0.000	0.000	0.000	0.000
49	0.305	0.509	0.000	0.000	0.000	0.721	0.000	0.000	0.758	0.000	0.000	0.000	0.740	0.000
51	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.415	0.000	0.000	0.128	0.000	0.000	0.000
52	0.000	0.000	0.134	0.000	0.000	0.336	0.000	0.000	0.446	0.000	0.090	0.000	0.415	0.000
53	0.106	0.000	0.000	0.000	0.406	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
55	0.000	0.000	0.000	0.122	0.000	0.000	0.000	0.000	0.371	0.000	0.000	0.000	0.000	0.000

Pairwise user similarity score was computed based on each user’s keyword score profile. Here’s a example of pairwise similarity scores based on keyword profile score of 1st 20 users above. It’s shown that the similarity score is well distributed between 0 and 1 instead of all focusing on lower end (0) or high end (1).

	user1	user2	user3	user4	user5	user6	user7	user8	user9	user10	user11	user12	user13	user14	user15	user16	user17	user18	user19	user20
user1	1.000	0.536	0.794	0.782	0.510	0.729	0.807	0.663	0.708	0.583	0.000	0.000	0.517	0.788	0.648	0.837	0.000	0.906	0.000	0.674
user2	0.536	1.000	0.621	0.325	0.422	0.486	0.349	0.441	0.277	0.466	0.370	0.370	0.401	0.607	0.447	0.451	0.354	0.488	0.000	0.419
user3	0.794	0.621	1.000	0.684	0.335	0.481	0.707	0.543	0.623	0.779	0.520	0.520	0.517	0.706	0.530	0.774	0.497	0.831	0.000	0.516
user4	0.782	0.325	0.684	1.000	0.112	0.653	0.920	0.737	0.884	0.304	0.000	0.000	0.352	0.572	0.720	0.704	0.097	0.844	0.000	0.701
user5	0.510	0.422	0.335	0.112	1.000	0.250	0.000	0.000	0.078	0.562	0.000	0.000	0.379	0.468	0.000	0.379	0.100	0.393	0.000	0.000
user6	0.729	0.486	0.481	0.653	0.250	1.000	0.705	0.885	0.556	0.029	0.000	0.000	0.000	0.687	0.901	0.475	0.000	0.585	0.000	0.841
user7	0.807	0.349	0.707	0.920	0.000	0.705	1.000	0.753	0.836	0.357	0.000	0.000	0.369	0.585	0.784	0.729	0.000	0.872	0.000	0.716
user8	0.663	0.441	0.543	0.737	0.000	0.885	0.753	1.000	0.628	0.000	0.000	0.000	0.000	0.776	0.976	0.537	0.000	0.624	0.000	0.950
user9	0.708	0.277	0.623	0.884	0.078	0.556	0.836	0.628	1.000	0.302	0.000	0.000	0.350	0.488	0.613	0.644	0.067	0.761	0.443	0.597
user10	0.583	0.466	0.779	0.304	0.562	0.029	0.357	0.000	0.302	1.000	0.365	0.365	0.694	0.477	0.037	0.656	0.349	0.688	0.000	0.000
user11	0.000	0.370	0.520	0.000	0.000	0.000	0.000	0.000	0.000	0.365	1.000	1.000	0.000	0.000	0.000	0.000	0.956	0.000	0.000	0.000
user12	0.000	0.370	0.520	0.000	0.000	0.000	0.000	0.000	0.000	0.365	1.000	1.000	0.000	0.000	0.000	0.000	0.956	0.000	0.000	0.000
user13	0.517	0.401	0.517	0.352	0.379	0.000	0.369	0.000	0.350	0.694	0.000	0.000	1.000	0.322	0.000	0.573	0.000	0.587	0.000	0.000
user14	0.788	0.607	0.706	0.572	0.468	0.687	0.585	0.776	0.488	0.477	0.000	0.000	0.322	1.000	0.758	0.739	0.000	0.782	0.000	0.738
user15	0.648	0.447	0.530	0.720	0.000	0.901	0.784	0.976	0.613	0.037	0.000	0.000	0.000	0.758	1.000	0.524	0.000	0.650	0.000	0.928
user16	0.837	0.451	0.774	0.704	0.379	0.475	0.729	0.537	0.644	0.656	0.000	0.000	0.573	0.739	0.524	1.000	0.000	0.880	0.055	0.510
user17	0.000	0.354	0.497	0.097	0.100	0.000	0.000	0.000	0.067	0.349	0.956	0.956	0.000	0.000	0.000	0.000	1.000	0.037	0.000	0.000
user18	0.906	0.488	0.831	0.844	0.393	0.585	0.872	0.624	0.761	0.688	0.000	0.000	0.587	0.782	0.650	0.880	0.037	1.000	0.000	0.593
user19	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.443	0.000	0.000	0.000	0.000	0.000	0.000	0.055	0.000	0.000	1.000	0.000
user20	0.674	0.419	0.516	0.701	0.000	0.841	0.716	0.950	0.597	0.000	0.000	0.000	0.000	0.738	0.928	0.510	0.000	0.593	0.000	1.000

Computed pairwise similarity score distribution among first 20k user, resulting in 20,000 x 20,000 similarity score matrix (cosine similarity score was used), the distribution of the values in the matrix is shown below -> it’s almost a perfect normal distribution.

Add trained model files

Please add your trained model files for dl-predictor to
Model/predictor-dl-model/experiments

[BLUEMARLIN-24] DL predictor for both ad requests and impressions

Dears,
we have a requirement to use DL predictor to predict both ad requests and impressions in parallel and save the predictions of impressions and ad-requests into different output tables and elastic search indexes. Can we discuss this feasibility today ?
is it possible to organize a meeting before Thursday's weekly meeting?

dlpredictor cannot process huge log files

If you have a different process for dlpredictor to process huge log files, please add it to
/Processes/dlpredictor/experiments/

Remove 0 imp uckeys before p_n

In dlpredictor pipeline/main_cluster, the uckeys with 0 imp in ts (training window) are removed before the calculation of std and mean.
This modification resulted to have fewer uckeys but the overall performance of the model stayed intact.
The new overall error rate at slot-id is 11.1% compared to previous 10.7%

Test issue 1

Describe the bug
This is a test.

[BLUEMARLIN-29] : For DIN-Lookalike model, training is very slow.

Training scenario:

   Following datasets details include users with minimum one click count with step = 10.
   test_dataset_count = 110755727,
   train_dataset_count = 517801469,
   user_count = 94315979,
   item_count = 19
   
   EPOCH = 250
   train_batch_size = 20480
       test_batch_size = 2048
   
   Current model takes around 12 hours to train 1 epoch if we use all datasets. If we use around 50% datasets by randomly selecting
   then also model takes around 7-8 hours to train for 1 epoch.
   
   By this analogy, If we want to train the model for complete 250 epochs on full datasets, then it will take around 125 days.
   
   Currently we are using Tensorflow 1.15, Two GPU are there in training but only one GPU is used.

Target about model.

   1. It is required that model should not take more than 24 hours to train.
   2. Model should be able to use all the available GPU.
   3. Is it possible to further reduce the datasets with regard to size without losing insights.
   4. Is it possible to get DIN-Lookalike model and trainer code in Tensorflow 2.0 version.

DL Model accuracy differences

As per the weekly meeting discussions, the hyper parameter "train_skip_first" value 12 was the reason for difference in predictions using the model you shared and the model we trained. Below is the predicted data and accuracy before and after updating the train_skip_first.
If this value need to be used for train_skip_first, please update the hparams.py file

Slot id	Actual values	Model trained Using feeder files shared by BM with train_skip_first = 0	Predictions using model trained after changing train_skip_first = 12
a47eavw7ex	618,874,425	710379690	679778335
66bcd2720e5011e79bc8fa163e05184e	141,771,777	150189391	134106959
x0ej5xhk60kjwq	140,838,705	166090973	150899108
l03493p0r3	111,445,915	95256513	97194166
7b0d7b55ab0c11e68b7900163e3e481d	105,754,734	110975030	89637188
b6le0s4qo8	101,684,589	116538098	108032992
e351de37263311e6af7500163e291137	77,548,938	85192615	69053892
a290af82884e11e5bdec00163e291137	72,593,126	78752741	62835164
68bcd2720e5011e79bc8fa163e05184e	48,699,653	52896931	43842830
f1iprgyl13	38,603,295	40964141	33260309
w3wx3nv9ow5i97	35,212,394	32632932	29544796
w9fmyd5r0i	33,533,582	35375640	31485616
d971z9825e	26,376,462	29203099	20635427
l2d4ec6csv	21,556,143	27453184	27377759
z041bf6g4s	17,440,168	25409583	6177790
71bcd2720e5011e79bc8fa163e05184e	12,459,624	10217224	8926710
5cd1c663263511e6af7500163e291137	7,096,585	6485828	5206453
x2fpfbm8rt	6,390,364	6004435	4104797
d9jucwkpr3	4,475,456	6371902	4366933
k4werqx13k	4,464,357	4717659	3375137
a8syykhszz	886,345	880801	790722
j1430itab9wj3b	864,567	698216	797136
s4z85pd1h8	637,263	714682	562948
d4d7362e879511e5bdec00163e291137	188,922	42751	35520
17dd6d8098bf11e5bdec00163e291137	168,582	60899	92402

[BLUEMARLIN-30]: DIN Lookalike potential issue

Hello @jimmylao,
I was wondering that whether doing operations on negative and positive sampling would exclude some of the device IDs out from training data ?
If it does then we might get an error at the time of ctr generation for those DIDs which were not present during the training.
Wanted to know your opinion on this.

[BLUEMARLIN-22] Lookalike model trainer is slow

For 1.2m user records which produce 4.3m training and 1m test datasets, it takes more than 3 days (250 epochs) to train a lookalike model; 20 minutes for each epoch.