Recommendation system using deep learning
pugantsov / stocktwits.recommender Goto Github PK
View Code? Open in Web Editor NEWstocktwits cashtag recommender system for master's project.
stocktwits cashtag recommender system for master's project.
3 months previously ~800k, 3 months now ~2.2Mil
Traceback (most recent call last):
File "model_spotlight.py", line 264, in <module>
sim.run(default=True)
File "model_spotlight.py", line 257, in run
results.save(DEFAULT_PARAMS, evaluation)
File "model_spotlight.py", line 37, in save
'hyperparameters': self._hash(hyperparameters),
File "model_spotlight.py", line 32, in <lambda>
self._hash = lambda x : hashlib.md5(json.dumps(x, sort_keys=True).encode('utf-8')).hexdigest()
File "/home/alex/anaconda3/envs/recsys/lib/python3.6/json/__init__.py", line 238, in dumps
**kw).encode(obj)
File "/home/alex/anaconda3/envs/recsys/lib/python3.6/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/home/alex/anaconda3/envs/recsys/lib/python3.6/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/home/alex/anaconda3/envs/recsys/lib/python3.6/json/encoder.py", line 180, in default
o.__class__.__name__)
TypeError: Object of type 'ndarray' is not JSON serializable
Traceback (most recent call last):
File "model_spotlight.py", line 536, in <module>
sim.run(defaults=True)
File "model_spotlight.py", line 528, in run
evaluation = self.evaluation(model, (train, test))
File "model_spotlight.py", line 439, in evaluation
train_prec, train_rec = sequence_precision_recall_score(model, train)
File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/spotlight/evaluation.py", line 128, in sequence_precision_recall_score
predictions = -model.predict(sequences[i])
File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/spotlight/sequence/implicit.py", line 318, in predict
self._check_input(sequences)
File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/spotlight/sequence/implicit.py", line 187, in _check_input
item_id_max = item_ids.max()
File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/numpy/core/_methods.py", line 28, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial)
ValueError: zero-size array to reduction operation maximum which has no identity
Write a file cleaner to clean rows which do not contain tokens
2019-02-28 00:28:16,844 [MainThread ] [INFO ] Removed users with less than 160 tweets. Size of DataFrame: 304206 -> 168873
2019-02-28 00:28:17,060 [MainThread ] [INFO ] Beginning NER parsing...
2019-02-28 00:39:00,502 [MainThread ] [INFO ] Parsing complete, recompiling DataFrame...
2019-02-28 00:39:03,815 [MainThread ] [INFO ] Removed users with malformed location information. Size of DataFrame: 167151 -> 0
2019-02-28 00:39:04,862 [MainThread ] [INFO ] Written CSV at 2019-02-28 00:39:04 with 0 entries
iterate_location_data
not writing properly to DataFrame, wipes all entries. Find out how to debug whilst in Pool.
Traceback (most recent call last):
File "model_spotlight.py", line 54, in <module>
sim.run()
File "model_spotlight.py", line 50, in run
implicit_model.predict(user_ids, item_ids=None)
File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/spotlight/factorization/implicit.py", line 307, in predict
self._use_cuda)
File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/spotlight/factorization/_components.py", line 20, in _predict_process_ids
user_ids = user_ids.expand(item_ids.size())
RuntimeError: The expanded size of the tensor (11439) must match the existing size (31548) at non-singleton dimension 0
Find out if it's worth normalising the interaction numbers and consider some form of bot detection to get rid of outliers manipulating the recommendations.
parse.py:160: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
d.set_value(i, 'user_loc_check', True)
Try df.at[df.index[2], 'ColName'] = 3
format instead of d.set_value(i,'user_location','|'.join(location))
@
Already written. Write new, tidier class to do this then figure out how to use embeddings as an item feature in LightFM
clean_locations
needs docstrings
Add location features now present in metadata_clean.csv
to HybrideBaselineModel
, initially without feature weights then add weights for states, cities, countries etc.
Use arrow
package to parse timestamps and add it as an item feature. Might be worth adding in AttributeCleaner
so as to separate parsing from model.
wipe all entries where watchlist_count is 0.
Write function to put item IDs, tags and body into plain, tab separated text file for doc2vec input
SAR is intended to be used on interactions with the following schema: , ,,[], [].
Find out what exactly constitutes a type, is it datatype or a classification (if latter, use sectors)
2019-02-28 15:05:37,168 [MainThread ] [INFO ] The dataset has 176 users and 92363 items with 18473 interactions in the test and 73890 interactions in the training set.
2019-02-28 15:05:37,169 [MainThread ] [INFO ] Begin fitting collaborative filtering model...
2019-02-28 15:05:39,443 [MainThread ] [INFO ] Collaborative Filtering training set AUC: 0.95837283
2019-02-28 15:05:40,755 [MainThread ] [INFO ] Collaborative Filtering test set AUC: 0.28463602
2019-02-28 15:05:40,755 [MainThread ] [INFO ] There are 92 distinct user locations, 9 distinct sectors, 215 distinct industries and 3929 distinct cashtags.
2019-02-28 15:05:40,825 [MainThread ] [INFO ] Begin fitting hybrid model...
2019-02-28 15:05:44,184 [MainThread ] [INFO ] Hybrid training set AUC: 0.889686
2019-02-28 15:05:45,276 [MainThread ] [INFO ] Hybrid test set AUC: 0.8127067
2019-02-28 15:05:48,247 [MainThread ] [INFO ] Hybrid training set Precision@10: 0.26931816
2019-02-28 15:05:49,271 [MainThread ] [INFO ] Hybrid test set Precision@10: 0.0
2019-02-28 15:05:52,234 [MainThread ] [INFO ] Hybrid training set Recall@10: 0.01004312342323803
2019-02-28 15:05:53,267 [MainThread ] [INFO ] Hybrid test set Recall@10: 0.0
model_baseline_hybrid.py:371: RuntimeWarning: invalid value encountered in double_scalars
f1_train, f1_test = 2*(train_recall * train_precision) / (train_recall + train_precision), 2*(test_recall * test_precision) / (test_recall + test_precision)
2019-02-28 15:05:53,268 [MainThread ] [INFO ] Hybrid training set F1 Score: 0.01936414014913654
2019-02-28 15:05:53,268 [MainThread ] [INFO ] Hybrid test set F1 Score: nan
2019-02-28 15:05:56,225 [MainThread ] [INFO ] Hybrid training set MRR: 0.3619097
2019-02-28 15:05:57,254 [MainThread ] [INFO ] Hybrid test set MRR: 0.0029159905
Precision/Recall seems to break model when 0.0 is passed, also check why 0.0 is passing in the first place. The fact that there's no information at all for the test set is unusual (even if very low results).
Use arrow
For the purposes of speed-checking between models, make sure you can run default settings (ie, just use_cuda=True)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.