The stocktwits.recommender from pugantsov

Parser outputting too many documents

3 months previously ~800k, 3 months now ~2.2Mil

Implicit Model Results: TypeError: Object of type 'ndarray' is not JSON serializable

Traceback (most recent call last):
  File "model_spotlight.py", line 264, in <module>
    sim.run(default=True)
  File "model_spotlight.py", line 257, in run
    results.save(DEFAULT_PARAMS, evaluation)
  File "model_spotlight.py", line 37, in save
    'hyperparameters': self._hash(hyperparameters),
  File "model_spotlight.py", line 32, in <lambda>
    self._hash = lambda x : hashlib.md5(json.dumps(x, sort_keys=True).encode('utf-8')).hexdigest()
  File "/home/alex/anaconda3/envs/recsys/lib/python3.6/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/home/alex/anaconda3/envs/recsys/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/home/alex/anaconda3/envs/recsys/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/home/alex/anaconda3/envs/recsys/lib/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'ndarray' is not JSON serializable

Get Doc2Vec embeddings in np.float32 format for LightFM

Attach identity matrix when supplying item/user features

lyst/lightfm#201

S_POOL: ValueError: zero-size array to reduction operation maximum which has no identity

Traceback (most recent call last):
  File "model_spotlight.py", line 536, in <module>
    sim.run(defaults=True)
  File "model_spotlight.py", line 528, in run
    evaluation = self.evaluation(model, (train, test))
  File "model_spotlight.py", line 439, in evaluation
    train_prec, train_rec = sequence_precision_recall_score(model, train)
  File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/spotlight/evaluation.py", line 128, in sequence_precision_recall_score
    predictions = -model.predict(sequences[i])
  File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/spotlight/sequence/implicit.py", line 318, in predict
    self._check_input(sequences)
  File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/spotlight/sequence/implicit.py", line 187, in _check_input
    item_id_max = item_ids.max()
  File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/numpy/core/_methods.py", line 28, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial)
ValueError: zero-size array to reduction operation maximum which has no identity

Enhance interactions with timestamps

Write clean_notokens into AttributeCleaner

Write a file cleaner to clean rows which do not contain tokens

iterate_location_data not updating parsed locations to DataFrame

2019-02-28 00:28:16,844 [MainThread  ] [INFO ]  Removed users with less than 160 tweets. Size of DataFrame: 304206 -> 168873
2019-02-28 00:28:17,060 [MainThread  ] [INFO ]  Beginning NER parsing...
2019-02-28 00:39:00,502 [MainThread  ] [INFO ]  Parsing complete, recompiling DataFrame...
2019-02-28 00:39:03,815 [MainThread  ] [INFO ]  Removed users with malformed location information. Size of DataFrame: 167151 -> 0
2019-02-28 00:39:04,862 [MainThread  ] [INFO ]  Written CSV at 2019-02-28 00:39:04 with 0 entries

iterate_location_data not writing properly to DataFrame, wipes all entries. Find out how to debug whilst in Pool.

SpotlightModel: Training Error (Tensor Size)

Traceback (most recent call last):
  File "model_spotlight.py", line 54, in <module>
    sim.run()
  File "model_spotlight.py", line 50, in run
    implicit_model.predict(user_ids, item_ids=None)
  File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/spotlight/factorization/implicit.py", line 307, in predict
    self._use_cuda)
  File "/home/alex/anaconda3/envs/recsys/lib/python3.6/site-packages/spotlight/factorization/_components.py", line 20, in _predict_process_ids
    user_ids = user_ids.expand(item_ids.size())
RuntimeError: The expanded size of the tensor (11439) must match the existing size (31548) at non-singleton dimension 0

Find how to get optimal hyperparameters for implicit model

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterSampler.html
https://github.com/maciejkula/spotlight/blob/master/examples/movielens_sequence/movielens_sequence.py

Create interactions object for SpotlightImplicitModel

Construct EmbeddingsParser class

Normalisation for interactions

Find out if it's worth normalising the interaction numbers and consider some form of bot detection to get rid of outliers manipulating the recommendations.

Evaluate first ImplicitFactorizationModel

Update deprecated set_value (pandas) in iterate_location_data

parse.py:160: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
  d.set_value(i, 'user_loc_check', True)

Try df.at[df.index[2], 'ColName'] = 3 format instead of d.set_value(i,'user_location','|'.join(location))

Build user-cashtag interactions matrix

Write "Best Result" checker for ImplicitModel

@

Use doc2vec/word2vec for Item Body embedding

Already written. Write new, tidier class to do this then figure out how to use embeddings as an item feature in LightFM

Enrich users with location information

Add docstrings for SpotlightImplicitModel

Update AttributeCleaner docstrings

clean_locations needs docstrings

Enrich cashtags with industry/sector information

Construct interactions, features to correspond with MovieLens format (TR-Att)

Add/improve docstrings for AttributeCleaner, AttributeParser

Convert parser CSV writer to pandas.to_csv

Split SpotlightImplicitModel into classes

Add user location features to HybridBaselineModel

Add location features now present in metadata_clean.csv to HybrideBaselineModel, initially without feature weights then add weights for states, cities, countries etc.

Prevent float rounding on normalisation of weights

Add timestamp to item features

Use arrow package to parse timestamps and add it as an item feature. Might be worth adding in AttributeCleaner so as to separate parsing from model.

Implement watchlist_count cleaner (AttributeCleaner)

wipe all entries where watchlist_count is 0.

Remove potential bot posts by having a user-cashtag occurrence cutoff

Write tokenisation iterator for clean_notokens

Add loop to go through best results for sequence models

Write trainable writer function

Write function to put item IDs, tags and body into plain, tab separated text file for doc2vec input

Include timestamp in trainables

Frame SAR interactions for Cashtag data

SAR is intended to be used on interactions with the following schema: , ,,[], [].

Find out what exactly constitutes a type, is it datatype or a classification (if latter, use sectors)

Train LSTM model w/ defaults, evaluate

Train CNNNet model w/ defaults, evaluate

Implement trending score normaliser (AttributeCleaner)

Convert metadata CSV to cashtag orientation

Fix pandas grouping to retain original columns

https://stackoverflow.com/questions/55027703/pandas-keep-columns-count-drop-duplicates

Write docstrings for HybridBaselineCashtagRecommender

Fix bad parsing for user features in HybridBaselineModel

2019-02-28 15:05:37,168 [MainThread  ] [INFO ]  The dataset has 176 users and 92363 items with 18473 interactions in the test and 73890 interactions in the training set.
2019-02-28 15:05:37,169 [MainThread  ] [INFO ]  Begin fitting collaborative filtering model...
2019-02-28 15:05:39,443 [MainThread  ] [INFO ]  Collaborative Filtering training set AUC: 0.95837283
2019-02-28 15:05:40,755 [MainThread  ] [INFO ]  Collaborative Filtering test set AUC: 0.28463602
2019-02-28 15:05:40,755 [MainThread  ] [INFO ]  There are 92 distinct user locations, 9 distinct sectors, 215 distinct industries and 3929 distinct cashtags.
2019-02-28 15:05:40,825 [MainThread  ] [INFO ]  Begin fitting hybrid model...
2019-02-28 15:05:44,184 [MainThread  ] [INFO ]  Hybrid training set AUC: 0.889686
2019-02-28 15:05:45,276 [MainThread  ] [INFO ]  Hybrid test set AUC: 0.8127067
2019-02-28 15:05:48,247 [MainThread  ] [INFO ]  Hybrid training set Precision@10: 0.26931816
2019-02-28 15:05:49,271 [MainThread  ] [INFO ]  Hybrid test set Precision@10: 0.0
2019-02-28 15:05:52,234 [MainThread  ] [INFO ]  Hybrid training set Recall@10: 0.01004312342323803
2019-02-28 15:05:53,267 [MainThread  ] [INFO ]  Hybrid test set Recall@10: 0.0
model_baseline_hybrid.py:371: RuntimeWarning: invalid value encountered in double_scalars
  f1_train, f1_test = 2*(train_recall * train_precision) / (train_recall + train_precision), 2*(test_recall * test_precision) / (test_recall + test_precision)
2019-02-28 15:05:53,268 [MainThread  ] [INFO ]  Hybrid training set F1 Score: 0.01936414014913654
2019-02-28 15:05:53,268 [MainThread  ] [INFO ]  Hybrid test set F1 Score: nan
2019-02-28 15:05:56,225 [MainThread  ] [INFO ]  Hybrid training set MRR: 0.3619097
2019-02-28 15:05:57,254 [MainThread  ] [INFO ]  Hybrid test set MRR: 0.0029159905

Precision/Recall seems to break model when 0.0 is passed, also check why 0.0 is passing in the first place. The fact that there's no information at all for the test set is unusual (even if very low results).

pugantsov / stocktwits.recommender Goto Github PK

stocktwits.recommender's Introduction

rec-sys

stocktwits.recommender's People

Contributors

stocktwits.recommender's Issues

Recommend Projects

Recommend Topics

Recommend Org