bradleypallen / keras-quora-question-pairs Goto Github PK

A Keras model that addresses the Quora Question Pairs dyadic prediction task.

License: MIT License

Python 9.24% Jupyter Notebook 90.76%

keras-quora-question-pairs's Introduction

I'm a technology executive and serial entrepreneur who is currently Chief Architect at Merit, a Bay Area startup building a verified identity platform. Previously, I was Chief Architect at Elsevier, and before that, founder/CTO at three startups in the Los Angeles area, achieving successful exits in two of the three. I began my career during the 1980s as one of the very first knowledge engineers of the expert systems era, after earning a BS in Applied Mathematics at Carnegie Mellon University. I am also a Guest Researcher in the INtelligent Data Engineering Lab at the University of Amsterdam. At INDE Lab, I am exploring the evolution of the practice of knowledge engineering and the impact of large language models on that evolution.

Below are some repos addressing a number of topics such as conceptual engineering using large language models (LLMs), using LLMs to evaluate knowledge graphs (KGs) in the context of KG refinement, the detection of hallucinations using LLMs (for the SemEval-2024 Task-6 SHROOM competition), a linked data catalog of my William S. Burroughs collection, the calculation of Texas Hold'Em hand win percentages, and a trainer for John Horton Conway's Doomsday algorithm.

keras-quora-question-pairs's People

Contributors

Stargazers

Watchers

Forkers

snakeroot91 xennygrimmato ompanda chernovsergey 176coding currie32 matthiasjfrank mathlf2015 babylls wuxiaobo carloslema davidfumo gallupliu harirajeev subedi90 ubikas gaphex canoefzh liuzhisheng1226 nourozr seyiqi shawnxiha jeffzhengye edhsu1984 paulantoine fpcheng rahasayantan lilitom yuhsinliu1993 zed9 root-master mathematiguy puchodeeplearninglabs dysdsyd janismdhanbad zhangruiskyline leezqcst zxlmufc colinsongf lovehoroscoper houhaichao830 cosecant-csc weili1988 ratulghosh jkhlot azaman13 wangpeng3891 zencoding sanjeeku everyonelijin sixtytwosecond fiquinho paliking lxianwei003 dburner jianbotang bdqnghi fangzheng354 fancyerii dmadeka rxt2012kc supersx sw1001 wing0077 manli009 hbcbh1999 lizihan021 kaeflint stevealbertwong ufukhurriyetoglu liveopp jbdatascience mzdu moonontheway zhf459 preke cyborgnoah allsystems-romania satosys peeyushpashine aritra70 santynaren shubhampachori12110095 aswanipranjal 460130107 inistlwq anyai wqw123 himani777 hallochen nemocpp mukhal aayushsinha44 waterzxj jobqiu maniyar2jaimin aiedward zhouyonglong patratanmoy038 happyyolanda

keras-quora-question-pairs's Issues

Is it matched with the following URL and which IDE you used pls?

Dear Bradley,

I am new to NLP, TensorFlow and DL using Python.

The design and implementation matched with this https://www.kaggle.com/c/quora-question-pairs/data pls ?

I couldn't manage to find q1_train.npy, q2_train.npy, label_train.npy, word_embedding_matrix.npy and nb_words.json in your repository so pls advise me.

The error below seem like related to the above missing files

Processing quora_duplicate_questions.tsv

KeyError Traceback (most recent call last)
in ()
10 reader = csv.DictReader(csvfile, delimiter='\t')
11 for row in reader:
---> 12 question1.append(row['text1'])
13 question2.append(row['text2'])
14 is_duplicate.append(row['duplicate'])

KeyError: 'text1'

NameError Traceback (most recent call last)
in ()
1 questions = question1 + question2
----> 2 tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
3 tokenizer.fit_on_texts(questions)
4 question1_word_sequences = tokenizer.texts_to_sequences(question1)
5 question2_word_sequences = tokenizer.texts_to_sequences(question2)

NameError: name 'Tokenizer' is not defined

NameError Traceback (most recent call last)
in ()
1 if not exists(KERAS_DATASETS_DIR + GLOVE_ZIP_FILE):
----> 2 zipfile = ZipFile(get_file(GLOVE_ZIP_FILE, GLOVE_ZIP_FILE_URL))
3 zipfile.extract(GLOVE_FILE, path=KERAS_DATASETS_DIR)
4
5 print("Processing", GLOVE_FILE)

NameError: name 'get_file' is not defined

NameError Traceback (most recent call last)
in ()
----> 1 q1_data = pad_sequences(question1_word_sequences, maxlen=MAX_SEQUENCE_LENGTH)
2 q2_data = pad_sequences(question2_word_sequences, maxlen=MAX_SEQUENCE_LENGTH)
3 labels = np.array(is_duplicate, dtype=int)
4 print('Shape of question1 data tensor:', q1_data.shape)
5 print('Shape of question2 data tensor:', q2_data.shape)

NameError: name 'pad_sequences' is not defined

NameError Traceback (most recent call last)
in ()
----> 1 np.save(open(Q1_TRAINING_DATA_FILE, 'wb'), q1_data)
2 np.save(open(Q2_TRAINING_DATA_FILE, 'wb'), q2_data)
3 np.save(open(LABEL_TRAINING_DATA_FILE, 'wb'), labels)
4 np.save(open(WORD_EMBEDDING_MATRIX_FILE, 'wb'), word_embedding_matrix)
5 with open(NB_WORDS_DATA_FILE, 'w') as f:

NameError: name 'q1_data' is not defined

I tried to play in PyCharm IDE and Anaconda Navigator adding necessary Frameworks like TensorFlow and Keras, NLP but still errors above seem like plug in didn't work well.

Really, appreciate your kind help and time.

Thanks and best regards

I am confused whether the 1st solution used XGB / LGBM as their main model?

https://www.kaggle.com/c/quora-question-pairs/discussion/34355
What is the function of XGB / LGBM in this 1st solution?
Thank you very much.
@bradleypallen

Threshold to predict new data

can i include round() in the final layer

or just put a threshold 0.5 for the sigmoid outputs for new predictions

Why TimeDistributed right after Embedding Layer?

@bradleypallen Thanks for sharing your solution. In your network architecture, you add an TimeDistributed (Dense) after the word embedding layer. Could you please explain your motivation and why? In most cases, it seems that TimeDistributed is applied after a LSTM and before the Softmax?

The metric is not comparable

Thank you for the work, and reference collection on this interesting dataset. As my understanding, these benchmarks are based on different test set. Secondly, the complexity of each solutions(e.g. number of parameters) is also a good indicator, for example the leaderboard from Stanford NLI here: http://nlp.stanford.edu/projects/snli/

My current solution:

Model params (693K), dataset: dev split: 0.1, test split: 0.1
loss = 0.3608
accuracy = 0.8336
precision = 0.7516
recall = 0.8228
F = 0.7782
CPU times: user 1min 36s, sys: 13.2 s, total: 1min 49s
Wall time: 1min 48s

How to get similarity scores between a pair of custom test sentences?

Once I trained the model, how can I use it during the inference time to input a pair of custom sentences and obtain its similarity score?

Has anything changed withing source file ?

I got this error message :
KeyError Traceback (most recent call last)
in ()
10 reader = csv.DictReader(csvfile, delimiter='\t')
11 for row in reader:
---> 12 question1.append(row['text1'])
13 question2.append(row['text2'])
14 is_duplicate.append(row['duplicate'])

KeyError: 'text1'

What kind of model is suitable for this scenery? (deal with synonym)

Train data: Who is Michael Jordan? and Who is Air Jordan? are one right match.
The test data is Who is Jumpman Jordan? and Who is Air Jordan?

What feature should I add for training? Which model can capture the match info for this case?

Thank you！！！
@bradleypallen

ValueError: could not convert string to float

@bradleypallen I'm not able to run the model from command line

Processing quora_duplicate_questions.tsv
Question pairs: 404290
Words in index: 95596
Processing glove.840B.300d.txt
Traceback (most recent call last):
File "keras-quora-question-pairs.py", line 91, in
embedding = np.asarray(values[1:], dtype='float32')
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numpy/core/numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float:

Thanks.

bradleypallen / keras-quora-question-pairs Goto Github PK

keras-quora-question-pairs's Introduction

keras-quora-question-pairs's People

Contributors

Stargazers

Watchers

Forkers

keras-quora-question-pairs's Issues

Is it matched with the following URL and which IDE you used pls?

Processing quora_duplicate_questions.tsv

I am confused whether the 1st solution used XGB / LGBM as their main model?

Threshold to predict new data

Why TimeDistributed right after Embedding Layer?

The metric is not comparable

How to get similarity scores between a pair of custom test sentences?

Has anything changed withing source file ?

What kind of model is suitable for this scenery? (deal with synonym)

ValueError: could not convert string to float

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent