aerdem4 / kaggle-quora-dup Goto Github PK

Solution to Kaggle's Quora Duplicate Question Detection Competition

License: MIT License

Python 100.00%

siamese-lstm nlp neural-network regex siamese-network

kaggle-quora-dup's Introduction

Solution to Kaggle's Quora Duplicate Question Detection Competition

The competition can be found via the link: https://www.kaggle.com/c/quora-question-pairs I was ranked 23rd (top 1%) among 3307 teams with this solution. This is a relatively lightweight model considering the other top solutions.

Prerequisites

Download the pre-trained word vectors, namely glove.840B.300d, from https://nlp.stanford.edu/projects/glove/ and put it into the project directory.
Download the train and test data from https://www.kaggle.com/c/quora-question-pairs/data. Create a folder named "data" and put them in.
Install all the packages in requirements.txt.

Pipeline

This code is written in Python 3.5 and tested on a machine with Intel i5-6300HQ processor and Nvidia GeForce GTX 950M. Keras is used with Tensorflow backend and GPU support.
First run nlp_feature_extraction.py and non_nlp_feature extraction.py scripts. They may take an hour to finish.
Then run model.py which may take around 5 hours to make 10 different predictions on the test set.
Finally, ensemble and postprocess the predictions by postprocess.py.

Model Explanation

Questions are preprocessed such that the different forms of writing the same thing are tried to be unified. So, LSTM does not learn different things from these different interpretations.
Words which occur more than 100 times in the train set are collected. The rest is considered as rare words and replaced by the word "memento" which is my favorite movie from C. Nolan. Since "memento" is irrelevant to almost anything, it is absically a placeholder. How many of the rare words are common in the both pairs and how many of them are numeric are used as features. This whole process leads to better generalization in LSTM so that it cannot overfit particular pairs by just using these rare words.
The features mentioned above are merged with NLP and non-NLP features. As a result, 4+15+6=25 features are prepared for the network.
The train data is divided into 10 folds. In every run, one fold is kept as the validation set for early stopping. So, every run uses 1 fold different than the other for training which can contribute to the model variance. Since we are going to ensemble the models, increasing model variance reasonably is something we may want. I also did more 10fold runs with different model parameters for better ensebling during the competition.

Network Architecture

Postprocessing

All the generated models are average ensembled.
Since the class inbalance is told to be different in the test set, predictions are adjusted regarding to the test set class ratio.
Postprocess method I explained in https://www.kaggle.com/divrikwicky/semi-magic-postprocess-0-005-lb-gain is used.

What made my model successful? BETTER GENERALIZATION

All the features are question order independent. When you swap the first and the second question, the feature matrix does not change. For example, instead of using question1_frequency and question2_frequency, I have used min_frequency and max_frequency.
Feature values are bounded when necessary. For example, number of neighbors are set to 5 for everything above 5, because I did not want to overfit on a particular pair with specific number of neighbor 76 etc.
Features generated by LSTM is also question order independent. They share the same LSTM layer. After the LSTM layer, output of question1 and question2 merged with commutative operations which are square of difference and summation.
I think a good preprocessing on the questions also leads to better generalization.
Replacing the rare words with a placeholder before LSTM is another thing that I did for better generalization.
The neural network is not so big and has reasonable amount of dropouts and gaussian noises.
Different NN ppredictions are ensembled at the end.

kaggle-quora-dup's People

Contributors

Stargazers

Watchers

Forkers

canoefzh lampts futureer qiuhuigithub allen840707 shanjgit 0ri0nx wushicanasl lilitom tjacowalvis airxiechao ofrik ashishlal soldni sw1001 zt2191stat kagglesolutions prokopyev pchankh khaledto geunyounglim yxiao1994 michael-wzhu dongheerhie diosguo sangensong aiedward stevewho dfenglei neuron888 rahasayantan sathvisiva dayeren lrpopeyou guihui jainnitk dylanxia2017 arunkumarramanan ibozkurt79 arvind-india tarsbase machineiearning crislanio battlegg aditi-bhole data-science-ai-open-source xxkkevin kurhula aniruthanv santhosh432 rupali-goyal iq-scm

kaggle-quora-dup's Issues

at preds = model.predict([test_data_1, test_data_2, features_test], batch_size=BATCH_SIZE, verbose=1)

preds = model.predict([test_data_1, test_data_2, features_test], batch_size=BATCH_SIZE, verbose=1)
getting error that
ValueError: Error when checking : expected input_3 to have shape (None, 24) but got array with shape (200001, 31)

Cannot properly generate kcore_dict in non-NLP features

Hi, I downloaded your code and tried to play with it. When I run the script non_nlp_feature_extraction.py, I encountered the following error. I am not familliar with the Graph lib you used here. Would you please have a look on it and tell me what's wrong with the code?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-df736cd8be48> in <module>()
     10 print("Calculating kcore features...")
     11 all_df = pd.concat([train_df, test_df])
---> 12 kcore_dict = get_kcore_dict(all_df)
     13 train_df = get_kcore_features(train_df, kcore_dict)
     14 test_df = get_kcore_features(test_df, kcore_dict)

<ipython-input-5-4e9e86a38b2a> in get_kcore_dict(df)
     25     print(type(g.nodes()))
     26     print(g.nodes())
---> 27     df_output = pd.DataFrame(data=g.nodes(), columns=["qid"])
     28     df_output["kcore"] = 0
     29     for k in range(2, NB_CORES + 1):

D:\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    352                                          copy=False)
    353             else:
--> 354                 raise ValueError('DataFrame constructor not properly called!')
    355 
    356         NDFrame.__init__(self, mgr, fastpath=True)

ValueError: DataFrame constructor not properly called!

I think the problem is with this function.

def get_kcore_dict(df):
    g = nx.Graph()
    g.add_nodes_from(df.qid1)
    edges = list(df[["qid1", "qid2"]].to_records(index=False))
    g.add_edges_from(edges)
    g.remove_edges_from(g.selfloop_edges())
    print(type(g.nodes()))
    print(g.nodes())
    df_output = pd.DataFrame(data=g.nodes(), columns=["qid"])   <==== THIS LINE
    df_output["kcore"] = 0
    for k in range(2, NB_CORES + 1):
        ck = nx.k_core(g, k=k).nodes()
        print("kcore", k)
        df_output.ix[df_output.qid.isin(ck), "kcore"] = k

    return df_output.to_dict()["kcore"]

Model reference

May I asked model in your solution,refer to any other reference ,such as paper?Thanks

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>

I am trying to run model.py but i am getting following error:

D:\imad_web\kaggle-quora-dup_24_position>python model.py
C:\ProgramData\Anaconda3\lib\site-packages\h5py_init_.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Creating the vocabulary of words occurred more than 100
Traceback (most recent call last):
File "model.py", line 122, in
embeddings_index = get_embedding()
File "model.py", line 55, in get_embedding
for line in f:
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to
`
def get_embedding():
embeddings_index = {}
f = open(EMBEDDING_FILE)
for line in f: ##########################line 55
values = line.split()
word = values[0]
if len(values) == EMBEDDING_DIM + 1 and word in top_words:
coefs = np.asarray(values[1:], dtype="float32")
embeddings_index[word] = coefs
f.close()
return embeddings_index

`
vectorizer = CountVectorizer(lowercase=False, token_pattern="\S+", min_df=MIN_WORD_OCCURRENCE)
vectorizer.fit(all_questions)
top_words = set(vectorizer.vocabulary_.keys())
top_words.add(REPLACE_WORD)

embeddings_index = get_embedding() ##############line 122
print("Words are not found in the embedding:", top_words - embeddings_index.keys())
top_words = embeddings_index.keys()

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7908: character maps to <undefined>

How to get out of this ????
can somebody help me??

Possible to include some trained weights?

Hi I was interested in playing about with this model, but I was wondering instead of hiring an AWS server to run the model, is it possible to include some trained weights so we can use the model without training it?

Thanks

get_kcore_dict()'s return value might be wrong

I have a doubt on this line.
https://github.com/aerdem4/kaggle-quora-dup/blob/master/non_nlp_feature_extraction.py#L41

Perhaps, this line should be something like:

    return dict(zip(df_output["qid"], df_output["kcore"]))

so that the returned dict has qids as keys properly.

(Your code is very helpful. Thank you very much!)

what is your offline score

Hello！
First of all, thank you very much for your open source.
I am very interested in your implementation, so I downloaded your code and implemented it on my machine. My GPU is TiTanic Xp.
After I finished running the file "model.py" with setting epochs=15, and I got the result of the validation set loss of about 0.203 (training set loss is about 0.17). The results of the model training don't seem so good!
I find you have a ranking of 23 on kaggle and an online score of 0.12988, so I would like to ask, what is your offline score? How can I use this open source code to achieve the same validation loss as you?

Looking forward to your reply.Thanks again!
@aerdem4

aerdem4 / kaggle-quora-dup Goto Github PK

kaggle-quora-dup's Introduction

Solution to Kaggle's Quora Duplicate Question Detection Competition

Prerequisites

Pipeline

Model Explanation

Network Architecture

Postprocessing

What made my model successful? BETTER GENERALIZATION

kaggle-quora-dup's People

Contributors

Stargazers

Watchers

Forkers

kaggle-quora-dup's Issues

Recommend Projects

Recommend Topics

Recommend Org