Coder Social home page Coder Social logo

docrec's Introduction

KEYWORD EXTRACTION AND DOCUMENT RECOMMENDATION IN CONVERSATIONS (DocRec)

Software from the PhD thesis of Maryam Habibi (EPFL n. 6760)

Maryam Habibi, Idiap Research Institute, October 2015

[email protected]
[email protected] (advisor)


Table of Contents
-------------------
1. Introduction
2. Main components
3. Requirements
4. References


1. Introduction
-------------------
The package contains several pieces of Matlab code.  Taken together, they extract keywords
from a conversation, then use them to build implicit queries, and then consolidate the
sets of retrieved documents to recommend to the conversation participants.

First a list of keywords is extracted from the conversation transcript. Then, the keywords
from the list are topically clustered into several topically-independent subsets. Each
subset represents an implicit query, which is submitted to the Lucene search engine
(available from lucene.apache.org) to retrieve documents from Wikipedia (using a dump of
the pages available from dumps.wikimedia.org) or any other local repository indexed by
Lucene.

Finally the lists of results from each separate query are merged using a merging method
that favors diversity of topics among the recommended documents.


2. Main components
-------------------
1. TestTexts.m : the main Matlab source code which connects different modules 
2. readWord1.m : reads the input transcript, removes stop words and represents the transcript using topical information
3. BeamSearchKeywordExtraction.m : diverse keyword extraction
4. mainRD.m : diverse merging of lists of document results


3. Requirements
-------------------
1. Create the following folders in the main folder of the software. 

AllResults: folder in which the software writes the title, WPID (id of Wikipedia page) and scores of the documents retrieved for each implicit query in three separate files per implicit query (one for titles, one for WPIDs and one for scores).

RSL75: the software writes the final merged list of results in a file in this folder.

transcripts: the software reads the input file from this folder.

Value: the software writes the weights of implicit query in a file in this folder. Each line of the file indicates one weight and the line number indicates the query number.

W: the software writes the index of the keywords (position in the word list extracted by topic modeling) extracted by diverse keyword extraction method in the file "1" and the string corresponding to each index in the file "2" in this folder.

words: the software writes the keywords representing each implicit query in a file of this folder. For example, in the case of two implicit queries, two files will be generated with the names of "1" and "2", each representing the query number.

2. Define the absolute path of the main folder (where results of each module are read or written) in TestTexts.m and the path of the code actually performing the search (see point 5, e.g. search.jar) also in TestTexts.m.

3. Build the topic table and indicate its path in "readWord1.m".  The table is a (T+1)*N matrix which is represented by a .mat file (Matlab file, see provided example "datawiki1-100.mat"). T is the number of topics and N is the words in the dictionary. The first column of the matrix is made of words (strings). The other elements of the matrix are numbers. Each element represents the number of times a word is assigned to a topic over all trained documents. A (2+1)*3 matrix example is as follows:

word	t1	t2
------------------
flower	3	0
shoe	1	7
hand	1	4

4. Compute the probability of topic given a word in the dictionary, p(z|w)=p(w|z)p(z)/p(w).
p(z)=number of words assign to topic z/sum of elements in the table above
p(w)=sum of (p(w|z)p(z)) over all topics.  Store these numbers in a .mat file following the provided example ("twp.mat").

5. Write a piece of Java code to perform search over Wikipedia using a retrieval system such as Lucene.  The input is a set of words, e.g. "w1, w2, ...". The output should be written into the "results" folder under the main folder.


4. References
-------------------
M. Habibi. Modeling Users' Information Needs in a Document Recommender 
for Meetings.  EPFL Thesis n. 6760, Electrical Engineering Doctoral 
School (EDEE), 2015.

M. Habibi and A. Popescu-Belis. Keyword Extraction and Clustering for
Document Recommendation in Conversations. IEEE/ACM Transactions on
Audio, Speech and Language Processing (TASLP), pp. 746-759, Volume 23,
Issue 4, 2015.

M. Habibi and A. Popescu-Belis. Enforcing Topic Diversity in a Document
Recommender for Conversations. In Proceedings of the 25th International
Conference on Computational Linguistics (Coling 2014), pp. 588-599, Dublin,
Ireland, 2014.

M. Habibi and A. Popescu-Belis. Diverse Keyword Extraction from
Conversations. In Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics (ACL 2013), pp. 651-657, Sofia, Bulgaria, 2013.

docrec's People

Contributors

marhabibi avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.