Coder Social home page Coder Social logo

alexandrosplessias / ir-cosinesimilarity-vs-freq Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 205 KB

Information Retrieval Model to the research interests of the Faculty members of the Department (a.k.a. DEP or professors) and based on this to suggest possible collaborations between them.

License: MIT License

TeX 90.89% Java 9.11%
java tf-idf vector-space-model frequency-analysis porter-stemmer-algorithm stopwords query hashmap-java treemap arraylist dictionary-data information-retrieval

ir-cosinesimilarity-vs-freq's Introduction

IR-CosineSimilarity

I create a vector Information Retrieval Model to the research interests of the Faculty members of the Department (a.k.a. DEP or professors) and based on this to suggest possible collaborations between them.

  • Question 1 . Finding important terms for each faculty member. In this question you are asked to characterize with a set of terms the research interests of each faculty member. Use the Vector Model and the tf-idf load to represent each faculty member as a weight vector of the terms contained in the titles of his research articles and conferences / journals that have been published. Your implementation should, after doing the above, create the profile "prof-description.txt" that will have 26 lines (as many faculty members) and each line will contain the last name of the faculty member followed by the N most important terms together with their weights in the form (term, weight), which were found in the titles of the articles and the respective journals / conferences. N will be a parameter of your program.

  • Question 2 . Find a faculty member based on a question. In this question you are asked to sort the faculty members based on their relevance to a question posed by the user. To do this you will characterize with a set of terms the research interests of each faculty member as you did in Question 1 using the Vector Model and tf-idf load. This way, the user will ask you a question in one or more terms (eg, truth model driven system with enable the nearest database.) And you will calculate the similarity of each member of the faculty with the question and you will rank them based on the similarity you calculated. Note that whatever you pre-processed in your data, you should do the same in the user's questions! Your implementation should, after doing the above, create the results- "question-words" .txt file (eg, results-truth-model-driven-system-with-enable-the-nearest-database. .txt for our example), which will have 26 lines (as well as faculty members) and each line will contain the surname of the faculty member followed by the similarity with the question in the form (surname, similarity). The file should be sorted in descending order of resemblance to the question.

  • Question 3 . Finding faculty members with close research interests. You are invited to calculate for each faculty member his two colleagues with the closest research interests. To do this, you will characterize with a set of terms the research interests of each faculty member as you did in Question 1 using the Vector Model and the tf-idf load. Then, you need to find its resemblance (use a cosine similarity) with any other colleague, performing an exhaustive comparison algorithm for each faculty member with each other. Your implementation should, after doing the above, create the file similar_profs.txt which will have 26 lines (as many faculty members) and each line will contain the last name of the faculty member followed by the Ms more similar with this faculty members along with their degree of similarity in form (surname, similarity). M will be a parameter of your program. In addition to this file, you should include in your report a 26x26 table with all the similarities of all faculty members with each other (See "Results / All profsesors cosine similarities").

  • Question 4 . Change of weight calculation method. Implement the above three questions by changing the way Calculate the weights of the terms as follows: for tf just simply use the number of occurrences of a term (freq) and do not use idf and normalize in cosine similarity. Describe in your report if and how the results are affected compared to before. It was expected; (Of course the similarity of the cosines works better than the simple frequency.)

ir-cosinesimilarity-vs-freq's People

Contributors

alexandrosplessias avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.