ADM-HW3

In this github repository we stored the files written for the third Homework of the ADM course.

Team Members #Group-19

Clara Lecce
Giulia Luciani
Luca Mattei
Zeeshan Asghar

File and Scripts descriptions

main.ipynb:

this is the notebook which contains the executed parts of the points below of the homework:

  1. Data collection
     1.1 Get the list of animes
     1.2 Crawl animes
     1.3 Parse downloaded pages

  2. Search Engine
     2.1 Conjunctive query
        2.1.1 Create your index!
        2.1.2 Execute the query
     2.2 Conjunctive query & Ranking score
        2.2.1 Inverted index
        2.2.2 Execute the query

  3. Define a new score!

  5. Algorithmic question

functions.py:

Python script in which we have written the useful functions to solve the questions.

  # 1 DATA COLLECTION
     # 1.1 Get the list of animes
     def get_link(url_link, file_txt);
     # 1.2 Crawl animes
     def crawl_html(start_index, stop_index=0);
     # 1.3 Parse downloaded pages
        # 1. Anime Name, String
        def get_title(soup);
        # 2. Anime Type, String
        def get_type(soup);
        # 3. Number of episode, Integer
        def get_num_ep(soup);
        # 4. Release and End Dates of anime, datetime format
        def get_dates(soup);
        # 5. Number of members, Integer
        def get_memb(soup);
        # 6. Score, Float
        def get_score(soup);
        # 7. Users, Integer
        def get_users(soup);
        # 8. Rank, Integer
        def get_rank(soup);
        # 9. Popularity, Integer
        def get_pop(soup);
        # 10. Synopsis, String
        def get_descr(soup);
        # 11. Related animes, List of strings
        def get_rel_an(soup);
        # 12. Characters, List of strings
        def get_char(soup);
        # 13. Voices, List of strings
        def get_voices(soup);
        # 14. Staff, List of strings
        def get_staff(soup);

  # 2 SEARCH ENGINE
     # 2.1 Conjunctive query
        def download();
        # function to stem the string given
        def text_mining(string);
        # function to create the vocabulary
        def create_vocab();
        
        # 2.1.1 Create your index!
           # function to create the inverted_index and stores it in a json file
           def invertedIndex();

     # 2.2 Conjunctive query & Ranking score
        # 2.2.1 Inverted index tf*idf
           # function to create the inverted_index_tfidf and stores it in a json file
           def invertedIndex_tfidf(vocabulary, inverted_index);
        # 2.2.2 Execute the query
           # function to take the first k documents
           def top_k_documents(query, k, inverted_index, inverted_index_tfidf, inverted_doc, vocabulary);
           # function to calculate the cosine similarity
           def search_similarity(query, inverted_index, inverted_index_tfidf, inverted_doc, vocabulary);

  # 3. DEFINE A NEW SCORE!
     # function to stem the string given
     def text_mining_score(string);
     # function to calculate the new score
     def new_score(query);
  
  # 5. ALGORITHMIC QUESTION
     # implementation function
     def MyAlg(seq,query);

the files below are only used to exchange data between us

anime_links.txt:

contains the links of the animes

vocabulary.json:

contains the vocabulary of the words contained in all the anime descriptions (but parsed with nltk library).

inverted_index.json:

contains the inverted index for the Search Engine 2.1

inverted_index_tfidf.json:

contains the inverted index tfidf for the Search Engine 2.2

inverted_doc.json:

contains the tfidf for every documents used for the cosine similarity

score_dict.json:

contains the new documents for the new score

heap.json:

contains a list unordered with only scores

claral27 / adm-hw3 Goto Github PK

adm-hw3's Introduction

ADM-HW3

Team Members #Group-19

File and Scripts descriptions

the files below are only used to exchange data between us

adm-hw3's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent