Coder Social home page Coder Social logo

Comments (6)

abhigupta768 avatar abhigupta768 commented on June 16, 2024

You need to first build the text_dict, authors_dict, and the pvs_dict.

Change the lines 300 - 312 in the generate_dataset.py to the following:

dicts['text'] = initVocabulary('text', ['./data/abstract_train.p', './title_train.p'], None, 50000, ' ', False)
dicts['authors'] = initVocabulary('authors', './data/authors_train.p', None, 20000, ' ', False)

saveVocabulary('text', dicts['text'], './data/text_dict')
saveVocabulary('authors', dicts['authors'], './data/authors_dict')

pvs=p.load(open("./pv_train.p","rb"))
unique_pvs=np.unique(np.array(pvs))
dicts['pvs']=Dict()
for pv in unique_pvs:
   dicts['pvs'].add(pv)
dicts['pvs'] = dicts['pvs'].prune(300)
saveVocabulary('pvs', dicts['pvs'], './data/pvs_dict')

After you have built these files once, you can revert the code back to the original.

Please let me know if you have any questions. Thanks!

from publication-venue-recommender.

GabrielLin avatar GabrielLin commented on June 16, 2024

I follow your instructions and the script has been run for more than two days but no output. I will try to run from the very beginning again. It might cost a little time. Thanks.

from publication-venue-recommender.

GabrielLin avatar GabrielLin commented on June 16, 2024

Sorry for the late reply. I have tried some many times. But the modified script just keep running but did not end.

from publication-venue-recommender.

abhigupta768 avatar abhigupta768 commented on June 16, 2024

Hey, sorry for the delay in reply. Can you provide some sample data that you are giving as input to the scripts?

If you can provide some sample data, I will look into it over the weekend and find the issue.

Thanks!

from publication-venue-recommender.

GabrielLin avatar GabrielLin commented on June 16, 2024

Thank you. Here is the step I wrote.

My Readme

This repository contains code for the Modular-Hierarchical Attention Based Scholarly Venue Recommender System using Deep Learning

Checking on Ubuntu 16.04.4 LTS

Ref Repo

Updated to https://github.com/abhigupta768/publication-venue-recommender/tree/530702eb0552aafb8f8517b329579610e1a7aa81

Ref Paper

Pradhan, T., Gupta, A., & Pal, S. (2020). HASVRec: A modularized hierarchical attention-based scholarly venue recommender system. Knowledge-Based Systems, 204, 106181. doi:10.1016/j.knosys.2020.106181

Dependencies

Python Environment

conda create -c conda-forge -n py36gpvr python=3.6
conda activate py36gpvr

conda install -c pytorch pytorch==1.7.1 torchvision==0.8.2 cudatoolkit==10.2.89 cudnn==7.6.5

conda install -c conda-forge nltk==3.6.1
conda install -c conda-forge pandas==1.1.5


Data

Download AMiner-Paper.rar from https://lfs.aminer.cn/lab-datasets/aminerdataset/AMiner-Paper.rar;
The above file is converted into zip format and stored outside;

Unzip the zip file and place it at project root

unzip AMiner-Paper.zip

Extract

python extract_data.py

move all .p files to ./data folder

cp *.p data/

Generate the PyTorch compatible format

nohup python -u generate_dataset_init.py > generate_dataset_init.log 2>&1 &
tailf generate_dataset_init.log

from publication-venue-recommender.

GabrielLin avatar GabrielLin commented on June 16, 2024

In the above step, generate_dataset_init.py is the file I modified by your suggestions. The content is

# -*- coding: utf-8 -*-
"""pubrec-generate-dataset.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1yowkK19M7YGLZTbjX7g4dTEaQphvp0a_
"""

# Commented out IPython magic to ensure Python compatibility.
# The following two lines are commented by me
# from google.colab import drive
# drive.mount('/gdrive')
# %cd /gdrive

# Commented out IPython magic to ensure Python compatibility.
# %cd My\ Drive/pubrec

import torch
import torch.utils.data as torch_data

import time
import csv
import json as js
import os
import codecs
import pickle as p

import numpy as np

import nltk
nltk.download('punkt')

PAD = 0
UNK = 1
BOS = 2
EOS = 3

PAD_WORD = '<blank>' 
UNK_WORD = 'UNK'
BOS_WORD = '<s>'
EOS_WORD = '</s>'
SPA_WORD = ' '

def flatten(l): 
    for el in l:
        if hasattr(el, "__iter__"):
            for sub in flatten(el):
                yield sub
        else:
            yield el

class Dict(object):
    def __init__(self, data=None, lower=False):
        self.idxToLabel = {}
        self.labelToIdx = {}
        self.frequencies = {}
        self.lower = lower
        # Special entries will not be pruned.
        self.special = [] 

        if data is not None:
            if type(data) == str:
                self.loadFile(data)
            else:
                self.addSpecials(data)

    def size(self):
        return len(self.idxToLabel)

    # Load entries from a file.
    def loadFile(self, filename):
        for line in open(filename):
            fields = line.split()
            label = ' '.join(fields[:-1])
            idx = int(fields[-1])
            self.add(label, idx)

    # Write entries to a file.
    def writeFile(self, filename):
        with open(filename, 'w') as file:
            for i in range(self.size()):
                label = self.idxToLabel[i]
                file.write('%s %d\n' % (label, i))

        file.close()

    def loadDict(self, idxToLabel):
        for i in range(len(idxToLabel)):
            label = idxToLabel[i]
            self.add(label, i)

    def lookup(self, key, default=None):
        key = key.lower() if self.lower else key
        try:
            return self.labelToIdx[key]
        except KeyError:
            return default

    def getLabel(self, idx, default=None):
        try:
            return self.idxToLabel[idx]
        except KeyError:
            return default

    # Mark this `label` and `idx` as special (i.e. will not be pruned).
    def addSpecial(self, label, idx=None):
        idx = self.add(label, idx)
        self.special += [idx]

    # Mark all labels in `labels` as specials (i.e. will not be pruned).
    def addSpecials(self, labels):
        for label in labels:
            self.addSpecial(label)

    # Add `label` in the dictionary. Use `idx` as its index if given.
    def add(self, label, idx=None):
        label = label.lower() if self.lower else label
        if idx is not None:
            self.idxToLabel[idx] = label
            self.labelToIdx[label] = idx
        else:
            if label in self.labelToIdx:
                idx = self.labelToIdx[label]
            else:
                idx = len(self.idxToLabel)
                self.idxToLabel[idx] = label
                self.labelToIdx[label] = idx

        if idx not in self.frequencies:
            self.frequencies[idx] = 1
        else:
            self.frequencies[idx] += 1

        return idx

    # Return a new dictionary with the `size` most frequent entries.
    def prune(self, size):
        if size >= self.size():
            return self

        # Only keep the `size` most frequent entries.
        freq = torch.Tensor(
                [self.frequencies[i] for i in range(len(self.frequencies))])
        _, idx = torch.sort(freq, 0, True)
        newDict = Dict()
        newDict.lower = self.lower

        # Add special entries in all cases.
        for i in self.special:
            newDict.addSpecial(self.idxToLabel[i])

        for i in idx[:size]:
            newDict.add(self.idxToLabel[i.item()])

        return newDict

    # Convert `labels` to indices. Use `unkWord` if not found.
    # Optionally insert `bosWord` at the beginning and `eosWord` at the .
    def convertToIdx(self, labels, unkWord, bosWord=None, eosWord=None):
        vec = []

        if bosWord is not None:
            vec += [self.lookup(bosWord)]

        unk = self.lookup(unkWord)
        vec += [self.lookup(label, default=unk) for label in labels]

        if eosWord is not None:
            vec += [self.lookup(eosWord)]

        vec = [x for x in flatten(vec)]

        return torch.LongTensor(vec)

    # Convert `idx` to labels. If index `stop` is reached, convert it and return.
    def convertToLabels(self, idx, stop):
        labels = []

        for i in idx:
            if i == stop:
                break
            labels += [self.getLabel(i)]

        return labels

class AttrDict(dict):

    def __init__(self, *args, **kwargs):
        super(AttrDict, self).__init__(*args, **kwargs)
        self.__dict__ = self

def read_config(path):
    return AttrDict(js.load(open(path, 'r')))


def format_time(t):
    return time.strftime("%Y-%m-%d-%H:%M:%S", t)


def logging(file):
    def write_log(s):
        print(s, end='')
        with open(file, 'a') as f:
            f.write(s)
    return write_log


def logging_csv(file):
    def write_csv(s):
        with open(file, 'a', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(s)
    return write_csv

class dataset(torch_data.Dataset):

    def __init__(self, text_data, label_data):
        self.text_data = text_data
        self.label_data = label_data
  
    def __getitem__(self, index):
        return [torch.from_numpy(x[index]).type(torch.FloatTensor) for x in self.text_data],\
               torch.from_numpy(self.label_data[index]).type(torch.FloatTensor)

    def __len__(self):
        return len(self.label_data)
       

def get_loader(dataset, batch_size, shuffle, num_workers):

    data_loader = torch.utils.data.DataLoader(dataset=dataset,
                                              batch_size=batch_size,
                                              shuffle=shuffle,
                                              num_workers=num_workers)
    return data_loader

def makeVocabulary(filename, size, sep=' ', char=False):

    vocab = Dict([PAD_WORD, UNK_WORD], lower=True)
    if char:
        vocab.addSpecial(SPA_WORD)

    lengths = []

    if type(filename) == list:
        for _filename in filename:
            data = p.load(open(_filename,"rb"))
            for sent in data:
                for word in sent.strip().split(sep):
                    lengths.append(len(word))
                    if char:
                        for ch in word.strip():
                            vocab.add(ch)
                    else:
                        vocab.add(word.strip())
    else:
        data = p.load(open(filename,"rb"))
        for sent in data:
            for word in sent.strip().split(sep):
                lengths.append(len(word))
                if char:
                    for ch in word.strip():
                        vocab.add(ch)
                else:
                    vocab.add(word.strip())
    print('max: %d, min: %d, avg: %.2f' % (max(lengths), min(lengths), sum(lengths)/len(lengths)))

    originalSize = vocab.size()
    vocab = vocab.prune(size)  
    print('Created dictionary of size %d (pruned from %d)' %
          (vocab.size(), originalSize))

    return vocab

def initVocabulary(name, dataFile, vocabFile, vocabSize, sep=' ', char=False):

    vocab = None
    if vocabFile is not None:
        # If given, load existing word dictionary.
        print('Reading ' + name + ' vocabulary from \'' + vocabFile + '\'...')
        vocab = Dict()
        vocab.loadFile(vocabFile)  
        print('Loaded ' + str(vocab.size()) + ' ' + name + ' words')

    if vocab is None:
        # If a dictionary is still missing, generate it.
        print('Building ' + name + ' vocabulary...')
        genWordVocab = makeVocabulary(dataFile, vocabSize, sep=sep, char=char)  
        vocab = genWordVocab

    return vocab


def saveVocabulary(name, vocab, file):

    print('Saving ' + name + ' vocabulary to \'' + file + '\'...')
    vocab.writeFile(file)

dicts = {}
dicts['text'] = initVocabulary('text', ['./data/abstract_train.p', './title_train.p'], None, 50000, ' ', False)
dicts['authors'] = initVocabulary('authors', './data/authors_train.p', None, 20000, ' ', False)

saveVocabulary('text', dicts['text'], './data/text_dict')
saveVocabulary('authors', dicts['authors'], './data/authors_dict')

pvs=p.load(open("./pv_train.p","rb"))
unique_pvs=np.unique(np.array(pvs))
dicts['pvs']=Dict()
for pv in unique_pvs:
   dicts['pvs'].add(pv)
dicts['pvs'] = dicts['pvs'].prune(300)
saveVocabulary('pvs', dicts['pvs'], './data/pvs_dict')

dicts['pvs'] = initVocabulary('pvs', None, './data/pvs_dict', 50000, ' ', False)

abstract = {'text_file': './abstract_train.p', 'text_dict': dicts['text'], 'doc_len': 8, 'text_len': 20}
title = {'text_file': './title_train.p', 'text_dict': dicts['text'], 'text_len': 20}
authors = {'text_file': './authors_train.p', 'text_dict': dicts['authors'], 'text_len': 7}
pv = {'pv_file': './pv_train.p', 'pv_dict': dicts['pvs']}

abstract_val = {'text_file': './abstract_val.p', 'text_dict': dicts['text'], 'doc_len': 8, 'text_len': 20}
title_val = {'text_file': './title_val.p', 'text_dict': dicts['text'], 'text_len': 20}
authors_val = {'text_file': './authors_val.p', 'text_dict': dicts['authors'], 'text_len': 7}
pv_val = {'pv_file': './pv_val.p', 'pv_dict': dicts['pvs']}

abstract_test = {'text_file': './abstract_test.p', 'text_dict': dicts['text'], 'doc_len': 8, 'text_len': 20}
title_test = {'text_file': './title_test.p', 'text_dict': dicts['text'], 'text_len': 20}
authors_test = {'text_file': './authors_test.p', 'text_dict': dicts['authors'], 'text_len': 7}
pv_test = {'pv_file': './pv_test.p', 'pv_dict': dicts['pvs']}

def make_data(abstract, title, authors, pv):

    text_data = []
    text_data.append(make_abstract_data(abstract['text_file'], abstract['text_dict'], abstract['doc_len'], abstract['text_len']))
    text_data.append(make_title_data(title['text_file'], title['text_dict'], title['text_len']))
    text_data.append(make_author_data(authors['text_file'], authors['text_dict'], authors['text_len'], sep=';'))
    pv_data = make_pv_data(pv['pv_file'], pv['pv_dict'])

    return dataset(text_data, pv_data)

def make_abstract_data(text_file, text_dict, doc_length, text_length, sep=' '):
    result = []
    data=p.load(open(text_file,"rb"))
    for line in data:
        temp = np.zeros((doc_length, text_length))
        sents=nltk.sent_tokenize(line)
        for i in range(len(sents)):
            if i < doc_length:
                words = nltk.word_tokenize(sents[i].strip())
                for j in range(len(words)):
                    if j < text_length:
                        temp[i, j] = text_dict.lookup(words[j].lower(), 1)
        result.append(temp)
    return result

def make_title_data(text_file, text_dict, text_length, sep=' '):
    result = []
    data=p.load(open(text_file,"rb"))
    for line in data:
        temp = np.zeros(text_length)
        words = nltk.word_tokenize(line.strip())
        for i in range(len(words)):
            if i < text_length:
                temp[i] = text_dict.lookup(words[i].lower(), 1)
        result.append(temp)
    return result

def make_author_data(text_file, text_dict, text_length, sep=' '):
    result = []
    data=p.load(open(text_file,"rb"))
    for line in data:
        temp = np.zeros(text_length)
        words = line.strip().split(sep)
        for i in range(len(words)):
            if i < text_length:
                temp[i] = text_dict.lookup(words[i].lower(), 1)
        result.append(temp)
    return result

def make_pv_data(pv_file, pv_dict):
    result = []
    length = len(pv_dict.idxToLabel)
    data=p.load(open(pv_file,"rb"))
    for line in data:
        temp = np.zeros(length)
        temp[pv_dict.lookup(str(line), 1)] = 1
        result.append(temp)
    return result

train = make_data(abstract, title, authors, pv)
val = make_data(abstract_val, title_val, authors_val, pv_val)
test = make_data(abstract_test, title_test, authors_test, pv_test)

data = {'train': train, 'val': val, 'test': test}
torch.save(data, './data/final_data_3')

# added by me
print('DONE.')


from publication-venue-recommender.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.