Coder Social home page Coder Social logo

rkadlec / ubuntu-ranking-dataset-creator Goto Github PK

View Code? Open in Web Editor NEW
664.0 664.0 202.0 3.61 MB

A script that creates train, valid and test datasets for the ranking task from Ubuntu corpus dialogs.

License: Apache License 2.0

Python 13.03% Shell 0.17% Jupyter Notebook 86.80%

ubuntu-ranking-dataset-creator's Issues

Cannot download dataset

Downloading the dataset fails. I have read the previous issues (#9 and #11), but the problem doesn't seem to have been resolved. When I run ./generate.sh, I get:

Downloading http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz to ./ubuntu_dialogs.tgz
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 404, in <module>
    prepare_data_maybe_download(args.data_root)
  File "create_ubuntu_dataset.py", line 260, in prepare_data_maybe_download
    filepath, _ = urllib.request.urlretrieve(url, archive_path)
  File "/usr/lib64/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/usr/lib64/python2.7/urllib.py", line 245, in retrieve
    fp = self.open(url, data)
  File "/usr/lib64/python2.7/urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "/usr/lib64/python2.7/urllib.py", line 357, in open_http
    'got a bad status line', None)
IOError: ('http protocol error', 0, 'got a bad status line', None)

The IOError comes from urlretrieve on http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz

Doing wget http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz also fails. Can anybody tell me how else to download the dataset? Thanks a lot in advance!

Splitting dataset fails

Downloading the archive is successful, yet splitting the dataset fails. When running ./generate.sh -t -s -l, I get the following errors:

0
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 409, in <module>
    args.func(args)
  File "create_ubuntu_dataset.py", line 328, in train_cmd
    lambda context_dialog, candidates :
  File "create_ubuntu_dataset.py", line 228, in create_examples
    examples.append(creator_function(context_dialog, candidate_dialog_paths))
  File "create_ubuntu_dataset.py", line 330, in <lambda>
    args.p, max_context_length=args.max_context_length))
  File "create_ubuntu_dataset.py", line 152, in create_single_dialog_train_example
    dialog = translate_dialog_to_lists(context_dialog_path)
  File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
    dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/278/1.tsv'
0
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 409, in <module>
    args.func(args)
  File "create_ubuntu_dataset.py", line 360, in test_cmd
    create_eval_dataset(args, "testfiles.csv")
  File "create_ubuntu_dataset.py", line 290, in create_eval_dataset
    lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
  File "create_ubuntu_dataset.py", line 228, in create_examples
    examples.append(creator_function(context_dialog, candidate_dialog_paths))
  File "create_ubuntu_dataset.py", line 291, in <lambda>
    args.n, args.max_context_length))
  File "create_ubuntu_dataset.py", line 180, in create_single_dialog_test_example
    dialog = translate_dialog_to_lists(context_dialog_path)
  File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
    dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/5/41626.tsv'
0
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 409, in <module>
    args.func(args)
  File "create_ubuntu_dataset.py", line 357, in valid_cmd
    create_eval_dataset(args, "valfiles.csv")
  File "create_ubuntu_dataset.py", line 290, in create_eval_dataset
    lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
  File "create_ubuntu_dataset.py", line 228, in create_examples
    examples.append(creator_function(context_dialog, candidate_dialog_paths))
  File "create_ubuntu_dataset.py", line 291, in <lambda>
    args.n, args.max_context_length))
  File "create_ubuntu_dataset.py", line 180, in create_single_dialog_test_example
    dialog = translate_dialog_to_lists(context_dialog_path)
  File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
    dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/3/4347.tsv'

Could anybody please help out? Thanks in advance!

info about raw file

Hi,

Could you provide some info regarding the raw file named ubuntu_dialogs.tgz?

I see it contains multiple directories with multiple files each. What does each directory and file represent?

Thank you.

IOError: [Errno 2] No such file or directory: './dialogs/10/974.tsv'

I am running the .sh script to download and create the data sets with the suggested flags.

Traceback (most recent call last):
File "create_ubuntu_dataset.py", line 408, in
args.func(args)
File "create_ubuntu_dataset.py", line 356, in valid_cmd
create_eval_dataset(args, "valfiles.csv")
File "create_ubuntu_dataset.py", line 288, in create_eval_dataset
lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
File "create_ubuntu_dataset.py", line 228, in create_examples
examples.append(creator_function(context_dialog, candidate_dialog_paths))
File "create_ubuntu_dataset.py", line 289, in
args.n, args.max_context_length))
File "create_ubuntu_dataset.py", line 187, in create_single_dialog_test_example
negative_responses = get_random_utterances_from_corpus(candidate_dialog_paths,rng,distractors_num)
File "create_ubuntu_dataset.py", line 82, in get_random_utterances_from_corpus
dialog = translate_dialog_to_lists(dialog_path)
File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/10/974.tsv'

Python 3 issues

Using this code I encountered a number of small issues related to Python 3 changes in list and str handling. I have a diff of the code for create_ubuntu_dataset.py that fixes these issues, though my changes do not provide python 27 backward compatibility (that's why this isn't a pull request). If there is interest I would be happy to make a pull request for this.

ubuntu-ranking-dataset-creator-p3-diff.zip

Created Training set is smaller than the old one & error with create_eval_dataset function

Please advice:
1- the code below contains a modified parameters default=10250000 so we can extract more training examples, right? because if I used only 1000000 it gives me smaller training set compared to the one from the original dataset, what if I want the same number of examples from the old training set, what should be the default=?? , in other words how many examples are in the training set?

2- the code below, which related to the test and eval sets is giving error in run time saying that: AttributeError: 'Namespace' object has no attribute 'examples' ,
kindly advice if that was popular for you.

parser_train = subparsers.add_parser('train', help='trainset generator')
parser_train.add_argument('-p', type=float, default=0.5, help='positive example probability')
parser_train.add_argument('-e', '--examples', type=int, default=10250000, help='number of examples to generate')
parser_train.set_defaults(func=train_cmd)

parser_test = subparsers.add_parser('test', help='testset generator')
parser_test.add_argument('-n', type=int, default=9, help='number of distractor examples for each context')
parser_test.set_defaults(func=test_cmd)

parser_valid = subparsers.add_parser('valid', help='validset generator')
parser_valid.add_argument('-n', type=int, default=9, help='number of distractor examples for each context')
parser_valid.set_defaults(func=valid_cmd)

Error in dataset generation

When trying to run the dataset generation command (python create_ubuntu_dataset.py ./generate.sh -t -s -l), I get the following error:

runfile('/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py', wdir='/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src')
Downloading http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz to ./ubuntu_dialogs.tgz
Successfully downloaded ./ubuntu_dialogs.tgz
Unpacking dialogs ...
Archive unpacked.
Traceback (most recent call last):

File "", line 1, in
runfile('/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py', wdir='/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src')

File "/home/janinanu/anaconda3/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)

File "/home/janinanu/anaconda3/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py", line 407, in
args.func(args)

AttributeError: 'Namespace' object has no attribute 'func'

I cannot make any sense of it. Any suggestions how to solve it?

how are you generating the LSTM,RNN model?

after generating the training/validation/test dataset, how are you generating the LSTM,RNN model?

##BASELINE RESULTS

####Dual Encoder LSTM model:

1 in 2:
    recall@1: 0.868730970907
1 in 10:
    recall@1: 0.552213717862
    recall@2: 0.72099120433,
    recall@5: 0.924285351827

####Dual Encoder RNN model:

1 in 2:
    recall@1: 0.776539210705,
1 in 10:
    recall@1: 0.379139142954,
    recall@2: 0.560689786585,
    recall@5: 0.836350355691,

####TF-IDF model:

1 in 2:
    recall@1:  0.749260042283
1 in 10:
    recall@1:  0.48810782241
    recall@2:  0.587315010571
    recall@5:  0.763054968288

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.