Coder Social home page Coder Social logo

rkadlec / ubuntu-ranking-dataset-creator Goto Github PK

View Code? Open in Web Editor NEW
665.0 665.0 202.0 3.61 MB

A script that creates train, valid and test datasets for the ranking task from Ubuntu corpus dialogs.

License: Apache License 2.0

Python 13.03% Shell 0.17% Jupyter Notebook 86.80%

ubuntu-ranking-dataset-creator's People

Contributors

andytwigg avatar howl-anderson avatar palmerabollo avatar petrbel avatar rkadlec avatar ryan-lowe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ubuntu-ranking-dataset-creator's Issues

info about raw file

Hi,

Could you provide some info regarding the raw file named ubuntu_dialogs.tgz?

I see it contains multiple directories with multiple files each. What does each directory and file represent?

Thank you.

how are you generating the LSTM,RNN model?

after generating the training/validation/test dataset, how are you generating the LSTM,RNN model?

##BASELINE RESULTS

####Dual Encoder LSTM model:

1 in 2:
    recall@1: 0.868730970907
1 in 10:
    recall@1: 0.552213717862
    recall@2: 0.72099120433,
    recall@5: 0.924285351827

####Dual Encoder RNN model:

1 in 2:
    recall@1: 0.776539210705,
1 in 10:
    recall@1: 0.379139142954,
    recall@2: 0.560689786585,
    recall@5: 0.836350355691,

####TF-IDF model:

1 in 2:
    recall@1:  0.749260042283
1 in 10:
    recall@1:  0.48810782241
    recall@2:  0.587315010571
    recall@5:  0.763054968288

IOError: [Errno 2] No such file or directory: './dialogs/10/974.tsv'

I am running the .sh script to download and create the data sets with the suggested flags.

Traceback (most recent call last):
File "create_ubuntu_dataset.py", line 408, in
args.func(args)
File "create_ubuntu_dataset.py", line 356, in valid_cmd
create_eval_dataset(args, "valfiles.csv")
File "create_ubuntu_dataset.py", line 288, in create_eval_dataset
lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
File "create_ubuntu_dataset.py", line 228, in create_examples
examples.append(creator_function(context_dialog, candidate_dialog_paths))
File "create_ubuntu_dataset.py", line 289, in
args.n, args.max_context_length))
File "create_ubuntu_dataset.py", line 187, in create_single_dialog_test_example
negative_responses = get_random_utterances_from_corpus(candidate_dialog_paths,rng,distractors_num)
File "create_ubuntu_dataset.py", line 82, in get_random_utterances_from_corpus
dialog = translate_dialog_to_lists(dialog_path)
File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/10/974.tsv'

Python 3 issues

Using this code I encountered a number of small issues related to Python 3 changes in list and str handling. I have a diff of the code for create_ubuntu_dataset.py that fixes these issues, though my changes do not provide python 27 backward compatibility (that's why this isn't a pull request). If there is interest I would be happy to make a pull request for this.

ubuntu-ranking-dataset-creator-p3-diff.zip

Splitting dataset fails

Downloading the archive is successful, yet splitting the dataset fails. When running ./generate.sh -t -s -l, I get the following errors:

0
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 409, in <module>
    args.func(args)
  File "create_ubuntu_dataset.py", line 328, in train_cmd
    lambda context_dialog, candidates :
  File "create_ubuntu_dataset.py", line 228, in create_examples
    examples.append(creator_function(context_dialog, candidate_dialog_paths))
  File "create_ubuntu_dataset.py", line 330, in <lambda>
    args.p, max_context_length=args.max_context_length))
  File "create_ubuntu_dataset.py", line 152, in create_single_dialog_train_example
    dialog = translate_dialog_to_lists(context_dialog_path)
  File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
    dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/278/1.tsv'
0
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 409, in <module>
    args.func(args)
  File "create_ubuntu_dataset.py", line 360, in test_cmd
    create_eval_dataset(args, "testfiles.csv")
  File "create_ubuntu_dataset.py", line 290, in create_eval_dataset
    lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
  File "create_ubuntu_dataset.py", line 228, in create_examples
    examples.append(creator_function(context_dialog, candidate_dialog_paths))
  File "create_ubuntu_dataset.py", line 291, in <lambda>
    args.n, args.max_context_length))
  File "create_ubuntu_dataset.py", line 180, in create_single_dialog_test_example
    dialog = translate_dialog_to_lists(context_dialog_path)
  File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
    dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/5/41626.tsv'
0
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 409, in <module>
    args.func(args)
  File "create_ubuntu_dataset.py", line 357, in valid_cmd
    create_eval_dataset(args, "valfiles.csv")
  File "create_ubuntu_dataset.py", line 290, in create_eval_dataset
    lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
  File "create_ubuntu_dataset.py", line 228, in create_examples
    examples.append(creator_function(context_dialog, candidate_dialog_paths))
  File "create_ubuntu_dataset.py", line 291, in <lambda>
    args.n, args.max_context_length))
  File "create_ubuntu_dataset.py", line 180, in create_single_dialog_test_example
    dialog = translate_dialog_to_lists(context_dialog_path)
  File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
    dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/3/4347.tsv'

Could anybody please help out? Thanks in advance!

Created Training set is smaller than the old one & error with create_eval_dataset function

Please advice:
1- the code below contains a modified parameters default=10250000 so we can extract more training examples, right? because if I used only 1000000 it gives me smaller training set compared to the one from the original dataset, what if I want the same number of examples from the old training set, what should be the default=?? , in other words how many examples are in the training set?

2- the code below, which related to the test and eval sets is giving error in run time saying that: AttributeError: 'Namespace' object has no attribute 'examples' ,
kindly advice if that was popular for you.

parser_train = subparsers.add_parser('train', help='trainset generator')
parser_train.add_argument('-p', type=float, default=0.5, help='positive example probability')
parser_train.add_argument('-e', '--examples', type=int, default=10250000, help='number of examples to generate')
parser_train.set_defaults(func=train_cmd)

parser_test = subparsers.add_parser('test', help='testset generator')
parser_test.add_argument('-n', type=int, default=9, help='number of distractor examples for each context')
parser_test.set_defaults(func=test_cmd)

parser_valid = subparsers.add_parser('valid', help='validset generator')
parser_valid.add_argument('-n', type=int, default=9, help='number of distractor examples for each context')
parser_valid.set_defaults(func=valid_cmd)

Cannot download dataset

Downloading the dataset fails. I have read the previous issues (#9 and #11), but the problem doesn't seem to have been resolved. When I run ./generate.sh, I get:

Downloading http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz to ./ubuntu_dialogs.tgz
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 404, in <module>
    prepare_data_maybe_download(args.data_root)
  File "create_ubuntu_dataset.py", line 260, in prepare_data_maybe_download
    filepath, _ = urllib.request.urlretrieve(url, archive_path)
  File "/usr/lib64/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/usr/lib64/python2.7/urllib.py", line 245, in retrieve
    fp = self.open(url, data)
  File "/usr/lib64/python2.7/urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "/usr/lib64/python2.7/urllib.py", line 357, in open_http
    'got a bad status line', None)
IOError: ('http protocol error', 0, 'got a bad status line', None)

The IOError comes from urlretrieve on http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz

Doing wget http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz also fails. Can anybody tell me how else to download the dataset? Thanks a lot in advance!

Error in dataset generation

When trying to run the dataset generation command (python create_ubuntu_dataset.py ./generate.sh -t -s -l), I get the following error:

runfile('/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py', wdir='/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src')
Downloading http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz to ./ubuntu_dialogs.tgz
Successfully downloaded ./ubuntu_dialogs.tgz
Unpacking dialogs ...
Archive unpacked.
Traceback (most recent call last):

File "", line 1, in
runfile('/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py', wdir='/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src')

File "/home/janinanu/anaconda3/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)

File "/home/janinanu/anaconda3/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py", line 407, in
args.func(args)

AttributeError: 'Namespace' object has no attribute 'func'

I cannot make any sense of it. Any suggestions how to solve it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.