rkadlec / ubuntu-ranking-dataset-creator Goto Github PK

View Code? Open in Web Editor NEW

665.0 665.0 202.0 3.61 MB

A script that creates train, valid and test datasets for the ranking task from Ubuntu corpus dialogs.

License: Apache License 2.0

Python 13.03% Shell 0.17% Jupyter Notebook 86.80%

ubuntu-ranking-dataset-creator's People

Contributors

Stargazers

Watchers

Forkers

ryan-lowe zxzbos hydercps kekedan little1tow reginold pmineiro chagge tonydeep hua-zhang oserban obinsc zaxliu skmalviya nukhetk pomidori palmerabollo binbinbian ashwinm2 lsq357 lijian8 chulakar codeaudit phpmind luciany 821760408-sp akshayjh hailiang-wang josemacedo davidtranno1 fducau yaokaichun agistrueai projectafey javelir barseghyanartur hdasappinc danimateos zysilence mansurul11 rahulmirdha elhomosiguiente macintoshxz andytwigg eugeneh101 huangpeng1126 sher-ali 6676401088 yangliuy mehdimashayekhi vicsanjinez ashrovy sunnysai12345 outcastofmusic akki2825 totalgood malcolmgreaves bhosalems skobets hanst stevenlol diwanshushekhar hfxunlp iammasariya akalz yangkf1985 rajeshinrise-zz melody-xiaomi lijuncen ubermenschlzy mydeeplearning andreiii99 guherbozdogan howl-anderson nagappankv hobson connietong pjam76 poffo rem0temeth0d badend atefehmorsali nojuman heartcored98 leezqcst colordays3 syx528911137 laal65 kimdj lovemercy shekharkoirala rsantana-isg mght littttttlebird furaoing digo engr3os raymonddixon derickp zhangqiking

ubuntu-ranking-dataset-creator's Issues

What is the difference between 1.0 and 2.0？

Thank you！

License on the dataset itself?

Hi,

What is the license on the dataset itself?

Dialog .\dialogs\17\13318.tsv was shorter than the minimum required lenght 1

python create_ubuntu_dataset.py --output "train.csv" "train"

is failing with the above error. Anybody else facing this issue?

cs.mcgill.ca download of raw files is slow

The download of the ubuntu_dialogs.tgz is slow from the cs.mcgill.ca address. Perhaps they could be mirrored on a GitHub release?

why can't download？

always failed download closed by remote server

Doesn't appear to be Python 3 compatible?

Don't know if that was intentional, but some error with print and xrange statements in the create_ubuntu_dataset.py .

info about raw file

Hi,

Could you provide some info regarding the raw file named ubuntu_dialogs.tgz?

I see it contains multiple directories with multiple files each. What does each directory and file represent?

Thank you.

AttributeError: 'Namespace' object has no attribute 'func'

I got an error in create_ubuntu_dataset.py
AttributeError: 'Namespace' object has no attribute 'func'

how are you generating the LSTM,RNN model?

after generating the training/validation/test dataset, how are you generating the LSTM,RNN model?

##BASELINE RESULTS

####Dual Encoder LSTM model:

1 in 2:
    recall@1: 0.868730970907
1 in 10:
    recall@1: 0.552213717862
    recall@2: 0.72099120433,
    recall@5: 0.924285351827

####Dual Encoder RNN model:

1 in 2:
    recall@1: 0.776539210705,
1 in 10:
    recall@1: 0.379139142954,
    recall@2: 0.560689786585,
    recall@5: 0.836350355691,

####TF-IDF model:

1 in 2:
    recall@1:  0.749260042283
1 in 10:
    recall@1:  0.48810782241
    recall@2:  0.587315010571
    recall@5:  0.763054968288

IOError: [Errno 2] No such file or directory: './dialogs/10/974.tsv'

I am running the .sh script to download and create the data sets with the suggested flags.

Traceback (most recent call last):
File "create_ubuntu_dataset.py", line 408, in
args.func(args)
File "create_ubuntu_dataset.py", line 356, in valid_cmd
create_eval_dataset(args, "valfiles.csv")
File "create_ubuntu_dataset.py", line 288, in create_eval_dataset
lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
File "create_ubuntu_dataset.py", line 228, in create_examples
examples.append(creator_function(context_dialog, candidate_dialog_paths))
File "create_ubuntu_dataset.py", line 289, in
args.n, args.max_context_length))
File "create_ubuntu_dataset.py", line 187, in create_single_dialog_test_example
negative_responses = get_random_utterances_from_corpus(candidate_dialog_paths,rng,distractors_num)
File "create_ubuntu_dataset.py", line 82, in get_random_utterances_from_corpus
dialog = translate_dialog_to_lists(dialog_path)
File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/10/974.tsv'

Python 3 issues

Using this code I encountered a number of small issues related to Python 3 changes in list and str handling. I have a diff of the code for create_ubuntu_dataset.py that fixes these issues, though my changes do not provide python 27 backward compatibility (that's why this isn't a pull request). If there is interest I would be happy to make a pull request for this.

ubuntu-ranking-dataset-creator-p3-diff.zip

Splitting dataset fails

Downloading the archive is successful, yet splitting the dataset fails. When running ./generate.sh -t -s -l, I get the following errors:

0
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 409, in <module>
    args.func(args)
  File "create_ubuntu_dataset.py", line 328, in train_cmd
    lambda context_dialog, candidates :
  File "create_ubuntu_dataset.py", line 228, in create_examples
    examples.append(creator_function(context_dialog, candidate_dialog_paths))
  File "create_ubuntu_dataset.py", line 330, in <lambda>
    args.p, max_context_length=args.max_context_length))
  File "create_ubuntu_dataset.py", line 152, in create_single_dialog_train_example
    dialog = translate_dialog_to_lists(context_dialog_path)
  File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
    dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/278/1.tsv'
0
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 409, in <module>
    args.func(args)
  File "create_ubuntu_dataset.py", line 360, in test_cmd
    create_eval_dataset(args, "testfiles.csv")
  File "create_ubuntu_dataset.py", line 290, in create_eval_dataset
    lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
  File "create_ubuntu_dataset.py", line 228, in create_examples
    examples.append(creator_function(context_dialog, candidate_dialog_paths))
  File "create_ubuntu_dataset.py", line 291, in <lambda>
    args.n, args.max_context_length))
  File "create_ubuntu_dataset.py", line 180, in create_single_dialog_test_example
    dialog = translate_dialog_to_lists(context_dialog_path)
  File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
    dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/5/41626.tsv'
0
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 409, in <module>
    args.func(args)
  File "create_ubuntu_dataset.py", line 357, in valid_cmd
    create_eval_dataset(args, "valfiles.csv")
  File "create_ubuntu_dataset.py", line 290, in create_eval_dataset
    lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
  File "create_ubuntu_dataset.py", line 228, in create_examples
    examples.append(creator_function(context_dialog, candidate_dialog_paths))
  File "create_ubuntu_dataset.py", line 291, in <lambda>
    args.n, args.max_context_length))
  File "create_ubuntu_dataset.py", line 180, in create_single_dialog_test_example
    dialog = translate_dialog_to_lists(context_dialog_path)
  File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
    dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/3/4347.tsv'

Could anybody please help out? Thanks in advance!

What is the size of the downloaded data?

I want to download the dataset within a limited storage server; what is the exact size of the data after running the script?

Thank you.

Created Training set is smaller than the old one & error with create_eval_dataset function

Please advice:
1- the code below contains a modified parameters default=10250000 so we can extract more training examples, right? because if I used only 1000000 it gives me smaller training set compared to the one from the original dataset, what if I want the same number of examples from the old training set, what should be the default=?? , in other words how many examples are in the training set?

2- the code below, which related to the test and eval sets is giving error in run time saying that: AttributeError: 'Namespace' object has no attribute 'examples' ,
kindly advice if that was popular for you.

parser_train = subparsers.add_parser('train', help='trainset generator')
parser_train.add_argument('-p', type=float, default=0.5, help='positive example probability')
parser_train.add_argument('-e', '--examples', type=int, default=10250000, help='number of examples to generate')
parser_train.set_defaults(func=train_cmd)

parser_test = subparsers.add_parser('test', help='testset generator')
parser_test.add_argument('-n', type=int, default=9, help='number of distractor examples for each context')
parser_test.set_defaults(func=test_cmd)

parser_valid = subparsers.add_parser('valid', help='validset generator')
parser_valid.add_argument('-n', type=int, default=9, help='number of distractor examples for each context')
parser_valid.set_defaults(func=valid_cmd)

AttributeError: 'Namespace' object has no attribute 'func'

I got an error in create_ubuntu_dataset.py
AttributeError: 'Namespace' object has no attribute 'func'

Cannot download dataset

Downloading the dataset fails. I have read the previous issues (#9 and #11), but the problem doesn't seem to have been resolved. When I run ./generate.sh, I get:

Downloading http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz to ./ubuntu_dialogs.tgz
Traceback (most recent call last):
  File "create_ubuntu_dataset.py", line 404, in <module>
    prepare_data_maybe_download(args.data_root)
  File "create_ubuntu_dataset.py", line 260, in prepare_data_maybe_download
    filepath, _ = urllib.request.urlretrieve(url, archive_path)
  File "/usr/lib64/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/usr/lib64/python2.7/urllib.py", line 245, in retrieve
    fp = self.open(url, data)
  File "/usr/lib64/python2.7/urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "/usr/lib64/python2.7/urllib.py", line 357, in open_http
    'got a bad status line', None)
IOError: ('http protocol error', 0, 'got a bad status line', None)

The IOError comes from urlretrieve on http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz

Doing wget http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz also fails. Can anybody tell me how else to download the dataset? Thanks a lot in advance!

Dataset link is not accessible

Can't download the dataset. The link is not accessible

Error in dataset generation

When trying to run the dataset generation command (python create_ubuntu_dataset.py ./generate.sh -t -s -l), I get the following error:

runfile('/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py', wdir='/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src')
Downloading http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz to ./ubuntu_dialogs.tgz
Successfully downloaded ./ubuntu_dialogs.tgz
Unpacking dialogs ...
Archive unpacked.
Traceback (most recent call last):

File "", line 1, in
runfile('/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py', wdir='/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src')

File "/home/janinanu/anaconda3/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)

File "/home/janinanu/anaconda3/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py", line 407, in
args.func(args)

AttributeError: 'Namespace' object has no attribute 'func'

I cannot make any sense of it. Any suggestions how to solve it?