rkadlec / ubuntu-ranking-dataset-creator Goto Github PK
View Code? Open in Web Editor NEWA script that creates train, valid and test datasets for the ranking task from Ubuntu corpus dialogs.
License: Apache License 2.0
A script that creates train, valid and test datasets for the ranking task from Ubuntu corpus dialogs.
License: Apache License 2.0
Thank you!
Hi,
What is the license on the dataset itself?
python create_ubuntu_dataset.py --output "train.csv" "train"
is failing with the above error. Anybody else facing this issue?
The download of the ubuntu_dialogs.tgz is slow from the cs.mcgill.ca address. Perhaps they could be mirrored on a GitHub release?
always failed download closed by remote server
Don't know if that was intentional, but some error with print and xrange statements in the create_ubuntu_dataset.py .
Hi,
Could you provide some info regarding the raw file named ubuntu_dialogs.tgz?
I see it contains multiple directories with multiple files each. What does each directory and file represent?
Thank you.
I got an error in create_ubuntu_dataset.py
AttributeError: 'Namespace' object has no attribute 'func'
after generating the training/validation/test dataset, how are you generating the LSTM,RNN model?
##BASELINE RESULTS
####Dual Encoder LSTM model:
1 in 2:
recall@1: 0.868730970907
1 in 10:
recall@1: 0.552213717862
recall@2: 0.72099120433,
recall@5: 0.924285351827
####Dual Encoder RNN model:
1 in 2:
recall@1: 0.776539210705,
1 in 10:
recall@1: 0.379139142954,
recall@2: 0.560689786585,
recall@5: 0.836350355691,
####TF-IDF model:
1 in 2:
recall@1: 0.749260042283
1 in 10:
recall@1: 0.48810782241
recall@2: 0.587315010571
recall@5: 0.763054968288
I am running the .sh script to download and create the data sets with the suggested flags.
Traceback (most recent call last):
File "create_ubuntu_dataset.py", line 408, in
args.func(args)
File "create_ubuntu_dataset.py", line 356, in valid_cmd
create_eval_dataset(args, "valfiles.csv")
File "create_ubuntu_dataset.py", line 288, in create_eval_dataset
lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
File "create_ubuntu_dataset.py", line 228, in create_examples
examples.append(creator_function(context_dialog, candidate_dialog_paths))
File "create_ubuntu_dataset.py", line 289, in
args.n, args.max_context_length))
File "create_ubuntu_dataset.py", line 187, in create_single_dialog_test_example
negative_responses = get_random_utterances_from_corpus(candidate_dialog_paths,rng,distractors_num)
File "create_ubuntu_dataset.py", line 82, in get_random_utterances_from_corpus
dialog = translate_dialog_to_lists(dialog_path)
File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/10/974.tsv'
Using this code I encountered a number of small issues related to Python 3 changes in list and str handling. I have a diff of the code for create_ubuntu_dataset.py that fixes these issues, though my changes do not provide python 27 backward compatibility (that's why this isn't a pull request). If there is interest I would be happy to make a pull request for this.
Downloading the archive is successful, yet splitting the dataset fails. When running ./generate.sh -t -s -l, I get the following errors:
0
Traceback (most recent call last):
File "create_ubuntu_dataset.py", line 409, in <module>
args.func(args)
File "create_ubuntu_dataset.py", line 328, in train_cmd
lambda context_dialog, candidates :
File "create_ubuntu_dataset.py", line 228, in create_examples
examples.append(creator_function(context_dialog, candidate_dialog_paths))
File "create_ubuntu_dataset.py", line 330, in <lambda>
args.p, max_context_length=args.max_context_length))
File "create_ubuntu_dataset.py", line 152, in create_single_dialog_train_example
dialog = translate_dialog_to_lists(context_dialog_path)
File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/278/1.tsv'
0
Traceback (most recent call last):
File "create_ubuntu_dataset.py", line 409, in <module>
args.func(args)
File "create_ubuntu_dataset.py", line 360, in test_cmd
create_eval_dataset(args, "testfiles.csv")
File "create_ubuntu_dataset.py", line 290, in create_eval_dataset
lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
File "create_ubuntu_dataset.py", line 228, in create_examples
examples.append(creator_function(context_dialog, candidate_dialog_paths))
File "create_ubuntu_dataset.py", line 291, in <lambda>
args.n, args.max_context_length))
File "create_ubuntu_dataset.py", line 180, in create_single_dialog_test_example
dialog = translate_dialog_to_lists(context_dialog_path)
File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/5/41626.tsv'
0
Traceback (most recent call last):
File "create_ubuntu_dataset.py", line 409, in <module>
args.func(args)
File "create_ubuntu_dataset.py", line 357, in valid_cmd
create_eval_dataset(args, "valfiles.csv")
File "create_ubuntu_dataset.py", line 290, in create_eval_dataset
lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
File "create_ubuntu_dataset.py", line 228, in create_examples
examples.append(creator_function(context_dialog, candidate_dialog_paths))
File "create_ubuntu_dataset.py", line 291, in <lambda>
args.n, args.max_context_length))
File "create_ubuntu_dataset.py", line 180, in create_single_dialog_test_example
dialog = translate_dialog_to_lists(context_dialog_path)
File "create_ubuntu_dataset.py", line 36, in translate_dialog_to_lists
dialog_file = open(dialog_filename, 'r')
IOError: [Errno 2] No such file or directory: './dialogs/3/4347.tsv'
Could anybody please help out? Thanks in advance!
I want to download the dataset within a limited storage server; what is the exact size of the data after running the script?
Thank you.
Please advice:
1- the code below contains a modified parameters default=10250000 so we can extract more training examples, right? because if I used only 1000000 it gives me smaller training set compared to the one from the original dataset, what if I want the same number of examples from the old training set, what should be the default=?? , in other words how many examples are in the training set?
2- the code below, which related to the test and eval sets is giving error in run time saying that: AttributeError: 'Namespace' object has no attribute 'examples' ,
kindly advice if that was popular for you.
parser_train = subparsers.add_parser('train', help='trainset generator')
parser_train.add_argument('-p', type=float, default=0.5, help='positive example probability')
parser_train.add_argument('-e', '--examples', type=int, default=10250000, help='number of examples to generate')
parser_train.set_defaults(func=train_cmd)
parser_test = subparsers.add_parser('test', help='testset generator')
parser_test.add_argument('-n', type=int, default=9, help='number of distractor examples for each context')
parser_test.set_defaults(func=test_cmd)
parser_valid = subparsers.add_parser('valid', help='validset generator')
parser_valid.add_argument('-n', type=int, default=9, help='number of distractor examples for each context')
parser_valid.set_defaults(func=valid_cmd)
I got an error in create_ubuntu_dataset.py
AttributeError: 'Namespace' object has no attribute 'func'
Downloading the dataset fails. I have read the previous issues (#9 and #11), but the problem doesn't seem to have been resolved. When I run ./generate.sh
, I get:
Downloading http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz to ./ubuntu_dialogs.tgz
Traceback (most recent call last):
File "create_ubuntu_dataset.py", line 404, in <module>
prepare_data_maybe_download(args.data_root)
File "create_ubuntu_dataset.py", line 260, in prepare_data_maybe_download
filepath, _ = urllib.request.urlretrieve(url, archive_path)
File "/usr/lib64/python2.7/urllib.py", line 98, in urlretrieve
return opener.retrieve(url, filename, reporthook, data)
File "/usr/lib64/python2.7/urllib.py", line 245, in retrieve
fp = self.open(url, data)
File "/usr/lib64/python2.7/urllib.py", line 213, in open
return getattr(self, name)(url)
File "/usr/lib64/python2.7/urllib.py", line 357, in open_http
'got a bad status line', None)
IOError: ('http protocol error', 0, 'got a bad status line', None)
The IOError comes from urlretrieve
on http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz
Doing wget http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz
also fails. Can anybody tell me how else to download the dataset? Thanks a lot in advance!
Can't download the dataset. The link is not accessible
When trying to run the dataset generation command (python create_ubuntu_dataset.py ./generate.sh -t -s -l), I get the following error:
runfile('/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py', wdir='/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src')
Downloading http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz to ./ubuntu_dialogs.tgz
Successfully downloaded ./ubuntu_dialogs.tgz
Unpacking dialogs ...
Archive unpacked.
Traceback (most recent call last):
File "", line 1, in
runfile('/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py', wdir='/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src')
File "/home/janinanu/anaconda3/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "/home/janinanu/anaconda3/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/janinanu/Desktop/Dialogue System/ubuntu-ranking-dataset-creator/src/create_ubuntu_dataset.py", line 407, in
args.func(args)
AttributeError: 'Namespace' object has no attribute 'func'
I cannot make any sense of it. Any suggestions how to solve it?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.