karelvesely84 / kaldi-io-for-python Goto Github PK
View Code? Open in Web Editor NEWPython functions for reading kaldi data formats. Useful for rapid prototyping with python.
License: Apache License 2.0
Python functions for reading kaldi data formats. Useful for rapid prototyping with python.
License: Apache License 2.0
Hi,
I'm working with python 3.5.2, and I am using a virtualenv to run kaldi_io. I'm trying to use this sample:
import kaldi_io
ark_scp_output='ark:| copy-feats --compress=true ark:- ark,scp:data/feats2.ark,data/feats2.scp'
with kaldi_io.open_or_fd(ark_scp_output,'wb') as f:for key,mat in dict.iteritems():
kaldi_io.write_mat(f, mat, key=key)
My dict is a csv file, such that the first column is utt-id, and the 2nd column is the feature. This feature is in a 2-D (1x1) numpy matrix format. I have 2 problems:
a) In using key, string type is not supported. Since that is an optional argument, I just didn't pass it. But I know it will be important.
b) In using open_or_fd, I'm getting the following error:
RROR (copy-feats[5.4.176~1-be967]:Read():kaldi-matrix.cc:1616) Failed to read matrix from stream. : Expected "[", got "��
[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::Matrix::Read(std::istream&, bool, bool)
kaldi::KaldiObjectHolder<kaldi::Matrix >::Read(std::istream&)
kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Next()
kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Open(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Open(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::SequentialTableReader(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
main
__libc_start_main
_start
WARNING (copy-feats[5.4.1761-be967]:Read():util/kaldi-holder-inl.h:84) Exception caught reading Table object.1-be967]:Next():util/kaldi-table-inl.h:574) Object read failed, reading archive standard input
WARNING (copy-feats[5.4.176
WARNING (copy-feats[5.4.1761-be967]:Open():util/kaldi-table-inl.h:521) Error beginning to read archive file (wrong filename?): standard input1-be967]:SequentialTableReader():util/kaldi-table-inl.h:860) Error constructing TableReader: rspecifier is ark:-
ERROR (copy-feats[5.4.176
[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::SequentialTableReader(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
main
__libc_start_main
_start
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/ayushi/Projects_2018/non_native_perception/data/recordings_edited/kaldi-io/kaldi_io/kaldi_io/kaldi_io.py", line 82, in cleanup
raise SubprocessFailed('cmd %s returned %d !' % (cmd,ret))
kaldi_io.kaldi_io.SubprocessFailed: cmd copy-feats --print-args=false ark:- ark,scp:/home/ayushi/Tools/kaldi/egs/nn_perception/data/train/feats.ark,/home/ayushi/Tools/kaldi/egs/nn_perception/data/train/feats.scp returned 255 !
Am I doing something wrong?
UnknownMatrixHeader is undefined due to a typo in the class definition. The diff below contains a fix:
$ git diff
diff --git a/kaldi_io.py b/kaldi_io.py
index e05a60c..49d518c 100755
--- a/kaldi_io.py
+++ b/kaldi_io.py
@@ -21,7 +21,7 @@ os.environ['PATH'] = os.popen('echo $KALDI_ROOT/src/bin:$KALDI_ROOT/tools/openfs
# Define all custom exceptions,
class UnsupportedDataType(Exception): pass
class UnknownVectorHeader(Exception): pass
-class UnkonwnMatrixHeader(Exception): pass
+class UnknownMatrixHeader(Exception): pass
class BadSampleSize(Exception): pass
class BadInputFormat(Exception): pass
I am trying to read the alignment file using read_ali_ark
method. My code looks like this:
src_file = 's5/exp/tri2_ali/ali.1.gz'
abc = kaldi_io.read_ali_ark(src_file)
But this crashes on assert. It goes like this:
read_ali_ark
will call open_or_fd
method.Simply removing this assert fixes the issue and makes it possible to read gzipped ark files.
is there a api to read values fram wav.scp?
I am wondering if we need to wait a thread that writes files to hard-disk in this script. Today I use kaldi_io to write a big file and immediately read it which led to this Error:
ivector-mean ark:data/train/spk2utt scp:exp/train_embed_vectors/embeddings.scp 'ark:| copy-vector ark:- ark,scp:exp/train_embed_vectors/spk_embeddings.ark,exp/train_embed_vectors/spk_embeddings.scp' ark,t:exp/train_embed_vectors/num_utts.ark
WARNING (ivector-mean[5.4.84~1405-c643]:ReadScriptFile():kaldi-table.cc:72) Invalid 148626'th line in script file:"id11251-gFfcgOVmiO0-00006"
WARNING (ivector-mean[5.4.84~1405-c643]:ReadScriptFile():kaldi-table.cc:46) [script file was: exp/train_embed_vectors/embeddings.scp]
ERROR (ivector-mean[5.4.84~1405-c643]:RandomAccessTableReader():util/kaldi-table-inl.h:2512) Error opening RandomAccessTableReader object (rspecifier is: scp:exp/train_embed_vectors/embeddings.scp)
The error will disappear after I manually type this cmd:
ivector-mean ark:data/train/spk2utt scp:exp/train_embed_vectors/embeddings.scp 'ark:| copy-vector ark:- ark,scp:exp/train_embed_vectors/spk_embeddings.ark,exp/train_embed_vectors/spk_embeddings.scp' ark,t:exp/train_embed_vectors/num_utts.ark
So does this cmd:
copy-vector "scp:echo 'id11251-gFfcgOVmiO0-00006 exp/train_embed_vectors/embeddings.ark:614118526' |" ark,t:-|less
I guess the possible reason is that when I reading the newly created big file, the earlier thread is still writing the file.
which function is equal to copy-matrix?
As of now, only read_mat_scp()
supports matrix ranges (as in /path/to/file/foo.ark:5[30:40]
)
I suggest moving the range parsing into read_mat()
so that ranges are also supported for direct calls of this function.
Hello,
I'm getting an error when attempting to use copy-vector on the output of 'kaldi_io.write_vec_int'.
Error is: "Failed to read vector from stream. : Expected token FV, got W"
Goal: I have a large text file of kaldi features. The file is in .ark format however the contents are in human-readable form which I converted using 'copy-feats ark:- ark,t:-'. I want to create multiple small files where each file contains a key and mat pair. To do this I am reading in the ark file using kaldi_io and attempting to write a new file using kaldi_io within the kaldi_io.read_vec_int_ark loop. I am able to successfully read key and mat from the file, but an error occurs when attempting to write.
Code:
`for key, mat in kaldi_io.read_vec_int_ark(sfile):
print("{} {}".format(key,mat.shape))
## create new file to write to
new_file_path_txt = os.path.join(sdir, "{}.{}".format(key, file_tail))
new_file_path = os.path.join(sdir, "{}.ark".format(key))
# new_file_path_txt = os.path.join(sdir, "{}.txt".format(key))
# Write new file
print("type: {}".format(type(mat)))
print("dtype: {}".format(mat.dtype))
mat = mat.astype('int32') # need to cast for writing purposes
print("dtype2: {}".format(mat.dtype))
ark_txt_output = 'ark:| copy-vector ark:- ark,t:{}'.format(new_file_path_txt)
with kaldi_io.open_or_fd(ark_txt_output, 'wb') as w:
kaldi_io.write_vec_int(w, mat, key=key)`
Hi,
When i read the compressed features, and I am reading features in parallel on the python. Encountered such a problem, ask for help.
AssertionError: Caught AssertionError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/home/zhangpeng/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/zhangpeng/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zhangpeng/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_dataset.py", line 68, in getitem
full_mat = read_mat(self.dataset[aid][1])
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_io.py", line 717, in read_mat
mat = _read_mat_binary(fd)
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_io.py", line 730, in _read_mat_binary
if header.startswith('CM'): return _read_compressed_mat(fd, header)
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_io.py", line 776, in _read_compressed_mat
assert(format == 'CM ') # The formats CM2, CM3 are not supported...
Thank you~
If the matrix type is CM, then the sample_size
assertions will fail because we never set sample_size
.
Not sure what the sample_size
is for CM, or whether it should simply complain?
Either way I'm happy to make a pull request.
Hey Karel,
I'd like to ask if it would be possible to ship this script as a library (e.g. to install that with pip), since I guess most people using this script copy it around their system a lot. It's just a bit more convenient.
first ,thank you for your work.
I have saw that you make capability for python3,but in kaldi_io.py,it sames that you have not revise code. And as the matter of fact,I can not run kaldi_io.py in python3. If it really could run on python3,could you give me some advise for how to use it.Before now,I have try to modify code to fit for python3,but it does not work.
the primary problem is str and byte which are different from python2
thank you,hope the response!!
Last release on pypi is 0.9.4, but there are no tags in this repo. For packaging things well in conda-forge, we need to know the relationship between the version and the sources, which is what git tags are for. 🙃
Could you please add them, ideally also for the last release? (tags can be pushed for past commits as well)
Hi,
Are the modifications to the PATH
variable in lines 17-24 really necessary?
If yes, I would suggest replacing the modifications with an exception if $KALDI_ROOT
is not set, and if they are not necessary for the script, I would suggest to remove them completely!
Thanks for all the work,
Best,
Quentin
I'm trying to access the features that are used for kaldi's dnn model. It looks like these matrices stored as a different type of file (Nnet3Eg, NumIo). I don't see that these are supported. Would it be non-trivial to read these?
Hi,
I noticed that kaldi_io currently does not support ranges in script files. I need this feature so I implemented it here. I guess it is best to generate test cases for that, before I open a pull request. Unfortunately I have no experience with testing in python so far. If you would like to help me with that or could point out a resource, that would be great.
Another point is, that I did not yet fully understand the Kaldi rx/wxfilenames. So I guess you could also add ranges to script file lines like
utt_id_01002 gunzip -c /usr/data/01002.gz |
but I am not sure how this would be done.
Thanks for your work on kaldi_io!
Hi @vesis84 ,
When writing a feature matrix to a .ark file, it might be helpful to generate the corresponding .scp file to indicate positions.
Like (the guys do here)[http://kaldi-to-matlab.gforge.inria.fr/], that will complete this tool's functionality.
BTW, It's a great tool, thank you. Tests are also great.
Hi, is there a way to read compressed feature matrix? thanks.
Hi Karel,
It's nice to support python3, I tested the new kaldi_io script, however, though it works fine for directly reading "feats.ark", it will fail when reading from the stream "ark:apply-cmvn-sliding --center=true ark:feats.ark ark:- |", line 49 will throw an error as
`fd = os.popen(file[:-1], 'rb')
File "/cm/shared/apps/python/3.6/lib/python3.6/os.py", line 970, in popen
raise ValueError("invalid mode %r" % mode)'
This may be caused by the difference of stdin/out between python2 and python3.
Can you change the hardcoded path in kaldi_io/kaldi_io.py from:
os.environ['KALDI_ROOT']='/mnt/matylda5/iveselyk/Tools/kaldi-trunk'
to something like:
os.environ['KALDI_ROOT']=os.path.join(os.environ['CONDA_PREFIX'], 'bin')
?
Maybe also instead of printing the warnings, logging would be useful so issues like that can be suppressed?
Is it possible to add lattice I/O support in kaldi_io
?
elif mode == "rb":
err=open(output_folder+'/log.log',"a")
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,stderr=err)
threading.Thread(target=cleanup,args=(proc,cmd)).start() # clean-up thread,
return proc.stdout
when the progarm at threading.Thread(target=cleanup,args=(proc,cmd)).start() # clean-up thread,
def cleanup(proc, cmd):
ret = proc.wait()
if ret > 0:
raise SubprocessFailed('cmd %s returned %d !' % (cmd,ret))
return
it reminds me that
data_io.SubprocessFailed: cmd gunzip -c /home/sxyl3800/workspace/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_test/ali*.gz | ali-to-pdf /home/sxyl3800/workspace/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_test/final.mdl ark:- ark:- returned 127 !
whats the cleanup function for? why ret = proc.wait()>0,it will have a error?
hello,
i am working on kaldi and i want to try other features, i use kaldi-io-for-python it works, but i want to have the same number of scp and ark file as my number of job, like default kaldi features
but the "open_or_fd" function doesn't have 'ab' mode
i want to append my ark and scp file
could anyone give some suggestions to do this, please?
Regards !
Zhor
:)
Hi, my situation is that I want to load small parts of a big ark file. Of course, it is possible to load the entire ark file and then select certain rows, but it is not memory and time efficient. I wonder if it is possible to read only small parts of the ark file? (like np.load('/tmp/123.npy', mmap_mode='r')) Thanks for your help!
Hey Karel,
since the humble beginnings of this script, the read_key
function only supported keys without a /
symbol. I was just wondering, is there a reason why it is like that, since kaldi itself supports keys with /
in them, e.g. audio/file.wav
, but kaldi_io does not ?
Otherwise, I just propose to change line 115 to:
assert(re.match('^[\/.a-zA-Z0-9_-]+$',key) != None)
Hi,
I have found this tool very useful for understanding the kaldi IO mechanisms. I have small query on extracting the samples from streaming speech.
Is it possible using Readhelper to pass the real time audio signal and observe the numpy_array or the samples?
Thanks in advance,
Regards
Pradeep
PhD student
Dept of CSE
IIT Kharagpur
Hi,
I was trying to read mfcc features from a subsegmented directory, i.e created using utils/data/subsegment_data_dir.sh
. The contents of feats.scp are of the form:
<path_to_ark_file>:xx[0:N]
Currently, this cannot be handled by read_mat_scp. Is there any alternative?
Hi,
I was trying to use kaldi_io to import alignment files, but I could not find out how to do it, and if it's possible.
I ran the timit recipe and ended up with a number of ali.<n>.gz
files for example under the exp/mono_ali/
directory. I can convert those files from transition model IDs to PDF IDs with the command (for example):
ali-to-pdf exp/mono_ali/final.mdl "ark:gunzip -c exp/mono_ali/ali.1.gz|" ark,t:mono_ali.1.pdf.txt
The resulting file contains a line for each utterance, with utterance ID (for example faem0_si1392
) and then a list of integer identifiers of the states in the model for each frame in that utterance.
ali-to-pdf
command when opening the ali.1.gz
file, so that I don't need to run it separately?Thank you!
Giampiero
Hi !
First of all, thanks you for this great job ! However, I had to transform every decode() in kaldi-io.py in decode('latin-1') in order to deal with French accents. I also had to comment an assertion that was checking for only non accentuated characters. It would be cool if you could bring this accentuation management for foreigners !
After I have extracted the VAD features, I want to read the scp, but the error is reported. How can I solve it?
ERROR:datasets.kaldi_io.UnknownMatrixHeader: The header contained 'FV '
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.