k2-fsa / text_search Goto Github PK
View Code? Open in Web Editor NEWSome fast-ish algorithms for batch text search in moderate-sized collections, intended for data cleanup
Home Page: https://k2-fsa.github.io/text_search/
Some fast-ish algorithms for batch text search in moderate-sized collections, intended for data cleanup
Home Page: https://k2-fsa.github.io/text_search/
Is this repo(https://huggingface.co/datasets/pkufool/librilight-text) broken?
Currently in my head the whole pipeline will work like this:
we will have many queries (i.e. the texts we transcript from audios) and many documents, and from the meta info of the audios we can get which document this audio comes from. A TextSource
is constructed from either a query or a document. By adding the meta info, we can convert a TextSource
to be a SourcedText
. A SourcedText
can also be constructed from a list of SourcedText (the meta info will tell us which TextSource each token comes from). Then we can do some cleaning / normalization on the SourcedText. For the matching part, we will create a suffix_array
on the text in the SourcedText, then we run a two stage matching algoritham (i.e. find_close_matches
and find_candidate_matches
)with the help of suffix_array
to get the matching documents for each query (bear in mind, there are both queries and documents in one SourcedText).
I think the TextSource and SourcedText and the function to construct and manipulate them will be on python side, and the SuffixArray and the matching algorithms (including Levenshtein distance) will be implemented with C++.
Some audios might have multiple versions of reference (for example the youtube have automatic and manual subtitles), to associate both of these reference to the audio segments, I think we can first align the audio with one of the reference, and then we can get the
"begin_byte" and "end_byte" of the choosen reference for each segment. We can associate the second reference by doing a levenshtein alignment between the first reference and the substring of second reference determined by "begin_byte" and "end_byte" (of course, need some extending on both side), we suppose the two references are very close.
If doing this way, we don't have to change the core part of our code, we just store the second reference in a custom filed in the cut, and associate it when writing out the results, the segments are short (from 2 seconds to 30 seconds) so the levenshtein alignment would be very fast and we can do it in parallel with multiple cpu cores.
@danpovey @npovey If you need help at this, it would be good if you can share a small subset of the data to me so I can add an example recipe to the project.
The above max_symbol affects directly how much memory is allocated below
https://github.com/danpovey/text_search/blob/9499196e16b0a5bc32bebaa8ecdfaeefbb094598/textsearch/csrc/suffix_array.cc#L43
where K
is max_symbol
.
We should renumber the elements so that max_symbol equals to number of unique elements plus 1.
doc location still seems to be my personal github?
Guys,
I was under the impressoion that we had committed some kind of examples for how you would use this end-to-end for aligning data where you have the reference and the approximate recognition, but I don't see anything like that... and the documentation seems to be only for the very first things we added? I think this aspect needs some work at some point. Or @pkufool maybe you have things that are not committed?
Dan
In matching.py, I set "max_duration": 29, but some segments are longer than 29 seconds, and a few are even up to 200 seconds. Below is my input for matching.py (I only changed has_punctuation and max_duration):
params = AttributeDict( { # parameters for loading source texts # you can find the docs in textsearch/datatypes.py "use_utf8": False, "is_bpe": True, "use_uppercase": True, "has_punctuation": False, # parameters for aligning queries # you can find the docs in textsearch/match.py#align_queries "num_close_matches": 2, "segment_length": 5000, "reference_length_difference": 0.4, "min_matched_query_ratio": 0.33, # parameters for splitting aligned queries # you can find the docs in textsearch/match.py#split_aligned_queries "preceding_context_length": 1000, "timestamp_position": "current", "duration_add_on_left": 0.0, "duration_add_on_right": 0.5, "silence_length_to_break": 0.45, "overlap_ratio": 0.25, "min_duration": 2, "max_duration": 29, "expected_duration": (5, 20), "max_error_rate": 0.20, # output "output_post_texts": False, } )
I used "pip install fasttextsearch", but when running run.sh I am getting the error, "ImportError: cannot import name 'str2bool' from 'textsearch.utils' (/usr/local/lib/python3.8/dist-packages/textsearch/utils.py)"
However when I used the developer's method aka installing from source, everything worked. There might be something missing in the pip install, causing this error.
(test-textsearch) np@np-INTEL:/mnt/speech3/text_search/examples/libriheavy$ sudo ./run.sh
2023-07-20 17:48:23 (run.sh:58:main) Stage 1: Prepare LibriLight manifest
2023-07-20 17:48:23,466 INFO [prepare_manifest.py:230] {'corpus_dir': PosixPath('/mnt/speech3/text_search/examples/libriheavy/download/libri-light'), 'books_dir': PosixPath('/mnt/speech3/text_search/examples/libriheavy/download/librilight_text'), 'output_dir': PosixPath('/mnt/speech3/text_search/examples/libriheavy/data/manifests'), 'num_jobs': 10}
2023-07-20 17:48:23,466 INFO [prepare_manifest.py:196] Preparing LibriLight...
Dataset parts: 0%| | 0/3 [00:00<?, ?it/s]2023-07-20 17:48:23,467 INFO [prepare_manifest.py:205] Processing LibriLight subset: small
Distributing tasks: 100%|███████████████████████████████████████| 2588/2588 [00:00<00:00, 88334.91it/s]
Processing: 100%|████████████████████████████████████████████████| 2588/2588 [00:00<00:00, 7874.66it/s]
Dataset parts: 33%|█████████████████▋ | 1/3 [00:00<00:00, 2.08it/s]2023-07-20 17:48:23,948 INFO [prepare_manifest.py:205] Processing LibriLight subset: medium
Distributing tasks: 0it [00:00, ?it/s]
Processing: 0it [00:00, ?it/s], ?it/s]
2023-07-20 17:48:23,956 INFO [prepare_manifest.py:205] Processing LibriLight subset: large
Distributing tasks: 0it [00:00, ?it/s]
Processing: 0it [00:00, ?it/s], ?it/s]
Dataset parts: 100%|█████████████████████████████████████████████████████| 3/3 [00:00<00:00, 4.96it/s]
2023-07-20 17:48:24,072 INFO [prepare_manifest.py:239] Done.
2023-07-20 17:48:24 (run.sh:69:main) Stage 2: Split long audio into chunks
2023-07-20 17:48:24,554 INFO [split_into_chunks.py:68] {'manifest_in': PosixPath('/mnt/speech3/text_search/examples/libriheavy/data/manifests/librilight_raw_cuts_small.jsonl.gz'), 'manifest_out': PosixPath('/mnt/speech3/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz'), 'chunk': 30.0, 'extra': 2.0}
2023-07-20 17:48:24,555 INFO [split_into_chunks.py:79] Processing /mnt/speech3/text_search/examples/libriheavy/data/manifests/librilight_raw_cuts_small.jsonl.gz.
2023-07-20 17:48:24,555 INFO [split_into_chunks.py:82] /mnt/speech3/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz already exists - skipping.
2023-07-20 17:48:24 (run.sh:87:main) Stage 3: Perform speech recognition on splitted chunks
Traceback (most recent call last):
File "./tools/recognize.py", line 46, in <module>
from asr_datamodule import AsrDataModule
File "/mnt/speech3/text_search/examples/libriheavy/tools/asr_datamodule.py", line 38, in <module>
from textsearch.utils import str2bool
ImportError: cannot import name 'str2bool' from 'textsearch.utils' (/usr/local/lib/python3.8/dist-packages/textsearch/utils.py)
(test-textsearch) np@np-INTEL:/mnt/speech3/text_search/examples/libriheavy$
Guys, for the Levenshtein edit distance part, I am thinking about using this code:
https://github.com/Martinsos/edlib
I am reading about it right now. License is MIT so very flexible.
It supports Levenshtein alignment of long sequences, also, using an algorithm that is quadratic in time but only linear in memory.
(We'd incorporate the C++ code into this project, probably.)
Error: "RuntimeError: shape '[1, 0, 2]' is invalid for input of size 15319040"
I had an environment that works on libriheavy and I described here.
But I needed to update the torch and torchaudio to a new version using this command
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2 --extra-index-url https://download.pytorch.org/whl/cu117
as lhotse scripts were not recognizing duration of m4a audio files in stage 1 when creating manifest. After upgrading those libraries I started having this error "RuntimeError: shape '[1, 0, 2]' is invalid for input of size 15319040". The code below started throwing an error in this script:
if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
# This script loads torchscript models, exported by `torch.jit.script()`,
# and uses it to decode waves.
# You can download the jit model from
# https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
# We will get librilight_asr_cuts_{subset}.jsonl.gz
# saved in $output_dir/manifests
log "Stage 3: Perform speech recognition on splitted chunks"
for subset in small; do
#for subset in small medium large; do
./tools/recognize.py \
--world-size $world_size \
--num-workers 8 \
--manifest-in $output_dir/manifests/librilight_chunk_cuts_${subset}.jsonl.gz \
--manifest-out $output_dir/manifests/librilight_asr_cuts_${subset}.jsonl.gz \
--nn-model-filename exp/exp/jit_script.pt \
--tokens exp/data/lang_bpe_500/tokens.txt \
--max-duration 1200 \
--decoding-method greedy_search \
--master 12346
done
fi
Output of the error:
(np_env) np@np-INTEL:/mnt/speech1/anna/text_search/examples/libriheavy$ ./run_small.sh
2023-08-30 21:21:35 (run_small.sh:59:main) Stage 1: Prepare LibriLight manifest
2023-08-30 21:21:36,682 INFO [prepare_manifest.py:230] {'corpus_dir': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/download/libri-light'), 'books_dir': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/download/librilight_text'), 'output_dir': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests'), 'num_jobs': 10}
2023-08-30 21:21:36,682 INFO [prepare_manifest.py:196] Preparing LibriLight...
Dataset parts: 0%| | 0/3 [00:00<?, ?it/s]2023-08-30 21:21:36,684 INFO [prepare_manifest.py:205] Processing LibriLight subset: small
Distributing tasks: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2588/2588 [00:00<00:00, 97380.94it/s]
Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2588/2588 [00:15<00:00, 171.84it/s]
Dataset parts: 33%|████████████████████████████████████████████▋
Dataset parts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:15<00:00, 5.28s/it]
2023-08-30 21:21:52,512 INFO [prepare_manifest.py:239] Done.
2023-08-30 21:21:52 (run_small.sh:70:main) Stage 2: Split long audio into chunks
2023-08-30 21:21:53,409 INFO [split_into_chunks.py:68] {'manifest_in': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_raw_cuts_small.jsonl.gz'), 'manifest_out': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz'), 'chunk': 30.0, 'extra': 2.0}
2023-08-30 21:21:53,409 INFO [split_into_chunks.py:79] Processing /mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_raw_cuts_small.jsonl.gz.
2023-08-30 21:21:58,185 INFO [split_into_chunks.py:93] Cuts saved to /mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz
2023-08-30 21:21:58 (run_small.sh:89:main) Stage 3: Perform speech recognition on splitted chunks
2023-08-30 21:22:04,385 INFO [recognize.py:314] (0/2) Decoding started
2023-08-30 21:22:04,399 INFO [recognize.py:314] (1/2) Decoding started
2023-08-30 21:22:04,405 INFO [recognize.py:324] (1/2) {'subsampling_factor': 4, 'frame_shift_ms': 10, 'world_size': 2, 'master_port': 12346, 'manifest_in': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz'), 'manifest_out': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_asr_cuts_small.jsonl.gz'), 'log_dir': PosixPath('logs'), 'nn_model_filename': 'exp/exp/jit_script.pt', 'tokens': 'exp/data/lang_bpe_500/tokens.txt', 'decoding_method': 'greedy_search', 'max_duration': 1200, 'return_cuts': True, 'num_mel_bins': 80, 'num_workers': 8, 'manifest_out_dir': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests'), 'suffix': '.jsonl.gz', 'cuts_filename': 'librilight_asr_cuts_small', 'blank_id': 0, 'unk_id': 2, 'vocab_size': 500}
2023-08-30 21:22:04,405 INFO [recognize.py:324] (0/2) {'subsampling_factor': 4, 'frame_shift_ms': 10, 'world_size': 2, 'master_port': 12346, 'manifest_in': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz'), 'manifest_out': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_asr_cuts_small.jsonl.gz'), 'log_dir': PosixPath('logs'), 'nn_model_filename': 'exp/exp/jit_script.pt', 'tokens': 'exp/data/lang_bpe_500/tokens.txt', 'decoding_method': 'greedy_search', 'max_duration': 1200, 'return_cuts': True, 'num_mel_bins': 80, 'num_workers': 8, 'manifest_out_dir': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests'), 'suffix': '.jsonl.gz', 'cuts_filename': 'librilight_asr_cuts_small', 'blank_id': 0, 'unk_id': 2, 'vocab_size': 500}
2023-08-30 21:22:05,773 INFO [recognize.py:329] (0/2) device: cuda:0
2023-08-30 21:22:05,773 INFO [recognize.py:331] (0/2) Loading jit model
2023-08-30 21:22:05,773 INFO [recognize.py:329] (1/2) device: cuda:1
2023-08-30 21:22:05,773 INFO [recognize.py:331] (1/2) Loading jit model
2023-08-30 21:22:18,183 INFO [recognize.py:290] (0/2) cuts processed until now is 40
2023-08-30 21:22:23,089 INFO [recognize.py:290] (1/2) cuts processed until now is 40
/home/np/anna/np_env/lib/python3.8/site-packages/torch/nn/modules/module.py:1501: UserWarning: operator() profile_node %626 : int = prim::profile_ivalue(%624)
does not have profile information (Triggered internally at ../third_party/nvfuser/csrc/graph_fuser.cpp:104.)
return forward_call(*args, **kwargs)
/home/np/anna/np_env/lib/python3.8/site-packages/torch/nn/modules/module.py:1501: UserWarning: FALLBACK path has been taken inside: compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
To report the issue, try enable logging via setting the envvariable ` export PYTORCH_JIT_LOG_LEVEL=manager.cpp`
(Triggered internally at ../third_party/nvfuser/csrc/manager.cpp:243.)
return forward_call(*args, **kwargs)
/home/np/anna/np_env/lib/python3.8/site-packages/torch/nn/modules/module.py:1501: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
(Triggered internally at ../third_party/nvfuser/csrc/manager.cpp:335.)
return forward_call(*args, **kwargs)
Traceback (most recent call last):
File "./tools/recognize.py", line 426, in <module>
main()
File "./tools/recognize.py", line 402, in main
mp.spawn(
File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/speech1/anna/text_search/examples/libriheavy/tools/recognize.py", line 353, in run
decode_dataset(
File "/mnt/speech1/anna/text_search/examples/libriheavy/tools/recognize.py", line 278, in decode_dataset
hyps, timestamps, scores = decode_one_batch(
File "/mnt/speech1/anna/text_search/examples/libriheavy/tools/recognize.py", line 182, in decode_one_batch
encoder_out, encoder_out_lens = model.encoder(
File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: shape '[1, 0, 2]' is invalid for input of size 15319040
(np_env) np@np-INTEL:/mnt/speech1/anna/text_search/examples/libriheavy$
my environment that is not working for facebook dataset or my dataset. always failing in step3
(np_env) np@np-INTEL:/mnt/speech1/anna/text_search/examples/libriheavy$ pip list
Package Version
------------------------ --------------------------------------
absl-py 1.4.0
audioread 3.0.0
cachetools 5.3.1
certifi 2023.7.22
cffi 1.15.1
charset-normalizer 3.2.0
click 8.1.7
cmake 3.27.2
cytoolz 0.12.2
dataclasses 0.6
dill 0.3.7
fasttextsearch 0.6
filelock 3.12.2
google-auth 2.22.0
google-auth-oauthlib 1.0.0
graphviz 0.20.1
grpcio 1.57.0
idna 3.4
importlib-metadata 6.8.0
intervaltree 3.1.0
Jinja2 3.1.2
k2 1.24.3.dev20230718+cuda11.7.torch2.0.1
kaldialign 0.7.1
kaldifeat 1.25.0.dev20230726+cuda11.7.torch2.0.1
kaldifst 1.6
kaldilm 1.15
lhotse 1.16.0.dev0+git.2046b55d.clean
lilcom 1.7
lit 16.0.6
Markdown 3.4.4
MarkupSafe 2.1.3
mpmath 1.3.0
networkx 3.1
numpy 1.24.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
oauthlib 3.2.2
packaging 23.1
Pillow 10.0.0
pip 20.0.2
pkg-resources 0.0.0
protobuf 4.24.1
pyasn1 0.5.0
pyasn1-modules 0.3.0
pycipher 0.5.2
pycparser 2.21
PyYAML 6.0.1
regex 2023.8.8
requests 2.31.0
requests-oauthlib 1.3.1
rsa 4.9
sentencepiece 0.1.99
setuptools 44.0.0
six 1.16.0
sortedcontainers 2.4.0
soundfile 0.12.1
sympy 1.12
tabulate 0.9.0
tensorboard 2.14.0
tensorboard-data-server 0.7.1
termcolor 2.3.0
toolz 0.12.0
torch 2.0.1+cu117
torchaudio 2.0.2+cu117
torchvision 0.15.2+cu117
tqdm 4.66.1
triton 2.0.0
typeguard 4.1.2
typing-extensions 4.7.1
urllib3 2.0.4
werkzeug 2.3.7
wheel 0.41.1
zipp 3.16.2
(np_env) np@np-INTEL:/mnt/speech1/anna/text_search/examples/libriheavy$
environment that works but duration for m4a audios was wrong in Lhotse scripts for my dataset, but works well with facebook dataset
np@np-INTEL:/mnt/speech3$ source venv_ts/bin/activate
(test-textsearch) np@np-INTEL:/mnt/speech3$ pip list
Package Version
----------------------- ---------------------------------------
absl-py 1.4.0
appdirs 1.4.3
apturl 0.5.2
attrs 19.3.0
audioread 3.0.0
Automat 0.8.0
backcall 0.1.0
bcrypt 3.1.7
beautifulsoup4 4.8.2
bleach 6.0.0
blinker 1.4
boto 2.49.0
Brlapi 0.7.0
cachetools 5.3.1
certifi 2019.11.28
cffi 1.15.1
chardet 3.0.4
click 8.1.6
colorama 0.4.3
command-not-found 0.3
constantly 15.1.0
cryptography 2.8
cssselect 1.1.0
cupshelpers 1.0
cytoolz 0.12.1
dataclasses 0.6
dbus-python 1.2.16
decorator 4.4.2
defer 1.0.6
dill 0.3.7
distlib 0.3.0
distro 1.4.0
distro-info 0.23ubuntu1
duplicity 0.8.12.0
entrypoints 0.3
fasteners 0.14.1
fasttextsearch 0.6
filelock 3.0.12
future 0.18.2
google-auth 2.22.0
google-auth-oauthlib 1.0.0
graphviz 0.20.1
grpcio 1.56.2
html5lib 1.0.1
httplib2 0.14.0
hyperlink 19.0.0
idna 2.8
importlib-metadata 6.8.0
incremental 16.10.1
intervaltree 3.1.0
ipython 7.13.0
ipython_genutils 0.2.0
jedi 0.15.2
k2 1.24.3.dev20230725+cuda11.3.torch1.12.1
kaggle 1.5.16
kaldialign 0.7.1
kaldifeat 1.25.0.dev20230726+cuda11.3.torch1.12.1
kaldifst 1.6
kaldilm 1.15
keyring 18.0.1
language-selector 0.1
launchpadlib 1.10.13
lazr.restfulclient 0.14.2
lazr.uri 1.0.3
lhotse 1.16.0
lilcom 1.7
lockfile 0.12.2
louis 3.12.0
lxml 4.5.0
macaroonbakery 1.3.1
Mako 1.1.0
Markdown 3.4.3
MarkupSafe 2.1.3
monotonic 1.5
more-itertools 4.2.0
mysqlclient 1.4.4
netifaces 0.10.4
numpy 1.24.4
oauthlib 3.1.0
olefile 0.46
packaging 23.1
paramiko 2.6.0
parsel 1.5.2
parso 0.5.2
pexpect 4.6.0
pickleshare 0.7.5
Pillow 7.0.0
pip 23.2
pipenv 11.9.0
prompt-toolkit 2.0.10
protobuf 4.23.4
pyasn1 0.4.2
pyasn1-modules 0.2.1
pycairo 1.16.2
pycparser 2.21
pycups 1.9.73
PyDispatcher 2.0.5
Pygments 2.3.1
PyGObject 3.36.0
PyHamcrest 1.9.0
PyJWT 1.7.1
pymacaroons 0.13.0
PyNaCl 1.3.0
pyOpenSSL 19.0.0
pyRFC3339 1.1
python-apt 2.0.0+ubuntu0.20.4.8
python-dateutil 2.7.3
python-debian 0.1.36ubuntu1
python-slugify 8.0.1
pytz 2019.3
pyxdg 0.26
PyYAML 5.3.1
queuelib 1.5.0
regex 2023.6.3
reportlab 3.5.34
requests 2.22.0
requests-oauthlib 1.3.1
requests-unixsocket 0.2.0
rsa 4.9
Scrapy 1.7.3
SecretStorage 2.3.1
sentencepiece 0.1.99
service-identity 18.1.0
setuptools 45.2.0
simplejson 3.16.0
six 1.14.0
sortedcontainers 2.4.0
soundfile 0.12.1
soupsieve 1.9.5
ssh-import-id 5.10
systemd-python 234
tabulate 0.9.0
tensorboard 2.13.0
tensorboard-data-server 0.7.1
text-unidecode 1.3
toolz 0.12.0
torch 1.12.1+cu113
torchaudio 0.12.1+cu113
torchvision 0.13.1+cu113
tqdm 4.65.0
traitlets 4.3.3
Twisted 18.9.0
typeguard 4.0.0
typing_extensions 4.7.1
ubuntu-advantage-tools 27.9
ubuntu-drivers-common 0.0.0
ufw 0.36
unattended-upgrades 0.1
urllib3 1.25.8
usb-creator 0.3.7
virtualenv 20.0.17
virtualenv-clone 0.3.0
w3lib 1.21.0
wadllib 1.3.3
wcwidth 0.1.8
webencodings 0.5.1
Werkzeug 2.3.6
wheel 0.34.2
xkit 0.0.0
youtube-dl 2021.12.17
zipp 1.0.0
zope.interface 4.7.1
[notice] A new release of pip is available: 23.2 -> 23.2.1
[notice] To update, run: python3 -m pip install --upgrade pip
(test-textsearch) np@np-INTEL:/mnt/speech3$
Guys,
I have an idea for how we could quickly obtain an alignment from the suffix-array stuff.
This version would make the most sense if it was done from one single long sequence to another,
e.g. from one book to an entire chapter.
A previous algorithm whose name I forget, gave us a set of "close matches" obtained from
a suffix array algorithm.
Let us think of these "close matches" as a set of (i, j) pairs, where i is an index into
one sequence and j is an index into the other.
We can define a task of getting an approximate alignment between the two sequences,
as the task of finding the longest chain of pairs:
(i1, j1), (i2, j2), ... (iN, jN)
such that i1 <= i2 <= ... <= iN, and j1 <= j2 <= ... <= jN.
One way that we could do this is as follows:
First sort the list of pairs on (i, then j).
Then, for each pair p == (i, j) in the sorted list, set
count[p] = 1 + max_q(count[q])
where q is taken from all previously processed pairs, but limited to those
where q.i <= p.i and q.j <= p.j. (treating them like pairs in python,
so [0] and [1] return the i and j values). We can take the "max" expression
to be 0 if there is no q satisfying these constraints
Then when we are done processing all pairs, the last element in the longest-chain is the
pair with the largest count. We can store "back-pointers" for each processed pair
to tell us which previous pair satisfied the "max" expression.
Now, the above algorithm is quadratic on the face of it, but I think in C++ we can make it
n log(n), by making use of the properties of std::set, as follows.
Sets are ordered by default in C++, i.e. a std::set is stored in sorted order, but with efficient
insertion. Concretely, the algorithm would be like this. I'll just use int as the type; you can decide
what to use.
// these GoodMatch items, at least the i and j elements,
// are an input, they are sorted on (i, then j). The index into this list
// is n.
struct GoodMatch {
int i; // position of "good match" in 1st sequence
int j; // position of "good match" in 2nd sequence
int prev_n; // previous n in backtrace, index into list of GoodMatch. set in the algorithm.
bool operator < (const GoodMatch &other) const { // sort on i, then j.
if (i < other.i) return true;
else if (i > other.i) return false;
else return j < other.j;
// note: could perhaps do this more efficiently using << and |, not sure if
// it matters.
};
// sort this by i, then j. prev_n values can be undefined for now.
std::vector<GoodMatch> sorted_input; // this is created somehow, then sorted.
struct GoodMatchCount {
int j;
int n;
int c;
bool operator < (const GoodMatchCount &other) const {
if (j < other.j) return true;
else if (j > other.j) return false;
else return n < other.n;
}
};
// in addition to being sorted on j, the counts will always have the
// property that the c values are strictly increasing. We ensure this
// by removing elements as necessary.
std::set<GoodMatchCount> cur_counts;
cur_counts.insert(GoodMatchCount(0, -1, 0));
int N = sorted_pairs.size();
for (int n = 0; n < N; ++n) {
int j = sorted_pairs[n].j;
GoodMatchCount gmc(j, n, 0); // the 0 is a don't-care, we'll set it later.
auto p = cur_counts.insert(gmc); // p is pair (iterator, bool)
assert(p.second); // should have been inserted, should not be present.
auto iter = p.first, prev_iter = p.first;
--prev_iter; // now points to the previous element, should have prev_iter->j <= j.
// we know prev_iter->i <= sorted_pairs[n].i, because sorted_pairs are sorted on i.
int c = prev_iter->c + 1;
sorted_pairs[n].prev_n = prev_iter->n; // backtrace for newly added element.
iter->c = c;
++iter;
// Erase all the elements gmc with gmc.j > j and gmc.count <= c:
// for these j values, the count value is not as good as the one we just
// added so we'll never want to backtrace to them.
cur_counts.erase(p.first, iter);
}
// After we have done iterating, we can get the "best traceback"
// as the final element of cur_counts, e.g. cur_counts.back(),
// and trace back using the prev_n member.
The idea is that we can use the traceback through these "good matches" as the backbone for the Levenshtein alignment, so that
we limit the Levenshtein alignment into blocks defined by the (i, j) positions in this traceback. (Whether we use the maximum possible number of blocks or not, I'm not sure.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.