k2-fsa / text_search Goto Github PK

View Code? Open in Web Editor NEW

48.0 48.0 11.0 3.06 MB

Some fast-ish algorithms for batch text search in moderate-sized collections, intended for data cleanup

Home Page: https://k2-fsa.github.io/text_search/

CMake 20.61% C++ 30.19% Python 49.21%

text_search's People

Contributors

Stargazers

Watchers

Forkers

pkufool o-alexandre-felipe csukuangfj pzelasko yaozengwei dongjigao yfyeung entn-at alphacep ishine samping

text_search's Issues

Can't download librilight-text

Is this repo(https://huggingface.co/datasets/pkufool/librilight-text) broken?

Early plans for the text_search

Currently in my head the whole pipeline will work like this:
we will have many queries (i.e. the texts we transcript from audios) and many documents, and from the meta info of the audios we can get which document this audio comes from. A TextSource is constructed from either a query or a document. By adding the meta info, we can convert a TextSource to be a SourcedText. A SourcedText can also be constructed from a list of SourcedText (the meta info will tell us which TextSource each token comes from). Then we can do some cleaning / normalization on the SourcedText. For the matching part, we will create a suffix_array on the text in the SourcedText, then we run a two stage matching algoritham (i.e. find_close_matches and find_candidate_matches)with the help of suffix_array to get the matching documents for each query (bear in mind, there are both queries and documents in one SourcedText).

I think the TextSource and SourcedText and the function to construct and manipulate them will be on python side, and the SuffixArray and the matching algorithms (including Levenshtein distance) will be implemented with C++.

Associate multiple version of reference

Some audios might have multiple versions of reference (for example the youtube have automatic and manual subtitles), to associate both of these reference to the audio segments, I think we can first align the audio with one of the reference, and then we can get the
"begin_byte" and "end_byte" of the choosen reference for each segment. We can associate the second reference by doing a levenshtein alignment between the first reference and the substring of second reference determined by "begin_byte" and "end_byte" (of course, need some extending on both side), we suppose the two references are very close.
If doing this way, we don't have to change the core part of our code, we just store the second reference in a custom filed in the cut, and associate it when writing out the results, the segments are short (from 2 seconds to 30 seconds) so the levenshtein alignment would be very fast and we can do it in parallel with multiple cpu cores.

@danpovey @npovey If you need help at this, it would be good if you can share a small subset of the data to me so I can add an example recipe to the project.

Support renumbering in create_suffix_array

https://github.com/danpovey/text_search/blob/9499196e16b0a5bc32bebaa8ecdfaeefbb094598/textsearch/python/textsearch/suffix_array.py#L45

The above max_symbol affects directly how much memory is allocated below
https://github.com/danpovey/text_search/blob/9499196e16b0a5bc32bebaa8ecdfaeefbb094598/textsearch/csrc/suffix_array.cc#L43
where K is max_symbol.

We should renumber the elements so that max_symbol equals to number of unique elements plus 1.

doc location

doc location still seems to be my personal github?

Example usage docs; update docs

Guys,
I was under the impressoion that we had committed some kind of examples for how you would use this end-to-end for aligning data where you have the reference and the approximate recognition, but I don't see anything like that... and the documentation seems to be only for the very first things we added? I think this aspect needs some work at some point. Or @pkufool maybe you have things that are not committed?
Dan

Segment durations exceed "max_duration"

In matching.py, I set "max_duration": 29, but some segments are longer than 29 seconds, and a few are even up to 200 seconds. Below is my input for matching.py (I only changed has_punctuation and max_duration):

params = AttributeDict( { # parameters for loading source texts # you can find the docs in textsearch/datatypes.py "use_utf8": False, "is_bpe": True, "use_uppercase": True, "has_punctuation": False, # parameters for aligning queries # you can find the docs in textsearch/match.py#align_queries "num_close_matches": 2, "segment_length": 5000, "reference_length_difference": 0.4, "min_matched_query_ratio": 0.33, # parameters for splitting aligned queries # you can find the docs in textsearch/match.py#split_aligned_queries "preceding_context_length": 1000, "timestamp_position": "current", "duration_add_on_left": 0.0, "duration_add_on_right": 0.5, "silence_length_to_break": 0.45, "overlap_ratio": 0.25, "min_duration": 2, "max_duration": 29, "expected_duration": (5, 20), "max_error_rate": 0.20, # output "output_post_texts": False, } )

ImportError: cannot import name 'str2bool' from 'textsearch.utils'

I used "pip install fasttextsearch", but when running run.sh I am getting the error, "ImportError: cannot import name 'str2bool' from 'textsearch.utils' (/usr/local/lib/python3.8/dist-packages/textsearch/utils.py)"

However when I used the developer's method aka installing from source, everything worked. There might be something missing in the pip install, causing this error.

(test-textsearch) np@np-INTEL:/mnt/speech3/text_search/examples/libriheavy$ sudo ./run.sh
2023-07-20 17:48:23 (run.sh:58:main) Stage 1: Prepare LibriLight manifest
2023-07-20 17:48:23,466 INFO [prepare_manifest.py:230] {'corpus_dir': PosixPath('/mnt/speech3/text_search/examples/libriheavy/download/libri-light'), 'books_dir': PosixPath('/mnt/speech3/text_search/examples/libriheavy/download/librilight_text'), 'output_dir': PosixPath('/mnt/speech3/text_search/examples/libriheavy/data/manifests'), 'num_jobs': 10}
2023-07-20 17:48:23,466 INFO [prepare_manifest.py:196] Preparing LibriLight...
Dataset parts:   0%|                                                             | 0/3 [00:00<?, ?it/s]2023-07-20 17:48:23,467 INFO [prepare_manifest.py:205] Processing LibriLight subset: small
Distributing tasks: 100%|███████████████████████████████████████| 2588/2588 [00:00<00:00, 88334.91it/s]
Processing: 100%|████████████████████████████████████████████████| 2588/2588 [00:00<00:00, 7874.66it/s]
Dataset parts:  33%|█████████████████▋                                   | 1/3 [00:00<00:00,  2.08it/s]2023-07-20 17:48:23,948 INFO [prepare_manifest.py:205] Processing LibriLight subset: medium
Distributing tasks: 0it [00:00, ?it/s]
Processing: 0it [00:00, ?it/s], ?it/s]
2023-07-20 17:48:23,956 INFO [prepare_manifest.py:205] Processing LibriLight subset: large
Distributing tasks: 0it [00:00, ?it/s]
Processing: 0it [00:00, ?it/s], ?it/s]
Dataset parts: 100%|█████████████████████████████████████████████████████| 3/3 [00:00<00:00,  4.96it/s]
2023-07-20 17:48:24,072 INFO [prepare_manifest.py:239] Done.
2023-07-20 17:48:24 (run.sh:69:main) Stage 2: Split long audio into chunks
2023-07-20 17:48:24,554 INFO [split_into_chunks.py:68] {'manifest_in': PosixPath('/mnt/speech3/text_search/examples/libriheavy/data/manifests/librilight_raw_cuts_small.jsonl.gz'), 'manifest_out': PosixPath('/mnt/speech3/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz'), 'chunk': 30.0, 'extra': 2.0}
2023-07-20 17:48:24,555 INFO [split_into_chunks.py:79] Processing /mnt/speech3/text_search/examples/libriheavy/data/manifests/librilight_raw_cuts_small.jsonl.gz.
2023-07-20 17:48:24,555 INFO [split_into_chunks.py:82] /mnt/speech3/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz already exists - skipping.
2023-07-20 17:48:24 (run.sh:87:main) Stage 3: Perform speech recognition on splitted chunks
Traceback (most recent call last):
  File "./tools/recognize.py", line 46, in <module>
    from asr_datamodule import AsrDataModule
  File "/mnt/speech3/text_search/examples/libriheavy/tools/asr_datamodule.py", line 38, in <module>
    from textsearch.utils import str2bool
ImportError: cannot import name 'str2bool' from 'textsearch.utils' (/usr/local/lib/python3.8/dist-packages/textsearch/utils.py)
(test-textsearch) np@np-INTEL:/mnt/speech3/text_search/examples/libriheavy$

edlib

Guys, for the Levenshtein edit distance part, I am thinking about using this code:
https://github.com/Martinsos/edlib
I am reading about it right now. License is MIT so very flexible.
It supports Levenshtein alignment of long sequences, also, using an algorithm that is quadratic in time but only linear in memory.
(We'd incorporate the C++ code into this project, probably.)

Error: "RuntimeError: shape '[1, 0, 2]' is invalid for input of size 15319040"

I had an environment that works on libriheavy and I described here.

But I needed to update the torch and torchaudio to a new version using this command

pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2 --extra-index-url https://download.pytorch.org/whl/cu117

as lhotse scripts were not recognizing duration of m4a audio files in stage 1 when creating manifest. After upgrading those libraries I started having this error "RuntimeError: shape '[1, 0, 2]' is invalid for input of size 15319040". The code below started throwing an error in this script:

if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
  # This script loads torchscript models, exported by `torch.jit.script()`,
  # and uses it to decode waves.
  # You can download the jit model from
  # https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15

  # We will get librilight_asr_cuts_{subset}.jsonl.gz
  # saved in $output_dir/manifests
  log "Stage 3: Perform speech recognition on splitted chunks"
  for subset in small; do
  #for subset in small medium large; do
    ./tools/recognize.py \
      --world-size $world_size \
      --num-workers 8 \
      --manifest-in $output_dir/manifests/librilight_chunk_cuts_${subset}.jsonl.gz \
      --manifest-out $output_dir/manifests/librilight_asr_cuts_${subset}.jsonl.gz \
      --nn-model-filename exp/exp/jit_script.pt \
      --tokens exp/data/lang_bpe_500/tokens.txt \
      --max-duration 1200 \
      --decoding-method greedy_search \
      --master 12346
  done
fi

Output of the error:

(np_env) np@np-INTEL:/mnt/speech1/anna/text_search/examples/libriheavy$ ./run_small.sh
2023-08-30 21:21:35 (run_small.sh:59:main) Stage 1: Prepare LibriLight manifest
2023-08-30 21:21:36,682 INFO [prepare_manifest.py:230] {'corpus_dir': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/download/libri-light'), 'books_dir': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/download/librilight_text'), 'output_dir': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests'), 'num_jobs': 10}
2023-08-30 21:21:36,682 INFO [prepare_manifest.py:196] Preparing LibriLight...
Dataset parts:   0%|                                                                                                                                              | 0/3 [00:00<?, ?it/s]2023-08-30 21:21:36,684 INFO [prepare_manifest.py:205] Processing LibriLight subset: small
Distributing tasks: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2588/2588 [00:00<00:00, 97380.94it/s]
Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2588/2588 [00:15<00:00, 171.84it/s]
Dataset parts:  33%|████████████████████████████████████████████▋                                                                                         
Dataset parts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:15<00:00,  5.28s/it]
2023-08-30 21:21:52,512 INFO [prepare_manifest.py:239] Done.
2023-08-30 21:21:52 (run_small.sh:70:main) Stage 2: Split long audio into chunks
2023-08-30 21:21:53,409 INFO [split_into_chunks.py:68] {'manifest_in': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_raw_cuts_small.jsonl.gz'), 'manifest_out': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz'), 'chunk': 30.0, 'extra': 2.0}
2023-08-30 21:21:53,409 INFO [split_into_chunks.py:79] Processing /mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_raw_cuts_small.jsonl.gz.
2023-08-30 21:21:58,185 INFO [split_into_chunks.py:93] Cuts saved to /mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz
2023-08-30 21:21:58 (run_small.sh:89:main) Stage 3: Perform speech recognition on splitted chunks
2023-08-30 21:22:04,385 INFO [recognize.py:314] (0/2) Decoding started
2023-08-30 21:22:04,399 INFO [recognize.py:314] (1/2) Decoding started
2023-08-30 21:22:04,405 INFO [recognize.py:324] (1/2) {'subsampling_factor': 4, 'frame_shift_ms': 10, 'world_size': 2, 'master_port': 12346, 'manifest_in': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz'), 'manifest_out': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_asr_cuts_small.jsonl.gz'), 'log_dir': PosixPath('logs'), 'nn_model_filename': 'exp/exp/jit_script.pt', 'tokens': 'exp/data/lang_bpe_500/tokens.txt', 'decoding_method': 'greedy_search', 'max_duration': 1200, 'return_cuts': True, 'num_mel_bins': 80, 'num_workers': 8, 'manifest_out_dir': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests'), 'suffix': '.jsonl.gz', 'cuts_filename': 'librilight_asr_cuts_small', 'blank_id': 0, 'unk_id': 2, 'vocab_size': 500}
2023-08-30 21:22:04,405 INFO [recognize.py:324] (0/2) {'subsampling_factor': 4, 'frame_shift_ms': 10, 'world_size': 2, 'master_port': 12346, 'manifest_in': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_chunk_cuts_small.jsonl.gz'), 'manifest_out': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests/librilight_asr_cuts_small.jsonl.gz'), 'log_dir': PosixPath('logs'), 'nn_model_filename': 'exp/exp/jit_script.pt', 'tokens': 'exp/data/lang_bpe_500/tokens.txt', 'decoding_method': 'greedy_search', 'max_duration': 1200, 'return_cuts': True, 'num_mel_bins': 80, 'num_workers': 8, 'manifest_out_dir': PosixPath('/mnt/speech1/anna/text_search/examples/libriheavy/data/manifests'), 'suffix': '.jsonl.gz', 'cuts_filename': 'librilight_asr_cuts_small', 'blank_id': 0, 'unk_id': 2, 'vocab_size': 500}
2023-08-30 21:22:05,773 INFO [recognize.py:329] (0/2) device: cuda:0
2023-08-30 21:22:05,773 INFO [recognize.py:331] (0/2) Loading jit model
2023-08-30 21:22:05,773 INFO [recognize.py:329] (1/2) device: cuda:1
2023-08-30 21:22:05,773 INFO [recognize.py:331] (1/2) Loading jit model
2023-08-30 21:22:18,183 INFO [recognize.py:290] (0/2) cuts processed until now is 40
2023-08-30 21:22:23,089 INFO [recognize.py:290] (1/2) cuts processed until now is 40
/home/np/anna/np_env/lib/python3.8/site-packages/torch/nn/modules/module.py:1501: UserWarning: operator() profile_node %626 : int = prim::profile_ivalue(%624)
 does not have profile information (Triggered internally at ../third_party/nvfuser/csrc/graph_fuser.cpp:104.)
  return forward_call(*args, **kwargs)
/home/np/anna/np_env/lib/python3.8/site-packages/torch/nn/modules/module.py:1501: UserWarning: FALLBACK path has been taken inside: compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
To report the issue, try enable logging via setting the envvariable ` export PYTORCH_JIT_LOG_LEVEL=manager.cpp`
 (Triggered internally at ../third_party/nvfuser/csrc/manager.cpp:243.)
  return forward_call(*args, **kwargs)
/home/np/anna/np_env/lib/python3.8/site-packages/torch/nn/modules/module.py:1501: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
 (Triggered internally at ../third_party/nvfuser/csrc/manager.cpp:335.)
  return forward_call(*args, **kwargs)
Traceback (most recent call last):
  File "./tools/recognize.py", line 426, in <module>
    main()
  File "./tools/recognize.py", line 402, in main
    mp.spawn(
  File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/speech1/anna/text_search/examples/libriheavy/tools/recognize.py", line 353, in run
    decode_dataset(
  File "/mnt/speech1/anna/text_search/examples/libriheavy/tools/recognize.py", line 278, in decode_dataset
    hyps, timestamps, scores = decode_one_batch(
  File "/mnt/speech1/anna/text_search/examples/libriheavy/tools/recognize.py", line 182, in decode_one_batch
    encoder_out, encoder_out_lens = model.encoder(
  File "/home/np/anna/np_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: shape '[1, 0, 2]' is invalid for input of size 15319040
(np_env) np@np-INTEL:/mnt/speech1/anna/text_search/examples/libriheavy$

my environment that is not working for facebook dataset or my dataset. always failing in step3


(np_env) np@np-INTEL:/mnt/speech1/anna/text_search/examples/libriheavy$ pip list
Package                  Version                               
------------------------ --------------------------------------
absl-py                  1.4.0                                 
audioread                3.0.0                                 
cachetools               5.3.1                                 
certifi                  2023.7.22                             
cffi                     1.15.1                                
charset-normalizer       3.2.0                                 
click                    8.1.7                                 
cmake                    3.27.2                                
cytoolz                  0.12.2                                
dataclasses              0.6                                   
dill                     0.3.7                                 
fasttextsearch           0.6                                   
filelock                 3.12.2                                
google-auth              2.22.0                                
google-auth-oauthlib     1.0.0                                 
graphviz                 0.20.1                                
grpcio                   1.57.0                                
idna                     3.4                                   
importlib-metadata       6.8.0                                 
intervaltree             3.1.0                                 
Jinja2                   3.1.2                                 
k2                       1.24.3.dev20230718+cuda11.7.torch2.0.1
kaldialign               0.7.1                                 
kaldifeat                1.25.0.dev20230726+cuda11.7.torch2.0.1
kaldifst                 1.6                                   
kaldilm                  1.15                                  
lhotse                   1.16.0.dev0+git.2046b55d.clean        
lilcom                   1.7                                   
lit                      16.0.6                                
Markdown                 3.4.4                                 
MarkupSafe               2.1.3                                 
mpmath                   1.3.0                                 
networkx                 3.1                                   
numpy                    1.24.4                                
nvidia-cublas-cu11       11.10.3.66                            
nvidia-cuda-cupti-cu11   11.7.101                              
nvidia-cuda-nvrtc-cu11   11.7.99                               
nvidia-cuda-runtime-cu11 11.7.99                               
nvidia-cudnn-cu11        8.5.0.96                              
nvidia-cufft-cu11        10.9.0.58                             
nvidia-curand-cu11       10.2.10.91                            
nvidia-cusolver-cu11     11.4.0.1                              
nvidia-cusparse-cu11     11.7.4.91                             
nvidia-nccl-cu11         2.14.3                                
nvidia-nvtx-cu11         11.7.91                               
oauthlib                 3.2.2                                 
packaging                23.1                                  
Pillow                   10.0.0                                
pip                      20.0.2                                
pkg-resources            0.0.0                                 
protobuf                 4.24.1                                
pyasn1                   0.5.0                                 
pyasn1-modules           0.3.0                                 
pycipher                 0.5.2                                 
pycparser                2.21                                  
PyYAML                   6.0.1                                 
regex                    2023.8.8                              
requests                 2.31.0                                
requests-oauthlib        1.3.1                                 
rsa                      4.9                                   
sentencepiece            0.1.99                                
setuptools               44.0.0                                
six                      1.16.0                                
sortedcontainers         2.4.0                                 
soundfile                0.12.1                                
sympy                    1.12                                  
tabulate                 0.9.0                                 
tensorboard              2.14.0                                
tensorboard-data-server  0.7.1                                 
termcolor                2.3.0                                 
toolz                    0.12.0                                
torch                    2.0.1+cu117                           
torchaudio               2.0.2+cu117                           
torchvision              0.15.2+cu117                          
tqdm                     4.66.1                                
triton                   2.0.0                                 
typeguard                4.1.2                                 
typing-extensions        4.7.1                                 
urllib3                  2.0.4                                 
werkzeug                 2.3.7                                 
wheel                    0.41.1                                
zipp                     3.16.2                                
(np_env) np@np-INTEL:/mnt/speech1/anna/text_search/examples/libriheavy$

environment that works but duration for m4a audios was wrong in Lhotse scripts for my dataset, but works well with facebook dataset

np@np-INTEL:/mnt/speech3$ source venv_ts/bin/activate
(test-textsearch) np@np-INTEL:/mnt/speech3$ pip list
Package                 Version
----------------------- ---------------------------------------
absl-py                 1.4.0
appdirs                 1.4.3
apturl                  0.5.2
attrs                   19.3.0
audioread               3.0.0
Automat                 0.8.0
backcall                0.1.0
bcrypt                  3.1.7
beautifulsoup4          4.8.2
bleach                  6.0.0
blinker                 1.4
boto                    2.49.0
Brlapi                  0.7.0
cachetools              5.3.1
certifi                 2019.11.28
cffi                    1.15.1
chardet                 3.0.4
click                   8.1.6
colorama                0.4.3
command-not-found       0.3
constantly              15.1.0
cryptography            2.8
cssselect               1.1.0
cupshelpers             1.0
cytoolz                 0.12.1
dataclasses             0.6
dbus-python             1.2.16
decorator               4.4.2
defer                   1.0.6
dill                    0.3.7
distlib                 0.3.0
distro                  1.4.0
distro-info             0.23ubuntu1
duplicity               0.8.12.0
entrypoints             0.3
fasteners               0.14.1
fasttextsearch          0.6
filelock                3.0.12
future                  0.18.2
google-auth             2.22.0
google-auth-oauthlib    1.0.0
graphviz                0.20.1
grpcio                  1.56.2
html5lib                1.0.1
httplib2                0.14.0
hyperlink               19.0.0
idna                    2.8
importlib-metadata      6.8.0
incremental             16.10.1
intervaltree            3.1.0
ipython                 7.13.0
ipython_genutils        0.2.0
jedi                    0.15.2
k2                      1.24.3.dev20230725+cuda11.3.torch1.12.1
kaggle                  1.5.16
kaldialign              0.7.1
kaldifeat               1.25.0.dev20230726+cuda11.3.torch1.12.1
kaldifst                1.6
kaldilm                 1.15
keyring                 18.0.1
language-selector       0.1
launchpadlib            1.10.13
lazr.restfulclient      0.14.2
lazr.uri                1.0.3
lhotse                  1.16.0
lilcom                  1.7
lockfile                0.12.2
louis                   3.12.0
lxml                    4.5.0
macaroonbakery          1.3.1
Mako                    1.1.0
Markdown                3.4.3
MarkupSafe              2.1.3
monotonic               1.5
more-itertools          4.2.0
mysqlclient             1.4.4
netifaces               0.10.4
numpy                   1.24.4
oauthlib                3.1.0
olefile                 0.46
packaging               23.1
paramiko                2.6.0
parsel                  1.5.2
parso                   0.5.2
pexpect                 4.6.0
pickleshare             0.7.5
Pillow                  7.0.0
pip                     23.2
pipenv                  11.9.0
prompt-toolkit          2.0.10
protobuf                4.23.4
pyasn1                  0.4.2
pyasn1-modules          0.2.1
pycairo                 1.16.2
pycparser               2.21
pycups                  1.9.73
PyDispatcher            2.0.5
Pygments                2.3.1
PyGObject               3.36.0
PyHamcrest              1.9.0
PyJWT                   1.7.1
pymacaroons             0.13.0
PyNaCl                  1.3.0
pyOpenSSL               19.0.0
pyRFC3339               1.1
python-apt              2.0.0+ubuntu0.20.4.8
python-dateutil         2.7.3
python-debian           0.1.36ubuntu1
python-slugify          8.0.1
pytz                    2019.3
pyxdg                   0.26
PyYAML                  5.3.1
queuelib                1.5.0
regex                   2023.6.3
reportlab               3.5.34
requests                2.22.0
requests-oauthlib       1.3.1
requests-unixsocket     0.2.0
rsa                     4.9
Scrapy                  1.7.3
SecretStorage           2.3.1
sentencepiece           0.1.99
service-identity        18.1.0
setuptools              45.2.0
simplejson              3.16.0
six                     1.14.0
sortedcontainers        2.4.0
soundfile               0.12.1
soupsieve               1.9.5
ssh-import-id           5.10
systemd-python          234
tabulate                0.9.0
tensorboard             2.13.0
tensorboard-data-server 0.7.1
text-unidecode          1.3
toolz                   0.12.0
torch                   1.12.1+cu113
torchaudio              0.12.1+cu113
torchvision             0.13.1+cu113
tqdm                    4.65.0
traitlets               4.3.3
Twisted                 18.9.0
typeguard               4.0.0
typing_extensions       4.7.1
ubuntu-advantage-tools  27.9
ubuntu-drivers-common   0.0.0
ufw                     0.36
unattended-upgrades     0.1
urllib3                 1.25.8
usb-creator             0.3.7
virtualenv              20.0.17
virtualenv-clone        0.3.0
w3lib                   1.21.0
wadllib                 1.3.3
wcwidth                 0.1.8
webencodings            0.5.1
Werkzeug                2.3.6
wheel                   0.34.2
xkit                    0.0.0
youtube-dl              2021.12.17
zipp                    1.0.0
zope.interface          4.7.1

[notice] A new release of pip is available: 23.2 -> 23.2.1
[notice] To update, run: python3 -m pip install --upgrade pip
(test-textsearch) np@np-INTEL:/mnt/speech3$

Idea for finding traceback quickly

Guys,

I have an idea for how we could quickly obtain an alignment from the suffix-array stuff.

This version would make the most sense if it was done from one single long sequence to another,
e.g. from one book to an entire chapter.

A previous algorithm whose name I forget, gave us a set of "close matches" obtained from
a suffix array algorithm.

Let us think of these "close matches" as a set of (i, j) pairs, where i is an index into
one sequence and j is an index into the other.

We can define a task of getting an approximate alignment between the two sequences,
as the task of finding the longest chain of pairs:
(i1, j1), (i2, j2), ... (iN, jN)

such that i1 <= i2 <= ... <= iN, and j1 <= j2 <= ... <= jN.

One way that we could do this is as follows:
First sort the list of pairs on (i, then j).
Then, for each pair p == (i, j) in the sorted list, set
count[p] = 1 + max_q(count[q])
where q is taken from all previously processed pairs, but limited to those
where q.i <= p.i and q.j <= p.j. (treating them like pairs in python,
so [0] and [1] return the i and j values). We can take the "max" expression
to be 0 if there is no q satisfying these constraints

Then when we are done processing all pairs, the last element in the longest-chain is the
pair with the largest count. We can store "back-pointers" for each processed pair
to tell us which previous pair satisfied the "max" expression.

Now, the above algorithm is quadratic on the face of it, but I think in C++ we can make it
n log(n), by making use of the properties of std::set, as follows.

Sets are ordered by default in C++, i.e. a std::set is stored in sorted order, but with efficient
insertion. Concretely, the algorithm would be like this. I'll just use int as the type; you can decide
what to use.

  // these GoodMatch items, at least the i and j elements,
  //  are an input, they are sorted on (i, then j).  The index into this list
  // is n.

  struct GoodMatch {
    int i;  // position of "good match" in 1st sequence
    int j;  // position of "good match" in 2nd sequence
    int prev_n;  // previous n in backtrace, index into list of GoodMatch.  set in the algorithm.
    bool operator < (const GoodMatch &other) const {  // sort on i, then j.
       if (i < other.i) return true;
       else if (i > other.i) return false;
       else return j < other.j;
       // note: could perhaps do this more efficiently using << and |, not sure if
       // it matters.
  };

  // sort this by i, then j.  prev_n values can be undefined for now.
  std::vector<GoodMatch> sorted_input;  // this is created somehow, then sorted.

  struct GoodMatchCount {
    int j;
    int n;
    int c;
    bool operator < (const GoodMatchCount &other) const {
       if (j < other.j) return true;
       else if (j > other.j) return false;
       else return n < other.n;
    }
  };


  // in addition to being sorted on j, the counts will always have the
  // property that the c values are strictly increasing.  We ensure this
  // by removing elements as necessary.
  std::set<GoodMatchCount> cur_counts;
  cur_counts.insert(GoodMatchCount(0, -1, 0));

  int N = sorted_pairs.size();
  for (int n = 0; n < N; ++n) {
    int j = sorted_pairs[n].j;
    GoodMatchCount gmc(j, n, 0);  // the 0 is a don't-care, we'll set it later.
    auto p = cur_counts.insert(gmc);  // p is pair (iterator, bool)
    assert(p.second);  // should have been inserted, should not be present.
    auto iter = p.first, prev_iter = p.first;
    --prev_iter;  // now points to the previous element, should have prev_iter->j <= j.
    // we know prev_iter->i <= sorted_pairs[n].i, because sorted_pairs are sorted on i.
    int c = prev_iter->c + 1;
    sorted_pairs[n].prev_n = prev_iter->n; // backtrace for newly added element.
    iter->c = c;
    ++iter;
    // Erase all the elements gmc with gmc.j > j and gmc.count <= c:
    // for these j values, the count value is not as good as the one we just
    // added so we'll never want to backtrace to them.
    cur_counts.erase(p.first, iter);
  }

  // After we have done iterating, we can get the "best traceback"
  // as the final element of cur_counts, e.g. cur_counts.back(),
  // and trace back using the prev_n member.

The idea is that we can use the traceback through these "good matches" as the backbone for the Levenshtein alignment, so that
we limit the Levenshtein alignment into blocks defined by the (i, j) positions in this traceback. (Whether we use the maximum possible number of blocks or not, I'm not sure.)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.