Coder Social home page Coder Social logo

Comments (4)

pkufool avatar pkufool commented on May 27, 2024 1

I think I can reproduce this issue with your envs.

Here is my logs:

python tools/recognize.py --world-s
ize 1 --manifest-in data/manifests/librilight_chunk_cuts_small.jsonl.gz --manifest-out librilight_cuts_test2.jsonl.gz --nn-model-filename exp/exp/jit_script.pt
 --tokens exp/data/lang_bpe_500/tokens.txt                                                                                                                     
2023-08-31 15:27:31,194 INFO [recognize.py:323] Decoding started                                                                                               
2023-08-31 15:27:31,197 INFO [recognize.py:336] {'subsampling_factor': 4, 'frame_shift_ms': 10, 'beam_size': 4, 'world_size': 1, 'master_port': 12354, 'manifes
t_in': PosixPath('data/manifests/librilight_chunk_cuts_small.jsonl.gz'), 'manifest_out': PosixPath('librilight_cuts_test2.jsonl.gz'), 'log_dir': PosixPath('log
s'), 'nn_model_filename': 'exp/exp/jit_script.pt', 'tokens': 'exp/data/lang_bpe_500/tokens.txt', 'decoding_method': 'greedy_search', 'max_duration': 600.0, 're
turn_cuts': True, 'num_mel_bins': 80, 'num_workers': 8, 'manifest_out_dir': PosixPath('.'), 'suffix': '.jsonl.gz', 'cuts_filename': 'librilight_cuts_test2', 'b
lank_id': 0, 'unk_id': 2, 'vocab_size': 500}                                                                                                                   
2023-08-31 15:27:31,268 INFO [recognize.py:341] device: cuda:0                                                                                                 
2023-08-31 15:27:31,269 INFO [recognize.py:343] Loading jit model                                                                                              
2023-08-31 15:27:44,860 INFO [recognize.py:299] cuts processed until now is 20                                                                                 
/star-kw/kangwei/dev_tools/anaconda/envs/textsearch/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: UserWarning: operator() profile_node %626 : int = prim::profile_ivalue(%624)
 does not have profile information (Triggered internally at ../third_party/nvfuser/csrc/graph_fuser.cpp:104.)
  return forward_call(*args, **kwargs)
/star-kw/kangwei/dev_tools/anaconda/envs/textsearch/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: UserWarning: FALLBACK path has been taken in$ide: compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
To report the issue, try enable logging via setting the envvariable ` export PYTORCH_JIT_LOG_LEVEL=manager.cpp`
 (Triggered internally at ../third_party/nvfuser/csrc/manager.cpp:243.)
  return forward_call(*args, **kwargs)
/star-kw/kangwei/dev_tools/anaconda/envs/textsearch/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
 (Triggered internally at ../third_party/nvfuser/csrc/manager.cpp:335.)
  return forward_call(*args, **kwargs)
Traceback (most recent call last):
  File "/star-kw/kangwei/code/text_search/examples/libriheavy/tools/recognize.py", line 433, in <module>
    main()
  File "/star-kw/kangwei/code/text_search/examples/libriheavy/tools/recognize.py", line 426, in main
    run(rank=0, world_size=world_size, args=args, in_cuts=in_cuts)
  File "/star-kw/kangwei/dev_tools/anaconda/envs/textsearch/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)                                                
  File "/star-kw/kangwei/code/text_search/examples/libriheavy/tools/recognize.py", line 365, in run
    decode_dataset(                     
  File "/star-kw/kangwei/code/text_search/examples/libriheavy/tools/recognize.py", line 287, in decode_dataset
    hyps, timestamps, scores = decode_one_batch(                                  File "/star-kw/kangwei/code/text_search/examples/libriheavy/tools/recognize.py", line 184, in decode_one_batch
    encoder_out, encoder_out_lens = model.encoder(                              
  File "/star-kw/kangwei/dev_tools/anaconda/envs/textsearch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):                               
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):                               
RuntimeError: shape '[1, 0, 2]' is invalid for input of size 7659520

When I set the export PYTORCH_NVFUSER_DISABLE=fallback the logs became:

python tools/recognize.py --world-size 1 --manifest-in data/manifests/librilight_chunk_cuts_small.jsonl.gz --manifest-out librilight_cuts_test2.jsonl.gz --nn-model-filename exp/exp/jit_script.pt --tokens exp/data/lang_bpe_500/tokens.txt 
2023-08-31 15:33:15,659 INFO [recognize.py:323] Decoding started
2023-08-31 15:33:15,663 INFO [recognize.py:336] {'subsampling_factor': 4, 'frame_shift_ms': 10, 'beam_size': 4, 'world_size': 1, 'master_port': 12354, 'manifest_in': PosixPath('data/manifests/librilight_chunk_cuts_small.jsonl.gz'), 'manifest_out': PosixPath('librilight_cuts_test2.jsonl.gz'), 'log_dir': PosixPath('logs'), 'nn_model_filename': 'exp/exp/jit_script.pt', 'tokens': 'exp/data/lang_bpe_500/tokens.txt', 'decoding_method': 'greedy_search', 'max_duration': 600.0, 'return_cuts': True, 'num_mel_bins': 80, 'num_workers': 8, 'manifest_out_dir': PosixPath('.'), 'suffix': '.jsonl.gz', 'cuts_filename': 'librilight_cuts_test2', 'blank_id': 0, 'unk_id': 2, 'vocab_size': 500}
2023-08-31 15:33:15,737 INFO [recognize.py:341] device: cuda:0
2023-08-31 15:33:15,737 INFO [recognize.py:343] Loading jit model
2023-08-31 15:33:29,322 INFO [recognize.py:299] cuts processed until now is 20
/star-kw/kangwei/dev_tools/anaconda/envs/textsearch/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: UserWarning: operator() profile_node %626 : int = prim::profile_ivalue(%624)
 does not have profile information (Triggered internally at ../third_party/nvfuser/csrc/graph_fuser.cpp:104.)
  return forward_call(*args, **kwargs)
Traceback (most recent call last):
  File "/star-kw/kangwei/code/text_search/examples/libriheavy/tools/recognize.py", line 433, in <module>
    main()
  File "/star-kw/kangwei/code/text_search/examples/libriheavy/tools/recognize.py", line 426, in main
    run(rank=0, world_size=world_size, args=args, in_cuts=in_cuts)
  File "/star-kw/kangwei/dev_tools/anaconda/envs/textsearch/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/star-kw/kangwei/code/text_search/examples/libriheavy/tools/recognize.py", line 365, in run
    decode_dataset(
  File "/star-kw/kangwei/code/text_search/examples/libriheavy/tools/recognize.py", line 287, in decode_dataset
    hyps, timestamps, scores = decode_one_batch(
  File "/star-kw/kangwei/code/text_search/examples/libriheavy/tools/recognize.py", line 184, in decode_one_batch
    encoder_out, encoder_out_lens = model.encoder(
  File "/star-kw/kangwei/dev_tools/anaconda/envs/textsearch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
RuntimeError: dims.value().size() == self->getMaybeRFactorDomain().size() INTERNAL ASSERT FAILED at "../third_party/nvfuser/csrc/parser.cpp":3399, please report a bug to PyTorch. 

I think it is a bug in the new version of pytorch.

from text_search.

danpovey avatar danpovey commented on May 27, 2024

maybe we can some figure out what op it was doing, to work around it? it's a shame if we can't inference our models in pytorch 2.0.1.

from text_search.

npovey avatar npovey commented on May 27, 2024

Not sure if it is relevant I use the same env to train icefall models. I was able to use an Icefall recipe and train a model for 150 epochs using my pytorch 2.0.1. env that is given on top. I am getting an error only when using k2fsa:text_search
https://github.com/k2-fsa/text_search/blob/master/examples/libriheavy/run.sh script. stage 3 as described above.

from text_search.

pkufool avatar pkufool commented on May 27, 2024

maybe we can some figure out what op it was doing, to work around it? it's a shame if we can't inference our models in pytorch 2.0.1.

I can't see any stacks, but I think we can first try exporting the model with pythorch 2.0.1.

from text_search.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.