aalireza / simpleaudioindexer Goto Github PK

View Code? Open in Web Editor NEW

101.0 8.0 12.0 62.92 MB

Searching for the occurrence seconds of words/phrases or arbitrary regex patterns within audio files

License: Apache License 2.0

Python 99.69% Dockerfile 0.31%

audio audio-index regex search search-engine search-in-audio sox watson python-library command-line-tool

simpleaudioindexer's Introduction

WARNiNG: I no longer have the time to maintain this library. Contact me if you want to become the mantainer.

SimpleAudioIndexer

Description
What can it do?
Documentation
Requirements
Installation
Uninstallation
Demo
Nice to implement in the future
Contributing
Authors
License

Description

This is a Python library and command-line tool that helps you search for a word or a phrase within an audio file (wav format). It also builts upon the initial searching capability and provides some [so-called] advanced searching abilities!

What can it do?

Index audio files (using Watson (Online/Higher-quality) or CMU Pocketsphinx (Offline/Lower-quality)) and save/load the results.
Searching within audio files in multiple languages (default is English)
Define a timing error for your queries to handle discrepencies.
Define constraints on your queries, e.g. whether to include (sub/super)sequences, results with missing words etc.
Do full blown regex pattern matching!

Documentation

To read the documentation, visit here.

Requirements

Python (v2.7, 3.3, 3.4, 3.5 or 3.6) with pip installed.
Watson API Credentials and/or CMU Pocketsphinx
sox
ffmpeg (if you choose CMU Pocketsphinx)
py.text and tox (if you want to run the tests)

Installation

Open up a terminal and enter:

pip install SimpleAudioIndexer

Installation details can be found at the documentations here.

There's a dockerfile included withing the repo if you're unable to do a native installation or are on a Windows system.

Uninstallation

Open up a terminal and enter:

pip uninstall SimpleAudioIndexer

Uninstalling sox, however, is dependent upon whether you're on a Linux or Mac system. For more information, visit here.

Demo

Say you have this audio file:

Have it downloaded to an empty directory for simplicity. We'd refer to that directory as SRC_DIR and the name of this audio file as small_audio.wav.

Here's how you can search through it.

Command-line Usage

Open up a terminal and enter.

$ sai --mode "ibm" --username_ibm USERNAME --password_ibm PASSWORD --src_dir SRC_DIR --search "called"

{'called': {'small_audio.wav': [(1.25, 1.71)]}}

Replace USERNAME and PASSWORD with your IBM Watson's credentials and SRC_DIR with the absolute path to the directory you just prepared.

The out would be, like above, a dictionary that has the query, the file(s) it appears in and the all of the (starting second, ending second) of that query.

Note that all commands work uniformally for other engines (i.e. Pocketsphinx), for example the command above can be enterred as:

$ sai --mode "cmu" --src_dir SRC_DIR --search "lives"

{'lives': {'small_audio.wav': [(3.12, 3.88)]}}

Which would use Pocketsphinx instead of Watson to get the timestamps. Note that the quality/accuracy of Pocketsphinx is much lower than Watson.

Instead of searching for a word, you could also match a regex pattern, for example:

$ sai --mode ibm --src_dir SRC_DIR --username_ibm USERNAME --password_ibm PASSWORD --regexp " [a-z][a-z] "

{u' in ': {'small_audio.wav': [(2.81, 2.93)]},
{u' to ': {'small_audio.wav': [(1.71, 1.81)]}}

That was the result of searching for two letter words. Note that your results would match any aribtrary regular expressions.

You may also save and load the indexed data from the command line script. For more information, visit here.

Library Usage

Say you have this file

>>> from SimpleAudioIndexer import SimpleAudioIndexer as sai

Afterwards, you should create an instance of sai

>>> indexer = sai(mode="ibm", src_dir="SRC_DIR", username_ibm="USERNAME", password_ibm="PASSWORD")

Now you may index all the available audio files by calling index_audio method:

>>> indexer.index_audio()

You could have a searching generator:

>>> searcher = indexer.search_gen(query="called")
>>> print(next(searcher))
{'Query': 'called', 'File Name': 'small_audio.wav', 'Result': (1.25, 1.71)}

Now there are quite a few more arguments implemented for search_gen. Say you wanted your search to be case sensitive (by default it's not). Or, say you wanted to look for a phrase but there's a timing gap and the indexer didn't pick it up right, you could specify timing_error. Or, say some word is completely missed, then you could specify missing_word_tolerance etc.

For a full list, see the API reference here

Note that you could also call search_all method to have search for a list of queries within all the audio files:

Finally, you could do a regex search!

>>> print(indexer.search_regexp(pattern="[A-Z][^l]* ")
{u'Americans are ca': {'small_audio.wav': [(0.21, 1.71)]}}

There are more functionalities implemented. For detailed explainations, read the documentation here.

Nice to implement in the future

Uploading in parallel
More control structures for searching (Typos, phoneme based approximation of words using CMU_DICT or NLTK etc.)
Searching for an unintelligible audio within the audio files. Possibly by cross correlation or something similar.

Contributing

Should you want to contribute code or ideas, file a bug request or give feedback, Visit the CONTRIBUTING file.

Authors

Alireza Rafiei - aalireza

See also the list of contributors to this project.

License

This project is licensed under the Apache v2.0 license - see the LICENCE file for more details.

simpleaudioindexer's People

Contributors

Stargazers

Watchers

Forkers

opyate techscientist pwoosam stevenlol xsongx nyimbi yuhuofei ncouture xcopyco borismolch ucifs markstur

simpleaudioindexer's Issues

TypeError: 'bool' object is not iterable

Hi, fair warning, I'm new to this module and I think I'm understanding it well enough by your examples to get it going. Please correct me if I'm not using it right.

My intention is to analyze the sample file you provided on the docs page (small_audio.wav), but when I run this using python 3.6 I get this error:

Traceback (most recent call last):
  File "times.py", line 5, in <module>
    indexer.index_audio(basename ='small_audio.wav')
  File "/home/mrhobbits/.local/lib/python3.6/site-packages/SimpleAudioIndexer/__init__.py", line 1108, in index_audio
    self._index_audio_ibm(*args, **kwargs)
  File "/home/mrhobbits/.local/lib/python3.6/site-packages/SimpleAudioIndexer/__init__.py", line 953, in _index_audio_ibm
    self._timestamp_regulator()
  File "/home/mrhobbits/.local/lib/python3.6/site-packages/SimpleAudioIndexer/__init__.py", line 1167, in _timestamp_regulator
    timestamp_basename][0]
TypeError: 'bool' object is not iterable

Here is the code I'm using, it doesn't get past the 4th line, so the rest is omitted.

from SimpleAudioIndexer import SimpleAudioIndexer as sai
indexer = sai(mode="ibm", src_dir="/home/mrhobbits/programming/pythonStuff/recordSpeakerAudio",
username_ibm="", password_ibm="")
indexer.index_audio(basename ='small_audio.wav')

Directory information:

mrhobbits@hobbits:~/programming/pythonStuff/recordSpeakerAudio$ ls -1
small_audio.wav
times.py
mrhobbits@hobbits:~/programming/pythonStuff/recordSpeakerAudio$ pwd
/home/mrhobbits/programming/pythonStuff/recordSpeakerAudio

System: Ubuntu 18.04.4 LTS
Python version(s) 3.6 and 3.7
SAI version: 1.0.0

TypeError: 'bool' object is not iterable

Hey. I really like the technology you're implementing here. It's amazing.

I am a simple man using this on windows. I read the documentation and tried this on a docker shell/bash window. I've read the instructions but I am getting the following error.
Please help me solve this.

Example does not work

When I run the example on a dir with only http://rafiei.net/assets/sai/small_audio.wav in it, it prints {}:

$ sai --mode "cmu" --src_dir data --search "lives"
{}

I'm on ubuntu 18.04 with python 3.7.
Using an abolsute path did not make a difference.

Using mode='cmu' causes unconditional reformatting of files?

I have thousands of files that I've converted myself using ffmpeg to the format that is required by SAI, but when I run my script, I notice when running in verbose mode that the files are being run through ffmpeg again anyway. I cannot find a place where SAI tries to do any kind of format check before running it through ffmpeg. Of course, on a very large set of audio files, this can really waste an enormous amount of time.

Have I missed a flag or something that would allow me to bypass this step? I imagine that saving/loading an existing index would bypass it, but that doesn't help on an initial run against a large set of audio files that have already been converted to 16k mono wav files.

Thanks

Error: The resulting request from Watson was unintelligible

Hello.
Thank you for making this project.
I'm trying to index a short .wav file with the following python (2.7):

from SimpleAudioIndexer import SimpleAudioIndexer as sai
indexer = sai("...","...",src_dir="//Users/.../temp")
indexer.index_audio()                     
print(indexer.get_timestamped_audio())

I'm getting the following error message:
{u'code_description': u'Bad Request', u'code': 400, u'error': u'No speech detected for 30s.'} The resulting request from Watson was unintelligible.
I've attached the sound file (though I had to add a .zip to the filename to get it to transfer; its not actually a .zip file).
Can you recommend a good way to debug this. Is there a verbose option from the Watson routines?
Thanks!
-K

sensorlog_2017-02-08_01-10-45-332_Dev26c5_Loc27_TypeAUDIO.wav.zip

requests.exceptions.ConnectionError: ('Connection aborted.', error("(32, 'EPIPE')",))

Hi @aalireza, thanks for the script, I'm actually using it via Speech Hacker's repo. But I'm getting this issue when I try to run it:

File "/Users/nickpettican/miniconda3/envs/py27/lib/python2.7/site-packages/SimpleAudioIndexer/__init__.py", line 1108, in index_audio
    self._index_audio_ibm(*args, **kwargs)
  File "/Users/nickpettican/miniconda3/envs/py27/lib/python2.7/site-packages/SimpleAudioIndexer/__init__.py", line 944, in _index_audio_ibm
    params=params)
  File "/Users/nickpettican/miniconda3/envs/py27/lib/python2.7/site-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/Users/nickpettican/miniconda3/envs/py27/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/nickpettican/miniconda3/envs/py27/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/nickpettican/miniconda3/envs/py27/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/Users/nickpettican/miniconda3/envs/py27/lib/python2.7/site-packages/requests/adapters.py", line 490, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', error("(32, 'EPIPE')",))

Seems to be happening in the post request:

            response = requests.post(
                    url=("https://stream.watsonplatform.net/"
                         "speech-to-text/api/v1/recognize"),
                    auth=(self.get_username_ibm(), self.get_password_ibm()),
                    headers={'content-type': 'audio/wav'},
                    data=f.read(),
                    params=params)

Could it be the endpoint changed?

Falls over with many WAV files in set

From ParhamP/Speech-Hacker#6

I think the dict here grows too large:
https://github.com/aalireza/SimpleAudioIndexer/blob/master/SimpleAudioIndexer/__init__.py#L240
(I have 5GB of WAV files, but only 2GB of RAM)

I was wondering if this dict could be bypassed, because the hard drive already seems to be used as a cache (filtered, staging, etc).

Please let me know if this is the case, then I can send a PR which uses the staged files instead of this dict.

ValueError: could not convert string to float: 'so'

CMU mode works fine on the example .wav, but on my own 5-minute .wav with several tried keywords, I get the following stacktrace:

$ sai --mode "cmu" --src_dir data/sample --search "security"
Traceback (most recent call last):
  File "/project/venv/bin/sai", line 10, in <module>
    sys.exit(Main())
  File "/project/venv/lib/python3.7/site-packages/SimpleAudioIndexer/__main__.py", line 101, in Main
    cli_script_wrapped(indexer)
  File "/project/venv/lib/python3.7/site-packages/SimpleAudioIndexer/__main__.py", line 73, in cli_script_wrapped
    indexer.index_audio()
  File "/project/venv/lib/python3.7/site-packages/SimpleAudioIndexer/__init__.py", line 1110, in index_audio
    self._index_audio_cmu(*args, **kwargs)
  File "/project/venv/lib/python3.7/site-packages/SimpleAudioIndexer/__init__.py", line 797, in _index_audio_cmu
    str_timestamps_with_sil_conf))]
  File "/project/venv/lib/python3.7/site-packages/SimpleAudioIndexer/__init__.py", line 838, in _timestamp_extractor_cmu
    ) for x in str_timestamps])
  File "/project/venv/lib/python3.7/site-packages/SimpleAudioIndexer/__init__.py", line 838, in <listcomp>
    ) for x in str_timestamps])
ValueError: could not convert string to float: 'so'

NameError: name 'culture' is not defined

This was my initial command; indexer.index_audio(basename=culture.wav)
and the output
NameError: name 'culture' is not defined

Reinstalling Sai

I recently uninstalled and i'm trying to install it back but I'm getting these errors.

Exception:
Traceback (most recent call last):
File "/home/adeoluwa/.local/lib/python2.7/site-packages/pip/basecommand.py", line 215, in main
status = self.run(options, args)
File "/home/adeoluwa/.local/lib/python2.7/site-packages/pip/commands/install.py", line 342, in run
prefix=options.prefix_path,
File "/home/adeoluwa/.local/lib/python2.7/site-packages/pip/req/req_set.py", line 784, in install
**kwargs
File "/home/adeoluwa/.local/lib/python2.7/site-packages/pip/req/req_install.py", line 851, in install
self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
File "/home/adeoluwa/.local/lib/python2.7/site-packages/pip/req/req_install.py", line 1064, in move_wheel_files
isolated=self.isolated,
File "/home/adeoluwa/.local/lib/python2.7/site-packages/pip/wheel.py", line 345, in move_wheel_files
clobber(source, lib_dir, True)
File "/home/adeoluwa/.local/lib/python2.7/site-packages/pip/wheel.py", line 316, in clobber
ensure_dir(destdir)
File "/home/adeoluwa/.local/lib/python2.7/site-packages/pip/utils/init.py", line 83, in ensure_dir
os.makedirs(path)
File "/usr/lib/python2.7/os.py", line 157, in makedirs
mkdir(name, mode)
OSError: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/SimpleAudioIndexer'

Periods in filenames

Are periods allowed in filenames? I'm getting an error from SAI:

OSError: [Errno 2] No such file or directory: 
'/Users/foo/filtered/Recording_Start2017-11-10T19-16-31-538Z_Part000_Dur.wav'

The original source filename is:
/Users/foo/Recording_Start.2017-11-10T19-16-31-538Z_Part.000_Dur.0m24s373.wav
Is SAI stripping out characters from the filename?
I will also note that the ~src/filtered directory does not seem to be showing up. Of course, this may be due to the filename issue.

Watson API has changed

The way to communicate with IBM has changed. IBM's new API endpoints should be used.

OSError: [Errno 20] Not a directory: '/home/adeoluwa/Music/culture.wav/filtered'

I entered this command; >>> indexer.index_audio()

and this what I got