Coder Social home page Coder Social logo

Comments (15)

benfmiller avatar benfmiller commented on August 19, 2024

That is pretty much exactly what each of those stand for!

The samples would be more accurately described as FFT frames. The defaults for FFT are defined in the fingerprint file and are DEFAULT_FS=44100, DEFAULT_WINDOW_SIZE=4096, DEFAULT_OVERLAP_RATIO=0.5.
Every audio file is automatically converted to 44100 FS, combined into one channel, and normalized when it is read in. So, the formula to convert between seconds and FFT samples/frames is:

Seconds = Samples / FS * WINDOW_SIZE * OVERLAP_RATIO

confidence = the count of fingerprint matches between the files.
offset_samples = Alignment in FFT samples/frames
locality = center of the matching windows of sound events in each file in FFT samples/frames
locality_setting = width of locality window in seconds after it has been converted to FFT samples/frames
offset_seconds = Alignment in seconds
locality_seconds = center of the matching windows of sound events in each file in seconds

The tuples are of the form (target_file, against_file), where target_file is the input to recognize and against_file is an already fingerprinted file.

Using locality could result in more than one window alignment that has the exact same confidence, so those will be added to the list. e.g. [(1,2),(1,5)] if they both have the same confidence.

Each index of the results is an alignment with the first index having the highest confidence. So if there are multiple possible alignments, attribute[index] gets the values for that single alignment

I suppose I should put an explanation in the readme or the wiki. Do you think I should change "samples" to "frames" to be clearer about their values?

Hope this helps!

from audalign.

Johndirr avatar Johndirr commented on August 19, 2024

Thank you very much for the explanation.
I'm trying to plot an alignment graph where you can see the offset over time of one recording against another (using this to visualize where a tv recording (with advertisement breaks) differs from a dvd video release) and figured I need more information about all the values :D

Can't help you much about the naming. I always think of "fingerprint samples" but frame sounds equally good to me :)

from audalign.

Johndirr avatar Johndirr commented on August 19, 2024

Ok, I was confident I can use your explanation to build what I explained but... I still have trouble to interpret the output. :-(

Lets say I have two files:

  • short_orig_intro.mp3
  • short_new_intro.mp3
    (same as short_orig_intro but the first 12.245 seconds are removed and 2 seconds of silence audio was inserted at the 12th second)

Fingerprinting and recognizing yields the following output:

Finished fingerprinting short_new_intro.mp3
Finished fingerprinting short_orig_intro.mp3
short_new_intro.mp3: Finding Matches... Aligning matches
{
'match_time': 0.005000114440917969,
'match_info':
{
'short_orig_intro.mp3':
{
'confidence': [152, 100],
'offset_samples': [222, 264],
'locality': [[(54, 53), (62, 55), (62, 53), (58, 55), (58, 58), (65, 55), (65, 65), (60, 66), (60, 60), (56, 58), (56, 54), (62, 62), (54, 54), (67, 68), (55, 255), (55, 54), (62, 264), (62, 62), (66, 280), (66, 49), (59, 59), (72, 215), (72, 52), (71, 237), (71, 71), (63, 267), (63, 86), (54, 267), (54, 86), (61, 45), (59, 60), (54, 54), (53, 50)], [(58, 58), (58, 165), (54, 165), (58, 58), (58, 155), (55, 155), (55, 54), (55, 149), (56, 56), (56, 145), (54, 55), (54, 54), (54, 143), (55, 55), (55, 139), (58, 58), (58, 55), (58, 57), (58, 129), (54, 58), (54, 56), (54, 54), (54, 122), (86, 58), (86, 56), (86, 54), (86, 110), (56, 54), (56, 87), (60, 54), (60, 83), (61, 54), (61, 374)]],
'locality_setting': [4.96907],
'offset_seconds': [10.30966, 12.26014],
'locality_seconds': [[(2.50776, 2.46132), (2.87927, 2.5542), (2.87927, 2.46132), (2.69351, 2.5542), (2.69351, 2.69351), (3.01859, 2.5542), (3.01859, 3.01859), (2.78639, 3.06503), (2.78639, 2.78639), (2.60063, 2.69351), (2.60063, 2.50776), (2.87927, 2.87927), (2.50776, 2.50776), (3.11147, 3.15791), (2.5542, 11.84218), (2.5542, 2.50776), (2.87927, 12.26014), (2.87927, 2.87927), (3.06503, 13.00317), (3.06503, 2.27556), (2.73995, 2.73995), (3.34367, 9.98458), (3.34367, 2.41488), (3.29723, 11.00626), (3.29723, 3.29723), (2.92571, 12.39946), (2.92571, 3.99383), (2.50776, 12.39946), (2.50776, 3.99383), (2.83283, 2.0898), (2.73995, 2.78639), (2.50776, 2.50776), (2.46132, 2.322)], [(2.69351, 2.69351), (2.69351, 7.66259), (2.50776, 7.66259), (2.69351, 2.69351), (2.69351, 7.19819), (2.5542, 7.19819), (2.5542, 2.50776), (2.5542, 6.91955), (2.60063, 2.60063), (2.60063, 6.73379), (2.50776, 2.5542), (2.50776, 2.50776), (2.50776, 6.64091), (2.5542, 2.5542), (2.5542, 6.45515), (2.69351, 2.69351), (2.69351, 2.5542), (2.69351, 2.64707), (2.69351, 5.99075), (2.50776, 2.69351), (2.50776, 2.60063), (2.50776, 2.50776), (2.50776, 5.66567), (3.99383, 2.69351), (3.99383, 2.60063), (3.99383, 2.50776), (3.99383, 5.10839), (2.60063, 2.50776), (2.60063, 4.04027), (2.78639, 2.50776), (2.78639, 3.85451), (2.83283, 2.50776), (2.83283, 17.36853)]]
}
}
}

Can you please explain where I can find the mentioned edits in this output? The 12.26014 in offset_seconds seems to be one thing I'm looking for but what does this offset mean exactly?

from audalign.

benfmiller avatar benfmiller commented on August 19, 2024

That sounds like a very fun project! I hope it's otherwise going well!
Sorry for the late reply, I had family in for the weekend.

The output is saying that the best alignment is 10.309 seconds with a confidence of 152. The offset in seconds says that 'short_orig_intro.mp3' starts 10.309 seconds before the target file, which is 'short_new_intro.mp3.' Likewise, there is a 100 confidence level that 'short_orig_intro.mp3' starts 12.260 seconds before 'short_new_intro.mp3.'

It's hard for fingerprinting to be completely accurate for the offset because mp3 encodings move the files a little, and the window size, overlap, and frame rate affect the accuracy of the spectrogram. Each frame corresponds to roughly 0.04 seconds, so alignment is always within that range for the current settings. I recently added a correlation-based alignment that uses the waveform, so it is much more accurate for alignment time but is more susceptible to noise. I am also planning on adding locality to that, too.

I hope this clears the alignment up a little better! It would be interesting to hear what you find for those alignments

from audalign.

Johndirr avatar Johndirr commented on August 19, 2024

Thanks for always getting back to me.
It seems that the second alignment (12.260) is for the first part of the track and the first alignment (10.309) is for the second part.
Since I removed the first ~12 seconds the track has to be shifted 12 seconds. Then comes the part where I inserted ~2 seconds so the offset is 12-2 =10... I was perplexed because of how this is ordered.

I still struggle to pinpoint the information to where the alignment starts or ends. Some of the locality_seconds matching tuples are looking quite promising but some don't make any sense to me.

from audalign.

benfmiller avatar benfmiller commented on August 19, 2024

You betcha! Happy to help!
The order of the matches is sorted by confidence. If you were to pass the folder to the align function and write them to a folder with the shifts, it uses the first, strongest match for the alignment. For your project, it might work better to do something like

new_match={}
for filename, info in match["match_info"].items():
    temp_match = zip(info["offset_seconds"],info["confidence"],info["offset_samples"],info["locality"],info["locality_seconds"])
    temp_match = sorted(temp_match, key=lambda  x: x[0])
    new_match[filename] = temp_match

to order it by offset_seconds.

Can I ask what version you are using? There was a bug in the locality setting that I fixed in 0.1.5. If your version is after that, then it's not quite making sense to me either.

from audalign.

Johndirr avatar Johndirr commented on August 19, 2024

Seems like I used an older version since the output is different after updating to 0.2.0. Both outputs are bellow (I inserted some line breaks into the new_result to distinguish the output a bit.

if __name__ == '__main__':
    ada = audalign.Audalign()
    ada.fingerprint_directory("shortintro")
    ada.save_fingerprinted_files("shortintro.json")
    
    # Only returns matches with total fingerprint matches greater than 50 within 5 second windows
    result = ada.recognize("short_new_intro.mp3", filter_matches=50, locality=5)
    print(result)

    print("---#####---")
    
    new_result={}
    for filename, info in result["match_info"].items():
        temp_match = zip(info["offset_seconds"],info["confidence"],info["offset_frames"],info["locality_frames"],info["locality_seconds"])
        temp_match = sorted(temp_match, key=lambda  x: x[0])
        new_result[filename] = temp_match
    
    print(new_result)

Fingerprinting short_new_intro.mp3
Fingerprinting short_orig_intro.mp3
Finished fingerprinting short_new_intro.mp3
Finished fingerprinting short_orig_intro.mp3
short_new_intro.mp3: Finding Matches... Aligning matches
{'match_time': 0.004990816116333008,
'match_info':
{'short_orig_intro.mp3':
{'confidence':
[138, 99],
'offset_frames':
[222, 264],
'locality_frames':
[[(62, 47), (57, 66), (57, 65), (60, 66), (60, 65), (58, 65), (58, 56), (57, 79), (57, 65), (57, 80), (57, 188), (57, 84), (57, 65), (57, 57), (57, 70), (57, 188), (54, 89), (54, 65), (54, 54), (54, 68), (54, 188), (64, 65), (64, 54), (64, 63), (64, 188), (54, 116), (54, 188), (54, 116), (54, 54), (54, 178), (65, 126), (65, 65), (65, 162), (56, 220), (56, 49), (54, 66), (54, 46), (62, 66), (62, 62), (56, 66), (56, 56), (84, 66), (84, 52), (64, 368), (64, 41), (54, 264), (54, 36), (54, 50), (59, 60), (54, 50), (50, 50)], [(58, 58), (58, 115), (58, 58), (58, 110), (55, 55), (55, 105), (55, 54), (55, 99), (55, 56), (55, 54), (55, 93), (55, 59), (55, 307), (55, 56), (55, 55), (55, 88), (55, 59), (55, 307), (58, 56), (58, 85), (58, 59), (58, 87), (54, 56), (54, 54), (54, 83), (54, 59), (54, 87), (91, 56), (91, 54), (91, 70), (91, 66), (91, 59), (91, 87), (56, 54), (56, 87), (60, 54), (60, 83), (61, 54), (61, 115)]],
'locality_frames_setting':
[4.96907],
'offset_seconds':
[10.30966, 12.26014],
'locality_seconds':
[[(2.87927, 2.18268), (2.64707, 3.06503), (2.64707, 3.01859), (2.78639, 3.06503), (2.78639, 3.01859), (2.69351, 3.01859), (2.69351, 2.60063), (2.64707, 3.66875), (2.64707, 3.01859), (2.64707, 3.71519), (2.64707, 8.7307), (2.64707, 3.90095), (2.64707, 3.01859), (2.64707, 2.64707), (2.64707, 3.25079), (2.64707, 8.7307), (2.50776, 4.13315), (2.50776, 3.01859), (2.50776, 2.50776), (2.50776, 3.15791), (2.50776, 8.7307), (2.97215, 3.01859), (2.97215, 2.50776), (2.97215, 2.92571), (2.97215, 8.7307), (2.50776, 5.38703), (2.50776, 8.7307), (2.50776, 5.38703), (2.50776, 2.50776), (2.50776, 8.2663), (3.01859, 5.85143), (3.01859, 3.01859), (3.01859, 7.52327), (2.60063, 10.21678), (2.60063, 2.27556), (2.50776, 3.06503), (2.50776, 2.13624), (2.87927, 3.06503), (2.87927, 2.87927), (2.60063, 3.06503), (2.60063, 2.60063), (3.90095, 3.06503), (3.90095, 2.41488), (2.97215, 17.08989), (2.97215, 1.90404), (2.50776, 12.26014), (2.50776, 1.67184), (2.50776, 2.322), (2.73995, 2.78639), (2.50776, 2.322), (2.322, 2.322)], [(2.69351, 2.69351), (2.69351, 5.34059), (2.69351, 2.69351), (2.69351, 5.10839), (2.5542, 2.5542), (2.5542, 4.87619), (2.5542, 2.50776), (2.5542, 4.59755), (2.5542, 2.60063), (2.5542, 2.50776), (2.5542, 4.31891), (2.5542, 2.73995), (2.5542, 14.25705), (2.5542, 2.60063), (2.5542, 2.5542), (2.5542, 4.08671), (2.5542, 2.73995), (2.5542, 14.25705), (2.69351, 2.60063), (2.69351, 3.94739), (2.69351, 2.73995), (2.69351, 4.04027), (2.50776, 2.60063), (2.50776, 2.50776), (2.50776, 3.85451), (2.50776, 2.73995), (2.50776, 4.04027), (4.22603, 2.60063), (4.22603, 2.50776), (4.22603, 3.25079), (4.22603, 3.06503), (4.22603, 2.73995), (4.22603, 4.04027), (2.60063, 2.50776), (2.60063, 4.04027), (2.78639, 2.50776), (2.78639, 3.85451), (2.83283, 2.50776), (2.83283, 5.34059)]]}}}

{'short_orig_intro.mp3': [
(10.30966, 138, 222, [(62, 47), (57, 66), (57, 65), (60, 66), (60, 65), (58, 65), (58, 56), (57, 79), (57, 65), (57, 80), (57, 188), (57, 84), (57, 65), (57, 57), (57, 70), (57, 188), (54, 89), (54, 65), (54, 54), (54, 68), (54, 188), (64, 65), (64, 54), (64, 63), (64, 188), (54, 116), (54, 188), (54, 116), (54, 54), (54, 178), (65, 126), (65, 65), (65, 162), (56, 220), (56, 49), (54, 66), (54, 46), (62, 66), (62, 62), (56, 66), (56, 56), (84, 66), (84, 52), (64, 368), (64, 41), (54, 264), (54, 36), (54, 50), (59, 60), (54, 50), (50, 50)],
[(2.87927, 2.18268), (2.64707, 3.06503), (2.64707, 3.01859), (2.78639, 3.06503), (2.78639, 3.01859), (2.69351, 3.01859), (2.69351, 2.60063), (2.64707, 3.66875), (2.64707, 3.01859), (2.64707, 3.71519), (2.64707, 8.7307), (2.64707, 3.90095), (2.64707, 3.01859), (2.64707, 2.64707), (2.64707, 3.25079), (2.64707, 8.7307), (2.50776, 4.13315), (2.50776, 3.01859), (2.50776, 2.50776), (2.50776, 3.15791), (2.50776, 8.7307), (2.97215, 3.01859), (2.97215, 2.50776), (2.97215, 2.92571), (2.97215, 8.7307), (2.50776, 5.38703), (2.50776, 8.7307), (2.50776, 5.38703), (2.50776, 2.50776), (2.50776, 8.2663), (3.01859, 5.85143), (3.01859, 3.01859), (3.01859, 7.52327), (2.60063, 10.21678), (2.60063, 2.27556), (2.50776, 3.06503), (2.50776, 2.13624), (2.87927, 3.06503), (2.87927, 2.87927), (2.60063, 3.06503), (2.60063, 2.60063), (3.90095, 3.06503), (3.90095, 2.41488), (2.97215, 17.08989), (2.97215, 1.90404), (2.50776, 12.26014), (2.50776, 1.67184), (2.50776, 2.322), (2.73995, 2.78639), (2.50776, 2.322), (2.322, 2.322)]),
(12.26014, 99, 264, [(58, 58), (58, 115), (58, 58), (58, 110), (55, 55), (55, 105), (55, 54), (55, 99), (55, 56), (55, 54), (55, 93), (55, 59), (55, 307), (55, 56), (55, 55), (55, 88), (55, 59), (55, 307), (58, 56), (58, 85), (58, 59), (58, 87), (54, 56), (54, 54), (54, 83), (54, 59), (54, 87), (91, 56), (91, 54), (91, 70), (91, 66), (91, 59), (91, 87), (56, 54), (56, 87), (60, 54), (60, 83), (61, 54), (61, 115)],
[(2.69351, 2.69351), (2.69351, 5.34059), (2.69351, 2.69351), (2.69351, 5.10839), (2.5542, 2.5542), (2.5542, 4.87619), (2.5542, 2.50776), (2.5542, 4.59755), (2.5542, 2.60063), (2.5542, 2.50776), (2.5542, 4.31891), (2.5542, 2.73995), (2.5542, 14.25705), (2.5542, 2.60063), (2.5542, 2.5542), (2.5542, 4.08671), (2.5542, 2.73995), (2.5542, 14.25705), (2.69351, 2.60063), (2.69351, 3.94739), (2.69351, 2.73995), (2.69351, 4.04027), (2.50776, 2.60063), (2.50776, 2.50776), (2.50776, 3.85451), (2.50776, 2.73995), (2.50776, 4.04027), (4.22603, 2.60063), (4.22603, 2.50776), (4.22603, 3.25079), (4.22603, 3.06503), (4.22603, 2.73995), (4.22603, 4.04027), (2.60063, 2.50776), (2.60063, 4.04027), (2.78639, 2.50776), (2.78639, 3.85451), (2.83283, 2.50776), (2.83283, 5.34059)])]}

To be honest the new output is not helping me much :( I still wonder how to read the information about the actual time of where the match starts and ends. Do I have to add the locality_frames_setting seconds to the tuples or do I have multiply the locality_frames_setting with the len of the locality_seconds or something :D

from audalign.

benfmiller avatar benfmiller commented on August 19, 2024

I totally getcha. Sorry, there is definitely a bug. I will have it fixed and a new release out by tonight!!

Each alignment has a list of locality tuples where each tuple has the same confidence. The tuples are of the form (target_file, against_file), where target_file is the input to recognize and against_file is an already fingerprinted file. Each number in the tuple is the position in seconds of the center of the locality window for the respective file.

It's not calculating the tuples correctly right now, but I'll fix that up

from audalign.

benfmiller avatar benfmiller commented on August 19, 2024
new_result = {}
for filename, info in result["match_info"].items():
    temp_match = zip(
        info["offset_seconds"],
        info["confidence"],
        info["offset_frames"],
        info["locality_frames"],
        info["locality_seconds"],
    )
    temp_match = sorted(temp_match, key=lambda x: x[0])
    temp_match = list(zip(*temp_match))
    new_result[filename] = {}
    new_result[filename]["offset_seconds"] = temp_match[0]
    new_result[filename]["confidence"] = temp_match[1]
    new_result[filename]["offset_frames"] = temp_match[2]
    new_result[filename]["localigy_frames"] = temp_match[3]
    new_result[filename]["locality_seconds"] = temp_match[4]

If you use this, it puts it back in the same dictionary format, just sorted by offset_seconds

from audalign.

Johndirr avatar Johndirr commented on August 19, 2024

I totally getcha. Sorry, there is definitely a bug. I will have it fixed and a new release out by tonight!!

👍
Wow, that's nice to hear but take your time. There really is no need to rush this.
It's good a bug was found. I'll make sure to verify the fix :)

from audalign.

benfmiller avatar benfmiller commented on August 19, 2024

It works well for me, now! I uploaded a fix to pypi, v0.2.1, so it should be all fixed.
Thanks! It'd be awesome to know if it works or you find anything fishy.

from audalign.

Johndirr avatar Johndirr commented on August 19, 2024

Well done. It's works for me too :)
Here is the actual output:

short_new_intro.mp3: Finding Matches... Aligning matches
confidence
[97, 137]
offset_frames
[222, 264]
locality_frames
[[(88, 352), (115, 379), (129, 393), (138, 401), (157, 411), (157, 423), (157, 510), (157, 527), (164, 411), (164, 428), (164, 510), (164, 527), (169, 411), (169, 434), (169, 510), (169, 527), (180, 411), (180, 434), (180, 444), (180, 510), (180, 527), (183, 411), (183, 434), (183, 444), (183, 447), (183, 510), (183, 527), (263, 481), (263, 506), (270, 481), (270, 513), (286, 481), (286, 528)], [(620, 842), (643, 842), (653, 842), (653, 875), (671, 842), (671, 895), (683, 842), (683, 905), (692, 842), (692, 910), (692, 969), (702, 842), (702, 924), (702, 969), (708, 842), (708, 928), (708, 931), (708, 969), (750, 976), (755, 977), (773, 995), (804, 1026), (814, 1030), (836, 993), (836, 1041), (856, 993), (856, 1078), (864, 993), (864, 1086), (939, 1171), (966, 1175), (1033, 1259), (1040, 1262), (1063, 1289), (1075, 1296)]]
locality_frames_setting
[4.96907]
offset_seconds
[10.30966, 12.26014]
locality_seconds
[[(4.08671, 16.34685), (5.34059, 17.60073), (5.99075, 18.25088), (6.40871, 18.6224), (7.29107, 19.0868), (7.29107, 19.64408), (7.29107, 23.68435), (7.29107, 24.47383), (7.61615, 19.0868), (7.61615, 19.87628), (7.61615, 23.68435), (7.61615, 24.47383), (7.84834, 19.0868), (7.84834, 20.15492), (7.84834, 23.68435), (7.84834, 24.47383), (8.35918, 19.0868), (8.35918, 20.15492), (8.35918, 20.61932), (8.35918, 23.68435), (8.35918, 24.47383), (8.4985, 19.0868), (8.4985, 20.15492), (8.4985, 20.61932), (8.4985, 20.75864), (8.4985, 23.68435), (8.4985, 24.47383), (12.2137, 22.3376), (12.2137, 23.49859), (12.53878, 22.3376), (12.53878, 23.82367), (13.28181, 22.3376), (13.28181, 24.52027)], [(28.79274, 39.1024), (29.86086, 39.1024), (30.32526, 39.1024), (30.32526, 40.63492), (31.16118, 39.1024), (31.16118, 41.56372), (31.71846, 39.1024), (31.71846, 42.02812), (32.13642, 39.1024), (32.13642, 42.26032), (32.13642, 45.00027), (32.60082, 39.1024), (32.60082, 42.91048), (32.60082, 45.00027), (32.87946, 39.1024), (32.87946, 43.09624), (32.87946, 43.23556), (32.87946, 45.00027), (34.82993, 45.32535), (35.06213, 45.37179), (35.89805, 46.20771), (37.33769, 47.64735), (37.80209, 47.83311), (38.82376, 46.11483), (38.82376, 48.34395), (39.75256, 46.11483), (39.75256, 50.06222), (40.12408, 46.11483), (40.12408, 50.43374), (43.60707, 54.38113), (44.86095, 54.56689), (47.97243, 58.46785), (48.29751, 58.60717), (49.36562, 59.86104), (49.9229, 60.18612)]]

I tried to visualize the output for understanding it better. There are two matches with the offset 10.30966 and 12.26014. The graphic shows four tracks. The original audio, the new audio (I try to recognize) and both matches separately. The offset seconds are matching very well.
grafik
Were the matches start and end can be read from the locality seconds output. E.g. (4.08671, 16.34685). This is the middle of one of the locality frames for the first match of the target file (the one I want to recognize). So the match starts at 4.08671 - locality_frames_setting/2 (half since its the middle) is that right?
In blue color I also outlined two locality_seconds tuples (the last ones for each match).

Thank you for your support.

PS. the code you posted gives me an index out of range error :)

from audalign.

benfmiller avatar benfmiller commented on August 19, 2024

I fixed the code I posted above; it was missing a line to unzip it. Thanks for the notice!

The tuple (4.08671, 16.34685) means that the center of the locality window for your target file (short_new_intro) is at 4.08671 seconds and the center of the locality window for the against file(short_orig_intro) is at 16.34685. With the locality setting at 5, the windows are up to 5 seconds wide (2.5 seconds on either side), but they could be smaller if there aren't many fingerprints at that location.

Each tuple is a separate "match" as I've been calling it, and each offset is a separate "alignment."

I realized that the locality tuples were not being added correctly by confidence. I'm so sorry for all the bugs and frustration. I am pretty confident that it is all working correctly now. I just pushed v0.2.2, which ensures that all tuples have the same confidence. I also just pushed v0.3.0, which lets you specify the locality_filter_prop. It filters the tuples by the proportion of tuple confidence over the highest confidence for that offset. This also gives the confidence of each tuple as a third number in the tuple, so you can tell exactly what each tuple's confidence is. Again, so sorry for the bugs.

Thanks for trying it all out!! I hope this version and answer helps you out!

from audalign.

Johndirr avatar Johndirr commented on August 19, 2024

Hi, I'm sorry for the late answer. But I was finally able to try the new version (0.3.1).
First I thought something was not working right because printing the recognize results, I only get two locality frames. The reason seems to be the locality_filter_prop. Not defining a locality filter probability is resulting in a much smaller list of frames then before the feature was introduced:

'match_time': 0.005000114440917969,
'match_info':
{
'short_orig_intro.mp3':
{
'confidence':
[137, 97],
'offset_frames':
[222, 264],
'locality_frames':
[[(620, 842, 137)], [(88, 352, 97)]],
'locality_frames_setting':
[4.96907],
'offset_seconds':
[10.30966, 12.26014],
'locality_seconds':
[[(28.79274, 39.1024, 137)], [(4.08671, 16.34685, 97)]]}}}

Now I wonder which locality filter probability would yield the same results as before? And which probability would make the most sense in general?

EDIT
I was able to finish what I wanted to do (visualizing the difference of two audio files by plotting the offset to another over time) . This is the result with a locality_filter_prop of 0.1 and with 0.9:
grafik
grafik

So I think a very low locality_filter_prop works best for me 👍

from audalign.

benfmiller avatar benfmiller commented on August 19, 2024

Yep, that's all working right and as intended. It was supposed to include tuples that all had the same alignment confidence, but that's not how it was working before. There was no way to tell the confidence of each tuple, and most of them were irrelevant. Now it reports the confidence of each tuple as the third value.

There is no probability. By adjusting the locality filter proportion, it will include tuples with a proportion of the form (tuple confidence / alignment confidence) that is higher than the given proportion. (0 to 1.0)

A good value to use might be 0.5, but it really depends on how many tuples you want and how concerned you are about noise.

from audalign.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.