First, thank you for sharing your experiment and code on brain waves.
Please note that my English is not very good, so I may use some incorrect English sentences.
We have done a replication experiment based on the code provided and have achieved similar results to the performance reported in the paper.
We could not get Schoffelen's data, so we only used GWilliams.
While analyzing the experimental results, we found that most of the predicted (generated) sentences either have all the words matching the correct answer or all the words are incorrect.
I am a natural language processing major. As far as I know, there are many cases where the generation model generates only some words in the whole sentence incorrectly.
However, in my experiments with the provided code, there are very few such cases.
We analyzed the data and found that all the sentences were the same in the training and evaluation data.
There were a total of 23339 training data, but only 661 unique sentences.
Similarly, the evaluation data had 651 unique sentences out of 2918.
However, all 651 unique sentences in the evaluation data were included in the train data.
Every MEG path is unique and is not shared by training data and test data.
This is probably a problem with the generation process that generates MEG data from multiple people through the same sentence.
We believe that this separation of data is hard an accurate evaluation.
Pre-trained whisper can learn patterns in word sequences.
Therefore, in this environment, if pre-trained whisper correctly guesses the first word, it can easily predict all subsequent words.
Our simple data analysis code is shown below.
Also, we were unable to get hold of the Schoffelen data, can you tell us where we can download it?
import jsonlines
train_data_path = "{data_path}/preprocess5/split1/train.jsonl"
val_data_path = "{data_path}/preprocess5/split1/val.jsonl"
test_data_path = "{data_path}/preprocess5/split1/test.jsonl"
train_data_sent = []
train_data_meg_path = []
with jsonlines.open(train_data_path, mode='r') as reader:
for json_obj in reader:
train_data_sent.append(json_obj["sentence"])
train_data_meg_path.append(json_obj["eeg"]["path"])
val_data_sent = []
val_data_meg_path = []
with jsonlines.open(val_data_path, mode='r') as reader:
for json_obj in reader:
val_data_sent.append(json_obj["sentence"])
val_data_meg_path.append(json_obj["eeg"]["path"])
test_data_sent = []
test_data_meg_path = []
with jsonlines.open(test_data_path, mode='r') as reader:
for json_obj in reader:
test_data_sent.append(json_obj["sentence"])
test_data_meg_path.append(json_obj["eeg"]["path"])
print("counting unique elements")
print("train")
print("sentence", len(train_data_sent))
print("unique_sentence", len(set(train_data_sent)))
print("meg", len(train_data_meg_path))
print("unique_meg", len(set(train_data_meg_path)))
print()
print("val")
print("sentence", len(val_data_sent))
print("unique_sentence", len(set(val_data_sent)))
print("meg", len(val_data_meg_path))
print("unique_meg", len(set(val_data_meg_path)))
print()
print("sentence", len(test_data_sent))
print("unique_sentence", len(set(test_data_sent)))
print("meg", len(test_data_meg_path))
print("unique_meg", len(set(test_data_meg_path)))
print()
same_sent = 0
same_meg = 0
for i,j in zip(test_data_sent, test_data_meg_path):
if i in test_data_sent:
same_sent += 1
if j in train_data_meg_path:
same_meg += 1
print("number of test_data", len(test_data_sent))
print("counting of setence in train-data", same_sent)
print("counting of meg in train-data", same_meg)
result
counting unique elements
train
sentence 23339
unique_sentence 661
meg 23339
unique_meg 23339
val
sentence 2917
unique_sentence 647
meg 2917
unique_meg 2917
sentence 2918
unique_sentence 651
meg 2918
unique_meg 2918
number of test_data 2918
counting of setence in train-data 2918
counting of meg in train-data 0