ottokart / punctuator2 Goto Github PK

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text

Home Page: http://bark.phon.ioc.ee/punctuator

License: MIT License

Python 98.60% Shell 1.40%

punctuation recurrent-neural-networks theano attention demo

punctuator2's People

Contributors

Stargazers

Watchers

Forkers

ouya-bytes aarasteh thuxugang aitorbajo jorpro rohithkodali maksymdelta codeashu hdubey siddharthshah52 abhiga cozec mdmustafizurrahman smokerhome ajanin deeplearningsprint alpoktem pietrop lverwimp grepchina ruohoruotsi mthrok aaronzira pluketic nodoze heypinch dave-chatmost casillas-qf yangyaoyunshu bekerov abhiramgandhe aswincsekar ttslr ravishegde xhansonx saeedamal tudorelu xxvaliumxx grez72 cuidaadmin jdupl123 mkhon ron-debajyoti ndbui seokhwankim lixiangnlp acerock6 seonkyu94 ragymorkos danovia ngocphuongnb dmitry-panin amgsharma alphadl yasongxu tjdai flydoooog nitrek qgzang sherlock3345 lauraturner gareththomasnz tanvirfuad onb3891 wangmengzhi isi-vista hrlinlp wangwang110 tien-le-grenoble slbinilkumar lql0716 shenxuhui hyb1234hi seralexger easonfzw surajrb lukecq1231 eggonlea juanting-xu wangphoebe bharat-patidar abramovi sharmagourav90 amiya-mandal chrisspen ericbolo cangyuwangwc shenjiawei19 qo6xup6 amberiver sangeet2020 v-yunbin jupinter ftk1000 alphanlp007 dacson phenix1108 polytronicgr srijith9862 ab6995

punctuator2's Issues

pre-train model improvement

I hv used the pre-train model: Demo-Europarl-EN.pcl for punctuation prediction and the result as follow:

PUNCTUATION PRECISION RECALL F-SCORE

b',COMMA 71.89999999999999 75.5 73.7 '

b'.PERIOD 74.2 32.9 45.6 '

b'?QUESTIONMARK 58.3 11.3 18.9 '

b'!EXCLAMATIONMARK nan 0.0 nan '

b':COLON 55.2 26.700000000000003 36.0 '

b';SEMICOLON 33.300000000000004 3.8 6.9 '

b'-DASH 40.6 9.700000000000001 15.7 '

Overall 72.0 55.300000000000004 62.5

Err: 5.86%

SER: 60.7%

With the following config under ubuntu : 16.04

Theano Version: 1.0.4+10.g9feed7868

Python 3.6.8 :: Anaconda, Inc.

Can you advice, how can I improve the performance to reach the baseline model.

Thanks a lot

success apply in chinese

Train：
Building model...
Number of parameters is 14387208
Training...
PPL: 1.8218; Speed: 159.58 sps
PPL: 1.6782; Speed: 174.15 sps
PPL: 1.5590; Speed: 178.82 sps

How to use

Can someone explain me how to use it, step by step?

Adapting from previous model state

print "Building model..."
        net = models.GRUstage2(
            rng=rng,
            x=x,
            minibatch_size=MINIBATCH_SIZE,
            n_hidden=num_hidden,
            x_vocabulary=word_vocabulary,
            y_vocabulary=punctuation_vocabulary,
            stage1_model_file_name=prev_model_file_name
        )

Number of parameters is 13667593
/theano/scan_module/scan.py", line 475, in scan
    actual_slice = seq['input'][k - mintap]
TypeError: 'NoneType' object has no attribute '__getitem_

On adapting to previously trained model file over a new data set , I find above error, I dont supply any pause info so its by default null

Training data to support for different language

Hello! I'm trying to add support for different language here. I have training data with about 100 000 sentences and can increase it to 1M or so. How many sentences I need to start training and how I need to update ./run.sh file in my case (input file name updated already)? (I've tried to use total and half number of lines already and got these errors:

...
Step 1/3
Step 2/3
[nltk_data] Downloading package punkt to /Users/350d/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Skipped 11383 lines
Step 3/3
head: illegal line count -- -50000
head: illegal line count -- -25000
Cleaning up...
...

Update: ok, OSX head and tail don't accept negative values, fixed with update to coreutils.
Will let you known about my progress here...

Pre-trained word embeddings

Hi Ottokart,

Thank you for making this work available. I have a couple questions:

The <hidden_layer_size> parameter: this represents the size of a single BiRNN layer. Have you tried stacking layers on top of each other?
In the paper you have a system that uses pre-trained word vectors. Can you share the code that uses them in the neural network? I see it runs on plain text, but I wonder if you have a script that requests a word2vec binary file (or some other format of word vectors) to be loaded as input to the NN.
Have you tried using an embedding layer as the first layer in the network before BiRNN? And do you think it's possible to use pre-trained vectors (general purpose) as input to an NN that has a word embedding layer (which hopefully will learn task specific embeddings) as the first layer and then your NN (BiRNN and attention)? Can you share your advise on that?

data.py is generating tiny files from my training data

Hi Ottokart,

I'm playing with punctuator2 in a different setting than punctuation restoration. I'm using it to predict long distance tokens; e.g., the input is a document (around 400 words) and I'm trying to predict a token (or more) inside that document (not actually a punctuation), say a paragraph ending. I don't know if punctuator2 will generalize to such tasks so please let me know if there are any restrictions that are specific to punctuation restoration. Now, everything is fine when I use it to predict punctuations and it performs exceptionally, but when I use it for this unusual task, the data.py script produces tiny files from my training data (remember, most of the time there is only one "punctuation" token in a given ~400-word line representing a document):

$ du -sch data/*
4.0K	dev
4.0K	test
16K	train
152K	vocabulary

Where my text files are:

$ wc *
     180   139630   852376 ep.dev.txt
     180    62584   371250 ep.test.txt
    4588  2007796 12097309 ep.train.txt

This is the error I get when I train:

$ python main.py ep 4 0.02
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN not available)
4 0.02 Model_ep_h4_lr0.02.pcl
Loading previous model state
Number of parameters is 84008
Training...
Total number of training labels: 1029
WARNING: Not enough samples in '../data/dev'. Reduce mini-batch size to 0 or use a dataset with at least 1050 words.
Total number of validation labels: 0
Traceback (most recent call last):
  File "main.py", line 182, in <module>
    ppl = np.exp(total_neg_log_likelihood / total_num_output_samples)
ZeroDivisionError: division by zero

I realize that my data is small, but the tiny data files generated by data.py suggest that text is being omitted. I wonder if there is another reason for this issue related to the code being not designed for such tasks. Will doing any of the following solve my problem:

Use pre-trained vectors.
Collect more training data.
Modify parts of the code (if it's a quick fix).

If none will work, no problem, I'll build my own architecture.

Thank you.
Wael

Question about late fusion

I'm not sure this is the best place to ask this question since it concerns the model itself rather than the code.

But anyway here it is: why is there a need for a late fusion between the attention model and the unidirectional GRU model ?

Why not simply use the encoder-decoder model with attention proposed in Bahdanau et al/ (2016) (NMT by jointly learning to align and translate) ? In that model, the output at time t is directly conditioned on the weighted sum of the encoder sequences and the previous hidden state

Thanks @ottokart !

Pretrained model in windows

I am trying to use the pretrained model (Demo-Europarl-EN.pcl) in Windows to convert a file "raw.txt" into "test.txt" with raw.txt has missing punctuations while test is the output file with punctuations

Sorry, i am a little confused on how to do that. Please correct me if I am wrong. Do I do this?

cat raw.txt | python punctuator.py Demo-Europarl-EN.pcl test.txt

cat is a linux command, how can i run this in windows command? Appreciate advice. thanks.

Friendly hello - pingback

Hello!

I was inspired by your project and created simplified alternative. I cited you in the readme: https://github.com/vackosar/keras-punctuator

Let me know if you find this interesting.

Vaclav

Punctuate with GPU

Hello!
I try to use GPU with CUDA with cuDNN for punctuating on Windows.
My Theano tests show that CPU (i7 7700) spend 11 sec and GPU (GTX 1060 6 GB) 0,2 sec.

But I noticed interesting thing, when I punctuate a big text with more 75 000 symbols, GPU punctuates texts faster than CPU, but when text has less 50 000 symbols, then CPU punctuates texts faster than GPU.

How can this be? And how to make the GPU punctuate texts faster regardless of their size?

ValueError: GpuReshape: trying to reshape an array of total size 2500 into an array of total size 12800.

Error while running play_with_model?

which theano and numpy package version it pre-require ?

i install ed

but when i tryied to run

i got this error

Questions / clarifications

Hi Otto,
I have a few questions.
It's unclear to me whether the data.py script requires "pre-processed" data, eg with ",COMMA" annotations, or if this is the script that will annotate the data.
case 1: if it requires annotated data, do you intend to provide the preparation script ?
case 2 : if it will annotate data, what kind of data is required ? tokenized ? normalized for strange punctuation / numbers formatting ?

Can you give an idea of training time depending on amount of data and use of cpu or gpu ?

Last question, if data were properly tokenized, why does it require this specific annotation. Couldn't we use directly the "," and "." ?

thanks for yor insight, and great work by the way.

Sample data files

Hi,

I was trying to run the code but I was stucked in reading the data files (train, dev). Is it possible for you to provide some sample data?

How to determine hidden layer size and learning rate?

python main.py <model_name> <hidden_layer_size> <learning_rate>
I am wondering at first stage of training the language model, how to choose the hidden layer size and learning rate?
Thanks!

RNN output

Hi ottokart,

What is the output of this RNN? So the input is a sequence of tokens (words) and what is the specific output of the RNN? Is it a list of numbers i.e., [0 0 0 0 1] where 0 is non token, 1 is period?

Thank you.

How to pass in pause durations when querying model

Excellent library, thanks for your efforts. I'm attempting to punctuate speech-to-text output of multi-party conversations. Was excited to see that your library supports pause durations during second-phase training. How would one query the model providing this information? I've tried passing in the following but it doesn't seem to produce more successful output:

to <sil=0.000> be <sil=1.000> or <sil=0.000> not <sil=0.000> to <sil=0.000> be <sil=1.000> that <sil=0.000> is <sil=0.000> the <sil=0.000> question <sil=1.000>

n.b. I'm using your demo model that you provided in google drive

How to use punctuator2 tools to process/train/test chinese data?

I want to use punctuator2 to punctuate chinese txt data, and have no idea how to process my chinese txt data.

chinese txt data have these feature different from english txt data:

chinese txt data have no space between characters.
chinese txt data is utf-8 format.
3.chinese txt data have different punctuation system with english. BUT can change to COMMA and PERIOD in english.

I process my data like english (add space between chinese character and substitute the ',' to ',COMMA' and so on), but get an error:

256 0.02 Model_./models/hello.mdl_h256_lr0.02.pcl Building model... Number of parameters is 2049032 Training... WARNING: Not enough samples in '../data/train'. Reduce mini-batch size to 0 or use a dataset with at least 6400 words. Total number of training labels: 0 WARNING: Not enough samples in '../data/dev'. Reduce mini-batch size to 0 or use a dataset with at least 6400 words. Total number of validation labels: 0 Traceback (most recent call last): File "main.py", line 202, in <module> ppl = np.exp(total_neg_log_likelihood / total_num_output_samples)

Thanks for your reading this issue. The sincerity anticipates your reply.

How to distribute the data to ep.train.txt, ep.dev.txt, and ep.test.txt? what's the purpose of these files?

head -n -400000 step2.txt > ./out/ep.train.txt
tail -n 400000 step2.txt > step3.txt
head -n -200000 step3.txt > ./out/ep.dev.txt
tail -n 200000 step3.txt > ./out/ep.test.txt

Hi ottokart,
Could you elaborate on how to distribute the data from corpus to these three files? And what's the purpose of these files?
I have a small corpus file, 65k lines and about 3M words. So, I need to know how should I distribute the data to these files. Thanks!

the calls of the demo url

it is a great project！
I am a web developer，thank you，and it is what i want , i want to call the http://bark.phon.ioc.ee/punctuator with restful api , and please，you have set maximum number of calls per second ? what other ways i can call except this url，as serving on api store https://market.mashape.com/japerk/text-processing ? or plan to do ？

Training using GPUs

Hi there,

Is there a way to train using GPUs?
And what is "normal" speed when training the example "ep" model?

On my macbook pro:

python main.py ep 256 0.02
256 0.02 Model_ep_h256_lr0.02.pcl
Building model...
Number of parameters is 17912840
Training...
PPL: 1.7275; Speed: 281.17 sps
PPL: 1.5956; Speed: 301.83 sps
PPL: 1.4879; Speed: 304.04 sps
PPL: 1.4188; Speed: 306.10 sps
PPL: 1.3774; Speed: 306.06 sps
PPL: 1.3452; Speed: 307.42 sps
PPL: 1.3209; Speed: 309.06 sps
PPL: 1.3014; Speed: 310.14 sps
PPL: 1.2860; Speed: 310.48 sps
PPL: 1.2729; Speed: 307.93 sps
PPL: 1.2619; Speed: 306.15 sps
PPL: 1.2526; Speed: 307.00 sps

On an AWS instance with GPU:

THEANO_FLAGS=device=gpu python main.py ep 256 0.02
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

/home/ubuntu/anaconda2/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:556: UserWarning: Theano flag device=gpu* (old gpu back-end) only support floatX=float32. You have floatX=float64. Use the new gpu back-end with device=cuda* for that value of floatX.
  warnings.warn(msg)
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN not available)
256 0.02 Model_ep_h256_lr0.02.pcl
Building model...
Number of parameters is 17912840
Training...
PPL: 1.7275; Speed: 250.31 sps

Thanks for your help!
Looking forward to training with my own data, but I want to make sure everything is working as expected.

Cheers
Miguel

increase in memory usage as training progresses

I am training the model on an AWS Server (Ubuntu 16.04, NVIDIA Tesla K80).
The memory (CPU RAM) is slowly used up as the training progresses, and I am not even able to complete 200 iterations before the process is killed due to the server running out of memory.
The training runs fine on my laptop (CPU), but with a speed of ~150 samples per second.

I would like to run this on the GPU server. Could you please help me with what is causing this and how to resolve it? I am training on the Europarl v7 dataset.

Problem with data.py

Trying to run the code in data.py from my OS terminal I get this error message:

File "data.py", line 80
print "Vocabulary size: %d" % len(vocabulary)
^
SyntaxError: invalid syntax

Can you explain to me how to overcome this?

MemorryError in data.py

Hello,

first of all, congrats for the proyect! It has been really useful.

My problem is that when I try to execute data.py with long training text files (about 720.000KB for example), a MemoryError appers. However, I was able to train the model with training text about 200.000KB. Is there any kind of limit? Might I change some parameters? I've monitored my RAM usage and it's about 600MB when executing that script, so I am not sure what is the real problem. I have 16GB RAM

My output is:

Traceback (most recent call last): File "data.py", line 279, in <module> create_dev_test_train_split_and_vocabulary(path, True, TRAIN_FILE, DEV_FILE, TEST_FILE, PRETRAINED_EMBEDDINGS_PATH) File "data.py", line 228, in create_dev_test_train_split_and_vocabulary for line in text: File "E:\Python27\lib\codecs.py", line 699, in next return self.reader.next() File "E:\Python27\lib\codecs.py", line 630, in next line = self.readline() File "E:\Python27\lib\codecs.py", line 553, in readline line += data MemoryError

Thanks in advance!

Pre Trained model

The pre trained model is not the same as the demo. The demo shows much higher accuracy.
Could you please give some hints on how to reproduce the exact demo model ?

max number of characters?

I noticed that when I pass text with more than 55000 characters, punctuator2 returns only certain number and the rest is cut off

Problems with Initialization

Hello,

i downloaded the repository and created a new folder data inside. Here i put my files:

data.train.txt
data.dev.txt
data.test.txt

In the documentation it is written:

the conversion can be initiated with python data.py <data_dir>

When i perform python data.py data i always get "Data already exists" and without the argument "The path to stage1 source data directory with txt files is missing". Can you please tell me what is the preferred file and folder structure for the data files?

Thanks!

Punctuation based on part-of-speech tags?

This is a great project! I'm working on some automatic transcription software. All the speech recognition engines I've looked at produce a straight stream of words and I haven't come across anything that can intelligently break up sentences and put in punctuation.

This is the most advanced project I have come across so far.

Although it seems to me that punctuation rules are based almost entirely on part-of-speech tags and the actual words may be unnecessary information.

Before I stumbled across this project my idea was to put text through a part of speech tagger (such as this one) and put the part of speech sequence through an LSTM neural net to predict punctuation.

This may allow you to get more accurate results with much less training data. Have you tried this approach at all?

The other idea I've had is to use a constituency tree. I've played around with Microsoft's tree and it may be possible to achieve this using strict if-then rules with no neural net involved.

Last phrase punctuation

First of all. Thank you for putting this project together, it's incredible and incredibly useful.

I've noticed that the last phrase or sentence in a block of text is not having punctuation added. Do you have any advice on how to fix this?

Thank you

Where is we.pcl?

Memory Error

Hello,

I've a dataset splitted into train, test and dev which consists in about 24M lines for train, 500k for test and 500k for dev. The 24M lines are splitted in 49 different files, about 500k each one, with a newline '\n' character between each phrase.

When trying to execute the script data.py with this dataset, this kind of error appear with the TRAIN_FILES:

File "data.py", line 285, in <module> create_dev_test_train_split_and_vocabulary(path, True, TRAIN_FILE, DEV_FILE, TEST_FILE, PRETRAINED_EMBEDDINGS_PATH) File "data.py", line 257, in create_dev_test_train_split_and_vocabulary write_processed_dataset(train_txt_files, train_output) File "data.py", line 192, in write_processed_dataset data.append(subsequence) MemoryError

However, when calling the function write_processed_dataset with the test and dev dataset, everything works fine. I've already trained the punctuator with smaller datasets.

Can you please suggest me how to deal with this error?

Thank you!

word break

Hi, Ottokar

Great product!
I'm not experienced in neural networks, so I have a question. Is it possible to keep words the same as from the input? I.e. the one is broken in such situation:

Unpunctuated text:

you're gonna get left behind in twenty sixteen everyone starting a podcast everyone is starting a podcast it's gonna be like bugs it's you you used to just be able to have a blog and it was cool and people read it

Result

You'Re gon na get left behind in twenty sixteen everyone starting a podcast. Everyone is starting a podcast, it's gon na be like bugs it's you. You used to just be able to have a blog, and it was cool and people read it

Thanks

Relations between data in first and second stages

Hello,

I have a question about how to train a first and second stages.
I have a lot of plain texts to train the first stage, but only a few part of them are annotated with word pause durations.

What is the best approach I should do?

Train a first stage with all plain texts, and then train the second one with the part of them with pause annotations.
Train a first stage with all plain texts except the pause annotated ones, and then use them only in the second stage.

Thank you so much!

Demo pre-triained file and demo site are not same

Hi, I downloaded your pre-trained model : Demo-Europarl-EN.pcl.
It works but not same as demo web site.

here is my text : scam mail data

http://london.campusanuncios.com/detcommunity972996941X-ZZMonths-old-London.html 11Months old Brenda is seeking for a new Family([email protected]) Brenda and Junior are a survivor, 1year old.growing and thriving against all odds! They are waiting for a forever family.They are intelligent, resourceful, outgoing, and logical.growing strong and healthy under the loving care of me and nannies. They needs an adoptive family who will explain their past to them and help them understand what has happened.As you can see in their pictures, Brenda and Junior are very healthy and well cared for baby. They arrived in a state of advanced malnutrition but is recovering beautifully. They are well attached to their caretakers. Now all they needs is a forever family of their own.Contact us directly my E-mail ([email protected]) for more information e-mail: 11Months old Brenda is seeking for a new Family([email protected]) LondonClick to expand... Received: from [41.202.196.91] CAMEROON From: Jesica Brown <[email protected]> DEAR SIR/MADAM, Its with great pleasure and happiness that i write to you this day,I am very sorry for the late reply because i have been very busy.I am willing to give this baby girl to any loving family...she is one year old and will be getting her first birth day next week ending. Attached to this mail are some pictures of the baby girl.this is such and unfortunate situation for me.,i want you to know that children are a gift from God and they deserve to be happy, This baby is free and shall be going to any family free.there is nothing like adoption fee.all I need is a family to spoil her with much love,care tenderness and lots of affection. most importantly to give her a sound education. I want to believe that you are able to offer all this to this baby... she is so cute and very playful,please I will also like to know where you are presently located. As soon as you see the pictures and you think you like the baby girl please just get back to me ,i would have loved her to spend her first birthday with her new supposed family. note that this baby shall be coming from Victoria Cameroon , i will like you to know that only the good Lord we serve with his infinite mercy will bless you for taking care of this children.they are a gift from God Almighty. FINANCIALLY. i will like you to know that there is no adoption fee as this baby is free to you ok.you shall only take care of the charges for the procedure which shall be carried out here in Victoria Cameroon . i will guide you through every step on the way through the maze of paperwork until the day you are officially declared the parent and the child is issued a new Birth Certificate bearing your surname.Children are a gift from God Children are innocent, regardless of the circumstance they are born. They have every right to have loving parents and a stable home. Adoption provides them hope for a happy beginning and a brighter future. Every child, with proper upbringing has the potential to become a better person. Moses was adopted. He became a great leader. You too can change the world by changing someone's life. You can make a difference, one child at a time. You can live and leave a lasting legacy. Now? Adopt BRENDA now NOTE THAT BRENDA SHALL BE COMING FROM Victoria Cameroon AS I AM PRESENTLY THERE WITH BRENDA.GET BACK TO US WITH YOUR CREDENTIALS OK. THANKS Tel:+(237)94428364Click to expand...

this is Demo-Europarl-EN.pcl with play_with_model.py.

http://london.campusanuncios.com/detcommunity972996941X-ZZMonths-old-London.html 11Months old Brenda is seeking for a new Family([email protected]) Brenda and Junior are a survivor, ,COMMA 1year ,COMMA old.growing and thriving against all odds! They are waiting for a forever .PERIOD family.They are intelligent, resourceful, ,COMMA outgoing, and logical.growing strong and healthy .PERIOD under the loving care of me and nannies. .PERIOD They needs an adoptive family who will explain their past to them and help them understand .PERIOD what has happened.As you can see in their pictures, .PERIOD Brenda and Junior are very healthy and well cared for .PERIOD baby. They arrived in a state of advanced malnutrition ,COMMA but is recovering .PERIOD beautifully. They are well attached to their caretakers. Now .PERIOD all they needs is a forever family of their own.Contact us directly .PERIOD my E-mail ([email protected]) for more information ,COMMA e-mail: 11Months old Brenda ,COMMA is seeking for a new Family([email protected]) LondonClick to expand... Received: from [41.202.196.91] CAMEROON From: Jesica Brown ,COMMA <[email protected]> ,COMMA DEAR ,COMMA SIR/MADAM, Its ,COMMA with great pleasure and happiness that i write to you .PERIOD this day,I am very sorry for the late reply ,COMMA because i have been very busy.I ,COMMA am willing to give this baby girl to any loving family...she is one year old and will be getting her first birth day next week .PERIOD ending. Attached to this mail are some pictures of the baby .PERIOD girl.this is such and unfortunate situation for me.,i want you to know that children are a gift from God and they deserve to be happy, .PERIOD This baby is free and shall be going to any family .PERIOD free.there is nothing like adoption .PERIOD fee.all I need is a family to spoil her with much love,care tenderness and lots of affection. .PERIOD most importantly ,COMMA to give her a sound education. .PERIOD I want to believe that you are able to offer all this to this baby... .PERIOD she is so cute and very playful,please .PERIOD I will also like to know where you are presently located. As soon .PERIOD as you see the pictures -DASH and you think you like the baby girl -DASH please just get back to me -DASH ,i -DASH would have loved her to spend her first birthday with her new supposed family. note that this baby shall be coming from Victoria Cameroon , .PERIOD i will like you to know that only the good Lord we serve with his infinite mercy will bless you for taking care of this children.they are a gift from God Almighty. FINANCIALLY. .PERIOD i will like you to know that there is no adoption fee ,COMMA as this baby is free to you .PERIOD ok.you shall only take care of the charges for the procedure which shall be carried out here in Victoria Cameroon . .PERIOD i will guide you through every step on the way ,COMMA through the maze of paperwork .PERIOD until the day you are officially declared the parent and the child is issued .PERIOD a new Birth Certificate bearing your surname.Children are a gift from God .PERIOD Children are innocent, ,COMMA regardless of the circumstance they are born. .PERIOD They have every right to have loving parents and a stable ,COMMA home. Adoption provides them hope for a happy beginning and a brighter future. Every child, with proper upbringing ,COMMA has the potential to become a better person. .PERIOD Moses was adopted. ,COMMA He became a great leader. .PERIOD You too ,COMMA can change the world by changing someone's life. You can make a difference, one child at a time. .PERIOD You can live and leave a lasting legacy. Now? Adopt BRENDA now NOTE THAT BRENDA SHALL BE COMING FROM Victoria Cameroon AS I AM PRESENTLY ,COMMA THERE ,COMMA WITH ,COMMA BRENDA.GET ,COMMA BACK ,COMMA TO ,COMMA US ,COMMA WITH ,COMMA YOUR ,COMMA CREDENTIALS ,COMMA OK. ,COMMA THANKS Tel:+(237)94428364Click to expand...

This is Demo-website

Http //london.campusanuncios.com/detcommunity972996941X-ZZMonths-old-London.html 11Months old Brenda is seeking for a new Family, ( jesicabrown82 @ yahoo.com, ), Brenda and Junior are a survivor, 1year, old.growing and thriving against all odds. They are waiting for a forever family.They are intelligent, resourceful outgoing and logical.growing strong and healthy. Under the loving care of me and nannies., They needs an adoptive family who will explain their past to them and help them understand what has happened.As you can see in their pictures. Brenda and Junior are very healthy and well cared for baby.. They arrived in a state of advanced malnutrition, but is recovering beautifully.. They are well attached to their caretakers.. Now all they needs is a forever family of their own.Contact us directly. My E-mail, (, jesicabrown82 @ yahoo.com, ) for more information, e-mail, 11Months, old Brenda is seeking for a new Family, ( jesicabrown82 @ yahoo.com ) LondonClick to expand ... Received from [ 41.202.196.91 ], CAMEROON From Jesica Brown, <, jesicabrown82 @ yahoo.com, > DEAR SIR/MADAM, Its with great pleasure And happiness that i write to you this day, I am very sorry for the late reply, because i have been very busy.I - am willing to give this baby girl to any loving family .... She is one year old and will be getting her first birth day next week. Ending. Attached to this mail are some pictures of the baby. Girl.This is such and unfortunate situation for me.. I want you to know that children are a gift from God and they deserve to be happy. This baby is free and shall be going to any family. Free.There is nothing like adoption. Fee.All, I need is a family to spoil her with much love care, tenderness and lots of affection.. Most importantly, to give her a sound education., I want to believe that you are able to offer all this to this baby .... She is so cute and very playful. Please I will also like to know where you are presently located. As soon as you see the pictures, and you think you like the baby girl, please just get back to me. I would have loved her to spend her first birthday with her new supposed family. note that this baby shall be coming from Victoria Cameroon. I will like you to know that only the good Lord we serve with his infinite mercy will bless you for taking care of this children.they are a gift from God, Almighty. FINANCIALLY.. I will like you to know that there is no adoption fee, as this baby is free to you. Ok.You shall only take care of the charges for the procedure which shall be carried out here in Victoria Cameroon. I will guide you through every step on the way, through the maze of paperwork. Until the day you are officially declared the parent and the child is issued. A new Birth Certificate bearing your surname.Children are a gift from God. Children are innocent, regardless of the circumstance they are born.. They have every right to have loving parents and a stable home. Adoption provides them hope for a happy beginning and a brighter future.. Every child with proper upbringing has the potential to become a better person.. Moses was adopted., He became a great leader.. You too can change the world by changing someone's life.. You can make a difference. One child at a time.. You can live and leave a lasting legacy. Now Adopt BRENDA now NOTE THAT BRENDA SHALL BE COMING FROM Victoria Cameroon, AS I AM PRESENTLY THERE WITH BRENDA.GET BACK TO US WITH YOUR CREDENTIALS. Ok. THANKS Tel +, ( 237 ) 94428364Click to expand ...

Demo site works better than local one.
Can you tell me why??
for example for upper data, there is nothing like adoption fee. part is different.

How to run pre-trained model in local cpu

Hi @ottokart
I am new to this. I downloaded the .pcl pre-trained model. Can someone tell me how to use this file to add puntuations to text

prosodic dataset

Hi,

Is there a dataset that anyone has found to train the model with prosodic (pause) duration included?

Thank you!

can not execute error_calculator.py

Hello,

I am getting error when I try to run error_calculator.py script :
Traceback (most recent call last): File "error_calculator.py", line 147, in <module> compute_error([target_path], [predicted_path]) File "error_calculator.py", line 38, in compute_error target_stream = target.read().split() File "/usr/lib/python2.7/codecs.py", line 686, in read return self.reader.read(size) File "/usr/lib/python2.7/codecs.py", line 492, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I tried to change the code from:

with codecs.open(target_path, 'r', encoding='utf-8') as target, codecs.open(predicted_path, 'r', encoding='utf-8') as predicted:

with codecs.open(target_path, 'r', encoding='utf-8') as target, codecs.open(predicted_path, 'r', encoding='utf-8', errors='ignore') as predicted:

but this time I get :
" ".join(predicted_stream[p_i-2:p_i+2])) AssertionError: <exception str() failed>
Any suggestions ?

Automatic capitalization

Hi Ottokar, I have trained the system for German language with Europarl data, and the result was great, thank you! I would like the model to predict capitalization also, what do you think of following scenario:

I add new labels to training data
US Präsident Trump angekündigt:
will be
us =ALLUPPERCASE präsident \TITLE trump \TITLE angekündigt :COLON
Assign new labels to punctuation symbols which come after capitalized words, so:
US, Frankreich, Deustchland
will be
us ,ALLUPPERCASE_COMMA frankreich ,TITLE_COMMA deustchland \TITLE

But data sparsity problem can arise because we use three different labels instead of one ,COMMA label: ,COMMA ,ALLUPPERCASE_COMMA ,TITLE_COMMA.
The second approach can be training 2 different models for punctuation and capitalization, and apply punctuation, then capitalization prediction. Do you think the first approach is sufficient if there's enough data?

I am also concerned about vocabulary size, do you think 100K is enough for production systems?

http://bark.phon.ioc.ee/punctuator is unavailable

post punctuator steps?

Question about steps after this:
cat data.dev.txt | python punctuator.py <model_path> <model_output_path>

We get a text file have the result with ',COMMA' and '.PERIOD' etc inside. To generate final result, we assume following steps:

replace punc with real punc
Capitalize the previous word after .PERIOD'

Is this the right understanding?

Pre-Trained Model

Hi Ottokart,

It would be very useful to be able to use the pre-trained model that you have set up in your demo. Would it be possible to add a link where we can download this? I am interested in running the model locally, but am relatively new to this field and am having difficulty setting up everything required to train the model in a reasonable amount of time.

We see spaces inserted that break words in the original text

A small example:
When run the following on http://bark.phon.ioc.ee/punctuator

how's everybody doing I'm Dalton I'm a partner at Y Combinator in addition I'm the head of Admissions which is our selection process but the companies that get into YC I am here to talk about pivoting yeah let's talk all about pivoting cool all right here are some stuff we're gonna cover what the heck is a pivot why you should pivot when you should pivot and evaluating ideas to pivot to so we're gonna try to cover all the bases here

The result we got is:

How'S everybody doing I'm Dalton, I'm a partner at Y Combinator. In addition, I'm the head of Admissions, which is our selection process, but the companies that get into YC. I am here to talk about pivoting yeah. Let'S talk all about pivoting cool. All right here are some stuff we're gon na cover. What the heck is a pivot, why you should pivot when you should pivot and evaluating ideas to pivot to so we're gon na try to cover all the bases here.

All the words gonna got broke into gon na . Is there a config that we can prevent this from happening?

Words Splitting Automatically

Hi Ottokar,
I have been running the model, and somehow, the model is splitting the words "gonna" and "wanna" into "gon na" and "wan na". I am unable to figure out the rationale behind this! Please help me understand the same.
Thanks!

nan values while training

Hello again,

I've a question related to the training process. I have been able to train the network successfully until NAN values of PPL appear. In my case, they start to appear in epoch 47, and after that epoch every value of PPL and validation PPL is NAN.

Gonna paste some lines of the output while training (I've print some additional values):

`Epoch = 46
Training with learning rate = 0.02
Total neg log lik = 1292.63056993
PPL = exp(0.00103047717629)
PPL: 1.0010; Speed: 12549.65 sps
Total neg log lik = 2790.23252964
PPL = exp(0.00111217814479)
PPL: 1.0011; Speed: 15146.28 sps
Total neg log lik = 4349.67130947
PPL = exp(0.00115584377909)
PPL: 1.0012; Speed: 16240.85 sps
Total neg log lik = 6142.70772409
PPL = exp(0.00122423224731)
PPL: 1.0012; Speed: 16856.70 sps
Total neg log lik = 7894.18972039
PPL = exp(0.00125863994266)
PPL: 1.0013; Speed: 17253.00 sps
Total neg log lik = 9953.48264909
PPL = exp(0.00132247590469)
PPL: 1.0013; Speed: 17521.59 sps
Total neg log lik = 12641.4703844
PPL = exp(0.00143967182766)
PPL: 1.0014; Speed: 17718.59 sps
Total neg log lik = 14958.8259408
PPL = exp(0.00149063555692)
PPL: 1.0015; Speed: 17859.47 sps
Total number of training labels: 10436608
Net saved
Total number of validation labels: 14513408
Validation perplexity is 6.6061

Epoch = 47
Training with learning rate = 0.02
Total neg log lik = 1606.82540393
PPL = exp(0.00128095137431)
PPL: 1.0013; Speed: 12486.81 sps
Total neg log lik = 3401.88553357
PPL = exp(0.00135598115975)
PPL: 1.0014; Speed: 15087.08 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 16219.43 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 16852.34 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 17255.23 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 17535.15 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 17743.79 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 17898.87 sps
Total number of training labels: 10436608
Net saved
Total number of validation labels: 14513408
Validation perplexity is nan

Epoch = 48
Training with learning rate = 0.02
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 12501.37 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 15115.35 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 16232.72 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 16846.52 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 17248.92 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 17529.18 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 17734.80 sps
Total neg log lik = nan
PPL = exp(nan)
PPL: nan; Speed: 17893.42 sps
Total number of training labels: 10436608
Net saved
Total number of validation labels: 14513408
Validation perplexity is nan`

I don't understand what is happening.
May I am doing something wrong? How can I fix that?

Thank you again!

Preprocessing scripts for TED dataset

Could you possibly add the preprocessing scripts for TED dataset?
It would help to reproduce the results on your Interspeech paper.

Changin the language

Hi,
I wanted to know if another language have the same pattern?.
Here you treated the English file as GloVe, but I'm trying to do
something similar to this project using Dutch as source language.
How can I have the GloVe version of Dutch ?.
Thank you.

Unclear on which paths to use in the model

Hey! I'm trying to train Punctuator 2 in Icelandic. I preprocessed the raw text with the script:

with open("isl_out.txt", 'w', encoding='utf-8') as out_txt:encoding='utf-8') as text:

    for line in text:
        line = line.replace("\"", "").strip()
        line = line.replace("“", "").strip()

        line = multiple_punct.sub(r"\g<1>", line)


        if skip(line):
            skipped += 1
            continue

        line = process_line(line)
        alllines_en.append(line)
            

print("Skipped %d lines" % skipped)

and saved 80% to isl.train.txt, 10% to isl.dev.txt and 10% to isl.test.txt.

Then I ran:
python data.py datadir

python main.py models.py 256 0.02

After this has finished running, I'm very unsure about what to do next, with:

cat data.dev.txt | python punctuator.py <model_path> <model_output_path>

I tried:
cat datadir/isl.dev.txt | python punctuator.py Model_models.py_h256_lr0.02.pcl model_output.pcl

but get an error:

Loading model parameters...
Traceback (most recent call last):
  File "punctuator.py", line 149, in <module>
    net, _ = models.load(model_file, 1, x)
  File "/Users/starspace/Desktop/LVL/H12/punctuator2-master/models.py", line 54, in load
    from . import models
ValueError: Attempted relative import in non-package

So most probably I was running the wrong command, but otherwise I don't know what to do with this error. Explanations would be very much appreciated, thanks!

Full dot found in the beginning for the csv

I don't why there is always fulldot found in the 1st word of csv in punkProse data, such as:
for test_groundtruth\0.txt
word|punctuation_before|pos|pause_before|pause_before_norm|f0_mean|f0_range|f0_birange|f0_sd|i0_mean|i0_range|i0_birange|i0_sd|speech_rate_norm|sent_tag
and|.|CC|0.1098|-0.3633|-3.6178|1.4643|3.8865|0.5531|4.3286|1.572|9.2383|0.774|-0.0917|

for test_samples\0.txt
and i was certain that's what my life would be . and when i went to college at the university of nevada , las vegas when i was eighteen , i was stunned to find that there was not a pop star onezeroone , or even a degree program for that interest .

no '.' at the beginning, I don't know why they are not sync together. Thanks for then kindly help
-dick

Cannot replicate paper results

Downloading the pre-trained INTERSPEECH-T-BRNN.pcl model from here and then running it on the TED talk data does not yield the reported overall F1 score of 63.1. Is this a reason for this - or am I doing something wrong with the training data?

Below are the obtained metrics, from error_calculator.py

Punctuation	P	R	F1
COMMA	43.9	56.0	49.2
PERIOD	61.4	70.4	65.6
QUESTIONMARK	50.0	62.2	55.4
OVERALL	52.4	63.5	57.4

Here is a sample of the test data:

i'm a savant ,COMMA or more precisely ,COMMA a high-functioning autistic savant .PERIOD it's a rare condition .PERIOD and rarer still when accompanied ,COMMA as in my case ,COMMA by self-awareness and a mastery of language .PERIOD very often when i meet someone and they learn this about me there's a certain kind of awkwardness .PERIOD i can see it in their eyes .PERIOD they want to ask me something .PERIOD and in the end ,COMMA quite often ,COMMA the urge is stronger than they are and they blurt it out ,COMMA if i give you my date of birth ,COMMA can you tell me what day of the week i was born on ?QUESTIONMARK or they mention cube roots or ask me to recite a long number or long text .PERIOD

And here is a sample of the output:

i'm a savant or more precisely a high-functioning autistic savant it's ,COMMA a rare condition and rarer still when accompanied as in my case by self-awareness and a mastery of language ,COMMA very often when i meet someone and they learn this about me ,COMMA there's a certain kind of awkwardness .PERIOD i can see it in their eyes .PERIOD they want to ask me something .PERIOD and in the end ,COMMA quite often ,COMMA the urge is stronger than they are .PERIOD and they blurt it out .PERIOD if i give you my date of birth ,COMMA can you tell me what day of the week i was born on or they mention cube roots or ask me to recite a long number or long text ?QUESTIONMARK

As a side note, attempting to run INTERSPEECH-T-BRNN-pre.pcl gives a ValueError: cannot reshape array of size 10000 into shape (200,256) error, which does not occur when running INTERSPEECH-T-BRNN.pcl or Demo-Europarl-EN.pcl.

ottokart / punctuator2 Goto Github PK

punctuator2's People

Contributors

Stargazers

Watchers

Forkers

punctuator2's Issues

here is my text : scam mail data

this is Demo-Europarl-EN.pcl with play_with_model.py.

This is Demo-website

Recommend Projects

Recommend Topics

Recommend Org