berknology / text-preprocessing Goto Github PK

View Code? Open in Web Editor NEW

60.0 60.0 7.0 41 KB

A python package for text preprocessing task in natural language processing.

License: BSD 3-Clause "New" or "Revised" License

Python 95.58% Makefile 4.42%

machine-learning natural-language-processing python text-preprocessing

text-preprocessing's People

Contributors

Stargazers

Watchers

Forkers

davidalami dangxuanhong ankush-chander samyooole patrickcnkm easy-forks bibix

text-preprocessing's Issues

Error: list

Thanks for the package, its neat to have everything in one place.

I am bumping into the following problem:

This works, but as soon as I add the punctuation and other functions there is a problem

Performance Issues with Lemmatization

I'm encountering a very slow processing time on a Pandas DataFrame of ~40k sentences (generally short to mid-length extracted from TV dialog).

It appears to be doing 4k sentences/10mins on a GPU/High-Memory Pro Google Colab notebook. This may be normal or infact good. It seems to be 2x the speed of the SpaCy pipeline specs here:

https://stackoverflow.com/questions/51372724/how-to-speed-up-spacy-lemmatization

Any guidance on speeding up this processing step would be appreciated. Could this be accelerated with multi-core? Are any config changes required? Any guidelines on breakpoints on memory size of Google Cloud instances that may help?

Great package, long-overdue and very needed. - J

How to cite this GitHub repository

Hello,
Thanks for your contribution. How can I cite your work on my paper?

names-dataset made breaking chnages

Hey team,

I saw that you are using names-dataset library and sadly, they made some breaking changes and it gives following error when you run the example given in the Readme -
AttributeError: 'NameDataset' object has no attribute 'search_first_name'

On further digging, I found that they have removed the search_first_name method altogether. Now only search method is there.

Could you please update the codebase or maybe specify the library version which included search_first_name method?

Thanks,
Naman

A problem with dash character with check spelling

when I use this method, it generates an error in check spelling because the remove_special_character doesn't remove the dash '-'.
I also need to use check spelling
def pre_process(input_text):
input_text=remove_pattern(input_text,"@[\w]")
input_text=remove_pattern(input_text,"#[\w]")
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, remove_special_character, normalize_unicode, remove_number, remove_whitespace, remove_stopword, lemmatize_word, stem_word, check_spelling]
preprocessed_text = preprocess_text(input_text, preprocess_functions)
return preprocessed_text

print(pre_process("The method is internal-based."))

Please, Could you suggest a solution?

[nltk_data] Package omw-1.4 is already up-to-date!

[nltk_data] Downloading package omw-1.4 to /home/ozmosys/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!

This message is very annoying...
And it causes an error End of script output before headers...

Is it possible to specify quiet=True mode in venv/lib/python3.9/site-packages/text_preprocessing/text_preprocessing.py

nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('omw-1.4')

Pass Preprocess_functions as a parameter to udf Preprocess_text method

I would like to pass preprocess_functions as a parameter to preprocess_text method

using the example below

def preprocess_text_spark(df: SparkDataFrame, target_column: str, preprocessed_column_name: str = 'preprocessed_text' ) -> SparkDataFrame:

""" Preprocess text in a column of a PySpark DataFrame by leveraging PySpark UDF to preprocess text in parallel """

preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, remove_special_character, normalize_unicode, remove_number, remove_whitespace, remove_stopword, lemmatize_word, stem_word, check_spelling] _preprocess_text = udf(preprocess_text, StringType()) new_df = df.withColumn(preprocessed_column_name, _preprocess_text(df[target_column],preprocess_functions)) return new_df

TypeError: Invalid argument, not a string or column: [<function to_lower at 0x7f33f9a865f0>, <function remove_email at 0x7f33f9a93c20>, <function remove_url at 0x7f33f9a933b0>, <function remove_punctuation at 0x7f33f9a934d0>, <function remove_special_character at 0x7f33f9a935f0>, <function normalize_unicode at 0x7f33f9a93a70>, <function remove_number at 0x7f33f9a93170>, <function remove_whitespace at 0x7f33f9a93830>, <function remove_stopword at 0x7f33f9a93b00>, <function lemmatize_word at 0x7f33f9a8d4d0>, <function stem_word at 0x7f33f9a8d3b0>, <function check_spelling at 0x7f33f9a8d170>] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

I tried to convert the preprocess_functions to array and lit with no results

How could I resolve this issue?

https://stackoverflow.com/questions/67561202/pass-list-to-udf-method-as-a-parameter

Code breaks when cleaning names

The code breaks when "preprocess_text(string)" is called.
The function "name_searcher.search_first_name(token)" specifically breaks. It does not seem to be supported anymore by "names_dataset". Instead in your usecase it could work with just the "search" function.
I.e. -> "name_searcher.search(token)".
I assume the same bug would occur with the last name search.
Thanks in advance

Apostrophe

Apostrophe is also removed when removing punctuation, which could cause problems in some cases.

berknology / text-preprocessing Goto Github PK

text-preprocessing's People

Contributors

Stargazers

Watchers

Forkers

text-preprocessing's Issues

Error: list

Performance Issues with Lemmatization

How to cite this GitHub repository

names-dataset made breaking chnages

A problem with dash character with check spelling

[nltk_data] Package omw-1.4 is already up-to-date!

Pass Preprocess_functions as a parameter to udf Preprocess_text method

Code breaks when cleaning names

Apostrophe

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent