berknology / text-preprocessing Goto Github PK
View Code? Open in Web Editor NEWA python package for text preprocessing task in natural language processing.
License: BSD 3-Clause "New" or "Revised" License
A python package for text preprocessing task in natural language processing.
License: BSD 3-Clause "New" or "Revised" License
I'm encountering a very slow processing time on a Pandas DataFrame of ~40k sentences (generally short to mid-length extracted from TV dialog).
It appears to be doing 4k sentences/10mins on a GPU/High-Memory Pro Google Colab notebook. This may be normal or infact good. It seems to be 2x the speed of the SpaCy pipeline specs here:
https://stackoverflow.com/questions/51372724/how-to-speed-up-spacy-lemmatization
Any guidance on speeding up this processing step would be appreciated. Could this be accelerated with multi-core? Are any config changes required? Any guidelines on breakpoints on memory size of Google Cloud instances that may help?
Great package, long-overdue and very needed. - J
Hello,
Thanks for your contribution. How can I cite your work on my paper?
Hey team,
I saw that you are using names-dataset
library and sadly, they made some breaking changes and it gives following error when you run the example given in the Readme -
AttributeError: 'NameDataset' object has no attribute 'search_first_name'
On further digging, I found that they have removed the search_first_name
method altogether. Now only search
method is there.
Could you please update the codebase or maybe specify the library version which included search_first_name
method?
Thanks,
Naman
when I use this method, it generates an error in check spelling because the remove_special_character doesn't remove the dash '-'.
I also need to use check spelling
def pre_process(input_text):
input_text=remove_pattern(input_text,"@[\w]")
input_text=remove_pattern(input_text,"#[\w]")
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, remove_special_character, normalize_unicode, remove_number, remove_whitespace, remove_stopword, lemmatize_word, stem_word, check_spelling]
preprocessed_text = preprocess_text(input_text, preprocess_functions)
return preprocessed_text
print(pre_process("The method is internal-based."))
Please, Could you suggest a solution?
[nltk_data] Downloading package omw-1.4 to /home/ozmosys/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
This message is very annoying...
And it causes an error End of script output before headers...
Is it possible to specify quiet=True mode in venv/lib/python3.9/site-packages/text_preprocessing/text_preprocessing.py
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('omw-1.4')
I would like to pass preprocess_functions as a parameter to preprocess_text method
using the example below
def preprocess_text_spark(df: SparkDataFrame, target_column: str, preprocessed_column_name: str = 'preprocessed_text' ) -> SparkDataFrame:
""" Preprocess text in a column of a PySpark DataFrame by leveraging PySpark UDF to preprocess text in parallel """
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, remove_special_character, normalize_unicode, remove_number, remove_whitespace, remove_stopword, lemmatize_word, stem_word, check_spelling] _preprocess_text = udf(preprocess_text, StringType()) new_df = df.withColumn(preprocessed_column_name, _preprocess_text(df[target_column],preprocess_functions)) return new_df
TypeError: Invalid argument, not a string or column: [<function to_lower at 0x7f33f9a865f0>, <function remove_email at 0x7f33f9a93c20>, <function remove_url at 0x7f33f9a933b0>, <function remove_punctuation at 0x7f33f9a934d0>, <function remove_special_character at 0x7f33f9a935f0>, <function normalize_unicode at 0x7f33f9a93a70>, <function remove_number at 0x7f33f9a93170>, <function remove_whitespace at 0x7f33f9a93830>, <function remove_stopword at 0x7f33f9a93b00>, <function lemmatize_word at 0x7f33f9a8d4d0>, <function stem_word at 0x7f33f9a8d3b0>, <function check_spelling at 0x7f33f9a8d170>] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
I tried to convert the preprocess_functions to array and lit with no results
How could I resolve this issue?
https://stackoverflow.com/questions/67561202/pass-list-to-udf-method-as-a-parameter
The code breaks when "preprocess_text(string)" is called.
The function "name_searcher.search_first_name(token)" specifically breaks. It does not seem to be supported anymore by "names_dataset". Instead in your usecase it could work with just the "search" function.
I.e. -> "name_searcher.search(token)".
I assume the same bug would occur with the last name search.
Thanks in advance
Apostrophe is also removed when removing punctuation, which could cause problems in some cases.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.