Coder Social home page Coder Social logo

berknology / text-preprocessing Goto Github PK

View Code? Open in Web Editor NEW
60.0 60.0 7.0 41 KB

A python package for text preprocessing task in natural language processing.

License: BSD 3-Clause "New" or "Revised" License

Python 95.58% Makefile 4.42%
machine-learning natural-language-processing python text-preprocessing

text-preprocessing's People

Contributors

berknology avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

text-preprocessing's Issues

Error: list

Thanks for the package, its neat to have everything in one place.

I am bumping into the following problem:

This works, but as soon as I add the punctuation and other functions there is a problem

image

image

Performance Issues with Lemmatization

I'm encountering a very slow processing time on a Pandas DataFrame of ~40k sentences (generally short to mid-length extracted from TV dialog).

It appears to be doing 4k sentences/10mins on a GPU/High-Memory Pro Google Colab notebook. This may be normal or infact good. It seems to be 2x the speed of the SpaCy pipeline specs here:

https://stackoverflow.com/questions/51372724/how-to-speed-up-spacy-lemmatization

Any guidance on speeding up this processing step would be appreciated. Could this be accelerated with multi-core? Are any config changes required? Any guidelines on breakpoints on memory size of Google Cloud instances that may help?

Great package, long-overdue and very needed. - J

names-dataset made breaking chnages

Hey team,

I saw that you are using names-dataset library and sadly, they made some breaking changes and it gives following error when you run the example given in the Readme -
AttributeError: 'NameDataset' object has no attribute 'search_first_name'

On further digging, I found that they have removed the search_first_name method altogether. Now only search method is there.

Could you please update the codebase or maybe specify the library version which included search_first_name method?

Thanks,
Naman

A problem with dash character with check spelling

when I use this method, it generates an error in check spelling because the remove_special_character doesn't remove the dash '-'.
I also need to use check spelling
def pre_process(input_text):
input_text=remove_pattern(input_text,"@[\w]")
input_text=remove_pattern(input_text,"#[\w]
")
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, remove_special_character, normalize_unicode, remove_number, remove_whitespace, remove_stopword, lemmatize_word, stem_word, check_spelling]
preprocessed_text = preprocess_text(input_text, preprocess_functions)
return preprocessed_text

print(pre_process("The method is internal-based."))

Please, Could you suggest a solution?

[nltk_data] Package omw-1.4 is already up-to-date!

[nltk_data] Downloading package omw-1.4 to /home/ozmosys/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!

This message is very annoying...
And it causes an error End of script output before headers...

Is it possible to specify quiet=True mode in venv/lib/python3.9/site-packages/text_preprocessing/text_preprocessing.py

nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('omw-1.4')

Pass Preprocess_functions as a parameter to udf Preprocess_text method

I would like to pass preprocess_functions as a parameter to preprocess_text method

using the example below

def preprocess_text_spark(df: SparkDataFrame, target_column: str, preprocessed_column_name: str = 'preprocessed_text' ) -> SparkDataFrame:

""" Preprocess text in a column of a PySpark DataFrame by leveraging PySpark UDF to preprocess text in parallel """

preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, remove_special_character, normalize_unicode, remove_number, remove_whitespace, remove_stopword, lemmatize_word, stem_word, check_spelling] _preprocess_text = udf(preprocess_text, StringType()) new_df = df.withColumn(preprocessed_column_name, _preprocess_text(df[target_column],preprocess_functions)) return new_df

TypeError: Invalid argument, not a string or column: [<function to_lower at 0x7f33f9a865f0>, <function remove_email at 0x7f33f9a93c20>, <function remove_url at 0x7f33f9a933b0>, <function remove_punctuation at 0x7f33f9a934d0>, <function remove_special_character at 0x7f33f9a935f0>, <function normalize_unicode at 0x7f33f9a93a70>, <function remove_number at 0x7f33f9a93170>, <function remove_whitespace at 0x7f33f9a93830>, <function remove_stopword at 0x7f33f9a93b00>, <function lemmatize_word at 0x7f33f9a8d4d0>, <function stem_word at 0x7f33f9a8d3b0>, <function check_spelling at 0x7f33f9a8d170>] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

I tried to convert the preprocess_functions to array and lit with no results

How could I resolve this issue?

https://stackoverflow.com/questions/67561202/pass-list-to-udf-method-as-a-parameter

Code breaks when cleaning names

The code breaks when "preprocess_text(string)" is called.
The function "name_searcher.search_first_name(token)" specifically breaks. It does not seem to be supported anymore by "names_dataset". Instead in your usecase it could work with just the "search" function.
I.e. -> "name_searcher.search(token)".
I assume the same bug would occur with the last name search.
Thanks in advance

Apostrophe

Apostrophe is also removed when removing punctuation, which could cause problems in some cases.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.