The nlp_project from jiang-wu-19

This is my NLP project including many sub-projects,using .

Text classification and keyword extraction based on abstracts

relative link: Text classification and keyword extraction based on abstracts

This is my first NLP project,not perfect but interesting.

note is my markdown.
baseline1 is the traditional baseline of the project,running on the Baidu AI Studio(relative link),and this is the local version.
NLP_baseline is a series of baseline,transmitting different classifiers including the Logistic Regression,the Support Vector Machine and the Random Forest Classifier. Based on the classifiers above,fine-tune the parameters with parameter_tuning.py baseline_tuning.py.

According to the score given by the platform,the fine-tuned Logistic Regression model(AKA fine-tuned baseline) performs best up to now,reaching 0.99401.

The official provides another dataset: testB.csv on 24th，July. The dataset remove the column Keywords. Thus, I update baseline2 into baseline3 to fix the dataset
NLP_upper is the upper project,using the BERT model from transformers to solve the classify-problem.

~~Regretfully, my local environment couldn't support the project(my poor GTX1650 4GB).~~

SOLUTION: Run the project on Ali Cloud(not success yet)<---It's still a good solution

~~However,this project has run for 26 epochs before I stopped the interpreter and the score was unsatisfactory~~.<---maybe overfitting

Set the epoch=10,and the model works well,accuracy reaching 0.9850.<---for task 1

The latest version of NLP_upper is a complete version. It uses the BERT model to solve two tasks compared with only one in last version. The result is quite good but a bit late :).
NLP_chatGLM is the project using the LLM,leveraging chatGLM in the case of the stability of the connection. However,using API may casuse the problem that the input including sensitive words stops the program,emphasizing the essence of training the LLM locally.

ChatGPT-generated Text Tester

relative link: ChatGPT-generated Text Tester

This is a program that identifies whether the content is generated by GPT.

note is my markdown
baseline is the baseline of this sub-project, it has an average level, using the Logistic Regression.
upper is the upper project,using the TF-IDF to classify the contents
bert is another solution using the BERT model and it's the best model up to now
~~chatGLM_api is a failed project~~,but it's not meaningless.

For one thing, the LLM performs well in classifying; for another thing, using the api is not a good idea. From my point of view, the solution is to build the training set and to fine-tune the LLM using the GPU.
ernie performs best. I use the Ernie model and Paddle environment. The project is run on the AI Studio. Set the epochs=100 and run all cells

To be continued...

jiang-wu-19 / nlp_project Goto Github PK

nlp_project's Introduction

nlp_project's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent