This is my NLP project including many sub-projects,using .
Text classification and keyword extraction based on abstracts
relative link: Text classification and keyword extraction based on abstracts
This is my first NLP project,not perfect but interesting.
-
note
is my markdown. -
baseline1
is the traditional baseline of the project,running on the Baidu AI Studio(relative link),and this is the local version. -
NLP_baseline
is a series of baseline,transmitting different classifiers including the Logistic Regression,the Support Vector Machine and the Random Forest Classifier. Based on the classifiers above,fine-tune the parameters withparameter_tuning.py
baseline_tuning.py
.According to the score given by the platform,the fine-tuned Logistic Regression model(AKA fine-tuned baseline) performs best up to now,reaching 0.99401.
The official provides another dataset:
testB.csv
on 24th,July. The dataset remove the columnKeywords
. Thus, I updatebaseline2
intobaseline3
to fix the dataset -
NLP_upper
is the upper project,using the BERT model from transformers to solve the classify-problem.Regretfully, my local environment couldn't support the project(my poor GTX1650 4GB).SOLUTION: Run the project on Ali Cloud(not success yet)<---It's still a good solution
However,this project has run for 26 epochs before I stopped the interpreter and the score was unsatisfactory.<---maybe overfittingSet the epoch=10,and the model works well,accuracy reaching 0.9850.<---for task 1
The latest version of
NLP_upper
is a complete version. It uses the BERT model to solve two tasks compared with only one in last version. The result is quite good but a bit late :). -
NLP_chatGLM
is the project using the LLM,leveraging chatGLM in the case of the stability of the connection. However,using API may casuse the problem that the input including sensitive words stops the program,emphasizing the essence of training the LLM locally.
ChatGPT-generated Text Tester
relative link: ChatGPT-generated Text Tester
This is a program that identifies whether the content is generated by GPT.
-
note
is my markdown -
baseline
is the baseline of this sub-project, it has an average level, using the Logistic Regression. -
upper
is the upper project,using the TF-IDF to classify the contents -
bert
is another solution using the BERT model and it's the best model up to now -
,but it's not meaningless.chatGLM_api
is a failed projectFor one thing, the LLM performs well in classifying; for another thing, using the api is not a good idea. From my point of view, the solution is to build the training set and to fine-tune the LLM using the GPU.
-
ernie
performs best. I use the Ernie model and Paddle environment. The project is run on the AI Studio. Set the epochs=100 and run all cells
To be continued...