burhanultayyab / detectgpt Goto Github PK
View Code? Open in Web Editor NEWPytorch implementation of DetectGPT (https://arxiv.org/pdf/2301.11305v1.pdf)
Home Page: https://gptzero.sg
License: MIT License
Pytorch implementation of DetectGPT (https://arxiv.org/pdf/2301.11305v1.pdf)
Home Page: https://gptzero.sg
License: MIT License
@BurhanUlTayyab Thanks for sharing the implementation. When running GPTZero code, I get the following error:
[/content/DetectGPT/model.py](https://localhost:8080/#) in getPPL_1(self, sentence)
374 if end_loc == seq_len:
375 break
--> 376 ppl = int(torch.exp(torch.stack(nlls).sum() / end_loc))
377 return ppl
378
**ValueError: cannot convert float NaN to integer**
The code I use to test GPTZero is:
import pandas as pd from model import GPT2PPLV2 import torch model = GPT2PPLV2() res_texts = [] max_tokens = 512 filtered_list = [text for text in mylist if len(text.split()) >= 100] # Remove texts with less than 100 words for text in filtered_list: input_text = text[:max_tokens] result = model(input_text, 300, "v1") res_texts.append(result)
I have pre-processed the input text to handle NaN values or empty lines as shown below, however I still get this error when trying to run GPTZero model.
df['text'] = df['text'].fillna('') df['text'] = df['text'].apply(lambda x: re.sub(r'\n\s*\n', '\n', x.strip()) if isinstance(x, str) else np.nan) df['text'] = df['text'].apply(lambda x: x.strip().replace('\n\n', '\n') if isinstance(x, str) else '') new_df = df.dropna(subset=['text'])
Can you please change the model.py code to handle NaN or provide a workaround to "skip" any line containing NaN when running the model?
Thanks in advance.
When run the code I notice that the return_text is scrambled. I believe this is due to a bug in the search pattern you use in re.finditer.
Current:
mask_indices = list(re.finditer("[MASK]", mask_text))
Proposed:
mask_indices = list(re.finditer("[MASK]", mask_text))
The current implementation gives me the position of the letters M, A, S and K (so span is always =1), but you want to know the position of the full string (including the opening and closing brackets).
I also notice there is an issue with the offset that I need to adjust for (removing the -1 when setting the start position and making adjustments to the offset in each loop).
After the corrections the function seems to work as I understand the intention.
执行 python local_infer.py
Please enter your sentence: (Press Enter twice to start processing)
输入 hello word.
报错
raise PipelineException(
transformers.pipelines.base.PipelineException: No mask_token ([MASK]) found on the input
Hi there,
Thank you for the open-source implementation of DetectGPT, it's pretty useful.
I would like to know how does one get the z-score from the detector? rather than just a percentage/probability if the text is AI/Human-generated.
Appreciate any leads on this.
When inputting the same text (generated by ChatGPT) into both GPTZero.sg's website and the GPT2PPL model on Github, the confidence levels were similar at around 50%, but the labeling results were different.
On GPTZero.sg, the label was 0, indicating that "this text is most likely generated by an A.I.", while using the code from the Github repository gave a label of 1, indicating that "this text is most likely written by a human".
Example input text:
"Artificial Intelligence (AI) is the use of computers to perform tasks that would normally require human intelligence such as reasoning, perception, prediction, and planning. The ultimate goal of AI research is to create systems with human-like general intelligence, which remains a major challenge in the field. Researchers use a variety of methodologies and techniques such as heuristics, planning, mathematical simplification, and knowledge representation to achieve this goal. The current research focuses on using AI to recognize and respond to human emotions, rather than on creating AI systems that can experience emotions themselves. The backpropagation algorithm is a popular method used to train Artificial Neural Networks (ANNs) which is based on the principle of "backpropagation" of errors and it is found to be effective in training deep neural networks."
Additionally, it would be helpful if the threshold could be set as an input argument of the model so that users can customize it.
Thank you.
I generated 500 pieces of text using llama2, but could only identify 106 pieces as AI
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.