burhanultayyab / detectgpt Goto Github PK

View Code? Open in Web Editor NEW

173.0 173.0 43.0 37 KB

Pytorch implementation of DetectGPT (https://arxiv.org/pdf/2301.11305v1.pdf)

Home Page: https://gptzero.sg

License: MIT License

Python 98.86% Dockerfile 1.14%

detectgpt's Issues

GPTZero PPL - ValueError: cannot convert float NaN to integer

@BurhanUlTayyab Thanks for sharing the implementation. When running GPTZero code, I get the following error:

[/content/DetectGPT/model.py](https://localhost:8080/#) in getPPL_1(self, sentence)
    374             if end_loc == seq_len:
    375                 break
--> 376         ppl = int(torch.exp(torch.stack(nlls).sum() / end_loc))
    377         return ppl
    378 

**ValueError: cannot convert float NaN to integer**

The code I use to test GPTZero is:

  import pandas as pd
  from model import GPT2PPLV2
  import torch
  
  model = GPT2PPLV2()
  
  res_texts = []
  max_tokens = 512
  
  filtered_list = [text for text in mylist if len(text.split()) >= 100]  # Remove texts with less than 100 words

  for text in filtered_list:
      input_text = text[:max_tokens]
      result = model(input_text, 300, "v1")
      res_texts.append(result)

I have pre-processed the input text to handle NaN values or empty lines as shown below, however I still get this error when trying to run GPTZero model.

df['text'] = df['text'].fillna('')
df['text'] = df['text'].apply(lambda x: re.sub(r'\n\s*\n', '\n', x.strip()) if isinstance(x, str) else np.nan)
df['text'] = df['text'].apply(lambda x: x.strip().replace('\n\n', '\n') if isinstance(x, str) else '')
new_df = df.dropna(subset=['text'])

Can you please change the model.py code to handle NaN or provide a workaround to "skip" any line containing NaN when running the model?

Thanks in advance.

Bug in chooseBestFittingText?

When run the code I notice that the return_text is scrambled. I believe this is due to a bug in the search pattern you use in re.finditer.

Current:
mask_indices = list(re.finditer("[MASK]", mask_text))

Proposed:
mask_indices = list(re.finditer("[MASK]", mask_text))

The current implementation gives me the position of the letters M, A, S and K (so span is always =1), but you want to know the position of the full string (including the opening and closing brackets).

I also notice there is an issue with the offset that I need to adjust for (removing the -1 when setting the start position and making adjustments to the offset in each loop).

After the corrections the function seems to work as I understand the intention.

运行报错

执行 python local_infer.py
Please enter your sentence: (Press Enter twice to start processing)
输入 hello word.
报错
raise PipelineException(
transformers.pipelines.base.PipelineException: No mask_token ([MASK]) found on the input

How to get Z score from DetectGPT?

Hi there,
Thank you for the open-source implementation of DetectGPT, it's pretty useful.
I would like to know how does one get the z-score from the detector? rather than just a percentage/probability if the text is AI/Human-generated.

Appreciate any leads on this.

Incoherent Result Labeling between GPTZero.sg and Github Code

When inputting the same text (generated by ChatGPT) into both GPTZero.sg's website and the GPT2PPL model on Github, the confidence levels were similar at around 50%, but the labeling results were different.

On GPTZero.sg, the label was 0, indicating that "this text is most likely generated by an A.I.", while using the code from the Github repository gave a label of 1, indicating that "this text is most likely written by a human".

Example input text:
"Artificial Intelligence (AI) is the use of computers to perform tasks that would normally require human intelligence such as reasoning, perception, prediction, and planning. The ultimate goal of AI research is to create systems with human-like general intelligence, which remains a major challenge in the field. Researchers use a variety of methodologies and techniques such as heuristics, planning, mathematical simplification, and knowledge representation to achieve this goal. The current research focuses on using AI to recognize and respond to human emotions, rather than on creating AI systems that can experience emotions themselves. The backpropagation algorithm is a popular method used to train Artificial Neural Networks (ANNs) which is based on the principle of "backpropagation" of errors and it is found to be effective in training deep neural networks."

Additionally, it would be helpful if the threshold could be set as an input argument of the model so that users can customize it.

Thank you.

Are there any plans to support Chinese in the future? Can I make modifications myself? If so, please give me some suggestions. Thank you

Low performance

I generated 500 pieces of text using llama2, but could only identify 106 pieces as AI

burhanultayyab / detectgpt Goto Github PK

detectgpt's Issues

GPTZero PPL - ValueError: cannot convert float NaN to integer

Bug in chooseBestFittingText?

运行报错

How to get Z score from DetectGPT?

Incoherent Result Labeling between GPTZero.sg and Github Code

Are there any plans to support Chinese in the future? Can I make modifications myself? If so, please give me some suggestions. Thank you

Low performance

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent