burhanultayyab / detectgpt Goto Github PK

View Code? Open in Web Editor NEW

166.0 166.0 43.0 37 KB

Pytorch implementation of DetectGPT (https://arxiv.org/pdf/2301.11305v1.pdf)

Home Page: https://gptzero.sg

License: MIT License

Python 98.86% Dockerfile 1.14%

detectgpt's People

Contributors

Stargazers

Watchers

Forkers

ncwork mconsidine cheatgpt newnativeabq kukupigs daryl149 5l1v3r1 josedudias dcerisano cyberflamego airwangyun damarsimple robertomalatesta loladeng artyom-morozov heng-xiu tuanphantom trulyfurqan souradip-chakraborty gzhdy credibility-ai jangocheng viethoangtranduong gavinchen1314 jeanbaptiste-dlb begonia2020 ishmael82 fermiq smallw00d2211 zaixuasd craftingdata tbandopa machine-w dongckim cold-eye sadat1971 stlovaer hanzla-nouman twizzworld lindseyrich2 ankishb roeehub

detectgpt's Issues

运行报错

执行 python local_infer.py
Please enter your sentence: (Press Enter twice to start processing)
输入 hello word.
报错
raise PipelineException(
transformers.pipelines.base.PipelineException: No mask_token ([MASK]) found on the input

Are there any plans to support Chinese in the future? Can I make modifications myself? If so, please give me some suggestions. Thank you

Bug in chooseBestFittingText?

When run the code I notice that the return_text is scrambled. I believe this is due to a bug in the search pattern you use in re.finditer.

Current:
mask_indices = list(re.finditer("[MASK]", mask_text))

Proposed:
mask_indices = list(re.finditer("[MASK]", mask_text))

The current implementation gives me the position of the letters M, A, S and K (so span is always =1), but you want to know the position of the full string (including the opening and closing brackets).

I also notice there is an issue with the offset that I need to adjust for (removing the -1 when setting the start position and making adjustments to the offset in each loop).

After the corrections the function seems to work as I understand the intention.

Low performance

I generated 500 pieces of text using llama2, but could only identify 106 pieces as AI

Incoherent Result Labeling between GPTZero.sg and Github Code

When inputting the same text (generated by ChatGPT) into both GPTZero.sg's website and the GPT2PPL model on Github, the confidence levels were similar at around 50%, but the labeling results were different.

On GPTZero.sg, the label was 0, indicating that "this text is most likely generated by an A.I.", while using the code from the Github repository gave a label of 1, indicating that "this text is most likely written by a human".

Example input text:
"Artificial Intelligence (AI) is the use of computers to perform tasks that would normally require human intelligence such as reasoning, perception, prediction, and planning. The ultimate goal of AI research is to create systems with human-like general intelligence, which remains a major challenge in the field. Researchers use a variety of methodologies and techniques such as heuristics, planning, mathematical simplification, and knowledge representation to achieve this goal. The current research focuses on using AI to recognize and respond to human emotions, rather than on creating AI systems that can experience emotions themselves. The backpropagation algorithm is a popular method used to train Artificial Neural Networks (ANNs) which is based on the principle of "backpropagation" of errors and it is found to be effective in training deep neural networks."

Additionally, it would be helpful if the threshold could be set as an input argument of the model so that users can customize it.

Thank you.

GPTZero PPL - ValueError: cannot convert float NaN to integer

@BurhanUlTayyab Thanks for sharing the implementation. When running GPTZero code, I get the following error:

[/content/DetectGPT/model.py](https://localhost:8080/#) in getPPL_1(self, sentence)
    374             if end_loc == seq_len:
    375                 break
--> 376         ppl = int(torch.exp(torch.stack(nlls).sum() / end_loc))
    377         return ppl
    378 

**ValueError: cannot convert float NaN to integer**

The code I use to test GPTZero is:

  import pandas as pd
  from model import GPT2PPLV2
  import torch
  
  model = GPT2PPLV2()
  
  res_texts = []
  max_tokens = 512
  
  filtered_list = [text for text in mylist if len(text.split()) >= 100]  # Remove texts with less than 100 words

  for text in filtered_list:
      input_text = text[:max_tokens]
      result = model(input_text, 300, "v1")
      res_texts.append(result)

I have pre-processed the input text to handle NaN values or empty lines as shown below, however I still get this error when trying to run GPTZero model.

df['text'] = df['text'].fillna('')
df['text'] = df['text'].apply(lambda x: re.sub(r'\n\s*\n', '\n', x.strip()) if isinstance(x, str) else np.nan)
df['text'] = df['text'].apply(lambda x: x.strip().replace('\n\n', '\n') if isinstance(x, str) else '')
new_df = df.dropna(subset=['text'])

Can you please change the model.py code to handle NaN or provide a workaround to "skip" any line containing NaN when running the model?

Thanks in advance.

How to get Z score from DetectGPT?

Hi there,
Thank you for the open-source implementation of DetectGPT, it's pretty useful.
I would like to know how does one get the z-score from the detector? rather than just a percentage/probability if the text is AI/Human-generated.

Appreciate any leads on this.

burhanultayyab / detectgpt Goto Github PK

detectgpt's People

Contributors

Stargazers

Watchers

Forkers

detectgpt's Issues

运行报错

Are there any plans to support Chinese in the future? Can I make modifications myself? If so, please give me some suggestions. Thank you

Bug in chooseBestFittingText?

Low performance

Incoherent Result Labeling between GPTZero.sg and Github Code

GPTZero PPL - ValueError: cannot convert float NaN to integer

How to get Z score from DetectGPT?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent