Coder Social home page Coder Social logo

detectgpt's People

Contributors

burhanultayyab avatar ncwork avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

detectgpt's Issues

运行报错

执行 python local_infer.py
Please enter your sentence: (Press Enter twice to start processing)
输入 hello word.
报错
raise PipelineException(
transformers.pipelines.base.PipelineException: No mask_token ([MASK]) found on the input

Bug in chooseBestFittingText?

When run the code I notice that the return_text is scrambled. I believe this is due to a bug in the search pattern you use in re.finditer.

Current:
mask_indices = list(re.finditer("[MASK]", mask_text))

Proposed:
mask_indices = list(re.finditer("[MASK]", mask_text))

The current implementation gives me the position of the letters M, A, S and K (so span is always =1), but you want to know the position of the full string (including the opening and closing brackets).

I also notice there is an issue with the offset that I need to adjust for (removing the -1 when setting the start position and making adjustments to the offset in each loop).

After the corrections the function seems to work as I understand the intention.

Low performance

I generated 500 pieces of text using llama2, but could only identify 106 pieces as AI

Incoherent Result Labeling between GPTZero.sg and Github Code

When inputting the same text (generated by ChatGPT) into both GPTZero.sg's website and the GPT2PPL model on Github, the confidence levels were similar at around 50%, but the labeling results were different.

On GPTZero.sg, the label was 0, indicating that "this text is most likely generated by an A.I.", while using the code from the Github repository gave a label of 1, indicating that "this text is most likely written by a human".

Example input text:
"Artificial Intelligence (AI) is the use of computers to perform tasks that would normally require human intelligence such as reasoning, perception, prediction, and planning. The ultimate goal of AI research is to create systems with human-like general intelligence, which remains a major challenge in the field. Researchers use a variety of methodologies and techniques such as heuristics, planning, mathematical simplification, and knowledge representation to achieve this goal. The current research focuses on using AI to recognize and respond to human emotions, rather than on creating AI systems that can experience emotions themselves. The backpropagation algorithm is a popular method used to train Artificial Neural Networks (ANNs) which is based on the principle of "backpropagation" of errors and it is found to be effective in training deep neural networks."

Additionally, it would be helpful if the threshold could be set as an input argument of the model so that users can customize it.

Thank you.

GPTZero PPL - ValueError: cannot convert float NaN to integer

@BurhanUlTayyab Thanks for sharing the implementation. When running GPTZero code, I get the following error:

[/content/DetectGPT/model.py](https://localhost:8080/#) in getPPL_1(self, sentence)
    374             if end_loc == seq_len:
    375                 break
--> 376         ppl = int(torch.exp(torch.stack(nlls).sum() / end_loc))
    377         return ppl
    378 

**ValueError: cannot convert float NaN to integer**

The code I use to test GPTZero is:

  import pandas as pd
  from model import GPT2PPLV2
  import torch
  
  model = GPT2PPLV2()
  
  res_texts = []
  max_tokens = 512
  
  filtered_list = [text for text in mylist if len(text.split()) >= 100]  # Remove texts with less than 100 words

  for text in filtered_list:
      input_text = text[:max_tokens]
      result = model(input_text, 300, "v1")
      res_texts.append(result)

I have pre-processed the input text to handle NaN values or empty lines as shown below, however I still get this error when trying to run GPTZero model.

df['text'] = df['text'].fillna('')
df['text'] = df['text'].apply(lambda x: re.sub(r'\n\s*\n', '\n', x.strip()) if isinstance(x, str) else np.nan)
df['text'] = df['text'].apply(lambda x: x.strip().replace('\n\n', '\n') if isinstance(x, str) else '')
new_df = df.dropna(subset=['text'])

Can you please change the model.py code to handle NaN or provide a workaround to "skip" any line containing NaN when running the model?

Thanks in advance.

How to get Z score from DetectGPT?

Hi there,
Thank you for the open-source implementation of DetectGPT, it's pretty useful.
I would like to know how does one get the z-score from the detector? rather than just a percentage/probability if the text is AI/Human-generated.

Appreciate any leads on this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.