Coder Social home page Coder Social logo

sobotics / guttenberg Goto Github PK

View Code? Open in Web Editor NEW
23.0 23.0 10.0 580 KB

A bot, searching for plagiarism on Stack Overflow.

Home Page: http://stackapps.com/q/7197/43403

License: GNU General Public License v3.0

Java 100.00%
bot plagiarism-prevention stack-overflow

guttenberg's People

Contributors

artofcode- avatar bhargav-rao avatar felixsfd avatar jdd-software avatar mjpieters avatar mottykohn avatar taur1ne avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

guttenberg's Issues

Mysterious crashes (around midnight?)

The thread that searches for duplicates sometimes crashes without an exception. Sample from last night:

 INFO [Guttenberg:137] 2017-01-31 00:02:40,141 - Add the answers to the PlagFin$
 INFO [Guttenberg:152] 2017-01-31 00:02:40,150 - Find the duplicates...
 INFO [Guttenberg:175] 2017-01-31 00:02:43,288 - Finished at - 2017-01-30T23:02$
 INFO [Guttenberg:104] 2017-01-31 00:03:37,368 - Executing at - 2017-01-30T23:0$
 INFO [NewAnswersFinder:53] 2017-01-31 00:03:37,811 - findRecentAnswers() done $
 INFO [RelatedAnswersFinder:53] 2017-01-31 00:03:37,814 - Fetch the linked/rela$
 INFO [RelatedAnswersFinder:57] 2017-01-31 00:03:38,077 - Related done
 INFO [RelatedAnswersFinder:59] 2017-01-31 00:03:38,323 - linked done
 INFO [RelatedAnswersFinder:87] 2017-01-31 00:03:39,444 - Collected 100 answers
 INFO [Guttenberg:137] 2017-01-31 00:03:39,446 - Add the answers to the PlagFin$
 INFO [Guttenberg:152] 2017-01-31 00:03:39,454 - Find the duplicates...

I think the last crash was around midnight (GMT+1), too. I'll check if this happens again

Fix the room-joining

With #76, I introduced a new way of defining if the bot runs locally or on the server. But I didn't use it to determine, which rooms to join.

Show the search-engine's quota

At the moment, the quota-command does only show the quota on SE. If a search engine returns the remaining quota, we should show this as well

Allow feedback

  • tp: If the post is plagiarism
  • fp: If both posts don't have anything to do with each other

This will need a web-dashboard like Sentinel

Status command

Should return

  • Guttenberg's version
  • the number of checked and caught posts
  • running since xxxxxxx

Guttenberg description should link to the StackApps post

At present Guttenberg reports posts with a description that links to the GitHub repository. This should link to the Stack Apps post instead.

[ [Guttenberg](https://git.io/vMrPa) ] [Possible plagiarism](https://stackoverflow.com/a/41975334) with a score of **0.77**. [Original post](https://stackoverflow.com/a/8549223)

Should be

[ [Guttenberg](//stackapps.com/q/7197/43403) ] [Possible plagiarism](https://stackoverflow.com/a/41975334) with a score of **0.77**. [Original post](https://stackoverflow.com/a/8549223)

Crash when request fails

The backoff and other errors are not handled correctly. The execution will stop instead of skipping the cycle one

Check date of posts

When Guttenberg is checking the new answers, the date is not a problem. All posts that will be found, are older than the target. But with the check-command, we could accidentally flag the original post instead of the copy.

Load more than 100 related answers

At the moment, only the first page of /answers/{ids} is loaded. This contains 100 results.

Since Guttenberg seems to run stable now, we have to check, how many quota he is using. (Looks like it uses 4 per minute) If we have enough quota left, we could load the next page (or more) as well. This could help us to find much more posts.

Add a check user command

Add a check user command

 checkuser <userId>

That can only be run by RO and moderators user.isModerator() || user.isRoomOwner()

  1. Get all posts of user (see Detecto Plagio code base)
  2. Develop algorithm to find what to search on (es. Second complete sentence, comments in code etc). For this we need testing to find best search strategy
  3. Call google or bing api to find posts (included only SO?), we can't compare if off-site, but we could notify.
  4. Run Guttenberg algo on result (first 3? SO results)
  5. Notify in chat the result (watch out for rate limit, hence throttle reports)

Replace the use of printStackTrace() with a logger

Instead of using printStackTrace() to log exceptions, it would be preferable to use the standard mechanism of loggers.

Since Chatexchange already brings slf4j-api on the classpath (which is a facade for every other logging library in Java), you could add a dependency on slf4j-log4j12 in the POM. Then start using the logging features like:

private static final Logger LOGGER = LoggerFactory.getLogger(MyClass.class);

try {
  LOGGER.info("Doing the miraculous thing with parameters {} and {}", param1, param2);
} catch (SomeException e) {
  LOGGER.error("Ups", e); // log the error with proper stacktrace
}

"quota" command

It should return the remaining api-quota. The data could also be included in the "status" command

"Reboot"-command

This command could be useful if one of the background threads freezes.


reboot <soft|hard>

A soft reboot resets the ScheduledExecutorServices and a hard reboot launches a new instance of the current jar and closes itself.


@Bhargav-Rao and I didn't manage to get this working properly ;-)

"Update" command

The auto-update can take up to 30 minutes. If an update needs to be deployed faster, an update command could be useful

We need google or bing

Today again we found a plag. user, Felix have done great work with algorithms to find possibile plag, but the /related search really sucks.

The solution would be to find a good strategy and use google or bing api, however the problem is the api-quota.

So this is my idea:

  1. We develop a software that can run on cliente macchine (hence multiple people can run it)
  2. On SOBotics.org we develop a service, where the cliente macchine can registered there current IP.
  3. Guttenberg forwards it search request to SOBotics.org, the application distribution the call to different clients registered that responds.
  4. SOBotics.org responds with possibile plag. id
  5. Guttenberg procede as usual

Basically this way we can use multiple ip and api-key to query google or bing (or both), note this system could also be use to query SE-Api in the future and let us run all bots un SOBotics.org, using our own pc only as clients to distribute ip/api.-key.

Show more related posts

Currently, only the post with the highest score is displayed. If the score is really high, it would be nice to get more related answers, when asking for it. This could be triggered by replying more

Add tag to each report

This will require to change the filter when fetching the target-answer.

Only the first tag should be displayed

Launch date is not written correctly

When executing the status-command, the "running since"-field is null. For some reason, the date seems not to be set anymore. Maybe due to an exception right after the launch

Check network-wide

Stack Overflow is the biggest site on the SE network, but the other sites have their fair share of plagiarism. Adding these should approximately double the load on the bot, so this might not be feasible at the moment.

false positive when comparing low length javascript / xml posts

I have noticed that they often get indicated to same post, hence if we could whitelist post, we could automatically remove all false positive related to it and drastically reduce the problem.

Example: @gut whitelist https://stackoverflow.com/a/43390566/4687348

The solution could also be to simple send fp to bot and it automatically whitelist the original post since it's probably very unlikely that someone will actually plag. that specific post and even if the reduction of fp may be more important then a single tp.

Add standby-mode

Guttenberg should check on Redunda, if it should be on standby or running

Restructure the project

The structure should be similar to Natty. Additionally, an ApiService should be implemented instead of only using ApiUtils

Use API cache

As soon as the required methods are available, we should add a property containing the URL to our API cache.

"opt-in" command

To get notified about possible plagiarism with a certain score, we could implement an opt-in-command like Natty has.

Example:

opt-in 0.8 //get notified about scores >= 0.8 when online

opt-in 0.8 always //try to ping you, even when you are offline

Deploy updates via Redunda

At the moment, tags will be built by Travis and deployed to FelixSFD/server. By tracking the JARs with org.sobotics.redunda.DataService, updates could automatically be deployed to all instances.

Separate post body

The body of each post should be separated into "plaintext", "blockquote" and "code".

If this is done, the Jaro-Winkler distance should not be calculated for two full posts, but for each of those parts. The values could be combined in one final score like this:

(score_plaintext*0.7)+(score_blockquote*0.5)+(score_code*1)

Don't ping on reposts

It's not useful to get pinged on high-scored reposts, since they will be autoflagged.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.