The guttenberg from sobotics

Status-command: Print location in first line

When multiple instances are online, they will all report their status (which is intended). But it would be easier to read, if the location was printed in the first line of the status report.

http://chat.stackoverflow.com/transcript/message/36442344#36442344

Set last successful execution in standby

When in standby, the value checked by SelfCheckService won't be set. This can result in a message like this, when waking up from standby.

Decrease the number of API-calls

At the moment, I guess we use 30 requests per minute. To let the bot run 24/7, we should manage it to only do <10 requests.

Mysterious crashes (around midnight?)

The thread that searches for duplicates sometimes crashes without an exception. Sample from last night:

 INFO [Guttenberg:137] 2017-01-31 00:02:40,141 - Add the answers to the PlagFin$
 INFO [Guttenberg:152] 2017-01-31 00:02:40,150 - Find the duplicates...
 INFO [Guttenberg:175] 2017-01-31 00:02:43,288 - Finished at - 2017-01-30T23:02$
 INFO [Guttenberg:104] 2017-01-31 00:03:37,368 - Executing at - 2017-01-30T23:0$
 INFO [NewAnswersFinder:53] 2017-01-31 00:03:37,811 - findRecentAnswers() done $
 INFO [RelatedAnswersFinder:53] 2017-01-31 00:03:37,814 - Fetch the linked/rela$
 INFO [RelatedAnswersFinder:57] 2017-01-31 00:03:38,077 - Related done
 INFO [RelatedAnswersFinder:59] 2017-01-31 00:03:38,323 - linked done
 INFO [RelatedAnswersFinder:87] 2017-01-31 00:03:39,444 - Collected 100 answers
 INFO [Guttenberg:137] 2017-01-31 00:03:39,446 - Add the answers to the PlagFin$
 INFO [Guttenberg:152] 2017-01-31 00:03:39,454 - Find the duplicates...

I think the last crash was around midnight (GMT+1), too. I'll check if this happens again

Fix the room-joining

With #76, I introduced a new way of defining if the bot runs locally or on the server. But I didn't use it to determine, which rooms to join.

Deployment via Travis doesn't work

See this build: https://travis-ci.org/SOBotics/Guttenberg/jobs/295846055

Minimum time-span between target and original

If the time-span between two posts on the same question is very small (< 10 minutes), it's most likely not plagiarism. Guttenberg should ignore these posts.

https://chat.stackoverflow.com/transcript/message/39903430#39903430

Clear old logs automatically

Wehn creating a new logfile, an old one (maybe 3 days) should be deleted

Show the search-engine's quota

At the moment, the quota-command does only show the quota on SE. If a search engine returns the remaining quota, we should show this as well

Allow feedback

tp: If the post is plagiarism
fp: If both posts don't have anything to do with each other

This will need a web-dashboard like Sentinel

Status command

Should return

Guttenberg's version
the number of checked and caught posts
running since xxxxxxx

Guttenberg description should link to the StackApps post

At present Guttenberg reports posts with a description that links to the GitHub repository. This should link to the Stack Apps post instead.

[ [Guttenberg](https://git.io/vMrPa) ] [Possible plagiarism](https://stackoverflow.com/a/41975334) with a score of **0.77**. [Original post](https://stackoverflow.com/a/8549223)

Should be

[ [Guttenberg](//stackapps.com/q/7197/43403) ] [Possible plagiarism](https://stackoverflow.com/a/41975334) with a score of **0.77**. [Original post](https://stackoverflow.com/a/8549223)

Crash when request fails

The backoff and other errors are not handled correctly. The execution will stop instead of skipping the cycle one

When Guttenberg is checking the new answers, the date is not a problem. All posts that will be found, are older than the target. But with the check-command, we could accidentally flag the original post instead of the copy.

Load more than 100 related answers

At the moment, only the first page of /answers/{ids} is loaded. This contains 100 results.

Since Guttenberg seems to run stable now, we have to check, how many quota he is using. (Looks like it uses 4 per minute) If we have enough quota left, we could load the next page (or more) as well. This could help us to find much more posts.

Add a check user command

 checkuser <userId>

That can only be run by RO and moderators user.isModerator() || user.isRoomOwner()

Get all posts of user (see Detecto Plagio code base)
Develop algorithm to find what to search on (es. Second complete sentence, comments in code etc). For this we need testing to find best search strategy
Call google or bing api to find posts (included only SO?), we can't compare if off-site, but we could notify.
Run Guttenberg algo on result (first 3? SO results)
Notify in chat the result (watch out for rate limit, hence throttle reports)

Replace the use of printStackTrace() with a logger

Instead of using printStackTrace() to log exceptions, it would be preferable to use the standard mechanism of loggers.

Since Chatexchange already brings slf4j-api on the classpath (which is a facade for every other logging library in Java), you could add a dependency on slf4j-log4j12 in the POM. Then start using the logging features like:

private static final Logger LOGGER = LoggerFactory.getLogger(MyClass.class);

try {
  LOGGER.info("Doing the miraculous thing with parameters {} and {}", param1, param2);
} catch (SomeException e) {
  LOGGER.error("Ups", e); // log the error with proper stacktrace
}

"quota" command

It should return the remaining api-quota. The data could also be included in the "status" command

"Reboot"-command

This command could be useful if one of the background threads freezes.

reboot <soft|hard>

A soft reboot resets the ScheduledExecutorServices and a hard reboot launches a new instance of the current jar and closes itself.

@Bhargav-Rao and I didn't manage to get this working properly ;-)

"Update" command

The auto-update can take up to 30 minutes. If an update needs to be deployed faster, an update command could be useful

Fetch answers to the same question as target

Yes. I did forget it indeed

We need google or bing

Today again we found a plag. user, Felix have done great work with algorithms to find possibile plag, but the /related search really sucks.

The solution would be to find a good strategy and use google or bing api, however the problem is the api-quota.

So this is my idea:

We develop a software that can run on cliente macchine (hence multiple people can run it)
On SOBotics.org we develop a service, where the cliente macchine can registered there current IP.
Guttenberg forwards it search request to SOBotics.org, the application distribution the call to different clients registered that responds.
SOBotics.org responds with possibile plag. id
Guttenberg procede as usual

Basically this way we can use multiple ip and api-key to query google or bing (or both), note this system could also be use to query SE-Api in the future and let us run all bots un SOBotics.org, using our own pc only as clients to distribute ip/api.-key.

Checkuser-command: Check high-scoring answers first

As recommended by Martijn here

His experience showed that plagiarized posts often reach high scores. We should check them first

Wrong user-link sent to CopyPastor

If Guttenberg is running outside SO's chat (for example SE), a wrong link to the user will be sent when submitting feedback

https://chat.stackexchange.com/transcript/message/42110738#42110738

Minimum length for ExactParagraphMatch

Remove backticks in the output of the `help` command (around the command `opt-in`)

http://chat.stackoverflow.com/transcript/message/36345390#36345390

Getting code paras excludes the necessary new line at the start

The getCodeParagraphs in PostUtils, is a bit faulty in edge cases.

In markdown, the four spaces creates a code block, only if there is a new line before the starting of the code block.

The getCodeParagraphs function should check this as well.

Show more related posts

Currently, only the post with the highest score is displayed. If the score is really high, it would be nice to get more related answers, when asking for it. This could be triggered by replying more

Add tag to each report

This will require to change the filter when fetching the target-answer.

Only the first tag should be displayed

Comparing code snippets results in too high scores

Example: https://chat.stackoverflow.com/transcript/message/39891045#39891045
Solution:
Remove the snippet-markdownn before comparing posts

Use CopyPastor

All reports should be sent to SOBotics/CopyPastor and a link to the dashboard should be in every report

Launch date is not written correctly

When executing the status-command, the "running since"-field is null. For some reason, the date seems not to be set anymore. Maybe due to an exception right after the launch

Check network-wide

Stack Overflow is the biggest site on the SE network, but the other sites have their fair share of plagiarism. Adding these should approximately double the load on the bot, so this might not be feasible at the moment.

NullPointerException when answerer doesn't exist

It's not a big problem, but all those exceptions in the log are annoying. It should be caught, but not printed

false positive when comparing low length javascript / xml posts

I have noticed that they often get indicated to same post, hence if we could whitelist post, we could automatically remove all false positive related to it and drastically reduce the problem.

Example: @gut whitelist https://stackoverflow.com/a/43390566/4687348

The solution could also be to simple send fp to bot and it automatically whitelist the original post since it's probably very unlikely that someone will actually plag. that specific post and even if the reduction of fp may be more important then a single tp.

CI not configured for the new server yet

At the moment, CI is just set up for my Raspberry Pi. Travis should push the tags to the new server instead.

Add more reasons for a post to be reported

This could be similar to the filters in Natty. If one of the filters (for example the current string-similarity) wants to report the post, report it.

Filter markup for code-snippets

The markup defining code snippets seem to confuse Guttenberg. We should delete it before comparing the posts

Respond to a global "alive"-command

To check the status of all bots in SOBotics, it would be nice to have one alive-command, for all bots like @bots alive.

See: http://chat.stackoverflow.com/transcript/message/36182446#36182446

Add standby-mode

Guttenberg should check on Redunda, if it should be on standby or running

Display a different text, when the matched posts have the same author

http://chat.stackoverflow.com/transcript/message/35121562#35121562

Allow feedback (Part 2): Store feedback

The feedback needs to be stored somewhere. Either in a database or in a textfile

Restructure the project

The structure should be similar to Natty. Additionally, an ApiService should be implemented instead of only using ApiUtils

Use API cache

As soon as the required methods are available, we should add a property containing the URL to our API cache.

"opt-in" command

To get notified about possible plagiarism with a certain score, we could implement an opt-in-command like Natty has.

Example:

opt-in 0.8 //get notified about scores >= 0.8 when online

opt-in 0.8 always //try to ping you, even when you are offline

(score_plaintext*0.7)+(score_blockquote*0.5)+(score_code*1)

Don't ping on reposts

It's not useful to get pinged on high-scored reposts, since they will be autoflagged.

Resource-leak when reading guttenberg.properties

#25 (comment)

Might be somewhere else as well

sobotics / guttenberg Goto Github PK

guttenberg's People

Contributors

Stargazers

Watchers

Forkers

guttenberg's Issues

Recommend Projects

Recommend Topics

Recommend Org