sobotics / guttenberg Goto Github PK
View Code? Open in Web Editor NEWA bot, searching for plagiarism on Stack Overflow.
Home Page: http://stackapps.com/q/7197/43403
License: GNU General Public License v3.0
A bot, searching for plagiarism on Stack Overflow.
Home Page: http://stackapps.com/q/7197/43403
License: GNU General Public License v3.0
When multiple instances are online, they will all report their status (which is intended). But it would be easier to read, if the location was printed in the first line of the status report.
http://chat.stackoverflow.com/transcript/message/36442344#36442344
When in standby, the value checked by SelfCheckService
won't be set. This can result in a message like this, when waking up from standby.
At the moment, I guess we use 30 requests per minute. To let the bot run 24/7, we should manage it to only do <10 requests.
The thread that searches for duplicates sometimes crashes without an exception. Sample from last night:
INFO [Guttenberg:137] 2017-01-31 00:02:40,141 - Add the answers to the PlagFin$
INFO [Guttenberg:152] 2017-01-31 00:02:40,150 - Find the duplicates...
INFO [Guttenberg:175] 2017-01-31 00:02:43,288 - Finished at - 2017-01-30T23:02$
INFO [Guttenberg:104] 2017-01-31 00:03:37,368 - Executing at - 2017-01-30T23:0$
INFO [NewAnswersFinder:53] 2017-01-31 00:03:37,811 - findRecentAnswers() done $
INFO [RelatedAnswersFinder:53] 2017-01-31 00:03:37,814 - Fetch the linked/rela$
INFO [RelatedAnswersFinder:57] 2017-01-31 00:03:38,077 - Related done
INFO [RelatedAnswersFinder:59] 2017-01-31 00:03:38,323 - linked done
INFO [RelatedAnswersFinder:87] 2017-01-31 00:03:39,444 - Collected 100 answers
INFO [Guttenberg:137] 2017-01-31 00:03:39,446 - Add the answers to the PlagFin$
INFO [Guttenberg:152] 2017-01-31 00:03:39,454 - Find the duplicates...
I think the last crash was around midnight (GMT+1), too. I'll check if this happens again
With #76, I introduced a new way of defining if the bot runs locally or on the server. But I didn't use it to determine, which rooms to join.
See this build: https://travis-ci.org/SOBotics/Guttenberg/jobs/295846055
If the time-span between two posts on the same question is very small (< 10 minutes), it's most likely not plagiarism. Guttenberg should ignore these posts.
https://chat.stackoverflow.com/transcript/message/39903430#39903430
Wehn creating a new logfile, an old one (maybe 3 days) should be deleted
At the moment, the quota
-command does only show the quota on SE. If a search engine returns the remaining quota, we should show this as well
This will need a web-dashboard like Sentinel
Should return
At present Guttenberg reports posts with a description that links to the GitHub repository. This should link to the Stack Apps post instead.
[ [Guttenberg](https://git.io/vMrPa) ] [Possible plagiarism](https://stackoverflow.com/a/41975334) with a score of **0.77**. [Original post](https://stackoverflow.com/a/8549223)
Should be
[ [Guttenberg](//stackapps.com/q/7197/43403) ] [Possible plagiarism](https://stackoverflow.com/a/41975334) with a score of **0.77**. [Original post](https://stackoverflow.com/a/8549223)
The backoff and other errors are not handled correctly. The execution will stop instead of skipping the cycle one
When Guttenberg is checking the new answers, the date is not a problem. All posts that will be found, are older than the target. But with the check
-command, we could accidentally flag the original post instead of the copy.
At the moment, only the first page of /answers/{ids}
is loaded. This contains 100 results.
Since Guttenberg seems to run stable now, we have to check, how many quota he is using. (Looks like it uses 4 per minute) If we have enough quota left, we could load the next page (or more) as well. This could help us to find much more posts.
Add a check user command
checkuser <userId>
That can only be run by RO and moderators user.isModerator() || user.isRoomOwner()
Instead of using printStackTrace()
to log exceptions, it would be preferable to use the standard mechanism of loggers.
Since Chatexchange already brings slf4j-api on the classpath (which is a facade for every other logging library in Java), you could add a dependency on slf4j-log4j12 in the POM. Then start using the logging features like:
private static final Logger LOGGER = LoggerFactory.getLogger(MyClass.class);
try {
LOGGER.info("Doing the miraculous thing with parameters {} and {}", param1, param2);
} catch (SomeException e) {
LOGGER.error("Ups", e); // log the error with proper stacktrace
}
It should return the remaining api-quota. The data could also be included in the "status" command
This command could be useful if one of the background threads freezes.
reboot <soft|hard>
A soft reboot resets the ScheduledExecutorServices
and a hard reboot launches a new instance of the current jar and closes itself.
@Bhargav-Rao and I didn't manage to get this working properly ;-)
The auto-update can take up to 30 minutes. If an update needs to be deployed faster, an update
command could be useful
Yes. I did forget it indeed
Today again we found a plag. user, Felix have done great work with algorithms to find possibile plag, but the /related
search really sucks.
The solution would be to find a good strategy and use google or bing api, however the problem is the api-quota.
So this is my idea:
Basically this way we can use multiple ip and api-key to query google or bing (or both), note this system could also be use to query SE-Api in the future and let us run all bots un SOBotics.org, using our own pc only as clients to distribute ip/api.-key.
As recommended by Martijn here
His experience showed that plagiarized posts often reach high scores. We should check them first
If Guttenberg is running outside SO's chat (for example SE), a wrong link to the user will be sent when submitting feedback
https://chat.stackexchange.com/transcript/message/42110738#42110738
The getCodeParagraphs in PostUtils, is a bit faulty in edge cases.
In markdown, the four spaces creates a code block, only if there is a new line before the starting of the code block.
The getCodeParagraphs function should check this as well.
Currently, only the post with the highest score is displayed. If the score is really high, it would be nice to get more related answers, when asking for it. This could be triggered by replying more
This will require to change the filter when fetching the target-answer.
Only the first tag should be displayed
Example: https://chat.stackoverflow.com/transcript/message/39891045#39891045
Solution:
Remove the snippet-markdownn before comparing posts
All reports should be sent to SOBotics/CopyPastor and a link to the dashboard should be in every report
When executing the status
-command, the "running since"-field is null
. For some reason, the date seems not to be set anymore. Maybe due to an exception right after the launch
Stack Overflow is the biggest site on the SE network, but the other sites have their fair share of plagiarism. Adding these should approximately double the load on the bot, so this might not be feasible at the moment.
It's not a big problem, but all those exceptions in the log are annoying. It should be caught, but not printed
I have noticed that they often get indicated to same post, hence if we could whitelist post, we could automatically remove all false positive related to it and drastically reduce the problem.
Example: @gut whitelist https://stackoverflow.com/a/43390566/4687348
The solution could also be to simple send fp
to bot and it automatically whitelist the original post since it's probably very unlikely that someone will actually plag. that specific post and even if the reduction of fp may be more important then a single tp.
At the moment, CI is just set up for my Raspberry Pi. Travis should push the tags to the new server instead.
This could be similar to the filters in Natty. If one of the filters (for example the current string-similarity) wants to report the post, report it.
The markup defining code snippets seem to confuse Guttenberg. We should delete it before comparing the posts
To check the status of all bots in SOBotics, it would be nice to have one alive
-command, for all bots like @bots alive
.
See: http://chat.stackoverflow.com/transcript/message/36182446#36182446
Guttenberg should check on Redunda, if it should be on standby or running
The feedback needs to be stored somewhere. Either in a database or in a textfile
The structure should be similar to Natty. Additionally, an ApiService
should be implemented instead of only using ApiUtils
As soon as the required methods are available, we should add a property containing the URL to our API cache.
To get notified about possible plagiarism with a certain score, we could implement an opt-in
-command like Natty has.
Example:
opt-in 0.8 //get notified about scores >= 0.8 when online
opt-in 0.8 always //try to ping you, even when you are offline
At the moment, tags will be built by Travis and deployed to FelixSFD/server
. By tracking the JARs with org.sobotics.redunda.DataService
, updates could automatically be deployed to all instances.
The body of each post should be separated into "plaintext", "blockquote" and "code".
If this is done, the Jaro-Winkler distance should not be calculated for two full posts, but for each of those parts. The values could be combined in one final score like this:
(score_plaintext*0.7)+(score_blockquote*0.5)+(score_code*1)
It's not useful to get pinged on high-scored reposts, since they will be autoflagged.
Might be somewhere else as well
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.