cofacts / rumors-api Goto Github PK
View Code? Open in Web Editor NEWGraphQL API server for clients like rumors-site and rumors-line-bot
Home Page: https://api.cofacts.tw
License: MIT License
GraphQL API server for clients like rumors-site and rumors-line-bot
Home Page: https://api.cofacts.tw
License: MIT License
Each crawler should include 2 parts:
Requires documentation on how to have crawlers up and running.
Because the structured format for rumor-db is subject to change, we may need to re-parse docs into latest formats. It would help if we store the previously crawled pages into WARC archives, and have a parser that parses data into the latest format.
The crawled website is subject to change as well. The WARC format includes parsed date in header. Compilers can use that to deal with different versions of the crawled website.
It is used by the Internet Archive and Common Crawl. Currently there is a great parsing / generation library for python. However it has little support in NodeJS :/
http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
https://github.com/internetarchive/warc
Each crawler should be a docker image on Docker Hub.The crawler and the framework communicates through mounted file systems. The framework will create a directory, put input.json
with input arguments in it, mount it under /data
inside the docker and docker run
the crawler. The crawler is expected to write output.warc
and output.json
, as the output of the Scraper and the Compiler, respectively.
問題描述:https://www.facebook.com/groups/1847232902175197/permalink/1884171838481303/
需要檢查分數機制的問題。
或許需要實做 #1 才能更有系統地來 tune 搜尋相關性公式。
需求:
This is an aggregated issue for:
需要與大家討論什麼 ID 好~
除了 Line ID,英文名字還會用在哪裡:
「真的假的」與其他闢謠網站不同的地方:
想到的名字:
謠言類
RumorsHasIt
SendMeRumors
rumor search / rumearch
findrumors
「真的假的」類
Realllly
for real?
usodaro
majide
honnto
其他網站:(內容生成、媒體類 )
http://www.snopes.com/
https://en.wikipedia.org/wiki/TruthOrFiction.com
Implements the API required to fulfill "Controversial point" functionality
Implementation detail: https://hackmd.io/s/SyivqlIrf#%E8%AC%A0%E8%A8%80%E5%9B%9E%E4%B8%8D%E5%AE%8C
TODO:
CreateArticle
should contain mandatory input field "reason"CreateReplyRequest
should contain optional input field "reason"As discussed on 1122 and 1025, only articles that has 2 or more reply requests are worth replying.
Since the 1st user sending in the link cannot get the response, if an article really has one reply request for a long time, it means that the reply will never be used in the LINE bot in the future.
The editors should have the option to list only the articles that has some more reply requests.
需要定時從 airtable 更新資料進 elastic search。
除了 cron job script 之外,重要的是要能自動化判斷相似的文章——或者是保守地差有點多的 rumor 都視為「不一樣」(但這樣的話,根據現在的搜尋評分機制,就會找不到最好的文章 Orz)
[ ] 將爬取過的 csv 結果從另外一個資料庫(https://rumor-search.g0v.ronny.tw)搬到 rumors-db 底下。
[ ] 整理爬蟲 code 與網站列表
[ ] 定時執行爬取
[ ] 若爬取有誤,需要通報管理者(如使用 Rollbar)
中秋節的「秋刀魚兩個洞是線蟲」訊息,以及近日來「剛收到消息,川普夫人的驕車在纽约川普大厦前被人們燒毀」的消息,在瘋狂轉載的當下都是尚未有闢謠文章的。
雖然現在「秋刀魚兩個洞是線蟲」已經找得到新聞闢謠了,但後者這種無來由的謠言很可能永遠不會有人闢謠。
When reviewing #53 , @darkbtf mentioned that we can use hashed article as article id when indexing the article in CreateArticle
.
This would greatly simplify the duplication check in #53 -- We can just go ahead to index, and do the fetching only when DB insertion fails. This reduces 1 RTT between the server and the database for normal article insertions.
TODO:
_id
when indexing article in CreateArticle
CreateArticle
UpdateUser(name: String) -> User
As the meeting note find out, it is not necessary to assume developers don't have node environment installed.
If the developer has their node environment installed, current install script can be more straight-forward. At least, yarn install
can be carried out before docker-compose up
, providing a smoother first-time developing experience.
Given this message:
https://cofacts.g0v.tw/article/s7m91ju27j4w
We cannot find the related article:
https://cofacts.g0v.tw/article/1lqxc09h1vbqw
https://cofacts.g0v.tw/article/3ubqbfcz0u49h
Even though all of them contains identical URLs
From: 2017-12-30 meeting note
A reply list for a specific editor can help the users to determine if an editor is credible, and helps editors to do double-check / proofreading for a specific editor as well.
LINE users seems not happy with the current "similarity" and tends to create new articles all the time. By showing the exact match of sentences may help them choose the identical articles.
Snippets / highlights can help Editors find interesting articles in the "related replies / articles" section.
API server should return matched paragraphs in each search result. LINE bot & website should display the search result in a manner similar to the snippets in Google Search.
This is achievable via elastic search "highlighting" function.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html
These editors of these replies does not have names.
https://cofacts.g0v.tw/article/5482632593803-rumor
https://cofacts.g0v.tw/article/5503261348541-rumor
Since they are editors in the editors' meetup, they are using the latest production code but still encounters this problem. This should be investigated.
對於 crowdsource 的資料庫搜尋,一個良好的搜尋系統應該要有下面的特性:
驗證方式如下:
調整推薦公式的時候,如果能把上面的驗證機制自動化,改完程式就能跑跑看,就能對公式的效果更有把握。
Add state of own reply for reply user, for rumors-site ReplyFeedback.
未來預計讓小編回應時,標記一篇 article 為「含有不實消息」、「不含不實消息」或「非轉傳訊息」三種 type 的其中之一。在 LINE bot 回應時,如果一篇 article 有複數個小編回應,會顯示「這則訊息被 3 個人標記為『含有不實訊息』、1 人標記為『不含不實訊息』」。
由於有些流言其實是屬於諷刺性** / 挖苦、或是笑話,
其中或許真的「含有不實訊息」,但目的不是要讓人信以為真,而是作為笑點,標記為「含有不實訊息」感覺好像也有點怪(畢竟也不會有「闢謠」文章不識趣地點破他)。
或許我們應該增加一個「sarcasm(諷刺或挖苦性**)」type 來標記這類消息?
In schema, articlereplyfeedback
already has a preserved field, comment
, for such purpose.
Please add comment
argument in the CreateOrUpdateArticleReplyFeedback
mutation so that we can record reasons why an user think an article-reply is useless.
Reference: https://hackmd.io/s/B1bb-hXhz "當使用者按下他覺得這個沒用的話,是否要留個理由。"
回應作者可以:
Connection 作者可以:
原本討論說留言作者是否可以刪掉別人建立的 reply connection
但由於他是 CC0,所以不應該讓 reply 作者刪掉別人的 connection
讓使用者可以使用下面的 filter:
以上 filter 希望可以複選(條件通通 AND 在一起)。
現在新增流言的機制是
如果有 API 可以對資料庫直接新增資料的話,就可以省去 airtable 這一段。
但相反的,整個編輯流言的網站就要有完整的功能(列出流言、檢視流言與答案、編輯與送出等等),才能正常地收資料。
View details in Rollbar: https://rollbar.com/mrorz/rumors-api/items/6/
Error: request entity too large
File "/srv/www/node_modules/raw-body/index.js", line 196, in readStream
return done(createError(413, 'request entity too large', 'entity.too.large', {
File "/srv/www/node_modules/raw-body/index.js", line 110, in executor
readStream(stream, encoding, length, limit, function onRead (err, buf) {
File "/srv/www/node_modules/raw-body/index.js", line 109, in getRawBody
return new Promise(function executor (resolve, reject) {
File "/srv/www/node_modules/co-body/lib/form.js", line 35, in Function.module.exports [as form]
return raw(inflate(req), opts)
File "/srv/www/node_modules/koa-bodyparser/index.js", line 89, in parseBody
return yield parse.form(ctx, formOpts);
File "native", line unknown, in next
File "/srv/www/node_modules/co/index.js", line 65, in onFulfilled
ret = gen.next(res);
File "/srv/www/node_modules/co/index.js", line 54, in <unknown>
onFulfilled();
File "/srv/www/node_modules/co/index.js", line 50, in Object.co
return new Promise(function(resolve, reject) {
File "/srv/www/node_modules/co/index.js", line 118, in Object.toPromise
if (isGeneratorFunction(obj) || isGenerator(obj)) return co.call(this, obj);
Implement these:
https://github.com/MrOrz/rumors-api/tree/master/src/graphql/mutations
The expected behavior is written in the comments. They are not hard requirements and can be changed.
Please also add test cases for SetRumor
and SetAnswer
. Please refer to https://github.com/MrOrz/rumors-api/tree/master/src/graphql/queries on how to structure and write test cases in this project, and see README for running the test.
「真的假的」是一個快速驗證謠言的 ChatBot 系統,透過群眾協作查證社群上不知道『真的假的』的分享訊息。現在「真的假的」需要一個英文名字,用在登記 github organization、 line ChatBot 帳號 ID 、網域上。
子曰:「名不正,則言不順;言不順,則事不成」
「真的假的」需要一個簡潔有力、朗朗上口、深入人心的好名字,以便於大眾找到我們!
在此我們向大家徵集「真的假的」英文名字。
如果你在下面的名字們裡面看到了自己覺得好的名字,可以按一下右上角的,選擇 👍 這個符號來投票復議;
如果你有了靈感,請大家依照以下格式,在這篇 issue 留下你想到的名字。
每一個 comment 只能留下一個名字,請分開記錄下你每一個名字的點子!
英文名字: 一個鏗鏘有力的名字
中文名字(可選):如果你認為「真的假的」跟你雋永的英文名字不般配,請留下你心中天造地設的名字
說文解字:解說你取名背後的寓意
英文:cofacts
說文解字: cofacts = collaborative + facts ,為群眾協作所產出的事實
「真的假的」 感謝您所做的貢獻!
As discussed in 20180207
It can be used in:
The level is determined by normal article reply count.
The number of article reply required to get to level n
is n
th number in Fibonacci number list.
The fields include:
level
: 0~n, current levellevelProgress.total
: The number of additional normal article reply count required to reach the next level. For level 0, 1, and 2, it's 1. For level 3, it's 2. For level 4, it's 3, and so on.levelProgress.current
: The current number of additional normal article reply count collected within this level. Ranges from 0 to levelProgress.total
.http://api.rumors.hacktabl.org/login/facebook?redirect=/
Redirect 回來之後會變成
`appId` and `redirect` must be set before. Did you forget to go to /login/*?
上述訊息出自 https://github.com/MrOrz/rumors-api/blob/master/src/auth.js#L168
console.log
出來發現是 ctx.session === {}
,但理論上 session 應該已經在 https://github.com/MrOrz/rumors-api/blob/master/src/auth.js#L141 這附近設定了。
好久沒用 用錯了 正在想辦法殺掉
英文名字: To yes, or not to yes
中文名字(可選):「真的假的」
說文解字:
我的英文能力不行,只是坑主說「個簡潔有力、朗朗上口、深入人心的好名字」,我的直覺出現一個「生存還是毀滅」(英語:To be, or not to be)的詞。
我想了兩個單字來替換,一個是「truth」(好像不押韻)一個是「yes」。
英文好的人請繼續,我的能力目前到此為止。
Adds new type "OPINIONATED" in SRC/graphql/models/ReplyTypeEnum
「有」—— do nothing
「沒有,我要回報流言」—— 牛頭不對馬嘴時讓使用者建立新 rumor( #2 還沒做好的話,就寫進 airttable)
Developer trial 碰到人數上限
https://www.facebook.com/groups/1847232902175197/permalink/1884362701795550/
要換到另一個 free plan
想要解決的問題:
如果可以記錄每個連結的
一併加入 Full text search 的話會方便很多。
「編輯 answer」
「建立 answer」
應該要登入後才能做。
Counts the number of articles the user has added article reply to.
This should be doable via nested query, since the author of an article-reply is articleReplies.userId
in articles
index.
This will be used to render personal progress bar against the total number of articles.
In order to prevent illegal immigrants DDoS attacks from browsers, the API server should only open to clients that has API keys.
We need a wall a mechanism to allow developers to apply for a key. If the developer's client is web-based, they should specify the domains so that we open up CORS according to the API key.
For clients that send request through backend http libraries, API keys are still required, but no need to specify domains.
超過 quota 之後可能請他自己去 google?
3/18 會前討論 中提到,目前未回答的謠言太多,使小編沒有動力。
如果可以讓小編對文章標記「我想回答」,然後在文章列表多兩個 filter:「我想回答」與「我回答過」,那小編就可以建立自己的 TODO list,不會說找不到之前看過的想回答的謠言,也因為 TODO 更明確而增加回答動力。
另外,編輯者小聚這類的場合裡,有「我想回答」的計數可以幫助小編避免重工的問題。
RFC:
實作方式想要做成直接存在 articles
index 裡頭。
開一個 pendingRepliers[]
,紀錄 {userId, createdAt}
。
在文章列表,可以列出「10 分鐘前有人表示想要回答」小字,告知其他小編說有人在多久之前說他要寫回答。
Steps to reproduce:
Root cause: URLs are not encoded, but redirect location requires so.
We should deal with cofacts/rumors-line-bot#41 in API.
In CreateArticle
, we first check if there is an article with 100% match in the datbase.
If so, we returns the existing article, and submits ReplyRequest for the user.
目前已經有不少 Rumor 有兩個 answer
在 airtable 內為兩個 duplicated row, 但 answer 欄位不同。
此時應該只要建立一個 rumor 與兩個 answer。
或許也要考慮一個 answer 有兩個 rumor 的情況。
3/18 聚會中,維基社群以及阿孝老師都曾經建議要讓不同知識領域的人可以分工回應文章。實作 user generated label 似乎是一個不錯的方式。
RFC:在 rumors-api 實作一個類似 hackpad / niconico 的 label,符合:
因為一開始一定沒啥標籤,我覺得可以之後再來討論 label 太多是否要合併之類的事情。立委投票指南的「議題」 似乎也是類似的實作方式。
實作方式為直接在 articles 開一個 field 存放 array of text,不另外開 index。
目前拿流言去網路爬文的資料庫搜尋,常常找到牛頭不對馬嘴的文章。
例: https://g0v-tw.slack.com/files/wildjcrt/F3FBL59UY/screenshot_20161217-130918.png
需要
View details in Rollbar: https://rollbar.com/mrorz/rumors-api/items/5/
SyntaxError: Unexpected token F in JSON at position 0
File "native", line unknown, in Object.parse
File "/srv/www/node_modules/node-fetch/lib/body.js", line 48, in <unknown>
return JSON.parse(buffer.toString());
File "internal/process/next_tick.js", line 129, in process._tickDomainCallback
CreateArticle
, store Traditional Chinese version of the text into an indexing field.ListArticles
and ListReplies
, convert filter terms to Traditional Chinese and search using the Traditional Chinese indexing field.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.