cofacts / rumors-api Goto Github PK

View Code? Open in Web Editor NEW

109.0 12.0 26.0 6.07 MB

GraphQL API server for clients like rumors-site and rumors-line-bot

Home Page: https://api.cofacts.tw

License: MIT License

JavaScript 99.77% Dockerfile 0.07% Pug 0.16% TypeScript 0.01%

rumors elasticsearch fact-checking crowdsourcing

rumors-api's Issues

Crawler framework

Each crawler should include 2 parts:

Scraper: Given timestamp, store all crawled documents that comes after the timestamp, to a WARC archive
Compiler: Given a WARC archive filename and a timestamp, parse the crawled documents into latest structured formats.

Requires documentation on how to have crawlers up and running.

Why Scraper-Compiler separation?

Because the structured format for rumor-db is subject to change, we may need to re-parse docs into latest formats. It would help if we store the previously crawled pages into WARC archives, and have a parser that parses data into the latest format.

The crawled website is subject to change as well. The WARC format includes parsed date in header. Compilers can use that to deal with different versions of the crawled website.

Why WARC?

It is used by the Internet Archive and Common Crawl. Currently there is a great parsing / generation library for python. However it has little support in NodeJS :/

http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
https://github.com/internetarchive/warc

How individual crawlers are integrated with the framework?

Each crawler should be a docker image on Docker Hub.The crawler and the framework communicates through mounted file systems. The framework will create a directory, put input.json with input arguments in it, mount it under /data inside the docker and docker run the crawler. The crawler is expected to write output.warc and output.json, as the output of the Scraper and the Compiler, respectively.

提升 crowdsource 資料庫的搜尋準度

問題描述：https://www.facebook.com/groups/1847232902175197/permalink/1884171838481303/

需要檢查分數機制的問題。
或許需要實做 #1 才能更有系統地來 tune 搜尋相關性公式。

需求：

如果資料庫有，就應該要精準地找出該筆資料。
如果資料庫沒有，而且沒把握找到的資料是否有關聯，寧願回傳說找不到。

DB changes needed for moving out from Airtable

This is an aggregated issue for:

cofacts/rumors-db#2
cofacts/rumors-db#3
cofacts/rumors-db#4

[line] 申購專屬 ID、申請 github organization、g0v domain name

需要與大家討論什麼 ID 好～

除了 Line ID，英文名字還會用在哪裡：

域名
line ID
Github 的 organization 名

「真的假的」與其他闢謠網站不同的地方：

不做內容生成，而是 curator / 查詢 / 入口
群眾協作
未來除了謠言查證，說不定也會做讓網友回報「爭議性**」然後附上「反駁論述連結」。但或許要先把謠言查證的搜尋先專心做好，樹立口碑 (謠言反駁的搜尋引擎？) 才能往這個方向做。

想到的名字：

謠言類
RumorsHasIt
SendMeRumors
rumor search / rumearch
findrumors
「真的假的」類
Realllly
for real?
usodaro
majide
honnto

其他網站：(內容生成、媒體類 )
http://www.snopes.com/
https://en.wikipedia.org/wiki/TruthOrFiction.com

firstCursor and lastCursor throws error when returned list is empty

CreateArticle should enforce senders to provide reasons

Implements the API required to fulfill "Controversial point" functionality

Implementation detail: https://hackmd.io/s/SyivqlIrf#%E8%AC%A0%E8%A8%80%E5%9B%9E%E4%B8%8D%E5%AE%8C

TODO:

CreateArticle should contain mandatory input field "reason"
CreateReplyRequest should contain optional input field "reason"
Allow editors to vote on the reply requests' reason

Allow article lists to filter by reply request counts

As discussed on 1122 and 1025, only articles that has 2 or more reply requests are worth replying.

Since the 1st user sending in the link cannot get the response, if an article really has one reply request for a long time, it means that the reply will never be used in the LINE bot in the future.

The editors should have the option to list only the articles that has some more reply requests.

Automated script for updating elasticsearch from Airtable

需要定時從 airtable 更新資料進 elastic search。

除了 cron job script 之外，重要的是要能自動化判斷相似的文章——或者是保守地差有點多的 rumor 都視為「不一樣」（但這樣的話，根據現在的搜尋評分機制，就會找不到最好的文章 Orz）

整理爬過的網站列表、在資料庫建檔

[ ] 將爬取過的 csv 結果從另外一個資料庫（https://rumor-search.g0v.ronny.tw）搬到 rumors-db 底下。
[ ] 整理爬蟲 code 與網站列表
[ ] 定時執行爬取
[ ] 若爬取有誤，需要通報管理者（如使用 Rollbar）

Enhance "NOT ARTICLE" functionality

改名「不在查證範圍」
新增樣板「這是廣告活動，活動期間到⋯⋯」引導編輯填寫活動時間，但編輯也可以不填。
新增樣板「訊息僅含有失效連結」
編輯界面增加「查證範圍」連結（for 編輯）；當有回應是「不在查證範圍」時，提供使用者一個連結說「查證範圍是什麼」（for 使用者）

如何處理尚未有闢謠文章的 0-day rumors

中秋節的「秋刀魚兩個洞是線蟲」訊息，以及近日來「剛收到消息，川普夫人的驕車在纽约川普大厦前被人們燒毀」的消息，在瘋狂轉載的當下都是尚未有闢謠文章的。

雖然現在「秋刀魚兩個洞是線蟲」已經找得到新聞闢謠了，但後者這種無來由的謠言很可能永遠不會有人闢謠。

Enhance duplicate article check

When reviewing #53 , @darkbtf mentioned that we can use hashed article as article id when indexing the article in CreateArticle.

This would greatly simplify the duplication check in #53 -- We can just go ahead to index, and do the fetching only when DB insertion fails. This reduces 1 RTT between the server and the database for normal article insertions.

TODO:

Use article content as _id when indexing article in CreateArticle
Simplify the duplication check in CreateArticle

Add mutation API for editors to change displayName

UpdateUser(name: String) -> User

README update about yarn environment

As the meeting note find out, it is not necessary to assume developers don't have node environment installed.

If the developer has their node environment installed, current install script can be more straight-forward. At least, yarn install can be carried out before docker-compose up, providing a smoother first-time developing experience.

Revisit similarity measure on similar documents

Given this message:
https://cofacts.g0v.tw/article/s7m91ju27j4w

We cannot find the related article:
https://cofacts.g0v.tw/article/1lqxc09h1vbqw
https://cofacts.g0v.tw/article/3ubqbfcz0u49h

Even though all of them contains identical URLs

List all replies for any editors

From: 2017-12-30 meeting note

A reply list for a specific editor can help the users to determine if an editor is credible, and helps editors to do double-check / proofreading for a specific editor as well.

rumor search 沒 answer 時不該回傳 rumor

此時試著 match answer，說不定有現有 answer 可以跟 rumor match，只是沒有連在一起。
如果沒有，就繼續往下（crawled DB、google）

Show related paragraphs in search result, instead of the first paragraphs

Scenario

LINE users seems not happy with the current "similarity" and tends to create new articles all the time. By showing the exact match of sentences may help them choose the identical articles.
Snippets / highlights can help Editors find interesting articles in the "related replies / articles" section.

Proposed solutions

API server should return matched paragraphs in each search result. LINE bot & website should display the search result in a manner similar to the snippets in Google Search.

This is achievable via elastic search "highlighting" function.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html

Some editor's names are empty

These editors of these replies does not have names.

https://cofacts.g0v.tw/article/5482632593803-rumor
https://cofacts.g0v.tw/article/5503261348541-rumor

Since they are editors in the editors' meetup, they are using the latest production code but still encounters this problem. This should be investigated.

用現有資料庫來驗證搜尋系統

對於 crowdsource 的資料庫搜尋，一個良好的搜尋系統應該要有下面的特性：

搜尋已經有 answer 的 rumor 時，應該要找到該則 answer。
搜尋還沒有 answer、但全文有存在資料庫裡的 rumor 時，應該要回該則 rumor，然後告知使用者有其他人回報但是沒人回答。
搜尋全文不在系統裡的 rumor 時，如果不是很有把握，就不應該硬回傳一個結果。

驗證方式如下：

1、2 兩個特性，可以寫個程式把所有資料庫裡面的 rumor 一筆一筆地都搜尋一次，看看結果是不是該則被選中的 rumor，就能驗證。
3 的部分可以透過從資料庫裡面拿掉一則文章，然後搜尋他，如果找不到東西就正常，如果找到了東西就要人工看看他算不算有關。

調整推薦公式的時候，如果能把上面的驗證機制自動化，改完程式就能跑跑看，就能對公式的效果更有把握。

Add state of own reply for reply user, for rumors-site ReplyFeedback.

Add new article type "Sarcasm"

未來預計讓小編回應時，標記一篇 article 為「含有不實消息」、「不含不實消息」或「非轉傳訊息」三種 type 的其中之一。在 LINE bot 回應時，如果一篇 article 有複數個小編回應，會顯示「這則訊息被 3 個人標記為『含有不實訊息』、1 人標記為『不含不實訊息』」。

由於有些流言其實是屬於諷刺性** / 挖苦、或是笑話，

其中或許真的「含有不實訊息」，但目的不是要讓人信以為真，而是作為笑點，標記為「含有不實訊息」感覺好像也有點怪（畢竟也不會有「闢謠」文章不識趣地點破他）。

或許我們應該增加一個「sarcasm（諷刺或挖苦性**）」type 來標記這類消息？

Ask user for reason why they think a reply is not useful

In schema, articlereplyfeedback already has a preserved field, comment, for such purpose.

Please add comment argument in the CreateOrUpdateArticleReplyFeedback mutation so that we can record reasons why an user think an article-reply is useless.

Reference: https://hackmd.io/s/B1bb-hXhz "當使用者按下他覺得這個沒用的話，是否要留個理由。"

編輯與刪除回應

回應作者可以：

編輯自己的回應（建立新的 replyVersion）

Connection 作者可以：

把自己蓋的 connection 刪掉（做一個 delete flag）

原本討論說留言作者是否可以刪掉別人建立的 reply connection
但由於他是 CC0，所以不應該讓 reply 作者刪掉別人的 connection

ListArticle 或 Search 支援新的 filter / sort 方式

讓使用者可以使用下面的 filter：

我標記成「等等回應」的文章（ #34 ）
沒人標記成「等等回應」的所有文章（ #34 ）
文章 tag ( #32 )
我回應過的 article
我送出過 replyRequest 的 article （我想知道）（ Related: cofacts/rumors-site#13 ）
所有人都認為現有 reply 沒用的 article / 照無用度 sort (「正向」+「負向」遞增排序)
使用「各文章最近一次被回報的時間」排序
回應中有「含有真實資訊」or「含有不實資訊」or「非文章」
回應中不含有「含有真實資訊」or「含有不實資訊」or「非文章」

以上 filter 希望可以複選（條件通通 AND 在一起）。

透過 API 來新增流言

現在新增流言的機制是

在 line 與 chatbot 交談，訊息就會進到 airtable
用程式把 airtable 裡的資料寫到資料庫裡

如果有 API 可以對資料庫直接新增資料的話，就可以省去 airtable 這一段。
但相反的，整個編輯流言的網站就要有完整的功能（列出流言、檢視流言與答案、編輯與送出等等），才能正常地收資料。

Needs to enlarge body size limit of koa-bodyparser

View details in Rollbar: https://rollbar.com/mrorz/rumors-api/items/6/


Error: request entity too large
  File "/srv/www/node_modules/raw-body/index.js", line 196, in readStream
        return done(createError(413, 'request entity too large', 'entity.too.large', {
  File "/srv/www/node_modules/raw-body/index.js", line 110, in executor
        readStream(stream, encoding, length, limit, function onRead (err, buf) {
  File "/srv/www/node_modules/raw-body/index.js", line 109, in getRawBody
      return new Promise(function executor (resolve, reject) {
  File "/srv/www/node_modules/co-body/lib/form.js", line 35, in Function.module.exports [as form]
      return raw(inflate(req), opts)
  File "/srv/www/node_modules/koa-bodyparser/index.js", line 89, in parseBody
          return yield parse.form(ctx, formOpts);
  File "native", line unknown, in next
  File "/srv/www/node_modules/co/index.js", line 65, in onFulfilled
            ret = gen.next(res);
  File "/srv/www/node_modules/co/index.js", line 54, in <unknown>
        onFulfilled();
  File "/srv/www/node_modules/co/index.js", line 50, in Object.co
      return new Promise(function(resolve, reject) {
  File "/srv/www/node_modules/co/index.js", line 118, in Object.toPromise
      if (isGeneratorFunction(obj) || isGenerator(obj)) return co.call(this, obj);

Implement mutation SetRumor and SetAnswer

Implement these:
https://github.com/MrOrz/rumors-api/tree/master/src/graphql/mutations

The expected behavior is written in the comments. They are not hard requirements and can be changed.

Please also add test cases for SetRumor and SetAnswer. Please refer to https://github.com/MrOrz/rumors-api/tree/master/src/graphql/queries on how to structure and write test cases in this project, and see README for running the test.

「真的假的」英文名字徵集

「真的假的」是一個快速驗證謠言的 ChatBot 系統，透過群眾協作查證社群上不知道『真的假的』的分享訊息。現在「真的假的」需要一個英文名字，用在登記 github organization、 line ChatBot 帳號 ID 、網域上。

子曰：「名不正，則言不順；言不順，則事不成」

「真的假的」需要一個簡潔有力、朗朗上口、深入人心的好名字，以便於大眾找到我們！
在此我們向大家徵集「真的假的」英文名字。

如果你在下面的名字們裡面看到了自己覺得好的名字，可以按一下右上角的，選擇 👍 這個符號來投票復議；

如果你有了靈感，請大家依照以下格式，在這篇 issue 留下你想到的名字。

每一個 comment 只能留下一個名字，請分開記錄下你每一個名字的點子！

格式

英文名字：一個鏗鏘有力的名字
中文名字（可選）：如果你認為「真的假的」跟你雋永的英文名字不般配，請留下你心中天造地設的名字
說文解字：解說你取名背後的寓意

範例

英文：cofacts
說文解字： cofacts = collaborative + facts ，為群眾協作所產出的事實

「真的假的」感謝您所做的貢獻！

List submitted articles of a LINE user

As discussed in 20180207

It can be used in:

When the user is banned, other users can see what kind of article will cause one being banned
Can have a list of article called "I submitted before"

[Gamification] Add "level" field for users

The level is determined by normal article reply count.
The number of article reply required to get to level n is nth number in Fibonacci number list.

0 article replies -> lv 0
1 article reply -> lv 1
2 article replies -> lv 2
3 article replies -> lv 3
5 article replies -> lv 4
8 article replies -> lv 5
13 article replies -> lv 6
... etc

The fields include:

level: 0~n, current level
levelProgress.total: The number of additional normal article reply count required to reach the next level. For level 0, 1, and 2, it's 1. For level 3, it's 2. For level 4, it's 3, and so on.
levelProgress.current: The current number of additional normal article reply count collected within this level. Ranges from 0 to levelProgress.total.

Reference: https://hackmd.io/s/B1bb-hXhz#%E7%B7%A8%E8%BC%AF%E7%A4%BE%E7%BE%A4%E7%9A%84%E7%87%9F%E9%81%8B%E6%96%B9%E5%BC%8F

api.rumors.hacktabl.org 無法登入

http://api.rumors.hacktabl.org/login/facebook?redirect=/

Redirect 回來之後會變成

`appId` and `redirect` must be set before. Did you forget to go to /login/*?

上述訊息出自 https://github.com/MrOrz/rumors-api/blob/master/src/auth.js#L168

console.log 出來發現是 ctx.session === {}，但理論上 session 應該已經在 https://github.com/MrOrz/rumors-api/blob/master/src/auth.js#L141 這附近設定了。

To yes, or not to yes

好久沒用　用錯了　正在想辦法殺掉　

英文名字： To yes, or not to yes

中文名字（可選）：「真的假的」

說文解字：
　　我的英文能力不行，只是坑主說「個簡潔有力、朗朗上口、深入人心的好名字」，我的直覺出現一個「生存還是毀滅」（英語：To be, or not to be）的詞。

　　我想了兩個單字來替換，一個是「truth」（好像不押韻）一個是「yes」。

　　英文好的人請繼續，我的能力目前到此為止。

Support new reply type "opinionated"

Adds new type "OPINIONATED" in SRC/graphql/models/ReplyTypeEnum

問使用者「這個訊息有解答到你嗎」

「有」—— do nothing
「沒有，我要回報流言」—— 牛頭不對馬嘴時讓使用者建立新 rumor（ #2 還沒做好的話，就寫進 airttable）

[line] 從 developer trial 換方案

Developer trial 碰到人數上限
https://www.facebook.com/groups/1847232902175197/permalink/1884362701795550/

要換到另一個 free plan

Index the title and content of URLs in page

想要解決的問題：

Retrieval 時遇到連結基本上就不能做任何事，即使連結背後的文章高度相關也無法找到。
Line bot 顯示找到的文章給使用者選時，對連結相當不友善
編輯要點進去看有點麻煩，而且 related article 對連結的效果不彰

如果可以記錄每個連結的

Title
Canonical URL (after redirect)
Content

一併加入 Full text search 的話會方便很多。

記錄建立與編輯 answer 的使用者

「編輯 answer」
「建立 answer」
應該要登入後才能做。

[gamification] Add number of articles the user has replied in "user" object type

Counts the number of articles the user has added article reply to.

This should be doable via nested query, since the author of an article-reply is articleReplies.userId in articles index.

This will be used to render personal progress bar against the total number of articles.

API key & CORS domains

In order to prevent ~~illegal immigrants~~ DDoS attacks from browsers, the API server should only open to clients that has API keys.

We need ~~a wall~~ a mechanism to allow developers to apply for a key. If the developer's client is web-based, they should specify the domains so that we open up CORS according to the API key.

For clients that send request through backend http libraries, API keys are still required, but no need to specify domains.

處理 google query quota exceeded 問題

超過 quota 之後可能請他自己去 google？

文章「購物車」：標記「我等等回」

3/18 會前討論中提到，目前未回答的謠言太多，使小編沒有動力。

如果可以讓小編對文章標記「我想回答」，然後在文章列表多兩個 filter：「我想回答」與「我回答過」，那小編就可以建立自己的 TODO list，不會說找不到之前看過的想回答的謠言，也因為 TODO 更明確而增加回答動力。

另外，編輯者小聚這類的場合裡，有「我想回答」的計數可以幫助小編避免重工的問題。

RFC：

實作方式想要做成直接存在 articles index 裡頭。
開一個 pendingRepliers[]，紀錄 {userId, createdAt}。
在文章列表，可以列出「10 分鐘前有人表示想要回答」小字，告知其他小編說有人在多久之前說他要寫回答。

Internal server error when logging in in search result

Steps to reproduce:

Go to article list page. If already logged in, logout first.
In search box type in Chinese characters, perform search
Login using any method
sees internal server error.

Root cause: URLs are not encoded, but redirect location requires so.

CreateArticle should check for duplicates before submission

We should deal with cofacts/rumors-line-bot#41 in API.

In CreateArticle, we first check if there is an article with 100% match in the datbase.

If so, we returns the existing article, and submits ReplyRequest for the user.

[db] airtable script 應該要支援多對多的 Answer 與 Rumor 關係

目前已經有不少 Rumor 有兩個 answer
在 airtable 內為兩個 duplicated row, 但 answer 欄位不同。
此時應該只要建立一個 rumor 與兩個 answer。

或許也要考慮一個 answer 有兩個 rumor 的情況。

標記文章關鍵字分類，可以 search by label

3/18 聚會中，維基社群以及阿孝老師都曾經建議要讓不同知識領域的人可以分工回應文章。實作 user generated label 似乎是一個不錯的方式。

RFC：在 rumors-api 實作一個類似 hackpad / niconico 的 label，符合：

使用者可以自由對 article 標記 label
輸入 label 時會會用現有 label 進行 autocomplete
可以列出含有特定 label 的文章

因為一開始一定沒啥標籤，我覺得可以之後再來討論 label 太多是否要合併之類的事情。立委投票指南的「議題」似乎也是類似的實作方式。

實作方式為直接在 articles 開一個 field 存放 array of text，不另外開 index。

提升網路爬文的搜尋準度

目前拿流言去網路爬文的資料庫搜尋，常常找到牛頭不對馬嘴的文章。

例： https://g0v-tw.slack.com/files/wildjcrt/F3FBL59UY/screenshot_20161217-130918.png

需要

用 explain query 看看到底出問題的是什麼關鍵字。
搜集栗子作為測試資料用以自動化測試（類似 #1 的做法）
調整判斷是否有關的條件式，如果沒有把握，寧願不回傳任何結果。

SyntaxError: Unexpected token F in JSON at position 0

View details in Rollbar: https://rollbar.com/mrorz/rumors-api/items/5/


SyntaxError: Unexpected token F in JSON at position 0
  File "native", line unknown, in Object.parse
  File "/srv/www/node_modules/node-fetch/lib/body.js", line 48, in <unknown>
    		return JSON.parse(buffer.toString());
  File "internal/process/next_tick.js", line 129, in process._tickDomainCallback

用 template message 水平地呈現多筆搜尋結果

如果不能確定回傳結果是否正確，或許多回傳幾筆給使用者選擇是更好的。

用 carousel template 讓使用者選擇哪個有關，然後給他更多資訊（並且記下哪個是 relevant 的資訊）

Convert queries and inserted articles to Traditional Chinese

In CreateArticle, store Traditional Chinese version of the text into an indexing field.
In ListArticles and ListReplies, convert filter terms to Traditional Chinese and search using the Traditional Chinese indexing field.

Tool: https://github.com/BYVoid/OpenCC