Comments (5)
執行的效率則與 cofacts/rumors-db#1 有關。需要大家想想如何更有效率地偵測重複的 rumors / answers。
(亦即改進 #13 的結果)
from rumors-api.
Too open xD
要不要一步一步解決,先處理更新同步問題?
現在是怎麼更新資料的?
from rumors-api.
airtable 可以參考:https://github.com/kytu800/bigplatform.tw/blob/master/cron.js
from rumors-api.
現在更新資料的方式收在另一個 repository: https://github.com/MrOrz/rumors-db
,手動載下 csv 之後,執行 npm run seed
這樣。
其實一個月以前,資料是直接從 airtable 拉下來的: https://github.com/MrOrz/rumors-db/blob/36b3e1d4b4d2f3feabece91c7a9ee87264447fe4/airtable/airtableToElasticSearch.js
但後來考量到其他開發者也要能在自己的電腦 populate seed data,用 airtable API 還要傳遞 API key 很麻煩,乾脆改成大家都能操作的 CSV 下載 + 從 CSV 讀檔案進資料庫。
不過現在的問題在:
- https://github.com/MrOrz/rumors-db 的 seed script 裡面其實會「合併類似的 rumor 與 answer」
那個「類似的」現在是用一個現成的 similarity 算法,但因為要兩兩比對,導致速度很慢。
( 亦即 cofacts/rumors-db#1 ) - 那個 seed script 其實遇到不確定像不像的兩段文字,會跳出來問使用者像不像。但如果要寫成 cron job,那就必須要讓程式自己判斷。現在的狀況是,那個 similarity 有時候兩篇文章根本就是一樣,但 similarity 只有 0.4;有時候兩個根本不一樣,但 similarity 有 0.6,很難一刀兩斷。
- seed script 判斷重複的效果,會影響 retrieval 的 recall rate。如果 seed script 誤將兩個類似的 doc 判斷為「不一樣」,把他們都寫進了資料庫,那可能會造成搜尋時兩篇文都找不到——這主要是因為現在判斷「best match」的公式是拿第一名與第二名的分數相比,第一名的分數是第二名的 N 倍(目前 N = 1.6)時,才算是有找到「best match」所致。相關更新請見: https://www.facebook.com/groups/1847232902175197/permalink/1891811947717292/ ,
徵求大家來試試看有沒有更合理、而 recall rate 也能更好的計算「best match」的方式。
from rumors-api.
目前確定要從 airtable 離開,變成資料直接進 elastic search
https://www.facebook.com/groups/1847232902175197/permalink/1896817880550032/
而且編輯界面正在寫。
closing this
from rumors-api.
Related Issues (20)
- untangle test dependencies HOT 1
- Highlight improvements HOT 1
- create graphql object type for ISO timestamp
- Support Google sign-in HOT 1
- Block invalid slug for users HOT 1
- ListArticle filter & sorting fix HOT 1
- Client app ID management
- iOS cannot login cofacts.tw HOT 1
- Should allow clearing slug (UpdateUser with empty slug) HOT 1
- Remove reply content in removeArticleReply.js script HOT 2
- iOS 12 devices cannot login cofacts.tw HOT 1
- [Image-M1] Wrap image management / dedup mechanism into library HOT 1
- Write author IDs when creating article reply feedbacks
- Load article LIFF's trend data in hourly cron job and expose them in API
- Connect Cofacts API to Media Manager file variants
- Searching tiktok URL results in JSON parse error
- Cleanup weird analytics doc
- Reduce Text-to-speech hallucination HOT 1
- Thumbnail & preview for video and audio files HOT 1
- Add `contributors` field to `article` to know who had edited the transcript
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rumors-api.