Coder Social home page Coder Social logo

wikimit's Introduction

Welcome to my portfolio

This is not the portfolio, this is just a README.

Check out the site here: https://davidtorosyan.com

Building and Running

Check out the scripts directory for instructions.

wikimit's People

Contributors

davidtorosyan avatar maximilian-reinsel avatar

Watchers

 avatar

wikimit's Issues

Design doc

Design doc for wikimit

Intro

Wikimit is an in-progress Wikipedia-to-git converter. This document details the approach.

Goals

  1. Input: wikipedia article, output: github repo organized under wikimit-hub
  2. Anonymous access (no account needed)
  3. Requesting conversion should be idempotently create/update a repo
  4. Low costs by using as-needed hosting and free storage

Architecture

Overview

Wikimit Design

Sequence (a):

  1. User chooses a Wikipedia page
  2. User submits URL in wikimit site
  3. Site sends ajax request
  4. Request handler adds URL to queue and returns the expected github URL
  5. Site polls github and shows link to user when ready

Sequence (b):

  1. Sync agent pulls URL off queue
  2. Sync agent queries Wikipedia for revision history
  3. Sync agent pushes commits to GitHub

wikimit.org

The wikimit site is a static webpage, hosted on GitHub pages with a custom domain (wikimit.org). The page has a textbox and a submit button.

Clicking the submit button sends an ajax request to the backend handler, which responds with a github URL. The client-side JS then polls that URL and alerts the user of two developments:

  1. When the repo is created
  2. When the repo is up to date with the latest revision

When (2) is reached, polling stops.

Request handler

The request handler is implemented as an AWS lambda. It validates the incoming Wikipedia URL, converts it to a GitHub repo URL, and places the pair on the job queue.

The handler will also:

  1. Make sure that the URL isn't already on the queue before putting it on
  2. Fail if the queue is full
  3. Throttle by IP address

Job queue

The job queue is a short-lived list of URLs that need to be processed.

TBD on the hosting.

Sync agent

The sync agent monitors the job queue and is only active when there's work to be done. Jobs can be done in parallel, but must take a lock on the queue when modifying it.

To do a job, the agent first creates a GitHub repo if one doesn't already exist. Otherwise, it clones the existing repo to its local filesystem. Then it fetches some number of revisions from Wikipedia, commits them, and pushes to GitHub. Then it (a) removes the job it just did from the queue, and (b) pushes a follow-up job to the end of the queue (if needed).

Follow-up jobs are needed if:

  1. There are more revisions to process.
  2. The job failed. In this case, the job is marked as being a retry.

TBD on the hosting.

Considerations

Security

Only the request handler and sync agent have access to the job queue (TBD on details).

The request handler throttles by IP (TBD on details).

The sync agent has access to a GitHub account with limited permissions, such that it can only push to the wikimit-hub project.

Reliability

Pages that have a huge number of revisions don't prevent other work from being processed because of the "do a small amount of work and create a follow-up job" approach the sync agent takes.

Transient failures are dealt with by retry jobs, and if they're persistent the job is dropped.

Alerting on:

  1. Queue is full
  2. Dropped jobs

Legal

Wikipedia content can be redistributed as long as it uses the CC-BY-SA license.

Wikipedia limits access to their API at two simultaneous requests per IP.

Performance

Initial testing shows that converting a page with 1000 revisions takes about 100 seconds, with 90% of time running "git commit". This is pretty slow but not catastrophic.

A small article with 1000 revisions takes about 6 MB of disk space. This should be manageable.

Both speed and size need to be stress tested with larger articles. TBD.

Costs

Hosting solutions:

  1. GitHub static web page (free)
  2. Request handler ($?)
  3. Job queue ($?)
  4. Sync agent ($?)
  5. Wikipedia API use (free)
  6. GitHub account use (free, but has some limits)
  7. Domain name ($10/yr)

TBD on actual cost estimates.

Open questions

See all the TBD items above.

Proof of concept

I ran the proof of concept for https://en.wikipedia.org/wiki/Finch to get a sense of a) Wikipedia's APIs and b) performance.

The generated repo (https://github.com/wikimit-hub/Finch) has 1090 commits. Here are the logs:

Querying wikipedia...
wiki took 0.47 seconds
Adding 100 commits...
git took 9.71 seconds
Querying wikipedia...
wiki took 0.77 seconds
Adding 100 commits...
git took 9.60 seconds
Querying wikipedia...
wiki took 1.22 seconds
Adding 100 commits...
git took 9.60 seconds
Querying wikipedia...
wiki took 0.76 seconds
Adding 100 commits...
git took 9.94 seconds
Querying wikipedia...
wiki took 0.74 seconds
Adding 100 commits...
git took 9.82 seconds
Querying wikipedia...
wiki took 0.68 seconds
Adding 100 commits...
git took 10.13 seconds
Querying wikipedia...
wiki took 0.82 seconds
Adding 100 commits...
git took 9.85 seconds
Querying wikipedia...
wiki took 0.76 seconds
Adding 100 commits...
git took 10.08 seconds
Querying wikipedia...
wiki took 0.77 seconds
Adding 100 commits...
git took 9.91 seconds
Querying wikipedia...
wiki took 0.83 seconds
Adding 100 commits...
git took 9.93 seconds
Querying wikipedia...
wiki took 0.63 seconds
Adding 89 commits...
git took 8.80 seconds
Done!
wiki took a total of 8.45 seconds
git took a total of 107.37 seconds

Conclusions:

  1. Wiki's API supports paging through entries from the beginning and using timestamp as an offset
  2. Git is our bottleneck, taking about 100ms per commit. That's pretty slow, hopefully we can speed things up by looking into more advanced git commands.
  3. The resulting git blame is pretty good for looking back into history, but could be improved (see question 4).
  4. This took a long time for a relatively tiny article.

Open questions:

  1. I'm assuming en wikipedia right now. Including other languages should be pretty easy but that needs testing.
  2. How are renames handled when looking through revision history?
  3. How should renames be handled for the repos?
  4. Should the wiki lines be split up further? Right now we're using newlines from the source, but there's nothing stopping us from adding additional newlines to help git blame along. Ideally this would be along sentence stops, but that might be hard to reliably track.

Also, the size of the resultant repo was 6.53 MB and 4,139 files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.