This is not the portfolio, this is just a README.
Check out the site here: https://davidtorosyan.com
Check out the scripts directory for instructions.
a wikipedia-to-git converter
License: MIT License
This is not the portfolio, this is just a README.
Check out the site here: https://davidtorosyan.com
Check out the scripts directory for instructions.
Wikimit is an in-progress Wikipedia-to-git converter. This document details the approach.
Sequence (a):
Sequence (b):
The wikimit site is a static webpage, hosted on GitHub pages with a custom domain (wikimit.org). The page has a textbox and a submit button.
Clicking the submit button sends an ajax request to the backend handler, which responds with a github URL. The client-side JS then polls that URL and alerts the user of two developments:
When (2) is reached, polling stops.
The request handler is implemented as an AWS lambda. It validates the incoming Wikipedia URL, converts it to a GitHub repo URL, and places the pair on the job queue.
The handler will also:
The job queue is a short-lived list of URLs that need to be processed.
TBD on the hosting.
The sync agent monitors the job queue and is only active when there's work to be done. Jobs can be done in parallel, but must take a lock on the queue when modifying it.
To do a job, the agent first creates a GitHub repo if one doesn't already exist. Otherwise, it clones the existing repo to its local filesystem. Then it fetches some number of revisions from Wikipedia, commits them, and pushes to GitHub. Then it (a) removes the job it just did from the queue, and (b) pushes a follow-up job to the end of the queue (if needed).
Follow-up jobs are needed if:
TBD on the hosting.
Only the request handler and sync agent have access to the job queue (TBD on details).
The request handler throttles by IP (TBD on details).
The sync agent has access to a GitHub account with limited permissions, such that it can only push to the wikimit-hub project.
Pages that have a huge number of revisions don't prevent other work from being processed because of the "do a small amount of work and create a follow-up job" approach the sync agent takes.
Transient failures are dealt with by retry jobs, and if they're persistent the job is dropped.
Alerting on:
Wikipedia content can be redistributed as long as it uses the CC-BY-SA license.
Wikipedia limits access to their API at two simultaneous requests per IP.
Initial testing shows that converting a page with 1000 revisions takes about 100 seconds, with 90% of time running "git commit". This is pretty slow but not catastrophic.
A small article with 1000 revisions takes about 6 MB of disk space. This should be manageable.
Both speed and size need to be stress tested with larger articles. TBD.
Hosting solutions:
TBD on actual cost estimates.
See all the TBD items above.
I ran the proof of concept for https://en.wikipedia.org/wiki/Finch to get a sense of a) Wikipedia's APIs and b) performance.
The generated repo (https://github.com/wikimit-hub/Finch) has 1090 commits. Here are the logs:
Querying wikipedia...
wiki took 0.47 seconds
Adding 100 commits...
git took 9.71 seconds
Querying wikipedia...
wiki took 0.77 seconds
Adding 100 commits...
git took 9.60 seconds
Querying wikipedia...
wiki took 1.22 seconds
Adding 100 commits...
git took 9.60 seconds
Querying wikipedia...
wiki took 0.76 seconds
Adding 100 commits...
git took 9.94 seconds
Querying wikipedia...
wiki took 0.74 seconds
Adding 100 commits...
git took 9.82 seconds
Querying wikipedia...
wiki took 0.68 seconds
Adding 100 commits...
git took 10.13 seconds
Querying wikipedia...
wiki took 0.82 seconds
Adding 100 commits...
git took 9.85 seconds
Querying wikipedia...
wiki took 0.76 seconds
Adding 100 commits...
git took 10.08 seconds
Querying wikipedia...
wiki took 0.77 seconds
Adding 100 commits...
git took 9.91 seconds
Querying wikipedia...
wiki took 0.83 seconds
Adding 100 commits...
git took 9.93 seconds
Querying wikipedia...
wiki took 0.63 seconds
Adding 89 commits...
git took 8.80 seconds
Done!
wiki took a total of 8.45 seconds
git took a total of 107.37 seconds
Conclusions:
Open questions:
Also, the size of the resultant repo was 6.53 MB and 4,139 files.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.