kowainik / issue-wanted Goto Github PK

View Code? Open in Web Editor NEW

59.0 12.0 11.0 136 KB

🏷 Web application to help beginners to start contributing into Haskell projects

Home Page: https://kowainik.github.io/posts/gsoc2019

License: Mozilla Public License 2.0

Haskell 82.18% Makefile 0.37% TSQL 0.14% Dockerfile 0.11% PLpgSQL 1.95% JavaScript 5.01% HTML 0.81% Elm 9.14% CSS 0.29%

haskell github-issues three-layer-architecture web-application backend gsoc-2019

issue-wanted's People

Contributors

Stargazers

Watchers

Forkers

naman-sopho chaitanyabaranwal kokobd rashadg1030 archbung chaharnishant11 demarkok vadimbakaev phinneas kevim3686

issue-wanted's Issues

Implement SQL join statements

I was thinking because we have join tables we should implement some join statements in the sql/join.sql file. Is this a good idea? I'm currently doing some research on SQL joins.

Create `IssueWanted.Db.Repo` module with schemas for repository

We want to store some information in our database regarding all Haskell repositories. We don't need to store everything about repo. Only interesting part. Schema of repo table should be created using squeal library.

Remove URL field from Issue schema

I noticed something important:

CREATE TABLE IF NOT EXISTS repos
( id         SERIAL PRIMARY KEY  
, owner      TEXT   NOT NULL       
, name       TEXT   NOT NULL
, descr      TEXT   NOT NULL
, categories TEXT   ARRAY
);

CREATE TABLE IF NOT EXISTS issues
( id         SERIAL PRIMARY KEY 
, number     INT    NOT NULL
, title      TEXT   NOT NULL
, body       TEXT   NOT NULL
, repo_owner TEXT   NOT NULL
, repo_name  TEXT   NOT NULL
, url        TEXT   NOT NULL
, labels     TEXT   ARRAY
);

For repos table we don't have an url field because we agreed we could construct the URL on the frontend with the owner and name.

The issues table does have an url field but we could remove it like repos.

-- | Data type representing a GitHub issue.
data Issue = Issue
    { issueId        :: Id Issue
    , issueNumber    :: Int
    , issueTitle     :: Text
    , issueBody      :: Text
    , issueRepoOwner :: RepoOwner
    , issueRepoName  :: RepoName
    , issueUrl       :: Text
    , issueLabels    :: SqlArray Text
    } deriving stock (Generic, Show, Eq)
      deriving anyclass (ToJSON, FromRow, ToRow)

-- | Data type representing a GitHub repository
data Repo = Repo 
    { repoId         :: Id Repo
    , repoOwner      :: RepoOwner
    , repoName       :: RepoName
    , repoDescr      :: Text
    , repoCategories :: SqlArray Text
    } deriving stock (Generic, Show, Eq)
      deriving anyclass (ToJSON, FromRow, ToRow)

Notice the Repo type doesn't have url either. Which one is better? Construct the URL on the frontend, backend, or store it in the database?

Add roundtrip property test for Repo type

We need a roundtrip property test to ensure the ToRow and FromRow instances of Repo are correct.

Choice of web-framework

There are several web-frameworks to choose from.

I have not tested web-frameworks for two years, but the ones I came to like back then were:

snap for being quite modular and not too magical.
scotty for being minimalist.
servant for being type-driven.

The drawbacks I can think of are: snap may require heavier lock-in / investment in the framework (writing snaplets when necessary), scotty may require us to learn web-framework organization the hard way if the site grows beyond its original intent, and servant would, I think, require a web-app-like front-end and give less of a website feel.

Maybe this issue hasn't been posted because there is an implicit answer already?

Decide on the Haskell database library

Possible candidates:

postgresql-simple
squeal
opaleye
esqueleto

@rashadg1030 needs to make the decision with which of the library he prefers to work. But we can share our thoughts on the libraries and discuss the options.

Change name of owner column in issues table

CREATE TABLE IF NOT EXISTS issues
( id        SERIAL PRIMARY KEY 
, number    INT    NOT NULL
, title     TEXT   NOT NULL
, body      TEXT
, url       TEXT   NOT NULL
, owner     TEXT   NOT NULL
, repo_name TEXT   NOT NULL
, labels    TEXT   ARRAY
);

I was wondering if we should change the name of the owner column to repo_owner?

Implement WithId type

How will this work?

Update SQL query functions

Upgrade to GHC-8.6.5

Upgrade cabal file (base dependency, use cabal-version: 2.4, use common stanzas)
Upgrade stack.yaml (bump up lts, libraries versions)
Update .travis.yml file

As a reference, you can see changes in the three-layer repository.

Prototypes or wireframes for frontend

This issue is in reference for the creation of prototypes or wireframes for the frontend. It would aid in streamlining the backend as per project's requirements. Do these wireframes exist or are in the process of development? If not then would it be desirable if I start implementing them with the guidance of respected contibutors?

Add function for inserting a list of Issue

Preferred issue update channels - webhooks?

Hi all, I'm looking at trying this for Google Summer of Code. Currently in the fetch functions in IssueWanted.Search, we can poll Github's API for issues regularly. Is this the permanent plan for keeping the cache up to date?

I ask because I've been prodding around the github library, and was wondering as to the value of using webhooks to subscribe to particularly active repos (e.g, stack) to give more live updates on issues with valuable tags.

@chshersh Do you have any specific recommendations on how we want to keep items synced?

Write Servant endpoint that returns Haskell repositories

After #8 is done

Frontend stack decision

We know for sure that we'd like to use some functional language for frontend. What suits best for our needs: PureScript, Elm, other? Pros and cons?

Graphql integration and github apis: V3 or V4?

Which version of Github API are we going to use?
Github provides access both in the form of REST(v3) and GraphQl(v4). Considering the amount of data we need to scrap for issues, cabal files and keeping our database synced with repos would it be beneficial to integrate graphql in backend as well as frontend? A single graphql call will replace multiple REST calls thus reducing latency with database as well as frontend and keep the rate limits imposed by github under check.

Change fields in Issue record to correct newtype

The issueRepoOwner and issueRepoName fields in the Issue type need to be changed to their corresponding newtypes.

Admin page discussion

Maybe we should have admin page. Let's discuss what can be there. I can think of the following:

Force sync of repository
Force sync of user
Blacklist user (if spammer)
Blacklist repository (fake repo to abuse achievement system)
Add/Remove/Edit tags

Implement wrapper functions over `postgresql-simple`

I will describe later what functions to add.

Add async worker for populating DB

I was wondering if it is a good time to work on the async worker that will populate our DB? Once we get a better idea of the data we need I guess. I've implemented an endpoint for our server that touches the DB and everything, so I think this would be a good next step. And one thing about file structure. We have a file that holds the GitHub query functions on the path src/IW/Server/Search.hs. Is it fine if it goes into the Server folder like it is now, or should it go to another folder called Async or Worker. For example, it could be src/IW/Worker/Search.hs. I'm not sure because technically Search.hs does function on the server, but I think files in src/IW/Server should be related to the issue-wanted API.

Implement Repo type

Setup Circle CI

Update the schema for more efficiency and simplicity

I did some research on how to structure the database for our use case better. We need to do some adjustments to make it work fast. At the same time, it will make our lives easier.

Foreign key in the issues table should be not for id but for repo.name. It's completely okay to have textual foreign keys. Just add an INDEX for repos.name column (later we can see what columns we are using so we can add more indexes for performance).
- https://www.postgresql.org/docs/9.1/sql-createindex.html
Remove tables categories, labels, repos_categories and issues_labels. Instead, we will store them as a separate column using PostgreSQL arrays. This approach is much much faster than joining tables. And such arrays support all operations that we need for filtering. This also means that updating labels is now a simple task: just write new labels in place for that array.

Database decision

We should decide which Database we want to use for issue-wanted. We don't need something strong and secure but we also don't want 💩

Possible candidates:

PostgreSQL
SQLite
RocksDB
acid-state
Raw files

Something else?

What we need to store (approximation of our database scheme):

List of issues for beginners for every Haskell GitHub repository
GitHub issue metadata (name, tags?, creation date, something else (to sort issues))
Categories for every project
Users metadata (GitHub nickname, achievements)

Looks like some SQL DB is the solution to go... But in that case we also need to choose library...

Add DB functions that operate within IO monad instead of entire App monad for testing

Bring `three-layer` architecture to issue-wanted

See the three-layer repository for example.

Specifically, we need to bring the following things:

Our custom monad with the environment (see Lib/App directory)
Error data types with helper functions (see Lib/App/Error.hs module)
Effects.Log module with logging
TOML configuration
Makefile for running commands smoothly

Document Postgres setup

Doing stack build I get the following error:

--  While building package postgresql-libpq-0.9.4.2 using:
      /tmp/stack-9b8112fa643e992a/postgresql-libpq-0.9.4.2/.stack-work/dist/x86_64-linux/Cabal-2.4.0.1/setup/setup --builddir=.stack-work/dist/x86_64-linux/Cabal-2.4.0.1 configure --with-ghc=/home/simon/.stack/programs/x86_64-linux/ghc-8.6.5/bin/ghc --with-ghc-pkg=/home/simon/.stack/programs/x86_64-linux/ghc-8.6.5/bin/ghc-pkg --user --package-db=clear --package-db=global --package-db=/home/simon/.stack/snapshots/x86_64-linux/lts-13.26/8.6.5/pkgdb --libdir=/home/simon/.stack/snapshots/x86_64-linux/lts-13.26/8.6.5/lib --bindir=/home/simon/.stack/snapshots/x86_64-linux/lts-13.26/8.6.5/bin --datadir=/home/simon/.stack/snapshots/x86_64-linux/lts-13.26/8.6.5/share --libexecdir=/home/simon/.stack/snapshots/x86_64-linux/lts-13.26/8.6.5/libexec --sysconfdir=/home/simon/.stack/snapshots/x86_64-linux/lts-13.26/8.6.5/etc --docdir=/home/simon/.stack/snapshots/x86_64-linux/lts-13.26/8.6.5/doc/postgresql-libpq-0.9.4.2 --htmldir=/home/simon/.stack/snapshots/x86_64-linux/lts-13.26/8.6.5/doc/postgresql-libpq-0.9.4.2 --haddockdir=/home/simon/.stack/snapshots/x86_64-linux/lts-13.26/8.6.5/doc/postgresql-libpq-0.9.4.2 --dependency=Cabal=Cabal-2.4.1.0-9MZFDeNrcJI10bcroa6pq8 --dependency=base=base-4.12.0.0 --dependency=bytestring=bytestring-0.10.8.2 --dependency=unix=unix-2.7.2.2
    Process exited with code: ExitFailure 1

I spent a little time figuring out that on Ubuntu I need to apt install libpq-dev to compile this.

For running Postgres, I need to

$ sudo apt install postgresql postgresql-contrib
$ sudo service postgres start
$ sudo -u postgres psql
postgres=# create database "issue-wanted";
postgres=# create user simon;
postgres=# grant all privileges on database "issue-wanted" to simon;

I modified pg_hba.conf with the lines

local all simon          trust
host all simon 0.0.0.0/0 trust

(This is a little unsafe, I realize, but for some reason 127.0.0.1/8 didn't cut it.)

and changed user=simon in config.toml and added listen_address = '127.0.0.1' in /etc/postgresql/10/main/postgresql.conf.

I then restarted Postgres and initialized the database manually:

$ sudo service postgres restart
$ psql issue-wanted < sql/schema.sql
$ psql issue-wanted < sql/seed.sql
$ stack exec issue-wanted

At this point the /issues endpoint is responding positively!

Perhaps we should document some of this in README.md?

Implement database related functions for Repo type

Change default-prelude to `relude`

Introduce simple servant application with a dummy endpoint

Discuss users UX

I would like to have an ability for users to see their achievements depending on their open-source contribution. This is the biggest motivation for people to do something. So we should have an ability to login into our application through GitHub (in perfect case).

Also, I'm not sure that we should synchronise contributors past... Let's track everything starting from server start.

So I propose the following sync scheme:

Sync all activity for user starting from server start and calculate all achivements
Sync all achievements for users no often than in 4 hours (?) (or smarter strategy, depending on user activity?...)
If user doesn't do anything for 30 days, stop syncing him (to not waste our resources)
Have force sync button in API

We should also assign points to every achievement and have ranking table...

Setup testing suite

I would like to work on setting up the testing suite before I start making any serious changes to the code. Should I refer to the three-layer example tests? If so, will we need hedgehog for testing or is hspec enough?

Update Issue type

Write instructions on how to build and test

Add created_at and updated_at columns to repos and issues tables

These columns should exist only inside the database. They should set to NOW() during creation. And updated_at should be updated automatically when we update the row. These columns will be useful later when we want to perform cleanup of our database.

Retrieve category from repository Cabal file

We can use the GitHub API to retrieve the contents of a repo's Cabal file. For example pandoc: https://api.github.com/repos/jgm/pandoc/contents/pandoc.cabal

We can then use
https://www.haskell.org/cabal/release/cabal-latest/doc/API/Cabal/Distribution-PackageDescription-Parsec.html
to parse the file once we retrieve it.

Anything I'm missing anything or suggestions??

Issue-wanted endpoint URLs

What should the URLs for issue-wanted look like? So far, we have one endpoint:
~/issues/:issueId

Some other ones I think we need:

~/issues/         -- returns all issues
~/issues/:label   -- returns all issues with the given label 
~/repos/          -- returns all repos
~/repos/:repoId   -- returns a repo with the Id
~/repos/:category -- returns all repos with the given category

Any more suggestions?

Update the README file

I propose to add a better description in the README using the abstract of my GSoC proposal. Is this a good idea?

Choose shorter name for `IssueWanted` prefix

All modules currently have IssueWanted prefix in the library. Let's choose shorter prefix until it's not to late 🙂

Add functions that convert github libary types to our own

We will need to add functions that convert the github library types like Issue and Repo to our own types.

Add functions for inserting issues into DB

Query Haskell repositories

We're going to use github package for GitHub bindings. This function might be useful:

https://hackage.haskell.org/package/github-0.19/docs/GitHub-Endpoints-Search.html#v:searchRepos

I propose to put this function under IssueWanted.Search module.

Describe SQL schema for storing data

What needs to be done here:

Schema should be written in raw SQL files in the sql/ directory in the project root. This directory should contain two files now: schema.sql with the schema and drop.sql with removing the schema (useful for testing)
IW.Db.Schema module with helper functions (see three-layer for the example)

Setup round-trip property testing for DB functions

Refers to #33

Add `Dockerfile`

After adding the Makefile I realized that the project uses docker to run the PostgreSQL database. Is the Dockerfile going to be the same as the one in the three-layer repo?

Describe all API endpoints we need from GitHub

Basically we need:

List of all Haskell repositories

curl -H 'Accept: application/vnd.github.preview.text-match+json' https://api.github.com/search/repositories?q=language:haskell&order=desc

For each repository: list of issues by given label (help wanted, good first issue)
- TODO: insert Rest API method here
For each user: each PR for this user
- TODO: insert Rest API method here
For each PR: which issue it closes
- TODO: insert Rest API method here

Implement unit tests for DB functions

We need unit tests to make sure our SQL query statements work correctly.

Introduce Id type with phantom type parameter

GitHub API Query Functions

I've made a fork of the project and have been testing the GitHub API query functions located in issue-wanted/src/IssueWanted/Search.hs. None of the functions return errors which is great, but I'm not sure if they are returning the right results. For example, the function fetchHaskellReposGFI, which returns all Haskell repositories with "good-first-issue" labels, gives a result count of 124. The function fetchGoodFirstIssue, which is supposed to return all issues with Haskell language and label "good-first-issue", only gives a result count of 8. This seems odd to me, but I'm not sure. There may be a problem with the query strings passed into searchRepos or searchIssues, but I can't be sure until I look more into the GitHub API documentation. I just wanted to get someone else's opinion on this issue.

Add folder that holds all async worker related modules

Start setting up the file structure for the async worker code once we've figured out the database.