The smag-mvp from codeuniversity

part	docs	contact
Api	`api/README.md`	@jo-fr
Frontend	`frontend/README.md`	@lukas-menzel
Postgres DB	`db/README.md`	@alexmorten

Github handle	Real name	Instagram profile	Twitter profile
@1Jo1	Josef Grieb	josef_grieb	josefgrieb
@Urhengulas	Johann Hemmann	Urhengulas	Johann
@alexmorten	Alexander Martin	no profile :(	no profile :(
@jo-fr	Jonathan Freiberger	jonifreiberger	Jonathan
@m-lukas	Lukas Müller	lmglukas	Lukas Müller
@lukas-menzel	Lukas Menzel	lukasmenzel	Lukas Menzel
@SpringHawk	Martin Zaubitzer	/	/

depency	version
`go`	`v1.13` (go modules)
`docker`	`v19.x`
`docker-compose`	`v1.24.x`

Implement changestream filter for Twitter

Implement in:
- twitter_inserter_users
- twitter_inserter_posts
Drop kafka.GetUserDiscoveryInserterConfig

Unify naming conventions in project

Make sure we keep with the namings:

<platform>_scraper
<platform>_<database>-inserter

We should unify it all over the place:

file structure
variable-, function-names
docker, docker-hub
kubernetes
circle-ci

fix insta_posts-inserter

The User ID that the inserter is getting from the scraper is not equal with the one the insta_follow-scraper created in the db. That means the posts are not matched with the according user at the moment.

One solution would be to add the username from the scraped user posts to the kafka message so that the posts can get matched via the username

to dos

add the username to InstagramPost Model
pass the actual username to findOrCreateUser in posts-inserter.go

comments/postgres-inserter replace user discovery logic with debezeium change log reader

Branch: https://github.com/codeuniversity/smag-mvp/tree/39-comments-postgres-inserter-replace-user-discovery-logic-with-debezeium-change-log-reader

Checklist:

Reading from Debezium changelog & write into user topic
delete user discovery postgres-inserter
delete user discovery insta-comments-inserter

add insta-posts to api

add endpoints for the insta-posts to the GRPC API

Insert image encoding in elasticsearch for k-nearest-neighbour search of similar images

To do the nearest neighbor search of faces efficiently, we get index the encoding of faces with https://github.com/ageitgey/face_recognition and search the n-nearest-neighbors with this elasticsearch addon: https://github.com/lior-k/fast-elasticsearch-vector-scoring

Deploy twitterscraper on Kubernetes (smag-deploy)

Add twitterscraper in the deployment configuration of smag-deploy and make it running on the Kubernetes Cluster

Implement gorm as ORM for the postgres-inserters

https://gorm.io/

Display basic information, Hashtags, Tweets, Retweets on result Twitter page, analysed data

Add Insta Post Description

-Add instagram description to insta_posts-scraper and insta_posts-inserter

Add test(s) for twitterscraper

// TODO

Workaround Worker for postgres <> kafka synchronization

By applying the db dump on postgres, debezium was not triggered and the usernames were not commited to the associated kafka topics. Therefore, we need a small worker which adds the uncomplete users in batches to kafka.

fix insta_posts-inserter

The insta_posts-inserter is listening to the wrong topic

Download twitter pictures

Do it by refactoring insta_pics-downloader/downloader.go

Add cli to CircleCI config

In order to build the CLI image, we have to add an entry to the .circleci/config.yaml file.

Display Instagram pictures, tagged pictures, hashtags, analysed data (Number of posts, hashtags, tagged posts) on Instagram result page

Twitter Post Inserter: pq: value too long for type character varying(64)

@Urhengulas

Remove user-discovery mode from inserters

After implementing debezium, the environmental variable USERDISCOVERY is redundant.

Show Userlist list on start page after inserting username

Remove unnecessary files from root dir

Description

Remove files:

pull_request_template.md (@Urhengulas)
useragents.json (@1Jo1)
build_and_upload_image.sh (@alexmorten)

by either:

deleting them, or
putting them into a more appropriate (scoped) place

Let's do it on one branch called clean-root.

Checklist

pull_request_template.md
useragents.json
build_and_upload_image.sh

Refactor executor to handle worker lifecycle

Description

Right now the executor exposes lifecycle hooks that can be called to handle something closing the execution from outside, but you still need to call the hooks in the right order and it is easy to forget to call some of the hooks.

Issues

easy to forget hook calls
panics not gracefully handled
no simple graceful shutdown mechanic for the Run

Goal

(Re)write a generalized executor that accepts the Run func and registers a list of shutdown hooks to be called in the order after:

a close call from outside
a graceful shutdown from inside
a panic from inside

Refactor Docker-Compose setup and divide into component clusters

Currently the local deployment setup is very exhausting and can easily lead to errors due to cached images or other dependencies. Divided docker-compose files for different kinds of components (always build, never changed -> never build, ...) could be a way to make it more fault proof and easier to manage.

Kafka web UI in docker

Deploy Kafka Web UI as Docker container and connect it to our existing infrastructure

Add gPRC to aws-service and scrapers

when he request limit is reached aws-service should renew scraper elastic ip to continue scraping

Add node_modules to .gitignore

Currently node_modules in frontend is not ignored and therefore creates unnecessary changes.

Add twitter to smag-deploy

Create new Elastic IP if it doesn't exist

support the functionality that you dont need a elastic ip for each pod ip to use aws service

abstract kafka-topic-transferer

Abstract the kafka-topic-transferer topic so that in can be used by all inserters

Kafka misses required parameter and doesn't start properly

Description

Kafka seems to miss a required argument and doesn't start completely -> not reachable

zookeeper_1               | 2019-10-18 14:53:18,082 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@949] - Client attempting to establish new session at /172.20.0.3:43176
zookeeper_1               | 2019-10-18 14:53:18,084 [myid:] - INFO  [SyncThread:0:ZooKeeperServer@694] - Established session 0x10008e64efb000e with negotiated timeout 30000 for client /172.20.0.3:43176
my-kafka_1                | Missing required argument "[partitions]"
my-kafka_1                | Option                                   Description       
...
zookeeper_1               | 2019-10-18 14:53:18,520 [myid:] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to read additional data from client sessionid 0x10008e64efb000e, likely client has closed socket
zookeeper_1               | 2019-10-18 14:53:18,522 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed socket connection for client /172.20.0.3:43176 which had sessionid 0x10008e64efb000e
zookeeper_1               | 2019-10-18 14:53:49,657 [myid:] - INFO  [SessionTracker:ZooKeeperServer@355] - Expiring session 0x10008e64efb000e, timeout of 30000ms exceeded
zookeeper_1               | 2019-10-18 14:53:49,658 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@487] - Processed session termination for sessionid: 0x10008e64efb000e

Error occurs when using docker-compose up -> might be an issue with a new release of kafka

Kafka not reachable (using go run cli/main/main.go): panic: dial tcp: lookup my-kafka: no such host

@alexmorten

EKS(Kubernetes) Deployment Setup

-Add EKS node spot instances to our Kubernetes cluster
-Add Spot Termination Notice Handler: it drains the node in advance before AWS takes it away. Hence, you can be sure that your applications are gracefully re-scheduled to the other nodes in the cluster
https://github.com/mumoshu/kube-spot-termination-notice-handler
-Add Grpc aws-service and scraper to renew the elastic ip

Download instagram posts

Building on top of https://github.com/codeuniversity/smag-mvp/tree/temp/img-downloader

filter the changestream of posts, for each created post or updated post where the url was changed, write download jobs for posts into new, partitioned topic
read the download jobs topic in a new worker that
- download posts
- puts images into s3
- writes path of image on s3 back into posts

Add timeout environment(insta-scraper)

Define 3 different timeouts environments:
-elasticIpAssignmentTimeout(if a new elastic ip was assigned to the scraper)
-RequestTimeout
-RetryTimeout

Simple website in React.js

User Flow:

As a user I want to type in my instagram user name to get the data about my social media presence.
As a user I want to see afterwards my instagram pictures and tweets.
As a user I want to see analysed data (Number of posts, hashtags, tagged posts)

Checklist:

Build User search
Show Userlist list on start page after inserting username
Show basic instagram user information (Username, realname, bio, authorimage)
Display results on protected route
Display Instagram pictures, tagged pictures, hashtags, analysed data (Number of posts, hashtags, tagged posts) on Instagram result page
Display basic information, Hashtags, Tweets, Retweets on result Twitter page, analysed data

Wireframes: https://projects.invisionapp.com/d/main/?origin=v7#/console/18582016/386444159/preview

Kubernetes Setup

-Deploy Kafka, insta-scraper
-Create managed Postgres

Kubernetes Deployment

Deploy all applications, services and storages on the Kubernetes cluster.

Kubernetes Debezium Connect

add debezium connect pod to listen to changes in the database

CircleCI can't build the docker images

adapt new insta folder names in circleci config

Merge twitter scraper in mvp repo

TODO:

add twitter scraper to repo
add twitter scraper to docker-compose
add instagram scraper to docker-compose
refactor postgres-inserter with gorm package (https://github.com/jinzhu/gorm)
refactor insertion functions to take input from both scrapers

Switch cli to use entry point instead of env vars

Use Docker entry point and refactor cli

Add twitter data to api

Add gPRC to aws-service and scraper

[twitter] split twitter_scraper_users up

Description

The problem is that the twitter_scraper_users takes to long to scrape a single user. Therefore it might make sense to split it up into separate scrapers for the user_info, followers and followings.

Checklist

user_info
followers
followings

Deploy Kafka on Kubernetes
Deploy Postgres on Kubernetes
Deploy scraper and inserter on Kubernetes
Deploy ElasticService (delete and recreate new Elastic IPs)

Fix kafka in twitter_scraper leaving group

@Urhengulas

twitter_scraper_posts_1       | 11:02:05.422 - base - WARNING - Heartbeat poll expired, leaving group
twitter_scraper_posts_1       | 11:02:05.423 - base - INFO - Leaving consumer group (tweets_scraper).

Might be a solution: dpkp/kafka-python#1435

rewrite models to use gorm syntax (struct tags for associations, primary key ...)
add auto-migrations
rewrite inserters on the example of twitter_*-inserter

codeuniversity / smag-mvp Goto Github PK

smag-mvp's Introduction

Social Record

About

Architectural overview

Further reading

Detailed documentation

Wanna contribute?

List of contributors

Deployment

Getting started

Requirements

Preparation

Scraper

smag-mvp's People

Contributors

Stargazers

Watchers

Forkers

smag-mvp's Issues

Make sure we keep with the namings:

We should unify it all over the place:

to dos

Description

Checklist

Description

Checklist

Recommend Projects

Recommend Topics

Recommend Org