Coder Social home page Coder Social logo

codeuniversity / smag-mvp Goto Github PK

View Code? Open in Web Editor NEW
19.0 7.0 6.0 7.47 MB

Social Record - Distributed scraping and analysis pipeline for a range of social media platforms

License: GNU Lesser General Public License v3.0

Go 43.95% TSQL 0.41% Dockerfile 4.36% Makefile 0.23% JavaScript 44.67% Shell 0.50% Python 3.30% HTML 0.51% CSS 2.07%
go kafka kubernetes distributed-scraping elasticsearch neo4j postgresql helm aws debezium

smag-mvp's Introduction

Social Record

Distributed scraping and analysis pipeline for a range of social media platforms

Shields

Contributors Commits

GitHub tag (latest SemVer) CircleCI

GitHub go.mod Go version

Table of content

About

The goal of this project is to raise awareness about data privacy. The mean to do so is a tool to scrape, combine and analyze public data from multiple social media sources.
The results will be available via an API, used for some kind of art exhibition.

Architectural overview

You can find an more detailed overview here.
Open it in draw.io and have a look at the different tabs "High level overview", "Distributed Scraper" and "Face Search".

Further reading

Detailed documentation

part docs contact
Api api/README.md @jo-fr
Frontend frontend/README.md @lukas-menzel
Postgres DB db/README.md @alexmorten

Wanna contribute?

If you want to join us raising awareness for data privacy have a look into CONTRIBUTING.md

List of contributors

Github handle Real name Instagram profile Twitter profile
@1Jo1 Josef Grieb josef_grieb josefgrieb
@Urhengulas Johann Hemmann Urhengulas Johann
@alexmorten Alexander Martin no profile :( no profile :(
@jo-fr Jonathan Freiberger jonifreiberger Jonathan
@m-lukas Lukas Müller lmglukas Lukas Müller
@lukas-menzel Lukas Menzel lukasmenzel Lukas Menzel
@SpringHawk Martin Zaubitzer / /

Deployment

The deployment of this project to kubernetes happens in codeuniversity/smag-deploy (this is a private repo!)

Getting started

Requirements

depency version
go v1.13 (go modules)
docker v19.x
docker-compose v1.24.x

Preparation

If this is your first time running this:

  1. Add 127.0.0.1 my-kafka and 127.0.0.1 minio to your /etc/hosts file
  2. Choose a <user_name> for your platform of choice <instagram|twitter> as a starting point and run
    $ go run cli/main/main.go <instagram|twitter> <user_name>

Scraper

Run the instagram- or twitter-scraper in docker:

$ make run-<platform_name>

smag-mvp's People

Contributors

1jo1 avatar alexmorten avatar dependabot[bot] avatar jo-fr avatar lukas-menzel avatar m-lukas avatar springhawk avatar urhengulas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

smag-mvp's Issues

Unify naming conventions in project

Make sure we keep with the namings:

  • <platform>_scraper
  • <platform>_<database>-inserter

We should unify it all over the place:

  • file structure
  • variable-, function-names
  • docker, docker-hub
  • kubernetes
  • circle-ci

fix insta_posts-inserter

The User ID that the inserter is getting from the scraper is not equal with the one the insta_follow-scraper created in the db. That means the posts are not matched with the according user at the moment.

One solution would be to add the username from the scraped user posts to the kafka message so that the posts can get matched via the username

to dos

  • add the username to InstagramPost Model
  • pass the actual username to findOrCreateUser in posts-inserter.go

Workaround Worker for postgres <> kafka synchronization

By applying the db dump on postgres, debezium was not triggered and the usernames were not commited to the associated kafka topics. Therefore, we need a small worker which adds the uncomplete users in batches to kafka.

Remove unnecessary files from root dir

Description

Remove files:

by either:

  • deleting them, or
  • putting them into a more appropriate (scoped) place

Let's do it on one branch called clean-root.

Checklist

  • pull_request_template.md
  • useragents.json
  • build_and_upload_image.sh

Refactor executor to handle worker lifecycle

Description

Right now the executor exposes lifecycle hooks that can be called to handle something closing the execution from outside, but you still need to call the hooks in the right order and it is easy to forget to call some of the hooks.

Issues

  • easy to forget hook calls
  • panics not gracefully handled
  • no simple graceful shutdown mechanic for the Run

Goal

(Re)write a generalized executor that accepts the Run func and registers a list of shutdown hooks to be called in the order after:

  • a close call from outside
  • a graceful shutdown from inside
  • a panic from inside

Refactor Docker-Compose setup and divide into component clusters

Currently the local deployment setup is very exhausting and can easily lead to errors due to cached images or other dependencies. Divided docker-compose files for different kinds of components (always build, never changed -> never build, ...) could be a way to make it more fault proof and easier to manage.

Kafka web UI in docker

Deploy Kafka Web UI as Docker container and connect it to our existing infrastructure

Kafka misses required parameter and doesn't start properly

Description

Kafka seems to miss a required argument and doesn't start completely -> not reachable

zookeeper_1               | 2019-10-18 14:53:18,082 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@949] - Client attempting to establish new session at /172.20.0.3:43176
zookeeper_1               | 2019-10-18 14:53:18,084 [myid:] - INFO  [SyncThread:0:ZooKeeperServer@694] - Established session 0x10008e64efb000e with negotiated timeout 30000 for client /172.20.0.3:43176
my-kafka_1                | Missing required argument "[partitions]"
my-kafka_1                | Option                                   Description       
...
zookeeper_1               | 2019-10-18 14:53:18,520 [myid:] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to read additional data from client sessionid 0x10008e64efb000e, likely client has closed socket
zookeeper_1               | 2019-10-18 14:53:18,522 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed socket connection for client /172.20.0.3:43176 which had sessionid 0x10008e64efb000e
zookeeper_1               | 2019-10-18 14:53:49,657 [myid:] - INFO  [SessionTracker:ZooKeeperServer@355] - Expiring session 0x10008e64efb000e, timeout of 30000ms exceeded
zookeeper_1               | 2019-10-18 14:53:49,658 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@487] - Processed session termination for sessionid: 0x10008e64efb000e

Error occurs when using docker-compose up -> might be an issue with a new release of kafka

Kafka not reachable (using go run cli/main/main.go): panic: dial tcp: lookup my-kafka: no such host

@alexmorten

Simple website in React.js

User Flow:

  1. As a user I want to type in my instagram user name to get the data about my social media presence.
  2. As a user I want to see afterwards my instagram pictures and tweets.
  3. As a user I want to see analysed data (Number of posts, hashtags, tagged posts)

Checklist:

  • Build User search
  • Show Userlist list on start page after inserting username
  • Show basic instagram user information (Username, realname, bio, authorimage)
  • Display results on protected route
  • Display Instagram pictures, tagged pictures, hashtags, analysed data (Number of posts, hashtags, tagged posts) on Instagram result page
  • Display basic information, Hashtags, Tweets, Retweets on result Twitter page, analysed data

Wireframes: https://projects.invisionapp.com/d/main/?origin=v7#/console/18582016/386444159/preview

Kubernetes Deployment

Deploy all applications, services and storages on the Kubernetes cluster.

  • aws service
  • insta posts scraper
  • insta comments scraper
  • user_names filter
  • post_pics_filter
  • pic downloader
  • api

[twitter] split twitter_scraper_users up

Description

The problem is that the twitter_scraper_users takes to long to scrape a single user. Therefore it might make sense to split it up into separate scrapers for the user_info, followers and followings.

Checklist

  • user_info
  • followers
  • followings

Add Insta Profile Information

Create new Kafka topic message to send the profile information in insta-posts-scrapter and copy the insta-postgres-inserter logic to insta-profile-inserter, update/create the profile information

Tagged User Photos

Services: insta-post-scraper & insta-post-inserter
-add tagged users to the model and database

Instagram Scraper & Kubernetes

  • Deploy Kafka on Kubernetes
  • Deploy Postgres on Kubernetes
  • Deploy scraper and inserter on Kubernetes
  • Deploy ElasticService (delete and recreate new Elastic IPs)

Update README

Currently instructions to initialise the debezium connector are missing

Implement full gorm support for other inserters

Implement gorm for the inserters as it is done in twitter_*-inserter packages.

  • rewrite models to use gorm syntax (struct tags for associations, primary key ...)
  • add auto-migrations
  • rewrite inserters on the example of twitter_*-inserter

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.