Coder Social home page Coder Social logo

soryu23 / weibo-crawler Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 56 KB

Weibo-crawler is a crawler project based on golang colly framework to crawl weibo sites and get information. It crawls web content by regular expressions and Xpath selector, spatially transforms keywords using word vector model, and clusters text content by HDBSCAN clustering algorithm.

License: MIT License

Makefile 2.17% Go 91.49% Python 6.34%

weibo-crawler's Introduction

go_crawler

Go_crawler is a crawler project based on golang colly framework to crawl weibo sites and get information. It crawls web content by regular expressions and Xpath selector, spatially transforms keywords using word vector model, and clusters text content by HDBSCAN clustering algorithm.

Features

Comprehensive capture of user information
Multi-dimensional collection of weibo content
Timed incremental acquisition
Keyword Cluster Analysis
Category hotspot sorting

Go_crawler is based on following tools

name description
Go An open source programming language that makes it easy to build simple, reliable, and efficient software.
Python Python is an interpreted, high-level and general-purpose programming language.
Gin Web struct based on Go, flexible middleware,strong data binding and outstanding performance.
Ginkgo Ginkgo builds on Go's testing package, allowing expressive Behavior-Driven Development ("BDD") style tests.
Colly Lightning Fast and Elegant Scraping Framework for Gophers
Postgres The world's most advanced open source relational database
Gorm The fantastic ORM library for Golang aims to be developer friendly.
Redis An open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker.
Docker Docker is a tool designed to make it easier to create, deploy, and run applications by using containers.
Sklearn Simple and efficient tools for predictive data analysis
Gensim The fastest library for training of vector embeddings – Python or otherwise.
HDBSCAN HDBSCAN is a clustering algorithm extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

Why not python

Python has more than one set of mature crawler frameworks such as scrapy, pyspider and so on.They have excellent runtime mechanism and powerful capabilities. But when the anti-crawler mechanism is strong, rewriting the middleware is a very difficult task. And it's not flexible enough to be accessed by a project system .

Struct

.
├── application.yml  
├── args
│   ├── args.go
│   └── cmd.go
├── conf  
│   ├── conf_debug.go
│   ├── conf.go
│   └── conf_release.go
├── controller
│   ├── application.go
│   ├── blogger.go
│   ├── category.go
│   ├── error.go
│   ├── query.go
│   ├── tag.go
│   └── task.go
├── corpus  
│   └── corpus.txt
├── db  
│   └── db.go
├── go.mod
├── go.sum
├── jwt  
│   └── jwt.go
├── main.go
├── Makefile  
├── models  
│   ├── base_model.go
│   ├── blog.go
│   ├── blogger.go
│   ├── category.go
│   ├── tag.go
│   └── user.go
├── python  
│   ├── dict.txt
│   ├── keywords.txt
│   ├── keywords_demo.py
│   └── save_cookies.go
├── README.md
├── redis
│   └── redis.go
├── routers  
│   └── router.go
├── tasks
│   ├── tags.go
│   └── tasks.go
├── test
└── util
    ├── agent.go
    ├── cookie.go
    ├── cookies.txt
    └── util.go

How to use

  1. install tools and dependency mentioned above
  2. config application.yml, establish connection
  3. go run main.go -db create
  4. go run main.go -db migrate
  5. go run main.go
  6. Add bloggers & keywords post /add_bloggers, /tags/set_keywords and /tags/cache_keywords
  7. wait 30 minutes or call /task(local debug environment)
  8. let the bullets fly
  9. Post /query_blogs to show datas

if you want to do cluster, post /tags/keywords, download corpus, python keywords.py, adjust and post /category/set Get /category/query to show hot topics

Api list

details


API CALL ROUTER FUNCTION
Ping GET /ping ping
Task GET /task (auto run every 30 minutes) crawler task
Query_blogs POST /query_blogs query according to different parameters
Add_bloggers POST /add_bloggers add bloggers in task list
Set_category GET /category/set set category by clustering result
Set_category_name POST /category/set_name rename category
Query_category GET /category/query query category
Query_tags GET /tags/query query tags
Cache_keywords POST /tags/cache_keywords save keywords to redis
Get_keywords POST /tags/keywords query keywords and write to txt for clustering
Set_keywords POST /tags/set_keywords add keywords as tags

weibo-crawler's People

Contributors

soryu23 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.