Go_crawler is a crawler project based on golang colly framework to crawl weibo sites and get information. It crawls web content by regular expressions and Xpath selector, spatially transforms keywords using word vector model, and clusters text content by HDBSCAN clustering algorithm.
Comprehensive capture of user information
Multi-dimensional collection of weibo content
Timed incremental acquisition
Keyword Cluster Analysis
Category hotspot sorting
Go_crawler is based on following tools
name | description |
---|---|
Go | An open source programming language that makes it easy to build simple, reliable, and efficient software. |
Python | Python is an interpreted, high-level and general-purpose programming language. |
Gin | Web struct based on Go, flexible middleware,strong data binding and outstanding performance. |
Ginkgo | Ginkgo builds on Go's testing package, allowing expressive Behavior-Driven Development ("BDD") style tests. |
Colly | Lightning Fast and Elegant Scraping Framework for Gophers |
Postgres | The world's most advanced open source relational database |
Gorm | The fantastic ORM library for Golang aims to be developer friendly. |
Redis | An open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. |
Docker | Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. |
Sklearn | Simple and efficient tools for predictive data analysis |
Gensim | The fastest library for training of vector embeddings – Python or otherwise. |
HDBSCAN | HDBSCAN is a clustering algorithm extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters. |
Python has more than one set of mature crawler frameworks such as scrapy, pyspider and so on.They have excellent runtime mechanism and powerful capabilities. But when the anti-crawler mechanism is strong, rewriting the middleware is a very difficult task. And it's not flexible enough to be accessed by a project system .
.
├── application.yml
├── args
│ ├── args.go
│ └── cmd.go
├── conf
│ ├── conf_debug.go
│ ├── conf.go
│ └── conf_release.go
├── controller
│ ├── application.go
│ ├── blogger.go
│ ├── category.go
│ ├── error.go
│ ├── query.go
│ ├── tag.go
│ └── task.go
├── corpus
│ └── corpus.txt
├── db
│ └── db.go
├── go.mod
├── go.sum
├── jwt
│ └── jwt.go
├── main.go
├── Makefile
├── models
│ ├── base_model.go
│ ├── blog.go
│ ├── blogger.go
│ ├── category.go
│ ├── tag.go
│ └── user.go
├── python
│ ├── dict.txt
│ ├── keywords.txt
│ ├── keywords_demo.py
│ └── save_cookies.go
├── README.md
├── redis
│ └── redis.go
├── routers
│ └── router.go
├── tasks
│ ├── tags.go
│ └── tasks.go
├── test
└── util
├── agent.go
├── cookie.go
├── cookies.txt
└── util.go
- install tools and dependency mentioned above
- config application.yml, establish connection
go run main.go -db create
go run main.go -db migrate
go run main.go
- Add bloggers & keywords post
/add_bloggers
,/tags/set_keywords
and/tags/cache_keywords
- wait 30 minutes or call
/task
(local debug environment) - let the bullets fly
- Post
/query_blogs
to show datas
if you want to do cluster, post /tags/keywords
, download corpus, python keywords.py
, adjust and post /category/set
Get /category/query
to show hot topics
API | CALL | ROUTER | FUNCTION |
---|---|---|---|
Ping | GET | /ping | ping |
Task | GET | /task | (auto run every 30 minutes) crawler task |
Query_blogs | POST | /query_blogs | query according to different parameters |
Add_bloggers | POST | /add_bloggers | add bloggers in task list |
Set_category | GET | /category/set | set category by clustering result |
Set_category_name | POST | /category/set_name | rename category |
Query_category | GET | /category/query | query category |
Query_tags | GET | /tags/query | query tags |
Cache_keywords | POST | /tags/cache_keywords | save keywords to redis |
Get_keywords | POST | /tags/keywords | query keywords and write to txt for clustering |
Set_keywords | POST | /tags/set_keywords | add keywords as tags |