Coder Social home page Coder Social logo

textractor's Introduction

textractor

从html文本中提取标题,正文,图片,作者,时间等信息,适用于新闻类网页

安装

    go get github.com/gloomyzerg/textractor

使用

package main

import (
    "io/ioutil"
	"log"
	"net/http"

	"github.com/gloomyzerg/textractor"
)

func main(){
    url := "http://www.xxx.com/xxx"
	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	source, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		log.Fatal(err)
    }
    // 这只是一个例子
    // textractor.Extract 接收一个html的字符串
    // 可根据需求自行选择如何获取一个html字符串
    // 例如带分页的页面,可自行获取所有分页内容,拼接后传入
    result, _ := textractor.Extract(string(source))
    fmt.Printf("%+v", result)
}

命令行使用

    go get -u github.com/gloomyzerg/textractor/cmd/...
    textractor [url]

说明

textractor使用的《基于文本及符号密度的网页正文提取方法》对于一般的中文新闻类网页有较高的准确率,根据论文结论可知准确率高达99%以上.但由于样本条件限制作者并未测试足够多的样本来验证准确率.
由于网页代码的多样性,任何提取算法都不可能覆盖所有网页.如遇到不能正确提取的网页,欢迎在issue中留下网页地址,具体问题具体分析.作者尽可能的去完善,以覆盖更多的页面.

textractor 命令行是为了方便测试和调试使用, 只是简单的 wget + extract , 并不能解析由js生成的动态页面, 动态页面可自行选择使用合适的解析办法.

感谢

本项目受到 github.com/kingname/GeneralNewsExtractor 的启发,并参考使用了它的测试用例用进行开发和测试

textractor's People

Contributors

eryx avatar kwaziidev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

textractor's Issues

panic with: runtime: goroutine stack exceeds 1000000000-byte limit

fatal output:

runtime: goroutine stack exceeds 1000000000-byte limit
runtime: sp=0xc022a823a8 stack=[0xc022a82000, 0xc042a82000]
fatal error: stack overflow

runtime stack:
runtime.throw(0x1132e81, 0xe)
	/usr/local/go/src/runtime/panic.go:1117 +0x72
runtime.newstack()
	/usr/local/go/src/runtime/stack.go:1069 +0x7ed
runtime.morestack()
	/usr/local/go/src/runtime/asm_amd64.s:458 +0x8f

goroutine 1031 [running]:
runtime.concatstrings(0x0, 0xc022a82400, 0x2, 0x2, 0x0, 0x0)
	/usr/local/go/src/runtime/string.go:24 +0x2f4 fp=0xc022a823b8 sp=0xc022a823b0 pc=0x453c14
runtime.concatstring2(0x0, 0x0, 0x0, 0x112f107, 0x2, 0xc022a82460, 0x2)
	/usr/local/go/src/runtime/string.go:59 +0x47 fp=0xc022a823f8 sp=0xc022a823b8 pc=0x453c67
github.com/andybalholm/cascadia.(*parser).parseName(0xc022a827d8, 0x0, 0x0, 0x112f110, 0x2)
	/opt/gopath/pkg/mod/github.com/andybalholm/[email protected]/parser.go:134 +0xc5 fp=0xc022a82458 sp=0xc022a823f8 pc=0x714805
github.com/andybalholm/cascadia.(*parser).parseIdentifier(0xc022a827d8, 0x10, 0x7fc249a3ef18, 0x10, 0xc014e7d2f0)
	/opt/gopath/pkg/mod/github.com/andybalholm/[email protected]/parser.go:114 +0x97 fp=0xc022a824c8 sp=0xc022a82458 pc=0x714577
github.com/andybalholm/cascadia.(*parser).parseTypeSelector(0xc022a827d8, 0x0, 0xc014e7d2f0, 0x0, 0x1)
	/opt/gopath/pkg/mod/github.com/andybalholm/[email protected]/parser.go:306 +0x2f fp=0xc022a82520 sp=0xc022a824c8 pc=0x7152af
github.com/andybalholm/cascadia.(*parser).parseSimpleSelectorSequence(0xc022a827d8, 0x203000, 0xc022a82728, 0x7fc21f5bb280, 0x40)
	/opt/gopath/pkg/mod/github.com/andybalholm/[email protected]/parser.go:720 +0x718 fp=0xc022a82690 sp=0xc022a82520 pc=0x718d58
github.com/andybalholm/cascadia.(*parser).parseSelector(0xc022a827d8, 0x0, 0xc014e6dc00, 0x1298070, 0xc014e7d2e0)
	/opt/gopath/pkg/mod/github.com/andybalholm/[email protected]/parser.go:779 +0x3d fp=0xc022a82708 sp=0xc022a82690 pc=0x718f1d
github.com/andybalholm/cascadia.(*parser).parseSelectorGroup(0xc022a827d8, 0x0, 0x8, 0x8, 0xc022a82828, 0x203005)
	/opt/gopath/pkg/mod/github.com/andybalholm/[email protected]/parser.go:820 +0x2f fp=0xc022a82780 sp=0xc022a82708 pc=0x7191af
github.com/andybalholm/cascadia.ParseGroup(0x112f107, 0xb, 0xc014e84600, 0x30, 0x28, 0x103afe0, 0x0)
	/opt/gopath/pkg/mod/github.com/andybalholm/[email protected]/selector.go:67 +0x72 fp=0xc022a82808 sp=0xc022a82780 pc=0x719472
github.com/andybalholm/cascadia.Compile(...)
	/opt/gopath/pkg/mod/github.com/andybalholm/[email protected]/selector.go:10github.com/PuerkitoBio/goquery.compileMatcher(0x112f107, 0xb, 0xc022a828c0, 0xc0005ee800)
	/opt/gopath/pkg/mod/github.com/!puerkito!bio/[email protected]/type.go:167 +0x39 fp=0xc022a82868 sp=0xc022a82808 pc=0x72d8b9
github.com/PuerkitoBio/goquery.(*Selection).Find(0xc014e84600, 0x112f107, 0xb, 0x0)
	/opt/gopath/pkg/mod/github.com/!puerkito!bio/[email protected]/traversal.go:24 +0x39 fp=0xc022a828d0 sp=0xc022a82868 pc=0x728b59
github.com/gloomyzerg/textractor.findHtag(0xc014e845a0, 0x112f107, 0xb, 0xc014e845d0)
	/opt/gopath/pkg/mod/github.com/gloomyzerg/[email protected]/title.go:37 +0xe5 fp=0xc022a82930 sp=0xc022a828d0 pc=0xea7c45
github.com/gloomyzerg/textractor.findHtag(0xc014e84540, 0x112f107, 0xb, 0xc014e84570)
	/opt/gopath/pkg/mod/github.com/gloomyzerg/[email protected]/title.go:41 +0x145 fp=0xc022a82990 sp=0xc022a82930 pc=0xea7ca5
github.com/gloomyzerg/textractor.findHtag(0xc014e844e0, 0x112f107, 0xb, 0xc014e84510)
	/opt/gopath/pkg/mod/github.com/gloomyzerg/[email protected]/title.go:41 +0x145 fp=0xc022a829f0 sp=0xc022a82990 pc=0xea7ca5
github.com/gloomyzerg/textractor.findHtag(0xc014e84480, 0x112f107, 0xb, 0xc014e844b0)
	/opt/gopath/pkg/mod/github.com/gloomyzerg/[email protected]/title.go:41 +0x145 fp=0xc022a82a50 sp=0xc022a829f0 pc=0xea7ca5
github.com/gloomyzerg/textractor.findHtag(0xc014e84420, 0x112f107, 0xb, 0xc014e84450)
	/opt/gopath/pkg/mod/github.com/gloomyzerg/[email protected]/title.go:41 +0x145 fp=0xc022a82ab0 sp=0xc022a82a50 pc=0xea7ca5
github.com/gloomyzerg/textractor.findHtag(0xc014e843c0, 0x112f107, 0xb, 0xc014e843f0)
	/opt/gopath/pkg/mod/github.com/gloomyzerg/[email protected]/title.go:41 +0x145 fp=0xc022a82b10 sp=0xc022a82ab0 pc=0xea7ca5
github.com/gloomyzerg/textractor.findHtag(0xc014e84360, 0x112f107, 0xb, 0xc014e84390)
	/opt/gopath/pkg/mod/github.com/gloomyzerg/[email protected]/title.go:41 +0x145 fp=0xc022a82b70 sp=0xc022a82b10 pc=0xea7ca5
github.com/gloomyzerg/textractor.findHtag(0xc014e84300, 0x112f107, 0xb, 0xc014e84330)
	/opt/gopath/pkg/mod/github.com/gloomyzerg/[email protected]/title.go:41 +0x145 fp=0xc022a82bd0 sp=0xc022a82b70 pc=0xea7ca5
...
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.