Coder Social home page Coder Social logo

victor-yanxiaozhong / dotnetspider Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dotnetcore/dotnetspider

0.0 2.0 0.0 42.92 MB

DotnetSpider, a .NET Standard web crawling library similar to WebMagic and Scrapy. It is a lightweight ,efficient and fast high-level web crawling & scraping framework for .NET

License: MIT License

Batchfile 0.15% Shell 0.04% C# 77.50% HTML 22.24% CSS 0.06% JavaScript 0.02%

dotnetspider's Introduction

DotnetSpider

Travis branch NuGet Member project of .NET Core Community GitHub license

DotnetSpider, a .NET Standard web crawling library similar to WebMagic and Scrapy. It is a lightweight, efficient and fast high-level web crawling & scraping framework for .NET

DESIGN

DESIGN

DEVELOP ENVIROMENT

  • Visual Studio 2017 (15.3 or later)
  • .NET Core 2.0 or later
  • Storage data to mysql. Download MySql grant all on . to 'root'@'localhost' IDENTIFIED BY '' with grant option; flush privileges;

OPTIONAL ENVIROMENT

MORE DOCUMENTS

https://github.com/dotnetcore/DotnetSpider/wiki

SAMPLES

Please see the Projet DotnetSpider.Sample in the solution.

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

public class EntityModelSpider
{
	public static void Run()
	{
		Spider spider = new Spider();
		spider.Run();
	}

	private class Spider : EntitySpider
	{
		protected override void OnInit(params string[] arguments)
		{
			var word = "可乐|雪碧";
			AddRequest(string.Format("http://news.baidu.com/ns?word={0}&tn=news&from=news&cl=2&pn=0&rn=20&ct=1", word), new Dictionary<string, dynamic> { { "Keyword", word } });
			AddEntityType<BaiduSearchEntry>();
			AddPipeline(new ConsoleEntityPipeline());
		}

		[Schema("baidu", "baidu_search_entity_model")]
		[Entity(Expression = ".//div[@class='result']", Type = SelectorType.XPath)]
		class BaiduSearchEntry : BaseEntity
		{
			[Column]
			[Field(Expression = "Keyword", Type = SelectorType.Enviroment)]
			public string Keyword { get; set; }

			[Column]
			[Field(Expression = ".//h3[@class='c-title']/a")]
			[ReplaceFormatter(NewValue = "", OldValue = "<em>")]
			[ReplaceFormatter(NewValue = "", OldValue = "</em>")]
			public string Title { get; set; }

			[Column]
			[Field(Expression = ".//h3[@class='c-title']/a/@href")]
			public string Url { get; set; }

			[Column]
			[Field(Expression = ".//div/p[@class='c-author']/text()")]
			[ReplaceFormatter(NewValue = "-", OldValue = "&nbsp;")]
			public string Website { get; set; }

			[Column]
			[Field(Expression = ".//div/span/a[@class='c-cache']/@href")]
			public string Snapshot { get; set; }

			[Column]
			[Field(Expression = ".//div[@class='c-summary c-row ']", Option = FieldOptions.InnerText)]
			[ReplaceFormatter(NewValue = "", OldValue = "<em>")]
			[ReplaceFormatter(NewValue = "", OldValue = "</em>")]
			[ReplaceFormatter(NewValue = " ", OldValue = "&nbsp;")]
			public string Details { get; set; }

			[Column(Length = 0)]
			[Field(Expression = ".", Option = FieldOptions.InnerText)]
			[ReplaceFormatter(NewValue = "", OldValue = "<em>")]
			[ReplaceFormatter(NewValue = "", OldValue = "</em>")]
			[ReplaceFormatter(NewValue = " ", OldValue = "&nbsp;")]
			public string PlainText { get; set; }
		}
	}
}

public static void Main()
{
	EntityModelSpider.Run();
}

Run via Startup

Command: -s:[spider type name | TaskName attribute] -i:[identity] -a:[arg1,arg2...] --tid:[taskId] -n:[name] -c:[configuration file path or name]
  1. -s: Type name of spider or TaskNameAttribute for example: DotnetSpider.Sample.BaiduSearchSpiderl
  2. -i: Set identity.
  3. -a: Pass arguments to spider's Run method.
  4. --tid: Set task id.
  5. -n: Set name.
  6. -c: Set config file path, for example you want to run with a customize config: -e:app.my.config

WebDriver Support

When you want to collect a page JS loaded, there is only one thing to do, set the downloader to WebDriverDownloader.

Downloader=new WebDriverDownloader(Browser.Chrome);

See a complete sample

NOTE:

  1. Make sure there is a ChromeDriver.exe in bin forlder when you try to use Chrome. You can install it to your project via NUGET manager: Chromium.ChromeDriver
  2. Make sure you already add a *.webdriver Firefox profile when you try to use Firefox: https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles
  3. Make sure there is a PhantomJS.exe in bin folder when you try to use PhantomJS. You can install it to your project via NUGET manager: PhantomJS

Storage log and status to database

DotnetSpider.Hub

https://github.com/zlzforever/DotnetSpider.Hub

  1. Dependences a ci platform for example I use teamcity right now.
  2. Dependences Scheduler.NET https://github.com/zlzforever/Scheduler.NET
  3. More documents continue...

1 2 3 4 5

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0
tcp-keepalive 60

Buy me a coffee

AREAS FOR IMPROVEMENTS

QQ Group: 477731655 Email: [email protected]

dotnetspider's People

Contributors

zlzforever avatar walterwhatwater avatar moonheart avatar albinchang avatar sanjusss avatar toolgood avatar alexinea avatar jaymzzh avatar drcaesar avatar marshallmick007 avatar varorbc avatar lvhua6352 avatar tangzhengping avatar xljiulang avatar

Watchers

James Cloos avatar Victor avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.