Coder Social home page Coder Social logo

alexleen / scrape-x Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 2.0 573 KB

Simple .NET library that provides generic web scraping abilities using XPaths.

License: MIT License

C# 100.00%
chsarp dotnet scraper scraping scraping-websites dotnetstandard

scrape-x's Introduction

scrape-x

Build status nuget

Simple .NET library that provides generic web scraping abilities using XPaths.

Basic features:

  • Fluent interface
  • Pagination
  • Throttling
  • HttpClient injection

Wiki

For how-to's, examples, and documentation, please see the wiki.

Example Usage

private static void Main(string[] args)
{
    IScraperFactory scraperFactory = new ScraperFactory();

    //Set up a new scraper to scrape Austin's craigslist
    IPaginatingScraper scraper = scraperFactory.CreatePaginatingScraper("https://austin.craigslist.org");

    //Set the URL for the results page. In this case, "apts/housing for rent".
    scraper.SetResultsStartPage("/search/apa")
    
           //Set the XPath for search result nodes
           .SetIndividualResultNodeXPath("//*[@id=\"sortable-results\"]/ul/li")
           
           //Sets the XPath for search result links relative to result node
           .SetIndividualResultLinkXPath("a/@href")
           
           //Sets a predicate that decides whether or not an individual result should be visited or not.
           //In this case, results are only visited if their "housing" span contains "1br".
           //This saves considerable bandwidth.
           .SetResultVisitPredicate(housing => housing.Contains("1br"), "p/span[2]/span[2]")
           
           //Sets "Next" button link XPath
           .SetNextLinkXPath("//*[@id=\"searchform\"]/div[3]/div[3]/span[2]/a[3]/@href")
           
           //Sets XPaths used for retrieving data from the target page.
           //Keys are used to identify the data in the callback to the Go method.
           .SetTargetPageXPaths(new Dictionary<string, string>
           {
               { "latitude", "//*[@id=\"map\"]/@data-latitude" },
               { "longitude", "//*[@id=\"map\"]/@data-longitude" },
               { "price", "/html/body/section/section/h2/span[2]/span[1]" },
               { "br", "/html/body/section/section/section/div[1]/p[1]/span[1]/b[1]" },
               { "sqft", "/html/body/section/section/section/div[1]/p[1]/span[2]/b" }
           })
           
           //Go!
           //Everytime a target page is scraped this callback is called.
           .Go(OnResultRetrieved);
}

private static void OnResultRetrieved(string link, IDictionary<string, string> results)
{
    //Do something with the results...
    Console.WriteLine(results["br"]);
}

Thanks!

JetBrains Rider
AppVeyor
HtmlAgilityPack
SonarCloud

scrape-x's People

Contributors

alexleen avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

mridulsri lanicon

scrape-x's Issues

Scrape Result Page Itself

Some result pages contain all needed information without having to visit a target link. This is currently not supported (as of v1.0.0). Let's support it.

Async Support

It would be nice if the Go method returned a Task and supported cancellation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.