Coder Social home page Coder Social logo

learnwebcrawlerv0's Introduction

Web Crawler Learning

Login

Use web crawler to log into http://quotes.toscrape.com/login

cd login_spider
scrapy crawl login

After running the above command, a browser should show up, and it can be observed that in the top right corner, the Login button becomes Logout button, i.e. now it's logged in.

Selenium search Google with robot

  • make sure selenium is installed to current python
  • make sure ChromeDriver is downloaded and palced into a directory under path variable.
cd seleniumDemo
python searchGoogle.py
## Sign into U of T Acorn, and Enroll Course Automatically with Selenium

```shell
cd acorn_login
# make sure a password.py file is added that include your acorn username and password.
python loginAcorn


## Download 2000 chapters of novel
* Navigate to `download_novel` directory. There are two versions of code that do the same job.
  One use iteration and the other uses recursion. Personally, I believe iteration saves more memory. 


## Retrieve Data From <http://books.toscrape.com/>, Part 1

There are 1000 books on this website, divided into 50 pages. This crawler go to each page, take url of each book, recursively go to each book's page and retrieve detailed information of that book. Then go on to the next book on the page.

After a page is explored, the crawler gets the url of `next page button`, go to next page and repeat the scraping process until all books are retrieved.

Add flag `-o` to specify where to export the retrieved data.

```shell
cd books_crawler2
scrapy crawl booksData -o data.csv

Eventually a data.csv is generated. All information of every book are stored in the csv file. the csv file could be opened with excel for better displaying data.

Retrieve Data From http://books.toscrape.com/, Part 2

cd books_crawler2
scrapy crawl booksData2 -a category="http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html"
# the link is taken from books.toscrape.com by clicking one of the category.

When we only want to scrape books of one specific category, we can manually add start_urls in a OOP way.

class Booksdata2Spider(Spider):
    name = 'booksData2'
    allowed_domains = ['books.toscrape.com']
	def __init__(self, category):
        self.start_urls = [category]

Retrieve Data From http://books.toscrape.com/, Part 3, Close Function

  • close function: function executed when scraping process is done
cd books_crawler2
scrapy crawl booksData2 -a category="http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html" -o items.csv

Regularly, the exported file will be saved as items.csv

With close function, it can be renamed.

    import os, glob
    def close(self, reason):
        csv_file = max(glob.iglob('*.csv'), key=os.path.getctime)
        os.rename(csv_file, 'newBooksItems.csv')

Using items.py

cd sample_items_spider
scrapy crawl sample_items_spider

The code above retrieves authors' names from quotes.toscrape.com.

Instead of using yield as I do previously, I use items.py instead.

Export Excel

For details, see ./Note/Lec7_Export

cd excel_export_demo
scrapy crawl export_excel_demo -o items.csv

Save data as csv, then convert csv to xlsx(excel)

Download Image

See ./download_image/README.md for details

cd download_image
sudo scrapy crawl books > ./log.txt

Images will be stored in ./download_image/books_crawler/books_crawler/downloaded_images

Check log.txt for exact location.

Store Data in Database

Details see Lec9 in Notes

  • SQL:

    • cd books_crawler2
      scrapy crawl booksData4_SQL -o items.csv > log.txt
    • Make sure mysql-server is installed locally and service is started.

    • Data will be stored in mySQL database

  • MongoDB

    • cd booksCrawler_MongoDB
      scrapy crawl booksData4_SQL -o items.csv > log.txt
    • Make sure mongod service is started locally

    • Data will be stored in MongoDB

Scrapy User Agent

scrapy shell 'https://www.amazon.ca/gp/profile/amzn1.account.AERSRZ2IKWWLCTLHRZKEW4SXX23Q/ref=cm_cr_arp_d_gw_rgt?ie=UTF8'
# the above won't give a valid response, try 
view(response)		# this page will be blank

Go to <https://www.whatismybrowser.com/detect/what-is-my-user-agent> , to find out your user agent.

scrapy shell 'https://www.amazon.ca/gp/profile/amzn1.account.AERSRZ2IKWWLCTLHRZKEW4SXX23Q/ref=cm_cr_arp_d_gw_rgt?ie=UTF8' -s USER_AGENT="paste user agent here" # -s stands for setting

Scrape Tables

scrape a table from <https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population>

cd scrapeTable
scrapy crawl wiki -o output.json

Scrape JSON

cd scrapeJSON
scrapy crawl tweets -o output.csv
# EASY

learnwebcrawlerv0's People

Contributors

huakunshen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.