- What is Web Scraping:
- known as Screen Scraping, Web Harvesting, Web Crawling, Spiders, Bot.
- An automated program that requests an HTML webpage or DOM meant for humans and parses the displayed data.
- A program that requests and parses any data on the web, especially in an unexpected way.
- Types of Web Scraping:
- A crawler that scans medical patient message boards looking for experiences with drug combinations.
- Automated UI testing of a company's app.
- A bot that interacts with an airline flight search app, monitoring price changes.
- Would a program that monitors prices changes using puplic API be considered web scraping?
- Fields in Web Scrapig:
- Application security.
- Networking.
- Data science.
- Natural language processing.
- Law.
- Data architecture.
being able to look for a website and check your database.
- How the internet works?
- Web Scraping Error Causes:
- You made a programming mistake.
- Your computer's Ethernet is unplugged.
- Your router cannot reach the web server.
- The web server is blocking your IP address.
- There is some programming error on the web server.
- And more.
Real-World Addresses have Layers: Ryan Mitchel, 123 Main street, Apt 456, Medford, MA 02155.
- Internet Layers:
- Physical layer: Actual electrons on a wire - High/Low voltage.
- Data Link Layer: Frames MAC addresses and physical machines on a local network.
- Network Layer: Router to router, Creates network IP addresses.
- Transport Layer: Presistent communication channels - TCP, UDP, ports.
- Session Layer: Open, close, manage sessions - AppleTalk, SCP.
- Presentation Layer: String encoding, encryption/decryption - Object serialization, files, compression.
- Application Layer: HTTP, POST and GET requests, REST APIs
- Think about the Internet as:
- Each request goes through many layers of wrapping and unwrapping to get to its destination and back.
- These requests do not require a web browser.
- Requests can be examined, replicated, and saved.
- Hello World With Scrapy:
- Install scrapy using
pip instal scrapy
- run
scrapy start project [name of the folder]
. - Navigate to
spiders
- run
scrapy genspider ietf pythonscraping.com
. - run
scrapy runspider ietf.py
- Install scrapy using
- Challenge:
- CSS selectors or xpath selectors.
- //h1
- //div/h1
- //span[@class = "title"]
- /text()
- /@id
- @content: to get the value.
- Crawling a website:
Wikipedia crawling.
\PythonWebScraping_Linkedin\article_scraper\article_scraper\spiders\wikipedia.py
- Recording Data.
- Go to
items.py
- Item is type of content you are scraping.
- Wikipedia is a source.
- To run the file:
scrapy runspider wikipedia.py -o article.csv -t csv -s CLOSESPIDER_PAGECOUNT=10
-s
stands for settings.
- To List all
cat articles.csv
-
you can have the file with
.json
or.xml
- Go to
- Scrapy Settings File:
- You can add any custome settings to the
settings.py
- You can add any custome settings to the
- Structure your scrapers for resusablility:
pipeline.py
.
- Challenge:
- scrape different data from different websites.
- Submitting a form:
- https requests: GET, POST
from scrapy.http import FormRequest
- Finding and Using hidden APIs.
- Siteemaps and robots.txt:
- A text to any scraper of what they should and should not scrape from a website and it appears with any website.
- To be able to follow the robots.txt rules Automatically activate the
ROBOTSTXT_OBEY = True
insettings.py
of the project.
- Challenge:
- Use CNN's sitemap and scrape data to database.
- Sitemap =
index.html
- Quick scraping.
- Logging in:
login.html
- Browser automation with Selenium:
- First install scrapy-selenium library =>
pip install scrapy-selenium
. - Second download browser driver file: https://chromedriver.chromium.org/downloads
- Then move this file somewhere memorable
- First install scrapy-selenium library =>
- Interacting with a page:
- Use selenium functions.
- Check out:
- Python Automation and Testing for more information about selenium.
- MySql Essential Training
- Python Data Analysis
- Web Scraping with Python book by Ryan Mitchell.
- Web Scraping With Python