Coder Social home page Coder Social logo

scrapeth's Introduction

Scraping Tsinghua News

This is the companion repo for a short guest lecture/workshop I did on web scraping for Professor MENG Tianguang's class at Tsinghua University. The goal was to get the students started on scraping information from web pages (Tsinghua University's news site in this case) using Python and the requests and lxml packages, with a focus on generating XPATH expressions. Some of the students were somewhat new to Python and perhaps programming languages in general, so I included a short bit on working in Python with the basics required to follow along with the workshop. The Jupyter notebooks were used to demonstrate basic Python usage and some basic web scraping techniques. The companion slides are found here. I walked through with students how to get article titles, URLs, summaries, dates, and associated images from the Tsinghua news site into a CSV file.

Getting Started

The main parts of this repo are the Jupyter notebooks.

Prerequisites

  • Python 3 (was not tested in Python 2)
  • requests
  • lxml
  • selenium
  • Firefox (hard-coded in but can be modified)
  • Jupyter (for modifying the notebooks)

Installing

After cloning the repo, you should be good to go.

Usage

The main goal of this project was to get beginner students started on scraping using Tsinghua's news site as the example. Therefore, functionality is fairly limited to just getting the first page of articles from Tsinghua's news site.

First, import the module.

import scrape_thnews

To use requests to get the page:

requests_results = scrape_thnews.basic_scrape()

The alternative is to use selenium and Firefox to get the page

selenium_results = scrape_thnews.basic_scrape(use_selenium=True)

Note that there was originally functionality to count the number of views each news item received. However, when I checked on October 29 2017, the views appear to be gone. An original page from Fall 2016 is included in the repo as thNews.html.

There is also a function to parse a news article itself, given its URL.

from scrape_thnews import parse_th_article

an_article_url = 'http://news.tsinghua.edu.cn/publish/thunews/9648/2016/20161125111440926399642/20161125111440926399642_.html'
an_article_data = parse_th_article(an_article_url)

Ostensibly, it will find the full article text, links to any papers and other materials referenced in the news (the news site generally announces research produced by its faculty), the editors of the news item, the news "provider" (generally what department the finding is from), and images inside the article. I did not test this function rigorously, so there is no guarantee that this function will be able to parse every Tsinghua news article.

scrapeth's People

Contributors

brtsay avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.