TickerScrape
is a package for scraping financial security ticker data. It leverages scrapy
.
Every publically traded security for every asset class are scraped to a SQL database using SQLAlchemy. The ORM is configured to create database tables mapping securities to assset classes, countries, industries and exchanges. It also creates relationships between counties and currencies, as well as industruies and sectors (based on NAICS codes). The securities table has columns for fundamental data, metadata, accounting ratios, and analyst estimates. Country metadata such as ISO 3166 code, continent, territory status, region, economic grouping, and geopolitical grouping are pulled from a local CSV file. The country table also has empty columns for economic data such as GDP. Currency metadata such as symbol, ISO 4217 code, ticker, and minor unit are pulled from a local CSV file. The currency table also has empty columns for economic data such as interest rates.
The repository can be found at: Github-TickerScrape
pip install git+git://github.com/Saran33/TickerScrape.git
or
git clone https://github.com/Saran33/TickerScrape.git
TickerScrape requires Docker, Splash and this fork of Aquarium to scrape some websites that render in Javascript.
- After pip installing TickerScrape, download Docker at the above link.
- As per the above Splash installation docs, pull the splash image with:
$ sudo docker pull scrapinghub/splash
$ docker pull scrapinghub/splash
- Start the container:
$ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
(Splash is now available at 0.0.0.0 at port 8050 (http))
$ docker run -it -p 8050:8050 --rm scrapinghub/splash
(Splash is available at 0.0.0.0 address at port 8050 (http))
-
Alternatively, use the Docker desktop app. Splash is found under the 'images' tab. Hover over it, click 'run'. In additional settings, name the container 'splash', and select a port such as 8050. Click 'run' and switch on the container before running scrapy. Switch it off after.
-
In a browser, enter
localhost:8050
(or whatever port you choose), and you should see Splash is working. -
The other dependencies will be automatically installed and you can run TickerScrape as normal.
$ sudo docker pull scrapinghub/splash
for Linux or$ docker pull scrapinghub/splash
for OS X.
- Aquarium creates multiple Splash instances behind a HAProxy, in order to load balance parallel scrapy requests to a splash docker cluster. The instances collaborate to render a specific website. It may be necessary for preventing 504 errors (timeout) on some sites. It also speeds up the scraping of Javascript pages, and can also facilitate Tor proxies. To install Aquarium, navigate to your home directory and run the command:
cookiecutter gh:Saran33/aquarium
Choose default settings or whatever suits, splash_version: latest, set user and password, set Tor to 0.
- a. To start the container (without Acquarium):
$ sudo docker run -it --restart always -p 8050:8050 scrapinghub/splash
(Linux)
(Splash is now available at 0.0.0.0 at port 8050 (http).)
or $ docker run -it --restart always -p 8050:8050 scrapinghub/splash
(OS X)
(Splash is available at 0.0.0.0 address at port 8050 (http).)
- Alternatively, use the Docker desktop app. Splash is found in the 'images' tab. Hover over it, click 'run'. In additional settings, name the container 'splash', and select a port such as 8050. Click 'run.'
- In a broweser, enter localhost:8050 (or whatever port you choose) and you should see Splash.
- The other dependencies will be automatically be installed and you can run TickerScrape as normal.
- b. Or to start the Splash cluster with Aquarium:
Go to the new acquarium folder and start the Splash cluster:
cd ./aquarium
docker-compose up
In a browser window, visit the below link to view Splash is working: http://localhost:8050/ To see the stats of the cluster: http://localhost:8036/
- Navigate to the outer directory of TickerScrape.
- Open a terminal and run:
python3 TickerScrape.py
scrapy crawl mw_stocks -a country=us
python3 TickerScrape_gui.py
- The default settings save the tickers to a local SQLite database (which can be changed in settings.py). The DB can be read via SQL queries such as:
sqlite3 TickerScrape.db
.tables
.schema stocks
.schema bonds
select * from stocks limit 3;
.quit
Alternatively, the DB can be opened in the convenient DB Browser for SQLite.
To save the scraped data to a CSV as well as the DB, run:
scrapy crawl marketwatch -o output.csv -t csv