Coder Social home page Coder Social logo

strykez / fastscrape Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 1.0 6 KB

A simple web scraper built with python and beautifulfoup.

License: MIT License

Python 100.00%
beautifulsoup python scraper scraping scraping-python scraping-web scraping-websites web-scraping python-scraping python-web-scraper

fastscrape's Introduction

FastScrape

A simple web scraper written in python and beautifulfoup.


It is meant to be run in the terminal as a command rather than a standalone script.


⚠️Note that I am not responsable for any misuse of this script.⚠️


Installation

If you don't have git:

Click the code button and download the zip, then extract it:


If you have git:

Use the following command in the termial/command line:

git clone https://github.com/Strykez/fastscrape.git 

Usage

Now you can run the script with this command:

./main.py

To make it run as a command to the following:


For Linux users:

Open the terminal in the folder which contains the main.py script and type the following:

cp ./main.py fastscrape
chmod +x fastscrape
mv fastscrape /bin
fastscrape

You can replace the fastscrape name with any name you want for the command


For Windows users:

  • Make a bin folder inside your User's folder
  • Copy the main.py script in it and remove it's extension
  • Rename the main file as the name you want the command to have (In this case fastscrape)
  • Type path in Windows search bar and hit enter
  • Add the folder in the path as per this gif:


Commands

███████╗ █████╗ ███████╗████████╗███████╗ ██████╗██████╗  █████╗ ██████╗ ███████╗
██╔════╝██╔══██╗██╔════╝╚══██╔══╝██╔════╝██╔════╝██╔══██╗██╔══██╗██╔══██╗██╔════╝
█████╗  ███████║███████╗   ██║   ███████╗██║     ██████╔╝███████║██████╔╝█████╗  
██╔══╝  ██╔══██║╚════██║   ██║   ╚════██║██║     ██╔══██╗██╔══██║██╔═══╝ ██╔══╝  
██║     ██║  ██║███████║   ██║   ███████║╚██████╗██║  ██║██║  ██║██║     ███████╗
╚═╝     ╚═╝  ╚═╝╚══════╝   ╚═╝   ╚══════╝ ╚═════╝╚═╝  ╚═╝╚═╝  ╚═╝╚═╝     ╚══════╝
                                                                                    V 0.7
                                                                                Made by Strykez
options:
  -h, --help,          show this help message and exits
  -m, -man, --manual

  -u, --url             sets the script's URL

  -s, --selector        the selector string used in the script
                        
                        selector format --> Column_name:selector.class/another_selector.another_class

                        If column name is empty, it will append to the current column, else it will create a new column and append
                        the data to it

                        Examples: Titles:div.card/div.first_half/p.title --> Gets all the instances of p.title in the specified path
                                  Titles:p.title --> Gets all the instances of p.title in the page
                                  p.title --> If you do not want a column name

  -o, --output          the path where you want the results to be saved in .csv format (creates the directory/ies if necessary)
                        if left blank it will print the selected elements to the terminal
  
  -v, --verbose         displays more information about the steps performed in the script
                        NOTE: Put the verbose argument as the last argument because putting it ahead can make the script crash

Notes:

  • The script requires a valid URL and a valid selector to work.
  • The verbose argument must be put last in order for the command to work.

Features

  • If no output argument is given it will print the requested code in the console
  • You can give a specific path as an output argument, such as: Desktop/myfolder/results.csv
  • You can give a specific path as a selector argument. For example: -s div.product_container/div.desc/p
  • You can add columns to the .csv file to make it more easily readable in Excel. Example: -s Price:div.product_info/p.price
  • If the path does not exist, the program will create it
  • If no Column Name is detected in the selector, it will append the result to the last column created
  • If another column exists in the .csv file, it will append the result in a different one

Example usage

Using QuotesToScrape website as a dummy example.


Extracting all the elements with a specific selector and class (in this example all quotes) and outputting into a folder:

./main.py --url https://quotes.toscrape.com/ --selector span.text -o Desktop/some_folder/quotes.csv

Extracting all the elements from a specific path:

./main.py --url https://quotes.toscrape.com/ --selector div.col-md-8/div.quote/span.text -o Desktop/some_folder/quotes.csv

Outputting the quotes in an excel-friendly column format:

./main.py --url https://quotes.toscrape.com/ --selector Quotes:div.col-md-8/div.quote/span.text -o Desktop/some_folder/quotes.csv

Appending more quotes to the Quotes column:

./main.py --url https://quotes.toscrape.com/page/2/ --selector div.col-md-8/div.quote/span.text -o Desktop/some_folder/quotes.csv

Creating a new Other_Quotes column in the .csv file with the quotes from page 3:

./main.py --url https://quotes.toscrape.com/page/3/ --selector Other_Quotes:div.col-md-8/div.quote/span.text -o Desktop/some_folder/quotes.csv

Issues

Feel free to submit issues with bugs that need fixing or with new features that you wish to be added.


You can also reach me trough my socials:


License

This repo is created under the MIT Licence.

fastscrape's People

Contributors

strykez avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

pluto-3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.