Coder Social home page Coder Social logo

instamancer's Introduction

Instamancer

Build Status Quality Coverage Speed NPM Dependencies Chat

Scrape Instagram's API with Puppeteer.

Instamancer is a new type of scraping tool that leverages Puppeteer's ability to intercept requests made by a webpage to an API.

Read more about how Instamancer works here.

Features

  • Scrape hashtags, locations and users
  • Output JSON, CSV
  • Download images, albums, and videos
  • Batch scraping

Data

Metadata that Instamancer is able to gather from posts:

  • Text
  • Timestamps
  • Tagged users
  • Accessibility captions
  • Like counts
  • Comment counts
  • Images (Thumbnails, Dimensions, URLs)
  • Videos (URL, View count, Duration)
  • Comments (Timestamp, Text, Like count, User)
  • User (Username, Full name, Profile picture, Profile privacy)
  • Location (Name, Street, Zip code, City, Region, Country)

Install

Linux

See Puppeteer troubleshooting

Enable user namespace cloning:

sysctl -w kernel.unprivileged_userns_clone=1

Or run without a sandbox:

# WARNING: unsafe
export NO_SANDBOX=true

Without downloading chromium

If you wish to install Instamancer without downloading chromium, enable the PUPPETEER_SKIP_CHROMIUM_DOWNLOAD environment variable before installation

export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true

From this repository

Requires TypeScript

git clone https://github.com/ScriptSmith/instamancer.git
cd instamancer
npm install
npm install -g

From NPM

npm install -g instamancer

If you're using root to install globally, use the following command to install the Puppeteer dependency

sudo npm install -g instamancer --unsafe-perm=true

From NPX

npx instamancer

Usage

Command Line

$ instamancer
Usage: instamancer <command> [options]

Commands:
  instamancer hashtag [id]       Scrape a hashtag
  instamancer location [id]      Scrape a location
  instamancer user [id]          Scrape a user
  instamancer post [ids]         Scrape a comma-separated list of posts
  instamancer batch [batchfile]  Read newline-separated arguments from a file

Options:
  --help                  Show help                                    [boolean]
  --version               Show version number                          [boolean]
  --count, -c             Number of posts to download. 0 to download all
                                                                    [default: 0]
  --visible               Show browser on the screen            [default: false]
  --download, -d          Save images and videos from posts
                                                      [boolean] [default: false]
  --graft, -g             Enable grafting              [boolean] [default: true]
  --full                  Get the full list of posts and their details from the
                          API and web page            [boolean] [default: false]
  --video                 Download videos. Only works in full mode
                                                      [boolean] [default: false]
  --silent                Disable progress output     [boolean] [default: false]
  --sync                  Synchronously download files between API requests
                                                      [boolean] [default: false]
  --threads, -k           The number of parallel download threads
                                                           [number] [default: 4]
  --waitDownload, -w      When true, media will only download once scraping is
                          finished                    [boolean] [default: false]
  --filename, --file, -f  Name of the output file              [default: "[id]"]
  --filetype, --type, -t  Type of output file
                              [choices: "csv", "json", "both"] [default: "json"]
  --downdir               Directory to save media
                                          [default: "downloads/[endpoint]/[id]"]
  --logging               Level of logger
                   [choices: "error", "none", "info", "debug"] [default: "none"]
  --logfile               Name of the log file      [default: "instamancer.log"]
  --browser               Location of the browser. Defaults to the copy
                          downloaded at installation

Examples:
  instamancer hashtag instagood -d          Download all the available posts,
                                            and their thumbnails from #instagood
  instamancer location 644269022 --count    Download 200 posts tagged as being
  200                                       at the Arc Du Triomphe
  instamancer user arianagrande             Download Ariana Grande's posts to a
  --filetype=csv --logging=info --visible   CSV file with a non-headless
                                            browser, and log all events

Source code available at https://github.com/ScriptSmith/instamancer

Module

ES2018 Typescript example:

import * as Instamancer from "instamancer";

const options: Instamancer.IOptions = {
    total: 10
};

const hashtag = Instamancer.hashtag("beach", options);
(async () => {
    for await (const post of hashtag) {
        console.log(post);
    }
})();

Generator functions

Instamancer.hashtag(id, options);
Instamancer.location(id, options);
Instamancer.user(id, options);
Instamancer.post(ids, options);

Options

const options: Instamancer.IOptions = {
    // Total posts to download. 0 for unlimited
    total: number,
    
    // Run Chrome in headless mode
    headless: boolean,
    
    // Logging events
    logger: winston.Logger,
    
    // Run without output to stdout
    silent: boolean,
    
    // Time to sleep between interactions with the page
    sleepTime: number,
    
    // Time to sleep when rate-limited
    hibernationTime: number,
    
    // Enable the grafting process
    enableGrafting: boolean,
    
    // Extract the full amount of information from the API
    fullAPI: boolean,
    
    // Use a proxy in Chrome to connect to Instagram
    proxyURL: string,
    
    // Location of the chromium / chrome binary executable
    executablePath: string,
}

Comparison

A comparison of Instagram scraping tools. Please suggest more tools and criteria through a pull request.

To see a speed comparison, visit this page

Tool Hashtags Users Locations Posts Login not required Private feeds Batch mode Command-line Library/Module Download media Download metadata Scraping method Daily builds Main language Speed ---------------------------- License ---------------------------- Last commit ---------------------------- Open Issues ---------------------------- Closed Issues ---------------------------- Build status ---------------------------- Test coverage ---------------------------- Code quality ----------------------------
Instamancer ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API request interception ✔️ Typescript
Instaphyte ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation ✔️ Python
Instaloader ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation Python
Instalooter ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation Python
Instagram crawler ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web DOM reading Python
Instagram Scraper ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation Python
Instagram Private API ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ App and Web API simulation Python
Instagram PHP Scraper ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ Web API simulation PHP

instamancer's People

Contributors

scriptsmith avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.