Coder Social home page Coder Social logo

imdb-review-scraper's Introduction

๐Ÿšฆ ๐ŸŽ„

Checks Badge
Composer validation (composer.json) Composer checks
PHP Cs Fixer (@Symfony [...]) PHP cs fixer check
Infection Mutation tests (Min MSI>95%, Min C MSI>95%) Infection tests
PHPinsights (Q>=95,C>=70,A>=95,S>=95) PHPinsights
PHPStan (L9) PHPStan
PHPUnit (phpunit.xml) PHPUnit

What is this thing ?

Hi, I'm Lukasz. I code. This is a thing I coded.
This is an experiment/exercise in building a composer package, and setting it up with a full deployment lifecycle.

It is a scraper that lookups up an IMDB users reviews, scrapes them, transforms them and spits them out as objects.
You have to choose your own psr17 request factory and psr18 client.

You should be using the official IMDB api.

Usage

Quick and dirty:

<?php

require 'vendor/autoload.php';

$movies = new Meltir\ImdbRatingsScraper\Scraper(new \GuzzleHttp\Client(), new \GuzzleHttp\Psr7\HttpFactory(), 'ur20552756');
var_dump($movies->getMovies());

Licence

Short version:

This is mine and nobody has my permission to use it or republish it in parts or whole anywhere ever.

Long and snarky version:

You can look, but you cannot touch, run, analyse, lick or compile.
I don't care enough to chase down bots (or anyone else for that matter), but for the record:

BAD BOT

This is scraping publicly available page's movie id's and review values, and does not use descriptions/posters/other metadata from IMDB
Please don't sue me.

If you want to do this or anything like it for any purpose, get the official, commercial IMDB api.
Consider this an unholy closed half-MIT licence.

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

imdb-review-scraper's People

Contributors

meltir avatar dependabot[bot] avatar

Stargazers

 avatar

Watchers

 avatar Kostas Georgiou avatar

imdb-review-scraper's Issues

drop dependency on guzzle

Switch to a generic PSR client interface, rather then relying on this concrete implementation.
I don't care what the http client is, as long as it works.

Dig into if this is doable and how - lookup other libs that dont care who provides http, like symfony and laravel.
Maybe only require guzzle in dev as a concrete implementation to test against ?

Add badges

All of the shiny badges, build, tests, infection, security whatever else.
Lets make a collection of things I could do, and tick them off.

Going to depend on the build pipeline working (#3)

Add interfaces SPIKE

Implement a generator, cachable and iterator interface - to not have to rely on the getall always.
See what else can be done (lazy loading, fibers?).

Shoehorn as much in as I can :P

Implement some sort of token to be able to resume a scrape after one is interrupted (times out etc).

Workflow refactor

Follow https://github.com/doctrine/DoctrineBundle/blob/2.11.x/.github/workflows/continuous-integration.yml and try to replicate - specific lts versions in test matrix, stable/dev revisions of core required libs, matrix with those versions of libs.
Merge all these into one or a few workflows ? I don't like how bitty they have become.
Share phpunit coverage cache with infection, see if I can share other caches.
Run automated workflows on a schedule to check if there are dependabot prs and merge if there are no errors (there should be another ticket for this).

drop usage of 'set-output' from pipeline, improve caching

1 Below warning is show on pipeline, where the date key for the composer cache is stored.

Warning: The `set-output` command is deprecated and will be disabled soon.   
Please upgrade to using Environment Files. 
For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/

above link

2 Update the key to also use an md5 or eq (sha256?) of composer.json - to update the cache when changes are made to libs and versions on the same day

3 Add caching of phpunit coverage into infection pipeline so its not run twice

4 Dont run the build pipeline on anything other than a PR (i think it also runs on a merge for no reason, eating up my free minutes - check this)

move userid from constructor

What if someone wants a couple of diff users ?
You have to create many instances ?
Silly. Might as well take the user when getting all movies or somewhere else.

This smells bad.

Clean up namespace

use Meltir\ImdbRatingsScraper\ImdbRatingsScraper;

This is redundant redundant.

Create tags and releases for packages

Implement semver in the pipelines after (or maybe prior ?) a merge into master - tag a github release automatically.
Confirm composer.json has all the fields required for a lib, author, etc.

phpinsights scores

Add phpinsights checks to composer.
Go over findings and figure out what the passing grades should be for the pipeline.
Implement checks on the pipeline.

Azure pipeline test/build agent

Setup a remote azure pipelines (alpine based ?) build agent and use it to run build/tests via github pr&merge.
Deploy pipeline should also tag a release etc. Full monty and all that.

Spike - investigate an automated backport

Will rector allow me to generate a 7.4 compatible version, that perhaps fails some strict stan checks and others - but works fine without it ?
Is rector free ? I haven't tried it yet - just read about it.
What are the alternatives ?

Add docs on how to build/test

Add docker run commands to readme/docker docs about how to download lib, build and test it using only docker run commands.
Add composer setup/commands to that as well if running on a full system.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.