zembrodt / pymdb Goto Github PK

View Code? Open in Web Editor NEW

6.0 2.0 0.0 386 KB

Python package to both parse datsets provided by IMDb and scrape information from imdb.com

Home Page: https://pymdb.com

License: MIT License

Python 100.00%

imdb imdb-dataset imdb-api imdb-movies movies movies-api movie-database moviedb-api tvdb pymdb

pymdb's Introduction

PyMDb

PyMDb is a package for both parsing the datasets provided by IMDb and scraping information from their web pages.

This package is able to gather information on people, titles, and companies provided by IMDb and is split into two separate modules: one for parsing the IMDb datasets, and one for scraping webpages on imdb.com.

Installation

The latest release of PyMDb can be installed from PyPI with:

pip install py-mdb

If downloading the source from GitHub, PyMDb requires the following packages:

Usage

>>> import pymdb
>>> from collections import defaultdict
>>>
>>> parser = pymdb.PyMDbParser(gunzip_files=True)
>>> genre_count = defaultdict(int)
>>> for title in parser.get_title_basics("path/to/files"):
...     for genre in title.genres:
...             genre_count[genre] += 1
...
>>> for genre in genre_count:
...     print(f"{genre}: {genre_count[genre]}")
...
Documentary: 600184
Short: 837912
Animation: 312227
    ...
Talk-Show: 584252
Reality-TV: 307037
Adult: 178493
>>>
>>> scraper = pymdb.PyMDbScraper(rate_limit=500)
>>> title = scraper.get_title("tt0076759")
>>> print(f"{title.display_title} came out in {title.release_date.year}!")
Star Wars: Episode IV - A New Hope came out in 1977!

Documentation

Full documentation can be found at the PyMDb Read the Docs page.

Disclaimer

PyMDb is still in a pre-release state and has only been tested with a small amount of data found on imdb.com. The web scraper portion of the code does have a rate limiter value you can customize, please be kind to IMDb. If any bugs or issues are found, please do not hesitate to create an issue or make a pull request on GitHub. Suggestions for features to be added to PyMDb in future releases are also welcome!

License

This project is licensed under the MIT License. Please see the LICENSE file for details.

pymdb's People

Contributors

Stargazers

Watchers

pymdb's Issues

Upgrade for Python 3.9

how to print generator object to string?

hello, I really like your project, but I have a little confusion, how do I print the actor to the string?

Optimize CSS selectors for PyMDbScraper

The CSS selectors are currently fully functional for the given tests, but there is potentially work that needs to be done to optimize them.

Restructure classes to use slots

Increase memory efficiency with custom objects using slots
https://www.datadependence.com/2016/07/pythonic-code-video-series-slots/

Import issues within the pymdb package

Stack trace:

File "C:\Users\Administrator\Documents\python_projects\django_test\update\view
s.py", line 5, in
from pymdb import PyMDbScraper
File "C:\Python\Python37\lib\site-packages\pymdb_init_.py", line 1, in
from .parser import PyMDbParser
File "C:\Python\Python37\lib\site-packages\pymdb\parser_init_.py", line 1,
in
from .pymdb_parser import PyMDbParser
File "C:\Python\Python37\lib\site-packages\pymdb\parser\pymdb_parser.py", line
7, in
from pymdb.models.name import (
File "C:\Python\Python37\lib\site-packages\pymdb\models_init_.py", line 1,
in
from .company import *
File "C:\Python\Python37\lib\site-packages\pymdb\models\company.py", line 6, i
n
from pymdb.utils import is_int
ImportError: cannot import name 'is_int' from 'pymdb.utils' (C:\Python\Python37
lib\site-packages\pymdb\utils_init_.py)

Add Support for Python 3.8

Add support for Python 3.8 with new selectolax builds

Add IMDb Search Requests

Add the ability to search IMDb with keywords.

IMDb's search works by sending a GET request to the server (for example: https://v2.sg.media-imdb.com/suggestion/r/robert.json) and receives a JSON response with the results to populate.

Use this JSON response to create a SearchResult Python object containing the ID, and potentially other details, of each result. May need to exclude certain results such as videos or trailers.

Results we care about should only be titles (tt) and people (nm).

Map credit job categories into key values

Map a credit job type, such as "Series Directed by" into an actual key such as "director". This will help align with the already hard coded "actor" category in get_full_cast.

Combine PyMDbScraper's get_full_cast and get_full_credits

Combine the two methods into a single helper method to call both and yield all results as a single generator.

Add rate limiter

Add rate limiter for each scrape request of IMDb's website

Change top_cast in TitleScrape to list of CreditScrapes

Change the top_cast list within a TitleScrape to return a list of CreditScrape objects instead of a list of name IDs.

Modify Generator Methods to Also Return Dictionaries

Modify certain generator methods, potentially for both PyMDbParser and PyMDbScraper, that return a dictionary of objects rather than a generator. Example would be get_full_credits returning a dictionary where each key is a unique job title and the values are an array of Credit objects.

get_title returns null display_title

Occasionally the get_title method returns a Title with a null display_title within unit tests, but other times will return the correct title.