Coder Social home page Coder Social logo

thegraydot / gcd_docker Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 715 KB

A simple Docker environment for the Grand Comic Database (GCD) with some Python scripts for interacting with the database

License: Other

Dockerfile 2.03% Python 91.54% Makefile 2.91% Shell 3.52%
comic-books comicbooks gcd grandcomicdatabase

gcd_docker's Introduction

gcd_docker

A simple Dockerised environment for working with the Grand Comic Database™ (GCD), and Python scripts for processing data into a format I work with.

License

This project uses data courtesy of the Grand Comics Database™ under a Creative Commons Attribution license. For more information, please refer to the GCD wiki page on Data Distribution and the GCD board decision on licensing.

NOTE: At present no data from the GCD, or data from the GCD that has been modified, is currently provided in this repository. However, it is planned that derived data from the GCD will be distributed in this repository in the future. Additionally, since the project uses the GCD database (while not currently distributing any data), it is both logical and ethical to adopt the same license.

gcd

A simple Docker Compose environment for the Grand Comic Database (GCD) with a Python environment and scripts for interacting with the database.

  • Uses Docker to provision containers to load and process GCD data
  • MySQL container to auto load the GCD data, and keep database persistent in a volume
  • Python container to run simple programs to query database, also mounted as a volume

Project Requirements

  • Docker
  • Docker Compose plugin
  • GCD DB dump

QuickStart

  • Make sure you have the following installed:
    • Docker
    • Docker Compose plugin
  • Clone the repo
    • git clone https://github.com/TheGrayDot/gcd_docker.git && cd ./gcd_docker
  • Copy a GCD MySQL dump to the data/gcd folder
    • cd ~/Downloads && unzip ~/Downloads/current.zip && cp ~/Downloads/2022-01-25.sql ./data/gcd/
  • Start Docker environment
    • make run
    • OR
    • docker compose up --file docker-compose_gcd.yml --build
  • Run a Python script in the Docker container:
    • docker exec gcd-python python gcd_test_load_all_into_models.py

GCD DB Dump

You need to add the GCD DB .sql file to load data into the MySQL database. You can download from the following URL (GCD account needed):

Unzip the download, and you will get an .sql file in the following format:

YYYY-MM-DD.sql

For example:

2022-01-25.sql

To auto-load data in the Docker environment, you will need to put the downloaded and extracted .sql file in the following location:

./gcd_docker/data/gcd/

Python Scripts

There are a collection of scripts in the python directory - mainly for examples. The cbdb folder contains a few classes for connecting to and retrieving data from the GCD database. Below is an example of how to run some of the scripts from the gcd-python container. Have a look at the source code to further understand the scripts and how to use/modify them.

docker exec gcd-python python gcd_multi_barcode_lookup.py
docker exec gcd-python python gcd_multi_issue_id_lookup.py
docker exec gcd-python python gcd_series_issues_lookup.py
docker exec gcd-python python gcd_dump_publisher_data.py
docker exec gcd-python python gcd_test_load_all_into_models.py

gcd_dump_publisher_data.py

  • Query database for publishers
  • Output JSON file mapping ID to name

gcd_multi_barcode_lookup.py

  • Read in example_barcodes.txt file
  • For each barcode, lookup the barcode and return all matches
  • Print output in nice format

gcd_multi_issue_id_lookup.py

  • Read in example_issues.txt file
  • For each issue ID, lookup and return the single matching issue
  • Print output in nice format

gcd_series_issues_lookup.py

  • Read in example_series_ids.txt file
  • For each series, lookup all Cover A issues
  • Print output in nice format

gcd_models.py

Note that these were manually generated using the following method:

# View table in MySQL and copy to file
DESCRIBE gcd_issue;
# Split the file and print needed columns
cat issue.txt | awk -F"|" '{print $2":"$3"="$6}'
# Columns are name, type, default
# Use find/replace to make changes to Python types/syntax
# Update Optional (nullable) manually

Migration

Migration is a set of database models and scripts to transform the GCD data into something that I find more useful. The general steps are:

  • Download new GCD database dump file
  • Move un-compressed database dump file to ./data/gcd directory
  • Update GCD_DUMP_DATE_LAST and GCD_DUMP_DATE_CURR to new values
  • Build GCD Docker environment using make gcd_build
  • Run migration script as documented below

To perform a full migration:

docker exec gcd-python python gcd_migrate_full.py > data/tgd/migration_2023-06-01.sql

To perform a partial migration:

docker exec gcd-python python gcd_migrate_partial.py > data/tgd/migration_2023-07-01.sql

Check for any errors:

grep "^Error" data/tgd/migration_2023-06-01.sql

Remove error line from file:

awk '!/^Error/' data/tgd/migration_2023-06-01.sql

Performance

The current implementation is not tweaked for performance. It has been designed to be flexible, rather than performant. Unless you are crunching lots of data, this shouldn't be a problem. Initial load of the entire GCD takes about 45GB storage, almost 1GB RAM for the dump import, and takes approximately 45 minutes (depending on your system).

The initial database load time is very slow. The GCD MySQL dump is ~2GB, and takes about 45 minutes to import on my laptop. This is a one-time import if you keep the Docker volume. However, if you want faster load times, consider extracting the gcd_issues, gcd_series and gcd_publishers tables, and only importing them. Depending on what you are doing, these three tables are usually sufficient - lookup issue, series and publisher metadata. These three tables are about 500MB, so the load time is much faster. To load in just these files, use the following commands to extract the tables, then only put those sql files into the gcd_data folder.

echo -e 'SET FOREIGN_KEY_CHECKS=0;\n' > data/gcd/issues.sql
cat ~/Downloads/2023-06-01.sql | sed -n -e '/DROP TABLE.*`gcd_issue`/,/UNLOCK TABLES/p' >> data/gcd/issues.sql
echo -e 'SET FOREIGN_KEY_CHECKS=0;\n' > data/gcd/series.sql
cat ~/Downloads/2023-06-01.sql | sed -n -e '/DROP TABLE.*`gcd_series`/,/UNLOCK TABLES/p' >> data/gcd/series.sql
echo -e 'SET FOREIGN_KEY_CHECKS=0;\n' > data/gcd/publishers.sql
cat ~/Downloads/2023-06-01.sql | sed -n -e '/DROP TABLE.*`gcd_publisher`/,/UNLOCK TABLES/p' >> data/gcd/publishers.sql

GCD DB Table Structure

Some notes on the SQL database structure in the GCD project.

All Tables

mysql> show tables;
+-----------------------------------+
| Tables_in_gcd                     |
+-----------------------------------+
| django_content_type               |
| gcd_award                         |
| gcd_biblio_entry                  |
| gcd_brand                         |
| gcd_brand_emblem_group            |
| gcd_brand_group                   |
| gcd_brand_use                     |
| gcd_creator                       |
| gcd_creator_art_influence         |
| gcd_creator_degree                |
| gcd_creator_membership            |
| gcd_creator_name_detail           |
| gcd_creator_non_comic_work        |
| gcd_creator_relation              |
| gcd_creator_relation_creator_name |
| gcd_creator_school                |
| gcd_creator_signature             |
| gcd_credit_type                   |
| gcd_degree                        |
| gcd_feature                       |
| gcd_feature_logo                  |
| gcd_feature_logo_2_feature        |
| gcd_feature_relation              |
| gcd_feature_relation_type         |
| gcd_feature_type                  |
| gcd_indicia_printer               |
| gcd_indicia_publisher             |
| gcd_issue                         |
| gcd_issue_credit                  |
| gcd_issue_indicia_printer         |
| gcd_membership_type               |
| gcd_name_type                     |
| gcd_non_comic_work_role           |
| gcd_non_comic_work_type           |
| gcd_non_comic_work_year           |
| gcd_printer                       |
| gcd_publisher                     |
| gcd_received_award                |
| gcd_relation_type                 |
| gcd_reprint                       |
| gcd_school                        |
| gcd_series                        |
| gcd_series_bond                   |
| gcd_series_bond_type              |
| gcd_series_publication_type       |
| gcd_story                         |
| gcd_story_credit                  |
| gcd_story_feature_logo            |
| gcd_story_feature_object          |
| gcd_story_type                    |
| stddata_country                   |
| stddata_date                      |
| stddata_language                  |
| stddata_script                    |
| taggit_tag                        |
| taggit_taggeditem                 |
+-----------------------------------+

Issue Table

Represents a single comic book (1 issue). For example, The Amazing Spider-man issue 300.

mysql> DESCRIBE gcd_issue;
+----------------------------+---------------+------+-----+---------------------+----------------+
| Field                      | Type          | Null | Key | Default             | Extra          |
+----------------------------+---------------+------+-----+---------------------+----------------+
| id                         | int           | NO   | PRI | NULL                | auto_increment |
| number                     | varchar(50)   | NO   | MUL | NULL                |                |
| volume                     | varchar(50)   | NO   | MUL |                     |                |
| no_volume                  | tinyint(1)    | NO   | MUL | 0                   |                |
| display_volume_with_number | tinyint(1)    | NO   | MUL | 0                   |                |
| series_id                  | int           | NO   | MUL | NULL                |                |
| indicia_publisher_id       | int           | YES  | MUL | NULL                |                |
| indicia_pub_not_printed    | tinyint(1)    | NO   |     | NULL                |                |
| brand_id                   | int           | YES  | MUL | NULL                |                |
| no_brand                   | tinyint(1)    | NO   | MUL | NULL                |                |
| publication_date           | varchar(255)  | NO   |     | NULL                |                |
| key_date                   | varchar(10)   | NO   | MUL | NULL                |                |
| sort_code                  | int           | NO   | MUL | NULL                |                |
| price                      | varchar(255)  | NO   |     | NULL                |                |
| page_count                 | decimal(10,3) | YES  |     | NULL                |                |
| page_count_uncertain       | tinyint(1)    | NO   |     | 0                   |                |
| indicia_frequency          | varchar(255)  | NO   |     |                     |                |
| no_indicia_frequency       | tinyint(1)    | NO   | MUL | 0                   |                |
| editing                    | longtext      | NO   |     | NULL                |                |
| no_editing                 | tinyint(1)    | NO   | MUL | 0                   |                |
| notes                      | longtext      | NO   |     | NULL                |                |
| created                    | datetime      | NO   |     | 1901-01-01 00:00:00 |                |
| modified                   | datetime      | NO   | MUL | 1901-01-01 00:00:00 |                |
| deleted                    | tinyint(1)    | NO   | MUL | 0                   |                |
| is_indexed                 | tinyint(1)    | NO   | MUL | 0                   |                |
| isbn                       | varchar(32)   | NO   | MUL |                     |                |
| valid_isbn                 | varchar(13)   | NO   | MUL |                     |                |
| no_isbn                    | tinyint(1)    | NO   | MUL | 0                   |                |
| variant_of_id              | int           | YES  | MUL | NULL                |                |
| variant_name               | varchar(255)  | NO   |     |                     |                |
| barcode                    | varchar(38)   | NO   | MUL |                     |                |
| no_barcode                 | tinyint(1)    | NO   |     | 0                   |                |
| title                      | varchar(255)  | NO   | MUL |                     |                |
| no_title                   | tinyint(1)    | NO   | MUL | 0                   |                |
| on_sale_date               | varchar(10)   | NO   | MUL | NULL                |                |
| on_sale_date_uncertain     | tinyint(1)    | NO   |     | 0                   |                |
| rating                     | varchar(255)  | NO   | MUL | NULL                |                |
| no_rating                  | tinyint(1)    | NO   | MUL | NULL                |                |
| volume_not_printed         | tinyint(1)    | NO   |     | NULL                |                |
| no_indicia_printer         | tinyint(1)    | NO   |     | NULL                |                |
+----------------------------+---------------+------+-----+---------------------+----------------+\

Series Tables

Represents a comic book series (multiple issues). For example, The Amazing Spider-man.

mysql> DESCRIBE gcd_series;
+---------------------------+--------------+------+-----+---------------------+----------------+
| Field                     | Type         | Null | Key | Default             | Extra          |
+---------------------------+--------------+------+-----+---------------------+----------------+
| id                        | int          | NO   | PRI | NULL                | auto_increment |
| name                      | varchar(255) | NO   | MUL | NULL                |                |
| sort_name                 | varchar(255) | NO   | MUL | NULL                |                |
| format                    | varchar(255) | NO   |     |                     |                |
| year_began                | int          | NO   | MUL | NULL                |                |
| year_began_uncertain      | tinyint(1)   | NO   |     | 0                   |                |
| year_ended                | int          | YES  |     | NULL                |                |
| year_ended_uncertain      | tinyint(1)   | NO   |     | 0                   |                |
| publication_dates         | varchar(255) | NO   |     |                     |                |
| first_issue_id            | int          | YES  | MUL | NULL                |                |
| last_issue_id             | int          | YES  | MUL | NULL                |                |
| is_current                | tinyint(1)   | NO   | MUL | 0                   |                |
| publisher_id              | int          | NO   | MUL | NULL                |                |
| country_id                | int          | NO   | MUL | NULL                |                |
| language_id               | int          | NO   | MUL | NULL                |                |
| tracking_notes            | longtext     | NO   |     | NULL                |                |
| notes                     | longtext     | NO   |     | NULL                |                |
| has_gallery               | tinyint(1)   | NO   | MUL | 0                   |                |
| issue_count               | int          | NO   |     | NULL                |                |
| created                   | datetime     | NO   |     | 1901-01-01 00:00:00 |                |
| modified                  | datetime     | NO   | MUL | 1901-01-01 00:00:00 |                |
| deleted                   | tinyint(1)   | NO   | MUL | 0                   |                |
| has_indicia_frequency     | tinyint(1)   | NO   |     | 1                   |                |
| has_isbn                  | tinyint(1)   | NO   |     | 1                   |                |
| has_barcode               | tinyint(1)   | NO   |     | 1                   |                |
| has_issue_title           | tinyint(1)   | NO   |     | 0                   |                |
| has_volume                | tinyint(1)   | NO   |     | 1                   |                |
| is_comics_publication     | tinyint(1)   | NO   |     | 1                   |                |
| color                     | varchar(255) | NO   |     | NULL                |                |
| dimensions                | varchar(255) | NO   |     | NULL                |                |
| paper_stock               | varchar(255) | NO   |     | NULL                |                |
| binding                   | varchar(255) | NO   |     | NULL                |                |
| publishing_format         | varchar(255) | NO   |     | NULL                |                |
| has_rating                | tinyint(1)   | NO   |     | NULL                |                |
| publication_type_id       | int          | YES  | MUL | NULL                |                |
| is_singleton              | tinyint(1)   | NO   |     | NULL                |                |
| has_about_comics          | tinyint(1)   | NO   |     | NULL                |                |
| has_indicia_printer       | tinyint(1)   | NO   |     | NULL                |                |
| has_publisher_code_number | tinyint(1)   | NO   |     | NULL                |                |
+---------------------------+--------------+------+-----+---------------------+----------------+

Publisher Tables

Represents a comic book publisher. For example, Marvel.

mysql> describe gcd_publisher;
+------------------------------+--------------+------+-----+---------------------+----------------+
| Field                        | Type         | Null | Key | Default             | Extra          |
+------------------------------+--------------+------+-----+---------------------+----------------+
| id                           | int          | NO   | PRI | NULL                | auto_increment |
| name                         | varchar(255) | NO   | MUL | NULL                |                |
| country_id                   | int          | NO   | MUL | NULL                |                |
| year_began                   | int          | YES  | MUL | NULL                |                |
| year_ended                   | int          | YES  |     | NULL                |                |
| notes                        | longtext     | NO   |     | NULL                |                |
| url                          | varchar(255) | NO   |     | NULL                |                |
| brand_count                  | int          | NO   | MUL | 0                   |                |
| indicia_publisher_count      | int          | NO   | MUL | 0                   |                |
| series_count                 | int          | NO   |     | 0                   |                |
| created                      | datetime     | NO   |     | 1901-01-01 00:00:00 |                |
| modified                     | datetime     | NO   | MUL | 1901-01-01 00:00:00 |                |
| issue_count                  | int          | NO   |     | 0                   |                |
| deleted                      | tinyint(1)   | NO   | MUL | 0                   |                |
| year_began_uncertain         | tinyint(1)   | NO   | MUL | 0                   |                |
| year_ended_uncertain         | tinyint(1)   | NO   | MUL | 0                   |                |
| year_overall_began           | int          | YES  | MUL | NULL                |                |
| year_overall_began_uncertain | tinyint(1)   | NO   | MUL | NULL                |                |
| year_overall_ended           | int          | YES  |     | NULL                |                |
| year_overall_ended_uncertain | tinyint(1)   | NO   | MUL | NULL                |                |
+------------------------------+--------------+------+-----+---------------------+----------------+

Example Queries

Lookup a comic issue with a specific barcode (including supplementary 2-5 digit code):

SELECT * FROM gcd_issue WHERE barcode = '75960608629004011';

Lookup a comic series using a specific series_id (found in the previous example):

SELECT * FROM gcd_series WHERE id = '110055';

gcd_docker's People

Contributors

thomaslaurenson avatar

Watchers

 avatar

gcd_docker's Issues

Change data processing script to output sql file

When using the published_date conversion script (varchar yo datetime) the processing time is huge. A potential solution is to dump an sql file of sql statements. Future updates to this file could use the modified datetime value to only process specific issues, thus saving time.

Goal:

  • implement first pass at sql output
  • include table generation or column addition
  • include new datetime value for published date

Move gcd > python > app to root project

It will be useful if some of the code is shared between gcd and tgd databases. So moving the python app code to a shared folder and mounting seems sensible.

Goal:

  • move app to project root
  • also move model into this folder
  • come up with a suitable folder/app name

Change project name

Potential ideas:

  • cbdb (see bee dee bee)

Where to change:

  • repo name
  • python project name

Update migration scripts

Full:

  • remove date check
  • remove print statements

Partial:

  • insert for newly created
  • patch for recently modified

Add tgd environment

Goal:

  • update docker compose file
  • create python container
  • create mysql container
  • load data into mysql
  • update makefile
  • write python test script

This should be very simple dev stack, to mess around with. Prod stack to be implemented in other project.

Split project into GCD and TGD

Goal:

  • two folders in root directory for each DB
  • shared comics py module and classes
  • update makefile where needed

Start scaffolding new TGD DB with:

  • docker compose solution for DB and app

Python script to extract publisher ID to name JSON

I don't really use the publisher table apart from mapping the ID to name. To save a select query, dump a JSON file with ID to name mapping.

Goal:

  • script to dump ID to name JSON file
  • save json file in repo
  • remove publisher select queries from scripts
  • replace with ID lookup

Update migration workflow

  • auto create publishers dump with correct name (e.g., _DATE)
  • Auto determine TGD output dump name (e.g., _DATE)

Potential automation:

  • Load up environment
  • Import dump
  • Dump publishers
  • Perform migration

Add license to match GCD

To make modifications to the original data, it makes sense to store the modified data in this repo. However, the project needs to be updated to ensure that the original GCD license is adhered to.

Goal:

  • add cc4 license, which matches GCD license
  • add info about contributing to GCD
  • add link to GCD where needed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.