Coder Social home page Coder Social logo

Comments (16)

gkrasin avatar gkrasin commented on July 22, 2024 1

Thank you, Matthias! I think this information is enough to close the issue with the resolution: "use MariaDB or PostgreSQL".

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

Acknowledged, will take a look tomorrow.

Thank you for providing the command line that fails. I will make sure it works.

from dataset.

tzatter avatar tzatter commented on July 22, 2024

Hello! gkrasin
Found the cause
Title or author name contains emoji
I removed emoji by following the python script

def removeEmoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)  # no emoji

Thank you <3

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

Thanks for the hint! I will remove emojis from the titles and upload a new tarball (tomorrow). While they are legitimate unicode symbols, it's not surprising that a lot of existing software fails to process them.

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

Hm... after filtering out the symbols from the range proposed above, mysql fails on Chinese characters:
ERROR 1300 (HY000): Invalid utf8 character string: '"2014**會'

Have you encountered that as well, or the problem is specific to my environment?

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

I have found this wonderful answer: http://stackoverflow.com/a/10959780

MySQL's utf8 permits only the Unicode characters that can be represented with 3 bytes in UTF-8. Here you have a character that needs 4 bytes: \xF0\x90\x8D\x83 (U+10343 GOTHIC LETTER SAUIL).

If you have MySQL 5.5 or later you can change the column encoding from utf8 to utf8mb4. This encoding allows storage of characters that occupy 4 bytes in UTF-8.

You may also have to set the server property character_encoding_server to utf8mb4 in the MySQL configuration file.

I am currently figuring out the correct options to /etc/mysql/my.cnf to get it working.

from dataset.

tzatter avatar tzatter commented on July 22, 2024

I changed to utf8mb4 using following the command

sudo bash -c 'printf "[client]\ndefault-character-set=utf8mb4\n[mysql]\ndefault-character-set=utf8mb4\n[mysqld]\ncharacter-set-server=utf8mb4\ncollation-server=utf8mb4_unicode_ci\n" >> /etc/mysql/my.cnf'

but I met same issue

ERROR 1300 (HY000): Invalid utf8mb4 character string: '"2014**會'

strange but successful I imported the csv to mysql using python ORM

#! -*- coding: utf-8 -*-
"""
Requirement Package
pip install peewee
"""

from peewee import *
import os
import csv
import re

imagesPath = "/path/to/the/images.csv"
db = MySQLDatabase('openimage', user='root', passwd="hogehoge", port=3306, host="localhost")

class Images(Model):
    image_id = CharField(primary_key=True)
    original_url = TextField()
    original_landing_url = TextField()
    license = CharField()
    author_profile_url = TextField()
    author = CharField(null = True)
    title = TextField(null = True)

    class Meta:
        database = db

db.connect()
db.create_tables([Images], True)

def removeEmoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)  # no emoji

def importImages():
    with open(imagesPath) as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        header = next(reader)
        dataSource = []
        for row in reader:
            dict = {}
            dict["image_id"] = row[0]
            dict["original_url"] = row[1]
            dict["original_landing_url"] = row[2]
            dict["license"] = row[3]
            dict["author_profile_url"] = row[4]
            dict["author"] = removeEmoji(row[5])
            dict["title"] = removeEmoji(row[6])
            dataSource.append(dict)

        with db.atomic():
            for idx in range(0, len(dataSource), 100):
                Images.insert_many(dataSource[idx:idx + 100]).execute()

if __name__ == "__main__":
    importImages()
    pass

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

FWIW, I have also verified that the contents of images.csv are valid UTF-8.

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

With the following it's still not working:

mysql> show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.01 sec)

mysql> show variables like '%collation%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_unicode_ci |
| collation_database   | utf8mb4_unicode_ci |
| collation_server     | utf8mb4_unicode_ci |
+----------------------+--------------------+

I added the following to /etc/mysql/my.cnt:

[mysqld]
init_connect = 'SET NAMES utf8mb4'
init_connect = 'SET collation_connection = utf8mb4_unicode_ci; SET NAMES utf8mb4;'

character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

#character_set_server=utf8mb4
#character_set_client=utf8mb4
#character_set_connection=utf8mb4
#character_set_results=utf8mb4

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

I also tried various other combination. It seems to be a standard issue with MySQL, and not with the CSVs provided here. Suggestions from MySQL experts are welcome!

from dataset.

tzatter avatar tzatter commented on July 22, 2024

Thank you for working hard!

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

Postgres works out of the box:

$ pqs postgres
# create database openimages;
# CREATE TABLE Images (ImageID VARCHAR,
OriginalURL VARCHAR,
OriginalLandingURL VARCHAR,
License VARCHAR,
AuthorProfileURL VARCHAR,
Author VARCHAR,
Title VARCHAR);
# COPY Images FROM '/home/krasin/v2/images_2016_08/train/images.csv' DELIMITER ',' CSV HEADER;
COPY 9011219

Note: I have picked lame values for columns types, because it was not the focus for my exercise here.

I think I will add a new column to CSV with Subset equals "train" or "validation" to allow easy importing of both sets into one SQL table.

from dataset.

tzatter avatar tzatter commented on July 22, 2024

I really wanted to know what database you are using
I must change to postgres

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

Internally, I use Spanner. Out of the options available outside of Google, Postgres seems to be the best.

from dataset.

tzatter avatar tzatter commented on July 22, 2024

wow It designed for scale of the earth
I am reading postgres tutorial
Thank you!

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

I have started a doc about importing into Postgres: https://github.com/openimages/dataset/wiki/Importing-into-PostgreSQL

If someone figures out a proper way to setup MySQL to accept all these unicode strings, writing a similar doc is welcome.

from dataset.

MatthiasWinkelmann avatar MatthiasWinkelmann commented on July 22, 2024

Not quite MySQL, but it does work flawlessly in MariaDB which is a fork & drop-in replacement for MySQL:

MariaDB> show variables like '%collation%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)

MariaDB>  show variables like '%char%';
+--------------------------+---------------------------------------------------------+
| Variable_name            | Value                                                   |
+--------------------------+---------------------------------------------------------+
| character_set_client     | utf8                                                    |
| character_set_connection | utf8                                                    |
| character_set_database   | utf8                                                    |
| character_set_filesystem | binary                                                  |
| character_set_results    | utf8                                                    |
| character_set_server     | utf8                                                    |
| character_set_system     | utf8                                                    |
| character_sets_dir       | /usr/local/Cellar/mariadb/10.1.17/share/mysql/charsets/ |
+--------------------------+---------------------------------------------------------+
8 rows in set (0.00 sec)

MariaDB> insert into bla values ("2014**會");
Query OK, 1 row affected (0.00 sec)

This is on MariaDB 10.1.17 installed with homebrew on macOS, with no changes to the default configuration.

from dataset.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.