Comments (16)
Thank you, Matthias! I think this information is enough to close the issue with the resolution: "use MariaDB or PostgreSQL".
from dataset.
Acknowledged, will take a look tomorrow.
Thank you for providing the command line that fails. I will make sure it works.
from dataset.
Hello! gkrasin
Found the cause
Title or author name contains emoji
I removed emoji by following the python script
def removeEmoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text) # no emoji
Thank you <3
from dataset.
Thanks for the hint! I will remove emojis from the titles and upload a new tarball (tomorrow). While they are legitimate unicode symbols, it's not surprising that a lot of existing software fails to process them.
from dataset.
Hm... after filtering out the symbols from the range proposed above, mysql fails on Chinese characters:
ERROR 1300 (HY000): Invalid utf8 character string: '"2014**會'
Have you encountered that as well, or the problem is specific to my environment?
from dataset.
I have found this wonderful answer: http://stackoverflow.com/a/10959780
MySQL's utf8 permits only the Unicode characters that can be represented with 3 bytes in UTF-8. Here you have a character that needs 4 bytes: \xF0\x90\x8D\x83 (U+10343 GOTHIC LETTER SAUIL).
If you have MySQL 5.5 or later you can change the column encoding from utf8 to utf8mb4. This encoding allows storage of characters that occupy 4 bytes in UTF-8.
You may also have to set the server property character_encoding_server to utf8mb4 in the MySQL configuration file.
I am currently figuring out the correct options to /etc/mysql/my.cnf to get it working.
from dataset.
I changed to utf8mb4 using following the command
sudo bash -c 'printf "[client]\ndefault-character-set=utf8mb4\n[mysql]\ndefault-character-set=utf8mb4\n[mysqld]\ncharacter-set-server=utf8mb4\ncollation-server=utf8mb4_unicode_ci\n" >> /etc/mysql/my.cnf'
but I met same issue
ERROR 1300 (HY000): Invalid utf8mb4 character string: '"2014**會'
strange but successful I imported the csv to mysql using python ORM
#! -*- coding: utf-8 -*-
"""
Requirement Package
pip install peewee
"""
from peewee import *
import os
import csv
import re
imagesPath = "/path/to/the/images.csv"
db = MySQLDatabase('openimage', user='root', passwd="hogehoge", port=3306, host="localhost")
class Images(Model):
image_id = CharField(primary_key=True)
original_url = TextField()
original_landing_url = TextField()
license = CharField()
author_profile_url = TextField()
author = CharField(null = True)
title = TextField(null = True)
class Meta:
database = db
db.connect()
db.create_tables([Images], True)
def removeEmoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text) # no emoji
def importImages():
with open(imagesPath) as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
header = next(reader)
dataSource = []
for row in reader:
dict = {}
dict["image_id"] = row[0]
dict["original_url"] = row[1]
dict["original_landing_url"] = row[2]
dict["license"] = row[3]
dict["author_profile_url"] = row[4]
dict["author"] = removeEmoji(row[5])
dict["title"] = removeEmoji(row[6])
dataSource.append(dict)
with db.atomic():
for idx in range(0, len(dataSource), 100):
Images.insert_many(dataSource[idx:idx + 100]).execute()
if __name__ == "__main__":
importImages()
pass
from dataset.
FWIW, I have also verified that the contents of images.csv are valid UTF-8.
from dataset.
With the following it's still not working:
mysql> show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.01 sec)
mysql> show variables like '%collation%';
+----------------------+--------------------+
| Variable_name | Value |
+----------------------+--------------------+
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+----------------------+--------------------+
I added the following to /etc/mysql/my.cnt:
[mysqld]
init_connect = 'SET NAMES utf8mb4'
init_connect = 'SET collation_connection = utf8mb4_unicode_ci; SET NAMES utf8mb4;'
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
#character_set_server=utf8mb4
#character_set_client=utf8mb4
#character_set_connection=utf8mb4
#character_set_results=utf8mb4
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
I also tried various other combination. It seems to be a standard issue with MySQL, and not with the CSVs provided here. Suggestions from MySQL experts are welcome!
from dataset.
Thank you for working hard!
from dataset.
Postgres works out of the box:
$ pqs postgres
# create database openimages;
# CREATE TABLE Images (ImageID VARCHAR,
OriginalURL VARCHAR,
OriginalLandingURL VARCHAR,
License VARCHAR,
AuthorProfileURL VARCHAR,
Author VARCHAR,
Title VARCHAR);
# COPY Images FROM '/home/krasin/v2/images_2016_08/train/images.csv' DELIMITER ',' CSV HEADER;
COPY 9011219
Note: I have picked lame values for columns types, because it was not the focus for my exercise here.
I think I will add a new column to CSV with Subset equals "train" or "validation" to allow easy importing of both sets into one SQL table.
from dataset.
I really wanted to know what database you are using
I must change to postgres
from dataset.
Internally, I use Spanner. Out of the options available outside of Google, Postgres seems to be the best.
from dataset.
wow It designed for scale of the earth
I am reading postgres tutorial
Thank you!
from dataset.
I have started a doc about importing into Postgres: https://github.com/openimages/dataset/wiki/Importing-into-PostgreSQL
If someone figures out a proper way to setup MySQL to accept all these unicode strings, writing a similar doc is welcome.
from dataset.
Not quite MySQL, but it does work flawlessly in MariaDB which is a fork & drop-in replacement for MySQL:
MariaDB> show variables like '%collation%';
+----------------------+-----------------+
| Variable_name | Value |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)
MariaDB> show variables like '%char%';
+--------------------------+---------------------------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/Cellar/mariadb/10.1.17/share/mysql/charsets/ |
+--------------------------+---------------------------------------------------------+
8 rows in set (0.00 sec)
MariaDB> insert into bla values ("2014**會");
Query OK, 1 row affected (0.00 sec)
This is on MariaDB 10.1.17 installed with homebrew on macOS, with no changes to the default configuration.
from dataset.
Related Issues (20)
- OpenImages V6 data set HOT 1
- there are no cat and dog coarse-grain category. HOT 1
- Image 01a624308e2f8c5d in oidv6-train-annotations-bbox.csv is mislabled
- Mislabeled Images HOT 1
- segmentations.csv mask 3 coordinates HOT 1
- Decoding Openimages v6 mask coordinates HOT 2
- BadZipFile Error HOT 3
- Soil-dataset
- L
- Golf rounds
- OIDv4 Tool Kit Windows 10 Python 3.7 HOT 2
- Extended dataset download per category? HOT 1
- (V5) Mismatched image and mask resolutions. HOT 2
- Explore UI does not load images HOT 2
- How to report invalid/questionable images? HOT 5
- Open Image Dataset V5 to COCO JSON format
- Why not build a video instance segmentation dataset?
- Where can I download the OpenImage V2 dataset? HOT 1
- Hierarchy question
- Request to add pretrained large-scale object detector to "Community Contributions" HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataset.