Coder Social home page Coder Social logo

Comments (3)

hermanlee avatar hermanlee commented on April 26, 2024

Comment by yoshinorim
Monday Feb 02, 2015 at 12:11 GMT


I like space optimization for cs collation, that is:

  • (mem_comparable_form) for cs collation, not storing any duplicates for cs collation
  • (mem_comparable_form, restore_data) for ci collation, and storing restore_data into value part of the key-value pair.

from mysql-5.6.

hermanlee avatar hermanlee commented on April 26, 2024

Comment by spetrunia
Monday Feb 02, 2015 at 16:40 GMT


Exploring how to make a bi-directional mapping

 value <=> (mem_comparable_form, restore_data).

latin1_swedish_ci, latin1_general_ci, (and other collations) have these properties:

  • strxfrm(char c) returns a character (a value between 0x00-0xFF).
  • The problem is that there are sets of characters X,Y,Z, ... where
    X!=Y, X!=Z, but strxfrm(X)==strxfrm(Y)==strxfrm(Z). In charset
    terminology, these characters have "the same weight" (let's call them a "weight group")

in latin1_swedish_ci: 114 characters have weight conflicts. They form 31 weight groups, size of the group varies between two members (19 groups) and ten (4 groups)

in latin1_general_ci: 56 characters have weight conflicts. They form 28 groups of two members each.

latin1_general_cs has no weight conflicts (we can just enable index_only for it).

Constructing restore_data

value <=> (mem_comparable_form, restore_data).

Let's take one character of the "value".
If its weight is unique, it can be restored from its mem-comparable form.

If its weight is shared with a set of characters $WEIGHT_GROUP, we can assign (statically) a number to each member of the set. The number can be stored in restore_data.

  • $WEIGHT_GROUP is usually small, so the number will only occupy a few bits.
  • by looking at restore_data, we can find its $WEIGHT_GROUP, and know how many
    bits (if any) are used to store the number of the value in the group.

Other 1-byte charsets

Some 1-byte charsets like latin1_german2_ci may map a single character into two bytes. (most of characters have 1-byte mem_comparable_form, but some have 2-byte). It looks like our approach could be extended to handle those, too.

from mysql-5.6.

hermanlee avatar hermanlee commented on April 26, 2024

Comment by spetrunia
Monday Feb 02, 2015 at 18:15 GMT


Unicode

// our charsets guru is currently not available, but we've had a discussion about this before and I've took another look now.

Most important

  • utf_bin - Already handled
  • utf8_general_ci - the "simpler" collation, provides basic sorting
  • utf8_unicode_ci - more complex
  • utf8mb4 - ??

utf8_general_ci

Sorts about 64K characters, non-trivial sorting provided for 2816 characters
(AFAIU, "alphabets").

Basically, it extends 1-byte charsets approach into using multiple "pages".
There are 11 256-element pages that have non-trivial case conversions or sorting rules such that weight(X)!=X.
For other pages, weight(X) = two_byte_form(X).

Pages with non-trivial case conversions can be handled in the same way as was proposed for 1-byte charsets.

utf8_unicode_ci

This is more complex collation as it does things like Beta='ss' for German, etc.
We may have difficulties with fully supporting it, because some languages have rather complex rules.
OTOH, these languages are not that common, so for those we could just store the source character in the restore_data. This will double the used space.

from mysql-5.6.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.