Comments (3)
Comment by yoshinorim
Monday Feb 02, 2015 at 12:11 GMT
I like space optimization for cs collation, that is:
- (mem_comparable_form) for cs collation, not storing any duplicates for cs collation
- (mem_comparable_form, restore_data) for ci collation, and storing restore_data into value part of the key-value pair.
from mysql-5.6.
Comment by spetrunia
Monday Feb 02, 2015 at 16:40 GMT
Exploring how to make a bi-directional mapping
value <=> (mem_comparable_form, restore_data).
latin1_swedish_ci, latin1_general_ci, (and other collations) have these properties:
- strxfrm(char c) returns a character (a value between 0x00-0xFF).
- The problem is that there are sets of characters X,Y,Z, ... where
X!=Y, X!=Z, but strxfrm(X)==strxfrm(Y)==strxfrm(Z). In charset
terminology, these characters have "the same weight" (let's call them a "weight group")
in latin1_swedish_ci: 114 characters have weight conflicts. They form 31 weight groups, size of the group varies between two members (19 groups) and ten (4 groups)
in latin1_general_ci: 56 characters have weight conflicts. They form 28 groups of two members each.
latin1_general_cs has no weight conflicts (we can just enable index_only for it).
Constructing restore_data
value <=> (mem_comparable_form, restore_data).
Let's take one character of the "value".
If its weight is unique, it can be restored from its mem-comparable form.
If its weight is shared with a set of characters $WEIGHT_GROUP, we can assign (statically) a number to each member of the set. The number can be stored in restore_data.
- $WEIGHT_GROUP is usually small, so the number will only occupy a few bits.
- by looking at restore_data, we can find its $WEIGHT_GROUP, and know how many
bits (if any) are used to store the number of the value in the group.
Other 1-byte charsets
Some 1-byte charsets like latin1_german2_ci may map a single character into two bytes. (most of characters have 1-byte mem_comparable_form, but some have 2-byte). It looks like our approach could be extended to handle those, too.
from mysql-5.6.
Comment by spetrunia
Monday Feb 02, 2015 at 18:15 GMT
Unicode
// our charsets guru is currently not available, but we've had a discussion about this before and I've took another look now.
Most important
- utf_bin - Already handled
- utf8_general_ci - the "simpler" collation, provides basic sorting
- utf8_unicode_ci - more complex
- utf8mb4 - ??
utf8_general_ci
Sorts about 64K characters, non-trivial sorting provided for 2816 characters
(AFAIU, "alphabets").
Basically, it extends 1-byte charsets approach into using multiple "pages".
There are 11 256-element pages that have non-trivial case conversions or sorting rules such that weight(X)!=X.
For other pages, weight(X) = two_byte_form(X).
Pages with non-trivial case conversions can be handled in the same way as was proposed for 1-byte charsets.
utf8_unicode_ci
This is more complex collation as it does things like Beta='ss' for German, etc.
We may have difficulties with fully supporting it, because some languages have rather complex rules.
OTOH, these languages are not that common, so for those we could just store the source character in the restore_data. This will double the used space.
from mysql-5.6.
Related Issues (20)
- Make compilation faster HOT 1
- How to achieve ~20% replication throughput improvement using Read Free Replication (RFR) feature
- Keep long-running MyRocks mtr tests in their own suite HOT 1
- MyRocks 8.0.28 has poor performance of primary key range query HOT 1
- create secondary index needs attention
- issue while install HOT 1
- Error during create secondary index
- Update the Build Steps page HOT 1
- MyRocks engine should respect WITH_UNIT_TESTS
- Allow users to manually set the number of block cache shards HOT 1
- Provide a counter to show pending compaction bytes for RocksDB
- MyRocks does commit step non-durably under server group 2PC protocol HOT 1
- alter talbe add index optimization
- Range lock support HOT 2
- Cached RocksDB transaction object accessed after delete by XA COMMIT
- optimize table has no effect on HIDDEN_PK table
- Rdb_iterator_base::next_with_direction: too many compares for eof check HOT 2
- Determin secondary index value emptiness by datadic
- undefined reference to `sgemm_' When compiling vector DB
- -DWITH_LZ4=system doesn't find the lz4 library on the fb-mysql-8.0.28 branch
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mysql-5.6.