github / gh-ost Goto Github PK

View Code? Open in Web Editor NEW

12.1K 568.0 1.2K 11.63 MB

GitHub's Online Schema-migration Tool for MySQL

License: MIT License

Shell 6.46% Go 93.54%

mysql schema-migrations

gh-ost's Introduction

gh-ost

GitHub's online schema migration for MySQL

gh-ost is a triggerless online schema migration solution for MySQL. It is testable and provides pausability, dynamic control/reconfiguration, auditing, and many operational perks.

gh-ost produces a light workload on the master throughout the migration, decoupled from the existing workload on the migrated table.

It has been designed based on years of experience with existing solutions, and changes the paradigm of table migrations.

How?

All existing online-schema-change tools operate in similar manner: they create a ghost table in the likeness of your original table, migrate that table while empty, slowly and incrementally copy data from your original table to the ghost table, meanwhile propagating ongoing changes (any INSERT, DELETE, UPDATE applied to your table) to the ghost table. Finally, at the right time, they replace your original table with the ghost table.

gh-ost uses the same pattern. However it differs from all existing tools by not using triggers. We have recognized the triggers to be the source of many limitations and risks.

Instead, gh-ost uses the binary log stream to capture table changes, and asynchronously applies them onto the ghost table. gh-ost takes upon itself some tasks that other tools leave for the database to perform. As result, gh-ost has greater control over the migration process; can truly suspend it; can truly decouple the migration's write load from the master's workload.

In addition, it offers many operational perks that make it safer, trustworthy and fun to use.

Highlights

Build your trust in gh-ost by testing it on replicas. gh-ost will issue same flow as it would have on the master, to migrate a table on a replica, without actually replacing the original table, leaving the replica with two tables you can then compare and satisfy yourself that the tool operates correctly. This is how we continuously test gh-ost in production.
True pause: when gh-ost throttles, it truly ceases writes on master: no row copies and no ongoing events processing. By throttling, you return your master to its original workload
Dynamic control: you can interactively reconfigure gh-ost, even as migration still runs. You may forcibly initiate throttling.
Auditing: you may query gh-ost for status. gh-ost listens on unix socket or TCP.
Control over cut-over phase: gh-ost can be instructed to postpone what is probably the most critical step: the swap of tables, until such time that you're comfortably available. No need to worry about ETA being outside office hours.
External hooks can couple gh-ost with your particular environment.

Please refer to the docs for more information. No, really, read the docs.

Usage

The cheatsheet has it all. You may be interested in invoking gh-ost in various modes:

a noop migration (merely testing that the migration is valid and good to go)
a real migration, utilizing a replica (the migration runs on the master; gh-ost figures out identities of servers involved. Required mode if your master uses Statement Based Replication)
a real migration, run directly on the master (but gh-ost prefers the former)
a real migration on a replica (master untouched)
a test migration on a replica, the way for you to build trust with gh-ost's operation.

Our tips:

Testing above all, try out --test-on-replica first few times. Better yet, make it continuous. We have multiple replicas where we iterate our entire fleet of production tables, migrating them one by one, checksumming the results, verifying migration is good.
For each master migration, first issue a noop
Then issue the real thing via --execute.

More tips:

Use --exact-rowcount for accurate progress indication
Use --postpone-cut-over-flag-file to gain control over cut-over timing
Get familiar with the interactive commands

Also see:

What's in a name?

Originally this was named gh-osc: GitHub Online Schema Change, in the likes of Facebook online schema change and pt-online-schema-change.

But then a rare genetic mutation happened, and the c transformed into t. And that sent us down the path of trying to figure out a new acronym. gh-ost (pronounce: Ghost), stands for GitHub's Online Schema Transmogrifier/Translator/Transformer/Transfigurator

License

gh-ost is licensed under the MIT license

gh-ost uses 3rd party libraries, each with their own license. These are found here.

Community

gh-ost is released at a stable state, but with mileage to go. We are open to pull requests. Please first discuss your intentions via Issues.

We develop gh-ost at GitHub and for the community. We may have different priorities than others. From time to time we may suggest a contribution that is not on our immediate roadmap but which may appeal to others.

Please see Coding gh-ost for a guide to getting started developing with gh-ost.

Download/binaries/source

gh-ost is now GA and stable.

gh-ost is available in binary format for Linux and Mac OS/X

Download latest release here

gh-ost is a Go project; it is built with Go 1.15 and above. To build on your own, use either:

script/build - this is the same build script used by CI hence the authoritative; artifact is ./bin/gh-ost binary.
build.sh for building tar.gz artifacts in /tmp/gh-ost

Generally speaking, master branch is stable, but only releases are to be used in production.

Authors

gh-ost is designed, authored, reviewed and tested by the database infrastructure team at GitHub:

gh-ost's People

Contributors

Stargazers

Watchers

Forkers

devopsbox adamvduke noscripter dveeden joshbuddy zhiephieforks nickwhite917 wookay mahakoala fanlifei duanshuaimin dozer47528 techscientist louishust geleman easypost maserg kleopatra999 poyhsiao nadeemse zhuoroger dongzerun wattpad hasblake thinkingqi xiepaup linwaterbin mr-mustash maniacs-ops sdgdsffdsfff fengshao0907 tomzhang 0-t-0 a58457432 liuzhonghu yangxer lovewitty luoyunqing-git keevol shengxianye zhanglei rainslytherin maniacs-db swallowcc delkyd fgy58963 nettedfish mojita speedocjx guoguobo sjmudd qiuqianslkiy comenglish is00hcw davidmr001 zoulincheng 306602615 longhaix heiheiwang digideskio tom2jack mage234 twotwotwo meru-akimbo kuguobing robustfengbin yuanfeng0905 yy1117 yanghongkjxy jbarnet michaelgooden qict777 linearregression jaimiewang leeparayno antonini joshuajlai xmison akshaychhajed fern4lvarez edgarlanting qiangchangjin rj03hou oneyx harrisonppt xwsky gechong jzoe hasjenl qiukeren freeshiyi23 caozhipan wowdba willyelm masa1217a staysun beichuan2016 boazjr caitaotao c2oo

gh-ost's Issues

Streamer crashes on high lag

When gh-ost runs with --test-on-replica and the replica happens to be lagging, streamer crashes, thereby not reading further events.

After some digging and logging, we get this:

2016-05-18 07:28:09 ERROR connection was bad
2016-05-18 07:28:09 DEBUG Done streaming

Looking into.

Support --cleanup flag

The --cleanup flag is mutually exclusive to --alter. It indicates we wish to clean up ghost and changelog tables.

Like --alter, it would default to noop (in which case it should print the drop statements), and would support --execute

[Potential Bug] log_slave_updates needed for master only migration

Using gh-ost 1.0.8 on a 5.7 master, setup with (log-bin,binlog_format=ROW, sync_binlog=1).

When I try to run using the --allow-on-master and --allow-master-master it exits with:

FATAL myserver:3306 must have log_slave_updates enabled

I have read the docs but could of missed something. Its my understanding that when doing the master only migration that it should be reading from the bin log, and log_slave_updates would be un-needed in this situation.

make nice-ratio float instead of integer

this will allow fine grained niceness control, especially on the low-side

behavior of `--test-on-replica`

--test-on-replica is one of the strongest features of this tool. Existing tools can only work on the master. They simply cannot work on replicas. It's not that they cannot simulate the master's concurrency, it's that triggers just don't run on the replica.
Edit: thanks for @arthurnn 's 👀 , triggers don't run on replicas in Row Based Replication, whereas in Statement Based Replication (which we do use today) -- they do run on replicas.
Unrelated to the above we are strongly considering moving into RBR.

We, however, are not limited to execution on master. The algorithm behaves exactly the same when running on master or replica. The binary log we read is agnostic to the host on which we execute the writes, as long as it's an ancestor (or identity).

Why would we want to run on replica?

For gaining trust. We are embarking this new tool on production. How can we trust it to migrate the data correctly?
Well we can just try it out on a replica, and it will work in the following way:

We execute with --test-on-replica
Tool confirms we execute on a replica and panics if we run on master
Tool panics if said replica is an intermediate master (ie serves yet other replicas).
- Unless --ok-to-test-on-replica-with-replicas is given
Tool treats the replica as the writer host (the host on which the applier operates):
- Ghost table, changelog table are created on said replica
- Chunk iteration, row copy are executed on said replica
- Binlog events are applied on said replica
- LOCK TABLES is executed on replica
But RENAME never happens. Instead, when ghosc is satisfied the tables are in sync, it just issues STOP SLAVE
And exits

At this time the original table is actually untouched (unmodified) on the replica, and the ghost table is hopefully in good shape to replace it.

Build your trust: now's a good time to compare the tables! Are they identical? Woohoo! The ghosc has done a good job

Letting it execute on the master has same effect in terms of integrity, albeit different load as the row-copy takes place even as concurrent connections are writing to the master (which does not happen on replica).

Those concurrent writes nonetheless do not make for "apples vs oranges" comparison. It is, and has been, established that those INSERT INTO ghost_table ...SELECT FROM original_table are safe to use even in the face of concurrent writes. It is established to be so by the nature of transactions; it has been established to be so by the operation of existing trigger-based online schema change tools.

It is then up to you to DROP the _New table on the replica, and hopefully trust is gained.

Table with single row does not get data copied

In the specific case of a single row, the migration completes without copying said row. This reproduces at all times.

Describing cut-over phase via UDF wait condition

I wrote these UTF functions (will cross link once in proper repo):

create_ghost_wait_condition()
destroy_ghost_wait_condition()
ghost_wait_on_condition()

For now, they use a singular, global wait condition (maybe in the future we will support multiple).
The wait condition is a lock which is not bound by a connection. So if I:

select create_ghost_wait_condition()

Then the lock is taken, and is kept taken even if my connection dies. Anyone can, at any time:

select destroy_ghost_wait_condition()

And release the lock.

The function ghost_wait_on_condition() returns immediately if condition is free, or blocks if condition is taken. Multiple ghost_wait_on_condition() can run concurrently and they will all wait. Once destroy_ghost_wait_condition() is called they all get released.

Cut-over via wait condition

We have these tables:

tbl - original table
_tbl_gst - ghost table

Sequence of events is:

create view _tbl_gst_v as select * from _tbl_gst where ghost_wait_on_condition() is not null with check option

Breakdown:
- ghost_wait_on_condition() is always not null
- with check condition means every insert, delete, update on existing rows will validate that the view definition is met, i.e. the where clause is satisfied, i.e. we wait on lock.
- the view uses merge algorithm; all queries are implicitly pushed down to the table (no temporary table buffer)
select create_ghost_wait_condition()
- Nothing happens yet. We got our lock but no one is attempting to use it.
rename table tbl to _tbl_old, _tbl_gst_v to tbl
- we push the view instead of the original table. The view reads/writes to _tbl_gst
- but it is blocked. insert|delete|update queries operating on _tbl_gst_v are blocked
  - note: except for update or delete that operate on non-existent rows (hence make no change, hence not visible in RBR anyhow, hence irrelevant)
Working on backlog, applying all those last changes read from the binary log onto _tbl_gst
- noteworthy: _tbl_gst itself is not blocked in any way.
select destroy_ghost_wait_condition()
- writes are now enabled on tbl (our view). It still reads/writes to _tbl_gst.
- all blocked queries are suddenly released and operating on the view
- new incoming queries operate on the view
rename table tbl to _tbl_gst_v_old, _tbl_gst to tbl
- we now move our view aside, putting our ghost table in its place.
drop view _tbl_gst_v_old
🍕

Will you please run this through your virtual interpreter in your brains? Let's assume the UDFs work perfectly well (and they're simple enough to support that assumption).

Too many colons in address when using IPv6 address literals for replication

If replication is set up using IPv6 address literals for the master host name, gh-ost dies with "2016-08-05 23:25:51 ERROR dial tcp: too many colons in address fd00:aaaa:bbbb::1:3306".

As far as I can tell, the mysql driver seems to expect IPv6 addresses to be written as [a:b:c::d]:3306, so this may just be a matter of detecting when Master_Host is an IPv6 literal and adding some square brackets?

Foreign key error is not informative

After dropping 3 foreign keys on sakila.rental gh-ost complains:

2016-08-03 09:33:39 ERROR Found 1 foreign keys on `sakila`.`rental`. Foreign keys are not supported. Bailing out
2016-08-03 09:33:39 FATAL 2016-08-03 09:33:39 ERROR Found 1 foreign keys on `sakila`.`rental`. Foreign keys are not supported. Bailing out

There is a foreign key on sakila.payment which points to rental.

Describing safe, blocking, pure-mysql cut-over phase

UPDATE

#82 overrides this. This is no longer in use.

Finally here's a blocking cut-over phase that will, at worst case (connections die throughout cut-over), create a table-outage (easily reversed).

Here are the steps to a safe solution:

We note different connections as C1, C2, ... Cn
We assume original table is tbl, ghost table is ghost.

In the below we note C1, C18, C19 as out own, controlling connections. We first assume no error in the below flow:

C1: lock tables tbl write
C2, C3, ..., C17: normal app connections, issuing insert, delete, update on tbl. Because of the lock, they are naturally blocked.
We apply those last event we need to apply onto ghost. No new events are coming our way because tbl is blocked.
C18: checking that C1 is still alive, then rename table tbl to tbl_old. This gets blocked.
C19: checking to see that C18's rename is in place (via show processlist), and that C1 is still alive; then issues: rename table ghost to tbl. This gets blocked.
(meanwhile more queries approach tbl, it doesn't matter, they all get deprioritized, same as C3...C17)
C1: unlock tables

What just happened? Let's first explain some stuff:

C18's rename gets prioritized over the DMLs, even though it came later. That is how MySQL prioritizes queries on metadata-locked tables.
C18 checks C1 is still alive, but as before, there's always the chance C1 will die just at the wrong time -- we're going to address that.
C19 is interested to see that C18 began execution, but potentially C18 will crash by the time C19 actually issues its own rename -- we're going to address that
C19's query sounds weird. At that time tbl still exists. You'd expect it to fail immediately -- but it does not. It's valid. This is because tbl's metadata lock is in use.
C19 gets prioritized over all the DMLs, but is known to be behind C18. The two stay in same order of arrival. So, C18 is known to execute before C19.
When C1 unlocks, C18 executes first.
Metadata lock is still in place on tbl even though it doesn't actually exist, because of C19.
C19 operates next.
Finally all the DMLs execute.

What happens on failures?

If C1 dies just as C18 is about to issue the rename, we get an outage: tbl is renamed to tbl_old, and the queries get released and complain the table is just not there.
- In such case C19 will not initiate because it is executed after C18 and checks that C1 is alive -- which turns to be untrue. So no C19.
- So we know we have outage, and we quickly rename tbl_old to tbl; and go drink coffee, then give the entire process another try.
- The outage is unfortunate, but does not put our data in danger.
If C1 happens to die just as C19 is about to issue its rename, there's no data integrity issue: at this point we've already asserted the tables are in sync. As C1 dies, C18 will immediately rename tbl to tbl_old. An outage will occur, but not for long, because C19 will next issue rename ghost to tbl, and close the gap. We suffered a minor outage, but no rollback. We roll forward.
If C18 happens to die just as C19 is about to issue its rename, nothing bad happens: C19 is still blocking for as long as C1 is running. We find out C18 died, and release C1. C19 attempts to rename ghost to tbl, but tbl exists (we assume C18 failed) and the query fails. The metadata lock is released and all the queries resume operation on the original tbl. The queries suffered a short block, but resume operation automatically. The operation failed but without error. We will need to try the entire cycle again.
If both C1 and C18 fail at the time C19 is about to begin its rename, same as above.
If C18 fails as C19 is already in place, same as above.
If C1 fails as C19 is already in place, it's as good as having it issue the unlock tables. We're happy.
If C19 fails at any given point, we suffer outage. We revert by rename tbl_old to tbl.

I'm grateful to reviews over this logic.

support environment variables in .cnf files

such that the following is valid:

[client]
user=${GHOST_USER}
password=${GHOST_PASSWORD}

This will work similarly to https://github.com/outbrain/orchestrator/: https://github.com/outbrain/orchestrator/blob/master/go/config/config.go#L335-L340

conf file credentials dont work properly

Reason is that we use the OS user as default user, and it's never empty. Thus, the cnf file info does not override it.

Really think hard into what should override what. Basically command line arguments should override anything else.

Out-of-order appliance of binlog events

In the attempt to make binlog events appliance non blocking to the operation (to throttling, to status events), an out-of-order scenario was introduced.
This was found when experimenting with intensively written production tables (hey, I was testing with --test-on-replica so no risk to data!).

The source of problem is https://github.com/github/gh-osc/blob/master/go/logic/migrator.go#L615-L626 . Note the true as the async parameter.
Need to find a solution when binlog events are async to other events, but are streamlined in-order between themselves.

gh-ost tries to update generated columns

If the old table has a generated column then gh-ost tries to insert on the generated column on the new table.
It does not detect the error and tries to continue.

The solution is:

detect the failure and stop
skip generated columns on insert to avoid the error
Get Oracle to add an option to MySQL to blackhole inserts on generated columns (This is wrong..)

Example table:

CREATE TABLE `salaries` (
  `emp_no` int(11) NOT NULL,
  `salary` int(11) NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  `yearly_per_year` int(11) GENERATED ALWAYS AS ((`salary` * 12)) VIRTUAL,
  PRIMARY KEY (`emp_no`,`from_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

Example output

$ sudo rm -rf /tmp/gh-ost.employees.salaries.sock; /usr/lib/gh-ost/gh-ost/gh-ost -table salaries -throttle-control-replicas replica1,replica2 -alter "ADD COLUMN test1 tinyint unsigned DEFAULT NULL" -conf /home/dvaneeden/.my_gh-ost.cnf -database employees -host $(hostname -f) -initially-drop-old-table -initially-drop-ghost-table -execute -test-on-replica
# Migrating `employees`.`salaries`; Ghost table is `employees`.`_salaries_gho`
# Migrating dvaneeden-test-2002.example.com:3306; inspecting dvaneeden-test-2002.example.com:3306; executing on dvaneeden-test-2002.example.com
# Migration started at Fri Aug 05 13:32:35 +0000 2016
# chunk-size: 1000; max-lag-millis: 1500ms; max-load: ; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle 
# Serving on unix socket: /tmp/gh-ost.employees.salaries.sock
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 0/100; Time: 0s(total), 0s(copy); streamer: binlog.000017:100461102; ETA: N/A
2016-08-05 13:32:36 ERROR Error 3105: The value specified for generated column 'yearly_per_year' in table '_salaries_gho' is not allowed.
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 0/100; Time: 1s(total), 1s(copy); streamer: binlog.000017:100463211; ETA: N/A
2016-08-05 13:32:36 ERROR Error 3105: The value specified for generated column 'yearly_per_year' in table '_salaries_gho' is not allowed.
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 0/100; Time: 1s(total), 1s(copy); streamer: binlog.000017:100463211; ETA: N/A
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 3s(total), 1s(copy); streamer: binlog.000017:100465895; ETA: N/A
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 4s(total), 1s(copy); streamer: binlog.000017:100466739; ETA: N/A
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 5s(total), 1s(copy); streamer: binlog.000017:100467583; ETA: N/A
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 6s(total), 1s(copy); streamer: binlog.000017:100468427; ETA: N/A
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 7s(total), 1s(copy); streamer: binlog.000017:100469271; ETA: N/A
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 8s(total), 1s(copy); streamer: binlog.000017:100470115; ETA: N/A
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 9s(total), 1s(copy); streamer: binlog.000017:100470959; ETA: N/A
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 10s(total), 1s(copy); streamer: binlog.000017:100471803; ETA: N/A
^CCopy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 11s(total), 1s(copy); streamer: binlog.000017:100472648; ETA: N/A
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 12s(total), 1s(copy); streamer: binlog.000017:100473493; ETA: N/A
Copy: 0/2838426 0.0%; Applied: 0; Backlog: 1/100; Time: 13s(total), 1s(copy); streamer: binlog.000017:100474338; ETA: N/A
^C

open socket/TCP before `select count(*)`

with --exact-row-count we issue a lengthy select count(*) from tbl; we want to have status available before that.

Document "normal" recovery after panic file usage...

I was looking at #71 and wasn't sure if it made sense to have the error message that includes the normal recovery procedures. Or if in a panic, there is no normal recovery. Or if recovery is really so easy that basically it is just deleting the old file. If it is just a matter of, "In a normal recovery I would have deleted the _gst table and gone to have a beer." then this issue isn't needed. If there is more cleanup, then I think we should consider having something more than the, peace out ✌️ ❤️ message. (I may be paraphrasing you infinity more accurate error message in the PR.)

How to recover from "ETA: postponing cut-over" ?

fisrt, this tool is very wonderful,being full of imagination,thank you a lot;
I test my environment, my table PlayerInventory,which rows have 3 million,size is 30G
when gh-ost running, I execute a UPDATA statement, the gh-ost output as following:

Copy: 11241000/299550386 3.8%; Applied: 0; Backlog: 0/100; Time: 5m0s(total), 3m41s(copy); streamer: binlog20000.000052:21082625; ETA: 1h34m28s
2016-08-11 15:49:30 ERROR Error 1205: Lock wait timeout exceeded; try restarting transaction
2016-08-11 15:49:30 ERROR Error 1205: Lock wait timeout exceeded; try restarting transaction
2016-08-11 15:49:30 INFO Row copy complete
Copy: 11241000/299550386 3.8%; Applied: 0; Backlog: 0/100; Time: 5m26s(total), 4m7s(copy); streamer: binlog20000.000052:21091001; ETA: 1h45m49s
Copy: 11241000/299550386 3.8%; Applied: 0; Backlog: 0/100; Time: 5m30s(total), 4m7s(copy); streamer: binlog20000.000052:21092614; ETA: postponing cut-over
Copy: 11241000/299550386 3.8%; Applied: 0; Backlog: 0/100; Time: 5m35s(total), 4m7s(copy); streamer: binlog20000.000052:21094542; ETA: postponing cut-over

next, has been in the state, that is, " ETA: postponing cut-over ", never heals,but update already over
how do I let gh-ost to continue to perform rather than terminate
my command as follows:

./gh-ost \
--max-load=Threads_running=25 \
--critical-load=Threads_running=1000 \
--chunk-size=1000 \
--throttle-control-replicas="replip" \
--max-lag-millis=1500 \
--user="gh" \
--password="gh123" \
--host="replip" \
--port=20000 \
--database="test" \
--table="PlayerInventory" \
--verbose \
--alter="change orig_source orig_source bigint" \
--switch-to-rbr \
--allow-master-master \
--cut-over=default \
--exact-rowcount \
--default-retries=120 \
--postpone-cut-over-flag-file=/data/dbbak/gh-ost/ghost.postpone.flag \
--execute

anyone can help me? thank you before : )

Known limitations

This will be a listing of known limitations for this tool. We may edit the comments.

Displaying contents of root directory rather than `/*`

Noticed the output of a run from last night showed what looks to be the directory of / instead of /*
What I observed:

2016-07-07 04:15:14 INFO Issuing and expecting this to block: rename /bin /boot /data /dev /etc /home /initrd.img /initrd.img.old /lib /lib32 /lib64 /lost+found /media /mnt /opt /proc /root /run /sbin /selinux /srv /sys /tmp /usr /var /vmlinuz /vmlinuz.old gh-ost */ table `<schema>`.`<table>` to `<schema>`.`_<table>_del`, `<schema>`.`_<table>_gho`

What I think should happen:

2016-07-07 04:15:14 INFO Issuing and expecting this to block: rename /* gh-ost */ table `<schema>`.`<table>` to `<schema>`.`_<table>_del`, `<schema>`.`_<table>_gho`

Error message about privileges is not helpful

Example:

2016-08-05 08:57:13 ERROR User has insufficient privileges for migration.
2016-08-05 08:57:13 FATAL 2016-08-05 08:57:13 ERROR User has insufficient privileges for migration.

It should tell the user:

what is needed
what it found

Describing safe, blocking, atomic, pure-mysql cut-over phase

Final-finally-finalizationally, here's an asynchronous, safe, atomic cut-over phase.
This solution doesn't cause "table outage" (as in #65).

Here are the steps for a safe, atomic cut-over:

The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). "Our" connections will be C10, C20. The "normal" app connections are C1..C9, C11..C19, C21..C29.

Connections C1..C9 operate on tbl with normal DML: INSERT, UPDATE, DELETE
Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE
Connections C11..C19, newly incoming, issue queries on tbl but are blocked due to the LOCK
Connection C20: RENAME TABLE tbl TO tbl_old, ghost TO tbl
- This is blocked due to the LOCK, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML on tbl
Connections C21..C29, newly incoming, issue queries on tbl but are blocked due to the LOCK and due to the RENAME, waiting in queue
Connection C10: checks that C20's RENAME is applied (looks for the blocked RENAME in show processlist)
Connection 10: DROP TABLE tbl_old
Nothing happens yet; tbl is still locked. All other connections still blocked.
Connection 10: UNLOCK TABLES

BAM! The RENAME is first to execute, ghost table is swapped in place of tbl, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shiny tbl

Some notes

We create tbl_old as a blocker for a premature swap
It is allowed for a connection to DROP a table it has under a WRITE LOCK
A blocked RENAME is always prioritized over a blocked INSERT/UPDATE/DELETE, no matter who came first

What happens on failures?

Much fun. Just works; no rollback required.

If C10 errors on the CREATE we do not proceed.
If C10 errors on the LOCK statement, we do not proceed. The table is not locked. App continues to operate as normal.
If C10 dies just as C20 is about to issue the RENAME:
- The lock is released, the queries C1..C9, C11..C19 immediately operate on tbl.
- C20's RENAME immediately fails because tbl_old exists.
- The entire operation is failed, but nothing terrible happens; some queries were blocked for some time is all. We will need to retry everything
If C10 dies while C20 is blocked on RENAME: Mostly similar to the above. Lock released, then C20 fails the RENAME (because tbl_old exists), then all queries resume normal operation
If C20 dies before C10 drops the table, we catch the error and let C10 proceed as planned: DROP, UNLOCK. Nothing terrible happens, some queries were blocked for some time. We will need to retry
If C20 dies just after C10 DROPs the table but before the unlock, same as above.
If both C10 and C20 die, no problem: LOCK is cleared; RENAME lock is cleared. C1..C9, C11..C19, C21..C29 are free to operate on tbl.

No matter what happens, at the end of operation we look for the ghost table. Is it still there? Then we know the operation failed, "atomically". Is it not there? Then it has been renamed to tbl, and the operation worked atomically.

A side note on failure is the matter of cleaning up the magic tbl_old. Here this is a matter of taste. Maybe just let it live and avoid recreating it, or you can drop it if you like.

Impact on app

App connections are guaranteed to be blocked, either until ghost is swapped in, or until operation fails. In the former, they proceed to operate on the new table. In the latter, they proceed to operate on the original table.

Impact on replication

Replication only sees the RENAME. There is no LOCK in the binary logs. Thus, replication sees an atomic two-table swap. There is no table-outage.

status while computing rowcount shows incorrect copy time

example:

Copy: 0/1096903139 0.0%; Applied: 0; Backlog: 0/100; Time: 3m23s(total), 2562047h47m16s(copy); streamer: mysql-bin.008194:172176475; ETA: N/A

row copy hasn't started yet.

Also, ETA needs to show "counting rows"

support `--sql-log-bin` (bool)

--sql-log-bin would default true. When false, writes to ghost table will not be logged to binlogs.

This may only work on:

--test-on-replica
--migrate-on-replica
--allow-on-master

This cannot work on normal connect-to-replica-and-write-on-master.

When GTID is enabled such writes would cause errant transactions.

Note: the changelog table also gets writes. In the event you're using GTID this still calls for a problem if you're migrating on a replica. See #146 for more discussion.

Dynamic lock timeout idea with the final table swap

There was discussion in a chat room about the lock wait timeout and retry settings for gh-ost: can these be set dynamically, rather than hardcoded.

Here's an idea:

Have a lowish lock_wait_timeout value; currently it's 3 seconds in the code.
Start the lock and swap sequence, where it's blocked waiting for the rest of the log to get parsed and changes run on the ghost table.
Have the processlist on the master polled regularly; if it sees a certain number of threads in "Waiting on metadata lock" state, timeout the lock/swap process and retry after a period.

A few thoughts/concerns:

We might want configurable settings for

lock-wait-timeout (if not already there)
max-waiting-threads-at-swap
wait-before-retry
retry-count?

Are there any other waiting process states that could occur from the lock and swap scenario? (lock tables and DDL will cause that, but mentioning the possibility anyways).

gh-ost doesn't cleanup its socket file

On a second run:
2016-08-03 10:51:34 FATAL listen unix /tmp/gh-ost.sakila.rental.sock: bind: address already in use

Check with ss -x src /tmp/gh-ost.sakila.rental.sock to see nobody is listening on the socket

Prometheus Monitoring

It would be good to integrate Prometheus client library for monitoring of long-running jobs.

expected features

Throwing around all ideas, in no particular order, lest we forget.

Potential too-early-rename via killed connections

Paraphrasing a lot due to lack of ability to understand Go, at the point where the tool is ready to swap tables, there are these connections:

GET_LOCK('something', 0);
SELECT RELEASE_LOCK('something') FROM original_table WHERE GET_LOCK('something',999)>=0 LIMIT 1; # or similar
RENAME TABLE original_table to original_old, original_new to original;

And then there's a check for the parsed binlog writes to have all gone to original_new, and then locks are released and the rename happens.

However, if either the GET_LOCK() connection or the SELECT RELEASE_LOCK() connection gets killed, the RENAME happens. So there are two places where a killed connection could result in a too-early-swap and some binlog events being lost.

I like the idea of confidence in being able to say that gh-ost wouldn't result in unintentional lost rows/writes. Maybe in this case it might involve locking the now-original-table after the rename and writing the missed events then (e.g. we'd know because the REPLACE/DELETE to original_new would have failed). Just brainstorming.

Minor feature requests regarding output

Can the help info (that lists postpone/panic/throttle specs to output on regular intervals) also point out if any of the flags are set?

...
# Migration started at Tue Jul 26 04:45:34 -0700 2016
# chunk-size: 100; max-lag-millis: 2100ms; max-load: Threads_running=25; critical-load: Threads_running=1000; nice-ratio: 0
# Throttle additional flag file: /tmp/gh-ost.throttle
# Postpone cut-over flag file: /tmp/ghost.postpone.flag
# Panic flag file: /tmp/ghost.panic.flag
# Serving on unix socket: /tmp/gh-ost.github_production.issue_comments.sock

I added a postpone flag to a currently running migration and you can't tell. Could be something to add to the status info above.

Does the rotation message need to be posted twice, or is this some artifact of our slack tools?

2016-07-26 14:33:09 INFO rotate to next log name: mysql-bin.006906
2016-07-26 14:33:09 INFO rotate to next log name: mysql-bin.006906

Further working towards an atomic table swap

Basically what we're looking for is a wait_for_condition function, that blocks until some condition actively turns true. The GET_LOCK is a reverse implementation, because it blocks on an active operation (another process holding the lock), as opposed of unblocking by an active operation.

Some other things I was looking it is MASTER_POS_WAIT which is difficult to work with, especially with multiple concurrent migrations. Also at LOAD_FILE() which placing a read-lock on the file itself (requires mount options on the file system).

There are no further blocking functions in mysql, to the best of my knowledge.

However, it is easy for us to write one. I'm able to produce a UDF function that blocks on a condition, such that said condition is not connection/session dependent and such that it would take an active activation of that condition (someone to say "yes, release!").

We do not need to fork MySQL for that; just to have that function as shared library, compatible with the versions we have deployed

That's an issue: we would typically need to have this function compiled per mysql version we're using. Not a big deal, and we have puppet to watch over our back anyhow. Some minor versions would not require recompiling but we would go on the safe side.

An example to a UDF which I wrote in the past is https://github.com/outbrain/audit_login ; it worked & works pretty well, with the hassle of recompiling per version.

This really solves our problems. It's not pure 100% MySQL, but we can open source this plugin, and support a choice of swap algorithms:

bumpy (the one FB uses today, table ceases to exist for a short duration)
Locky, pure MySQL (the GET_LOCK() implementation). Death of connections will cause premature rename
UDF. No death risk. We need to develop this. We need to make sure we don't crash our servers (I haven't crashed anything with the audit_login)

Question: why are master/master configs unsupported?

The tool bails out if a master-master pair is detected anywhere in the replication hierarchy (even if it's in a different branch), but I'm not actually sure why? I can see why it might cause issues with --test-on-replica (if the replica were a member of the master/master pair), but are there any other issues to be aware of? Given a topology like the following:

Is there any danger running gh-ost pointed to one of R1/R2/R3?

gh-ost should display a warning if it's run against a table with triggers defined

From the docs:

Triggers are not supported. They may be supported in the future.

The output of gh-ost should probably mention this as at least a warning if it detects any triggers on the table to be altered. As it stands now, it silently leaves them attached to the old table, and if --ok-to-drop-table is defined, silently deletes them.

How to dynamically change execution parameters of gh-ost?

How to dynamically change execution parameters of gh-ost?
For example the parameters:chunk-size=1500, max-lag-millis=2000, max-load=Thread_running=30 and so on.Can you give an example?

Thank you!

Disable binlogging when running with --test-on-replica?

What would you think about running SET SESSION sql_log_bin=0 before applying changes when running in --test-on-replica mode? Right now, after testing on a replica, we need to wipe it and restore it from scratch (to prevent any invalid GTIDs from creeping in the event of a later promotion of that host to master). This is kind of irritating, and if we just turn off binlogging for that session, we wouldn't necessarily have to reset the box afterwards.

Various TODOs

At this development time there's just a gazillion things to sift through. I'm writing them down; some of these are just next steps for development; others are concepts.

Support triggers [community contribution welcome]

Triggers can be supported by gh-ost. This is not on our immediate roadmap, and so it's an ideal opportunity for community contribution.

Some notes:

as far as I can see, all sorts of triggers could be supported (before|after, insert|delete|update)
throughout the migration process, nothing is really done; the ghost table should NOT have triggers.
during the cut-over phase we can copy+paste the triggers from original table to ghost table
Preferably the triggers are read at migration start, not at cut-over phase
triggers can be applied on the ghost table after the backlog is drained and the tables are in sync (after this.waitForEventsUpToLock())
triggers must be removed in any case of rollback, and in such case, before gh-ost resumes writes onto the ghost tables (no problem if original table resumes writes)
Note that at this time gh-ost supports two types of cut-over, and both must be supported.

Anyone who is interested to jump on this please let me know beforehand. Thank you!

ETA to use PID calculation

Current ETA heuristic is oversimplistic, and uses (runtime so far)*(total number of rows)/(copied number of rows)

This is susceptible to momentary pauses, spikes etc. We need to be able to better predict ETA. PID is a classic solution and easy to implement.

test

`noop` to print output of `show create table`

As per #79 , noop drops the ghost table. So at least we should have the new table structure in stdout.

parsing mysqlbinlog output

As per #1, I can see or think of the following:

Generally, a binlog entry starts with # at 123456 and the following entry has an end_log_pos
But sometimes an entry looks like this: ### Row event for unknown table ... at 123456
For now both are handled, but are there further such cases?
- For now this matters, because I'm using the end_log_pos from previous entry to validate that the current entry has same number. It's just a self test
- But if we cannot use this self test -- that's also OK, as long as we don't miss out on anything
At this time we don't even bother checking for rolledback transactions or halfway-through transactions
- This may be an issue, making parsing of mysqlbinlog less ideal than the already less-than-ideal state. We'll see.
How do we tail a binlog file? The size of the binlog file does not necessarily indicate an end-pos, because it might be only halfway-written throughout the current transaction.
- So we would need to connect to the replica and issue SHOW MASTER STATUS or SHOW BINARY LOGS, as the output of these commands does indicate a true position as which a statement/entry is complete.

Experiment with go-mysql toolkit

Easy to get up to speed with is https://github.com/siddontang/go-mysql#replication

Will check if it works with RBR and how events look like.

mysqlbinlog reader proof of concept

As initial proof of concept, we need to be able to read & parse binary logs in real time and without accumulating lag.
If we are not able to read binary logs in real time, this whole project cannot work.
Current options are:

Invoking mysqlbinlog --verbose --base64-output=DECODE-ROWS, parsing textual output.

While this will be the slowest option, for sure, it is also the safest, mysqlbinlog being the authoritative tool that it is. If we can make it happen with this, we win.
Vitess

I would trust that Vitess is reliable & stable.
myreplication

unfamiliar with it
go-mysql

Played with this toolkit a couple times, seems nice; actively maintained

All three programmatic approaches (Vitess, myreplication, go-mysql) connect to a MySQL server and act as a slave. I would prefer working directly on the binary logs via file system, but still good to go.

Consider renaming to gh-ost

Because ghost
Acronym-wise, it's easy: online schema translate/transform/transfer/tamper/thing

If we do stick with gh-osc (compatible with existing tool names), consider this official acronym:
online-schema-corrupt

Support hooks

On startup
On end
Before beginning rowcopy
When prepared to cut-over
Right after cut-over
When begins throttling (spammy)
When receiving interactive command

operational question: how does one decide whether a _gho or _ghc are abandoned?

With triggered tools, the absence of triggers is an indication that the operation is dead. How does one recognize the triggerless gh-ost operation is dead?

Once closed, this question should move into the documentation.

Getting by without SUPER privilege

Hi,
we have case in which we would like to use gh-ost with AWS RDS. However, the problem with this managed database store is that it does not grant us the SUPER privilege.
Is there any way of using gh-ost and not having SUPER on the server?
Thanks

unix socket file does not get cleaned up

gh-ost termination, at least orderly one, should clean up and remove unix socket file.

Working towards an atomic table swap

As per #26, the solution of atomically swapping the tables is unsafe in the event of death of connections.

I've been considering some other solutions. I've reached dead-end with all of them, but hopefully someone else can pick on where I stopped and find a solution.

An idea based on views, derived from my earlier work on http://code.openark.org/blog/mysql/auto-caching-tables is as follows:

Views & SLEEP(), dead-end on the SLEEP() part

We assume we want to alter table tbl
Ghost table is called ghost
We issue: create or replace view ghost_view as select * from ghost where sleep(600)>=0
(this view is updatable, and can receive INSERT, UPDATE, DELETE)
To begin the swap, we issue: rename table tbl to tbl_old, ghost_view to tbl
RENAME works just as well on views as it does on tables
Queries are now executing on the view, but are stalling for a long enough period
We now have time to apply changes onto ghost
Problem/Dead-end: we would like to rename table tbl to tbl_old_view, ghost to tbl but the rename is blocked until all those hanging queries are complete. There's no way to migrate those already running queries onto the renamed ghost table.
- I tried two levels of views (a view calling a view calling the table) and then swapping the mid-layered view. Still does not work. The rename hangs until the queries complete, which is not what we want.

Any ideas?

Views & GET_LOCK(), spaghetti on GET_LOCK()

Before I realized the spaghetti was there, I found this solution to be appealing:

True, it uses GET_LOCK, so same "what if the connection dies" problem is still there
However I reduce the number of risky connections from 2 to 1
And then make it possible to have n connections who will share the risk: it would take death of all n connections to cause for a premature rename (as opposed to death of any connection)

And then I found the spaghetti.

The solution is similar to the above, but:

Connection #1 issues GET_LOCK('ding', 0) and succeeds
We create or replace view ghost_view as select * from ghost where get_lock('ding',600) >= 0
We rename table tbl to tbl_old, ghost_view to tbl
queries are blocked on tbl (which is now a view)
We complete writing events onto ghost
We rename table tbl to tbl_old_view, ghost to tbl
- this blocks due to the already existing queries
We RELEASE_LOCK('ding') in connection #1
Problem queries are released, but are now blocking each other!!!
rename unblocks and we have our migrated table in place.

Before discussing the Problem, note that we can variant as follows:

Have connections #1 .. #n issue a SELECT GET_LOCK('ding#1', 600) to GET_LOCK('ding#n', 600) (we take n locks)
Variant the view: create or replace view ghost_view as select * from ghost where get_lock('ding#1',600)+get_lock('ding#2',600)+...+get_lock('ding#n',600) >= 0

Such that it would take the death of all n connections to make a premature rename.

So, back to the problem. Each query will issue get_lock and queries will block each other, leading to really scary workload. We can hack around this by doing crazy stuff like:
create view... select where release_lock(concat('ding', if(get_lock('ding',600) >= 0, '', ''))) >= 0. This will make sure to release any acquired lock.

This still scares me because of crazy locking contention.

Ideas?

Mix of --<option> and -<option>

The output of gh-ost --help shows short options like -database in go-style.
However gh-ost without options shows 2016-08-03 09:16:34 FATAL --database must be provided and database name must not be empty with --database instead.

Let's settle for the go-style one-dash options instead of the GNU style long options.
This prevents the later use Unix style short options, but that should be fine

Explicitly WARN about pausing/throttling and binlog expiration

<captaineyesight>Might I suggest putting some kind of bolded note about only being able to throttle as long as binary logs hang around. I know it should be 100% obvious for everything else written. I just don’t want someone crying that they paused their schema change for 10 days and they only saved 5 days of binary logs and that somehow is the tools fault.

<rlowe> I'd go one step further and warn interactively when they pause where possible.`

github / gh-ost Goto Github PK

gh-ost's Introduction

gh-ost

GitHub's online schema migration for MySQL

How?

Highlights

Usage

What's in a name?

License

Community

Download/binaries/source

Authors

gh-ost's People

Contributors

Stargazers

Watchers

Forkers

gh-ost's Issues

Cut-over via wait condition

UPDATE

Here are the steps to a safe solution:

What just happened? Let's first explain some stuff:

What happens on failures?

Here are the steps for a safe, atomic cut-over:

Some notes

What happens on failures?

Impact on app

Impact on replication

Views & SLEEP(), dead-end on the SLEEP() part

Views & GET_LOCK(), spaghetti on GET_LOCK()

Recommend Projects

Recommend Topics

Recommend Org