Friday, January 20, 2017

How upgrading MariaDB Server failed because 50M warnings were ignored

This post is part of the series "please do not ignore warnings in MySQL/MariaDB".  The previous post of the series can be found here.

In this post, I will present why ignoring warnings made me lose time in upgrading MariaDB Server.  I think this war story is entertaining to read and it is also worth presenting to people claiming that ignoring warnings is no big deal.

A few months ago, I was in the process of testing a new MariaDB Server version on some slaves (I usually start by installing a newer version on slaves for testing replication before upgrading the master).  Then one day, all newer slaves stop replicating with the following error:
Last_Error: Error 'Invalid utf8 character string: ''' on query.
Default database: 'my_database'.
Query: 'LOAD DATA INFILE '../tmp/SQL_LOAD-a-b-c.data' IGNORE INTO TABLE `my_table` CHARACTER SET utf8 FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' (my_field_list)'
OK, so this statement executes correctly on the master (old version) but fails on the slaves (new version).  The error message is not very clear, so I tried to run the LOAD DATA locally on the slave.  I got the same unclear error message:
> LOAD DATA INFILE '../tmp/SQL_LOAD-a-b-c.data' IGNORE INTO TABLE `test_jfg`
   CHARACTER SET utf8 FIELDS TERMINATED BY '\t' ENCLOSED BY ''
   ESCAPED BY '\\' LINES TERMINATED BY '\n' (my_field_list);
ERROR 1300 (HY000): Invalid utf8 character string: ''
At this point, I am curious of the behavior on the master, so I copied the file on a test server with the same MariaDB version as the master, created a test table mirroring the structure from the master, and tried the LOAD DATA.  I got the nightmare result below (look at the warning count):
> LOAD DATA INFILE '../tmp/SQL_LOAD-a-b-c.data' IGNORE INTO TABLE `test_jfg`
   CHARACTER SET utf8 FIELDS TERMINATED BY '\t' ENCLOSED BY ''
   ESCAPED BY '\\' LINES TERMINATED BY '\n' (my_field_list);
Query OK, 50000000 rows affected, 65535 warnings (35 min 43.44 sec)
Records: 50000000  Deleted: 0  Skipped: 0  Warnings: 50000012
So this file has 50 million rows and produces 50 million and twelve warnings.  At this point, I can guess that each row generates a warning.  Which warning is that:
> SHOW WARNINSG LIMIT 1;
+---------+------+------------------------------------------------------+
| Level   | Code | Message                                              |
+---------+------+------------------------------------------------------+
| Warning | 1366 | Incorrect integer value: '' for column 'id' at row 1 |
+---------+------+------------------------------------------------------+
1 row in set (0.00 sec)
This warning means that an empty value has been put in a NOT NULL column (AUTO_INCREMENT in this case).  It is very easy to get rid of this warning (not the subject of this post), but as the end result was OK(-ish), someone did not bother...
50 million warnings were hiding 12 warnings that should have been looked into a long time ago.
So this is why ignoring warnings is bad: 50 million warnings were hiding 12 warnings that should have been looked into a long time ago.  But at this point, I have no way to know what those 12 warnings are.  So I went back on the slave and I searched for a row that makes the LOAD DATA fail.  For that, I used a modified divide-and-conquer binary search trying to load half the file:
  • if it succeeds, I try again with the other half
  • if it fails, I divide that sub-file in two and try again
I end up finding a single row that makes the LOAD DATA fail with this error:
> LOAD DATA INFILE '/tmp/test_jfg.txt' IGNORE INTO TABLE `test_jfg`
   CHARACTER SET utf8 FIELDS TERMINATED BY '\t' ENCLOSED BY ''
   ESCAPED BY '\\' LINES TERMINATED BY '\n' (my_field_list);
ERROR 1300 (HY000): Invalid utf8 character string: 'a string'
Then I tried to load that row on the master, and I got the following:
> LOAD DATA INFILE '/tmp/test_jfg.txt' IGNORE INTO TABLE `test_jfg`
   CHARACTER SET utf8 FIELDS TERMINATED BY '\t' ENCLOSED BY ''
   ESCAPED BY '\\' LINES TERMINATED BY '\n' (my_field_list);
Query OK, 1 row affected, 1 warning (0.00 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 1

> SHOW WARNINSG LIMIT 1;
+---------+------+----------------------------------------------------------------------------+
| Level   | Code | Message                                                                    |
+---------+------+----------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xZZ\xZZ\xZZ\xZZ \xZZ...' for column 'b' at row 1 |
+---------+------+----------------------------------------------------------------------------+
1 row in set (0.00 sec)
Note: I replaced valid UTF8MB4 bytes by \xZZ above.

So now it becomes clear how things failed: a UTF8MB4 string is put in a UTF8 (UTF8MB3) field, the data is truncated at the first 4-bytes code point on the master, but this fails on the slave because of the different MariaDB version.

Argh ! we have been truncating data for a long time, MariaDB was warning us, but we never saw it because we were ignoring warning.  Moral of this story: always pay attention to MySQL/MariaDB warnings !
Moral of this story:
always pay attention to MySQL/MariaDB warnings !
Final note: some bug reports related to this:
And a feature request to keep some information when more than max_error_count warnings are produced: Bug#84578 and MDEV-11852 - Add a SHOW WARNINGS SUMMARY function.

2 comments:

  1. This is truly a horrifying story.

    These 5M "warnings" should actually be no-way-am-I-going-to-let-you-do-that errors. These "warnings" tell you "oh, I hope you don't mind I've just corrupted your data, but I'm going to keep doing that".

    The fact that a database (read: DATA-base) knowingly lets you corrupt your data is unthinkable. Anyone who bashes MySQL/MariaDB on this account gets complete agreement from me.

    The unsung hero is `sql_mode=STRICT_ALL_TABLES`. Ideally we should have it everywhere.

    Unfortunately we all carry database apps legacy that are used to having sql_mode='', and making the switch is untrivial.

    As a recent anecdote, gh-ost is known to have caught data corruption queries by attempting to migrate a table with corrupted data. gh-ost itself enforces STRICT_ALL_TABLES. Users would report "hey, gh-ost is breaking" while the reality it "hey, gh-ost is letting you know you've been corrupting your data all along".

    My best suggestion: use STRICT_ALL_TABLES if possible.

    ReplyDelete
  2. Related:
    http://mysqlserverteam.com/mysql-8-0-performance-schema-instrumentation-of-server-errors/

    ReplyDelete