Skip to content

Uses UTF8MB4 everywhere#8425

Merged
Sesquipedalian merged 42 commits intoSimpleMachines:release-3.0from
Sesquipedalian:3.0/utf8mb4
Mar 24, 2025
Merged

Uses UTF8MB4 everywhere#8425
Sesquipedalian merged 42 commits intoSimpleMachines:release-3.0from
Sesquipedalian:3.0/utf8mb4

Conversation

@Sesquipedalian
Copy link
Copy Markdown
Member

@Sesquipedalian Sesquipedalian commented Jan 30, 2025

Fixes #7938
Fixes #7173
Closes #6409
Closes #6406

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@jdarwood007
Copy link
Copy Markdown
Member

Looks like 90% of this is just removing and hardcoding UTF-8 on everything. InnoDB is a fairly safe conversion. Just may take longer for larger forums on certain tables, but nothing can be avoided in timeout protections for that. It looks good from what I see.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Feb 3, 2025

I'll run some test upgrades. Merge conflicts need to be resolved first, though.

Or is there a per-requisite PR?

Signed-off-by: Jon Stovell <jonstovell@gmail.com>

# Conflicts:
#	Sources/Db/APIs/MySQL.php
@Sesquipedalian
Copy link
Copy Markdown
Member Author

I'll run some test upgrades. Merge conflicts need to be resolved first, though.

Great! Thank you. 🙂 Merge conflict has been resolved.

Or is there a per-requisite PR?

Nope.

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Feb 3, 2025

First test was an upgrade of a new, vanilla 2.1.4 forum to 3.0, via CLI.

I followed the old 2.1.x protocol, where I would copy the upgrade files over from the /other folder, then run upgrade.php.

DB: MySQL, version 8.4.0
php: 8.4.2

Had a few errors, here is the complete output:

  • Updating Settings.php... Successful.
    Error: Undefined variable $file_substrsteps File: D:\wamp64\www\van2130\upgrade.php Line: 2403 * +++ Updating old values done.
    +++ Changing default values done.
    . Successful.
  • +++ Removing all karma data, if selected done.
    Successful.
  • +++ Emptying error log, if selected done.
    Successful.
  • +++ Adding login history... done.
    +++ Copying the current package backup setting... done.
    +++ Copying the current "allow users to disable word censor" setting...Error: Trying to access array offset on null File: D:\wamp64\www\van2130\upgrade.php(2521) : eval()'d code Line: 7 done.
    +++ Converting collapsed categories... done.
    +++ Parsing board descriptions and names done.
    +++ Dropping "collapsed_categories" done.
    +++ Adding new "topic_move_any" setting done.
    +++ Adding new "enable_ajax_alerts" setting done.
    +++ Adding new "alerts_auto_purge" setting done.
    +++ Adding new "minimize_files" setting done.
    +++ Collapse object done.
    +++ Adding new "DEFAULTMaxListItems" setting done.
    +++ Adding new "loginHistoryDays" setting done.
    +++ Enable some settings we ripped from Theme settings done.
    +++ Adding new "httponlyCookies" setting done.
    +++ Adding new "samesiteCookies" setting done.
    +++ Calculate appropriate hash costError: Passing E_USER_ERROR to trigger_error() is deprecated since 8.4, throw an exception or call exit with a string message instead File: D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php Line: 2527Error: Invalid data structure sent to the database.(upgrade.php(2521) : eval()'d code-1) File: D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php Line: 2527
    Fatal error: Uncaught TypeError: array_combine(): Argument 2 ($values) must be of type array, string given in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php on line 430

TypeError: array_combine(): Argument 2 ($values) must be of type array, string given in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php on line 430

Call Stack:
0.0079 1402080 1. {main}() D:\wamp64\www\van2130\upgrade.php:0
0.4792 5966336 2. DatabaseChanges() D:\wamp64\www\van2130\upgrade.php:371
0.4833 5967968 3. parse_sql($filename = 'D:\wamp64\www\van2130/upgrade_2-1_MySQL.sql') D:\wamp64\www\van2130\upgrade.php:1869
2.3360 6283880 4. eval('D:\wamp64\www\van2130\upgrade.php(2521) : eval()'d code') D:\wamp64\www\van2130\upgrade.php:2521
3.0980 6781696 5. SMF\Db\APIs\MySQL->insert($method = 'replace', $table = '{db_prefix}settings', $columns = ['variable' => 'string', 'value' => 'string'], $data = [0 => 'bcrypt_hash_cost', 1 => 14], $keys = [0 => 'variable'], $returnmode = ???, $connection = ???) D:\wamp64\www\van2130\upgrade.php(2521) : eval()'d code:1
3.0982 6782016 6. array_combine($keys = [0 => 'variable', 1 => 'value'], $values = 'bcrypt_hash_cost') D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php:430

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Feb 3, 2025

I repeated the above test via browser, & get this error. Note the 2.1 environment being upgraded is using the default, English.

This may help understand the CLI errors:

pr8425_upgr_msg

@Sesquipedalian
Copy link
Copy Markdown
Member Author

Okay, so it looks like we have some unrelated upgrader bugs to fix before you can even get to the point of testing the new ConvertUtf8() logic in this PR.

Oh, the joys of the upgrader never cease. 😒

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Feb 3, 2025

Looks like 2 things here...

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Feb 3, 2025

Probably the same issues in a different form... But I attempted an install in the same environment.

This comes up after entering the DB credentials, etc. Same issues in php 8.3 & 8.4:

Warning: Undefined array key 0 in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php on line 915

Warning: Trying to access array offset on null in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php on line 915

Fatal error: Uncaught TypeError: SMF\Db\APIs\MySQL::detect_charset(): Return value must be of type string, null returned in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php:934 Stack trace: 0 D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php(2195): SMF\Db\APIs\MySQL->detect_charset('messages', 'body') 1 D:\wamp64\www\van2130\Sources\Db\DatabaseApi.php(331): SMF\Db\APIs\MySQL->__construct(Array) # D:\wamp64\www\van2130\install.php(904): SMF\Db\DatabaseApi::load(Array) 3 D:\wamp64\www\van2130\install.php(172): DatabaseSettings() 4 {main} thrown in D:\wamp64\www\van2130\Sources\Db\APIs\MySQL.php on line 934

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Feb 3, 2025

Same errors occur in unix.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian
Copy link
Copy Markdown
Member Author

Sesquipedalian commented Feb 18, 2025

  • The old 2.1 logic did the check & the conversion column by column, in case things choked & the user had to restart. Keep reruns/restarts in mind.

I don't think that's an issue. Since database transactions are always atomic, changing the whole table at once whenever possible is actually better and safer. When we do one column at a time using the method where we change the column to a binary encoding and then back, an interruption at an inopportune moment can leave the column sitting there in a binary encoding. If anything, we should probably build more safety checks around the column-based method in the new upgrader code.

  • The old 2.1 logic, I believe, avoided using mb_convert_encoding because it is prone to double-encoding (do a quick search on mb_convert_encoding() & double encoding...). Doing this 100% within MySQL helped, because MySQL is simply better at conversion than php.

I don't think that's accurate. The old logic was inherited from SMF 2.0, when the mbstring extension was not required by SMF and therefore the upgrader couldn't rely on it.

Regarding double encoding, all that I found on the matter was a single StackOverflow discussion in which the original poster was trying to do something different than we are (and didn't seem to understand what they were doing very well). Perhaps your searches turned up something mine didn't, though, so if there's something more, please share the link. I am often wrong, after all, and always glad to discover a better understanding. 🙂

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian
Copy link
Copy Markdown
Member Author

Sesquipedalian commented Feb 18, 2025

For DB functions, I don't think we should ever use 'utf8', only 'utf8mb3' and 'utf8mb4'.

I was just looking over the code again and noticed the spot that you were probably referring to; 14ea6ae should fix it.

Again, ready for testing whenever you are. 🙂

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Feb 26, 2025

Back in town & will get back to testing. Juggling a lot, so it may take a while.

A couple notes on the above... First, this logic does go column by column when bouncing off of binary... See lines 3758+.

Also, mb_convert_encoding pretty much only does what you tell it to, and sometimes that's a bad thing... The whole point of bouncing off of binary is to make use of MySQL's charset detection, which is far better than PHPs. The real issue is a lot of legacy latin1, win1251, etc., with various other encodings stuffed into it, from back in the day when php & mysql were kinda awful at it...

Run this:

<?php
$text = "this is Middle English: He wes Leovenaðes sone -- liðe him be Drihten. \n";
echo "before... " . $text;
echo "after.... " . mb_convert_encoding($text, 'ISO-8859-1', 'UTF-8');

Food for thought.

@Sesquipedalian
Copy link
Copy Markdown
Member Author

Sesquipedalian commented Feb 27, 2025

I've made a change in order to let MySQL handle character set detection internally whenever and wherever possible. We now only do it manually for character sets that MySQL does not have native support for. I believe this will address your concerns, @sbulen.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian
Copy link
Copy Markdown
Member Author

@sbulen, have you had a chance to test the latest changes yet?

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Mar 3, 2025

Latest round. This time I'll include a list of all the tests I hope to power thru, including my technologically sophisticated result conveyance system:

3.0 Install 🍻 OK
3.0 => 3.0 FAIL 😥
2.1 => 3.0 🍻 OK
2.0 Latin1 => 3.0 FAIL 😥
2.0 Utf8 => 3.0 🍻 OK
2.0 Latin2 (Slovenian) Not Yet Run 😐
1.1 => 3.0 Not Yet Run 😐
1.0 => 3.0 Not Yet Run 😐
Yabbse => 3.0 Not Yet Run 😐
Inst/Upgr DB Compare Not Yet Run 😐
Confirm encoding 🍻 OK
Confirm InnoDB 🍻 OK
Confirm Dynamic 🍻 OK
pg 2.1 => 3.0 Not Yet Run 😐

The 3.0 => 3.0 result - it just stares at me & does not proceed:
image

The 2.0 latin1 => 3.0 result (Partial log... Note it's not unusual for display fields to fail to unserialize here - not sure why actually, it's been a while... The real problem is the incorrect string value):

  • +++ Adding a new column "spoofdetector_name" to members table done.
    . +++ Adding new "spoofdetector_censor" setting done.
    Successful.
  • +++ Adding a new column "smf_version" to log_packages table done.
    Successful.
  • +++ Updating primary key for log_search_results table done.
    Successful.
  • +++ Update mail_type Successful.
    Converting data from serialize() to json_encode().
    Fixing some settings...
  • Failed to unserialize the 'displayFields' setting. Skipping. done.
    Successful.
    Converting table rul_admin_info_files to utf8mb4...Incorrect string value: '\x967.2.'...' for column 'data' at row 3

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Mar 3, 2025

Note the 2.0 file is just the standard admin info files, e.g.:

image

It's not liking the BLOBs...

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Mar 3, 2025

OK, I snuck in a few more tests (yabb, 1.0 & 1.1), this is all I can do for today:

3.0 Install OK 🍻
3.0 => 3.0 FAIL 😥
2.1 => 3.0 OK 🍻
2.0 Latin1 => 3.0 FAIL 😥
2.0 Utf8 => 3.0 OK 🍻
2.0 Latin2 (Slovenian) Not Yet Run 😐
1.1 => 3.0 FAIL 😥
1.0 => 3.0 FAIL 😥
Yabbse => 3.0 FAIL 😥
Inst/Upgr DB Compare Not Yet Run 😐
Confirm encoding OK 🍻
Confirm InnoDB OK 🍻
Confirm Dynamic OK 🍻
pg 2.1 => 3.0 Not Yet Run 😐

1.1 issue:

  • +++ Updating primary key for log_search_results table done.
    . Successful.
  • +++ Update mail_type Successful.
    Converting data from serialize() to json_encode().
    Fixing some settings...
  • Failed to unserialize the 'displayFields' setting. Skipping. done.
    Successful.
    Converting table smf_attachments to utf8mb4... done.
    .Converting table smf_ban_groups to utf8mb4... done.
    Converting table smf_ban_items to utf8mb4... done.
    .Converting table smf_board_permissions to utf8mb4... done.
    Converting table smf_boards to utf8mb4... done.
    .Converting table smf_calendar to utf8mb4... done.
    .Converting table smf_categories to utf8mb4... done.
    .Converting table smf_log_actions to utf8mb4... done.
    .Converting table smf_log_banned to utf8mb4... done.
    Converting table smf_log_errors to utf8mb4...Invalid default value for 'session'

1.0 issue, very similar to 1.1:

  • +++ Update mail_type Successful.
    Converting data from serialize() to json_encode().
    Fixing some settings...
  • Failed to unserialize the 'cal_today_birthday' setting. Skipping.
  • Failed to unserialize the 'cal_today_event' setting. Skipping.
  • Failed to unserialize the 'cal_today_holiday' setting. Skipping.
  • Failed to unserialize the 'displayFields' setting. Skipping. done.
    Successful.
    Converting table smf_attachments to utf8mb4... done.
    .Converting table smf_board_permissions to utf8mb4... done.
    .Converting table smf_boards to utf8mb4... done.
    .Converting table smf_calendar to utf8mb4... done.
    .Converting table smf_categories to utf8mb4... done.
    Converting table smf_log_actions to utf8mb4... done.
    .Converting table smf_log_banned to utf8mb4... done.
    Converting table smf_log_errors to utf8mb4...Invalid default value for 'session'

Yabbse issue - didn't get very far at all... (Yabbse is so different, I wonder if it's time to punt on this support...):

  • Updating Settings.php... FAILURE.

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Mar 3, 2025

Uh... The 1.0 & 1.1 default for session is... uh... odd:

In DB:
image

In Installer:

#
# Table structure for table `log_errors`
#

CREATE TABLE {$db_prefix}log_errors (
  ID_ERROR mediumint(8) unsigned NOT NULL auto_increment,
  logTime int(10) unsigned NOT NULL default '0',
  ID_MEMBER mediumint(8) unsigned NOT NULL default '0',
  ip char(16) NOT NULL default '                ',
  url text NOT NULL,
  message text NOT NULL,
  session char(32) NOT NULL default '                                ',
  PRIMARY KEY (ID_ERROR),
  KEY logTime (logTime),
  KEY ID_MEMBER (ID_MEMBER),
  KEY ip (ip(16))
) ENGINE=MyISAM;

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Mar 4, 2025

All tests run in php 8.4.2 & mysql 8.4.0.

@Oldiesmann
Copy link
Copy Markdown
Contributor

We can look into the issue with YaBB SE, but definitely not something we need to spend a lot of time on. It's been 21 years since YaBB SE 1.5.5 was released and a little over 20 since SMF 1.0 was released (YaBB SE 1.5.5 was released in January of 2004, SMF 1.0 final was released in late December of 2004). If someone is still running YaBB SE at this point, it's not likely they're ever going to upgrade.

@Sesquipedalian
Copy link
Copy Markdown
Member Author

OK, I snuck in a few more tests (yabb, 1.0 & 1.1), this is all I can do for today:

3.0 Install OK 🍻
3.0 => 3.0 FAIL 😥
2.1 => 3.0 OK 🍻
...

For this PR we only care about 2.1 → 3.0 and 3.0 → 3.0, and specifically about the conversion to utf8mb4 for MySQL. There's no need to run tests on anything else at this point.

Regarding the 3.0 → 3.0, what went wrong?

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Mar 19, 2025

OK, I snuck in a few more tests (yabb, 1.0 & 1.1), this is all I can do for today:
3.0 Install OK 🍻
3.0 => 3.0 FAIL 😥
2.1 => 3.0 OK 🍻
...

For this PR we only care about 2.1 → 3.0 and 3.0 → 3.0, and specifically about the conversion to utf8mb4 for MySQL. There's no need to run tests on anything else at this point.

I would hope we plan on doing the same thing? These upgrades used to work, & I think this is a clue the changes & new approach aren't working. I suspect returning to the single command would work, rather than spreading it out over several steps. Haven't tested that - I've been pretty busy lately with multiple RL challenges.

Regarding the 3.0 → 3.0, what went wrong?
See the very first screenshot above for the browser result.

CLI just returns & does nothing.

Signed-off-by: Jon Stovell <jonstovell@gmail.com>
Signed-off-by: Jon Stovell <jonstovell@gmail.com>
@Sesquipedalian
Copy link
Copy Markdown
Member Author

Sesquipedalian commented Mar 20, 2025

Regarding the 3.0 → 3.0, what went wrong?

See the very first screenshot above for the browser result.

CLI just returns & does nothing.

Well, I can't reproduce that. 3.0 → 3.0 completes successfully for me, including the ConvertToUtf8() step, which is the only step we care about in this PR. I don't know what is causing 3.0 → 3.0 to fail for you right now, but since (a) it is happening at the first step and (b) you have run into the same problem previously, I don't think it is related to the changes in this PR.

Uh... The 1.0 & 1.1 default for session is... uh... odd:

Although 1.0 and 1.1 issues are out of scope for this PR, I've added a fix for that in 9f3409d.

2.0 Latin1 => 3.0 FAIL 😥

Does it fail during ConvertToUtf8(), or somewhere else?

Yabbse issue - didn't get very far at all... (Yabbse is so different, I wonder if it's time to punt on this support...):

  • Updating Settings.php... FAILURE.

That'll be a problem to deal with in #8093

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Mar 20, 2025

The 2.0 Latin 1 failure was highlighted up here - at first glance, it appears to be having issues with BLOBs?
#8425 (comment)

And yes, that's in the utf8mb4 conversion step. I have not attempted a retest recently yet.

Note my 3.0 => 3.0 failure is due to $upcontext['language'] being blank on this line:

if (!file_exists($lang_dir . '/' . $upcontext['language'] . '/General.php')) {

My steps:

  • I literally started from an empty directory & an empty DB
  • I installed 3.0, & did a quick test by making a post
  • I upgraded 3.0 => 3.0
  • PHP 8.4.2, MySQL 8.4.0, wamp

I haven't looked too closely, not a lot of time today... My initial suspicion is there is a confusion between 'lang' and 'language', both are used at different points around there??? Or maybe Config language isn't set anywhere?

Note also that I don't think the throw error logic is working there, as the thrown error is not visible anywhere. Just a 'try again' link via browser, or a clean exit via CLI.

@Sesquipedalian
Copy link
Copy Markdown
Member Author

Sesquipedalian commented Mar 24, 2025

The 2.0 Latin 1 failure was highlighted up here - at first glance, it appears to be having issues with BLOBs? #8425 (comment)

And yes, that's in the utf8mb4 conversion step. I have not attempted a retest recently yet.

Thanks, I guess I overlooked that. 🙂

Unfortunately, I cannot reproduce that either. However, my test data for the 2.0 → 3.0 upgrade is just content generated by Populate.php, so perhaps that generated content isn't creating the right conditions to trigger the problem.

Could we arrange a way for me to get copies of the databases you are using, @sbulen? I would like to run tests using the same data as you are using so that I can try to figure out what the cause is.

In the meantime, though, I think I am going to go ahead and merge this for now. Even if there are still kinks to work out with the upgrader, waiting for the rest of the pending changes in this PR is holding up everything else in the development pipeline. Once I can get ahold of a copy of your test data and figure out the cause of the issues you are seeing, we will be able to fix the upgrade problems in the dedicated PR for upgrader changes.

@Sesquipedalian Sesquipedalian merged commit 51a67bc into SimpleMachines:release-3.0 Mar 24, 2025
6 checks passed
@Sesquipedalian Sesquipedalian deleted the 3.0/utf8mb4 branch March 24, 2025 01:38
@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Mar 24, 2025

OK. I'll attempt to reproduce these two issues with current GH. If I can reproduce, I'll write them up.

Note that for both issues, I started with a simple fresh install. No other content.

@Sesquipedalian
Copy link
Copy Markdown
Member Author

Sesquipedalian commented Mar 25, 2025

Note that for both issues, I started with a simple fresh install. No other content.

Hm. So, just to make sure that I understand correctly, the database you were using for the 2.0 → 3.0 upgrade was a fresh, empty install of 2.0 that you immediately upgraded to 3.0?

If I am indeed understanding correctly, then the fact that I cannot reproduce the problem is weird.

@sbulen
Copy link
Copy Markdown
Contributor

sbulen commented Mar 25, 2025

Hm. So, just to make sure that I understand correctly, the database you were using for the 2.0 → 3.0 upgrade was a fresh, empty install of 2.0 that you immediately upgraded to 3.0?

Yep. latin1 db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use UTF8MB4 everywhere MySQL 8.0 deprecation warnings

4 participants