gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8#120248
gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8#120248vstinner wants to merge 7 commits intopython:mainfrom
Conversation
PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII. Remove unused 'consumed' parameter of unicode_decode_utf8_writer().
|
I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one? Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for @serhiy-storchaka @methane: Would you mind to review this change? |
|
|
|
|
But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the PyUnicode_Format("%s", "\xe2\x82\xac") |
I prefer "strict" because "hard to notice" is also hard to debug. |
I would like to accept UTF-8 format string to make functions consistent: use UTF-8 basically everywhere. It's also to use the UTF-8 decoder (with |
|
@serhiy-storchaka @methane: Would you mind to review the updated PR?
I modified the change to use the strict error handler. I also modified the implementation to still raise UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 21: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line 391, in test_from_format
PyUnicode_FromFormat(b'invalid format string\xff: %s', b'abc')
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line 377, in PyUnicode_FromFormat
return _PyUnicode_FromFormat(format, *cargs)
ValueError: PyUnicode_FromFormatV() expects a valid UTF-8-encoded format string, got an invalid UTF-8 string |
Replace PyErr_Format() with PyErr_SetString()
serhiy-storchaka
left a comment
There was a problem hiding this comment.
I do not think that this change is necessary, but I do not strongly oppose it.
| if (unicode_decode_utf8_writer(&writer, f, len, | ||
| _Py_ERROR_STRICT, "strict") < 0) { | ||
| PyObject *exc = PyErr_GetRaisedException(); | ||
| PyErr_SetString(PyExc_ValueError, |
There was a problem hiding this comment.
Why raise ValueError explicitly? If you want a ValueError for compatibility, UnicodeDecode is a subclass of ValueError, so this is a backward compatible change. Other functions which take const char * do not raise ValueError explicitly.
There was a problem hiding this comment.
The error message helps debugging such issue: it points directly to the format string.
Well, my first motivation for this change was to reuse the more efficient ASCII and UTF-8 decoders and Benchmark: diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index b139b46c826..4efef31ef4c 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -3305,6 +3305,22 @@ function_set_warning(PyObject *Py_UNUSED(module), PyObject *Py_UNUSED(args))
Py_RETURN_NONE;
}
+static PyObject *
+bench(PyObject *Py_UNUSED(module), PyObject *args)
+{
+ const char *format;
+ if (!PyArg_ParseTuple(args, "y", &format)) {
+ return NULL;
+ }
+
+
+ PyObject *str = PyUnicode_FromFormat(format, 123);
+ assert(str != NULL);
+ Py_DECREF(str);
+
+ Py_RETURN_NONE;
+}
+
static PyMethodDef TestMethods[] = {
{"set_errno", set_errno, METH_VARARGS},
{"test_config", test_config, METH_NOARGS},
@@ -3446,6 +3462,7 @@ static PyMethodDef TestMethods[] = {
{"check_pyimport_addmodule", check_pyimport_addmodule, METH_VARARGS},
{"test_weakref_capi", test_weakref_capi, METH_NOARGS},
{"function_set_warning", function_set_warning, METH_NOARGS},
+ {"bench", bench, METH_VARARGS},
{NULL, NULL} /* sentinel */
};
Script: import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_func('bench 3', _testcapi.bench, b'x' * 3 + b'%i')
runner.bench_func('bench 30', _testcapi.bench, b'x' * 30 + b'%i')
runner.bench_func('bench 100', _testcapi.bench, b'x' * 100 + b'%i')Result: |
|
@erlend-aasland @corona10: Do you have an opinion on this change? |
|
Since switching to UTF-8 seems to be controversial and my main motivation was to optimize the code, I wrote PR gh-120796 which keeps ASCII but optimizes the code using similar code paths: strchr() + ucs1lib_find_max_char(). There is a similar speedup. I close this PR. |
|
I am not so strongly against this idea, I only asked about the reason. In any case, errors in the format string should not be ignored. |
|
Well, I'm not convinced myself anymore, so I prefer to abandon this PR. |
PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII.
Remove unused 'consumed' parameter of unicode_decode_utf8_writer().
📚 Documentation preview 📚: https://cpython-previews--120248.org.readthedocs.build/