[R] Encodage UTF8 d'un contenu d'Edit

GMH · #1

Bonjour,

Je récupère le contenu d'un Edit de mon programme AutoIt encodé UTF-8, pour le sauvegarder dans un fichier ouvert en UTF8 sans BOM, et expédier ce dernier à un éditeur externe.

Si le fichier expédié est bien encodé en UTF8, les accents français ne passent pas.

Voici la suite des fonctions que j'utilise :

- Récupération du contenu de l'Edit : $contenu = _GUICtrlEdit_GetText($monEdit)
- Ouverture d'un fichier UTF8 : FileOpen($nom_fichier, 256)
- Sauvegarde du contenu de l'Edit dans le fichier UTF8 : FileWrite($nom_fichier, $contenu)
- Envoi du fichier vers l'éditeur externe : Run('C:\Program Files (x86)\dossier\editeur_externe.exe "' & $nom_fichier & '"')

J'imagine qu'il faille traiter le contenu de l'Edit. Et là, ça coince.

Je vous remercie de l'aide et des conseils que vous pourrez m'apporter.

GMH · #2

Quelques heures de prise de tête après, je me rends compte que le problème vient de l'éditeur extérieur. D'autres utilisateurs ont été confrontés à ce problème qui, au vu des réponses données sur les forums, n'a pas de solution, si ce n'est de changer de logiciel.
Vous remerciant de l'attention que vous avez portée à mon message et m'excusant de ma fausse route.

#3

Il faut soit utiliser un éditeur capable de détecter de l'UTF8_NoBOM correctement, soit essayer de l'UTF8_BOM avec l'éditeur externe actuel (lequel ?), soit transcoder (transmuter) la chaîne lue dans la page de code que l'éditeur externe actuel attend (avec les risques que ça comporte).

GMH · #4

L'éditeur externe est le logiciel d'écriture musicale Frescobaldi. Finalement, je suis arrivé à lui faire admettre les lettres accentuées française en insérant dans le code quelques lignes. Rude journée mais... réussie !
Merci à vous jchd pour votre message. Quelle fonction permet de forcer un affichage en UTF-8 ?

#5

Quelle fonction permet de forcer un affichage en UTF-8 ?

La question n'a pas vraiment de sens, telle quelle.
UTF8 n'est qu'un encodage d'une chaîne Unicode qu'on peut passer à un programme qui va l'afficher correctement seulement s'il attend une chaîne sous cette forme.

C'est un peu comme un afficheur d'images : s'il ne connaît que les formats JGP et PNG, il ne va pas savoir afficher correctement une image TIFF si on renomme le fichier en .JPG En fait chaque format graphique impose une structure qui doit être cohérente et il y a toutes les chances qu'on détecte une incohérence quand on tente l'opération ci-dessus.

Par contre une chaîne de caractères n'a pas, a priori, de structure précise, donc cette étape de cohérence n'a généralement pas lieu.
On ne peut pas "forcer" un programme à savoir ce qu'est l'encodage UTF8 s'il n'a pas été programmé pour cela.

Il faut plutôt déterminer ce que sait traiter le programme et lui envoyer les données sous cette forme.
Ayant jeté un très rapide coup d'oeil à la page de ce soft, il est quasi-certain qu'il lit et interprète correctement de l'UTF8.

Pour convertir une chaîne AutoIt native en UTF8 :

Code : Tout sélectionner

Local $sNative = "J'ai trouvé ça à 8€"
Local $sUTF8 = BinaryToString(StringToBinary($sNative, 4), 1)

GMH · #6

Merci encore jchd pour toutes ces précisions. Je me rends compte qu'il y a bien des termes qu'on utilise en croyant comprendre ce qu'ils cachent. Mais quand on creuse un peu ... ! Je prends bonne note de la fonction que vous rappelez... Elle servira un jour ...

#7

J'avais concocté le pavé suivant pour qu'il fasse partie de l'aide de la dernière version, mais il a été omis pour l'instant ou du moins seul des extraits ont été inclus dans l'aide de la bêta.
J'ai trop peu de temps pour traduire, Google si besoin.

AutoIt and string encodings

Foreword

The problem of internal representation of characters has been plaguing the computer industry since IT became widespread.
Initially every company used its own conventions and tables to represent text and symbols, making interoperability a nightmare. The growing demand for support of more symbols, control characters and non-Latin scripts made the situation even worse.
Character sets and their possible encodings ressembles playing cards: tarot and poker don't use the same set of cards. Next, once a set is chosen, one must create a design (a representation or encoding) for each card so that every player recognizes them instantly.

Character sets

Today, all character sets fall into 2 families: Unicode and byte-codepages.
• Unicode
Also know as ISO/IEC 10646, this huge and complex character set which has the formidable advantage to be universal: it aims to include every character or symbol ever used by humans, including cuneiform, hieroglyphs, Klingon, emoji and much more. Unicode version 13.0 (03/2020) defines 143,849 characters among the 1,112,064 possible codepoints defined by the standard.
Unicode is more than just a character set and defines ways to handle directional text (left to right or right to left), combine glyphs in complex ways (graphemes), procedures to collate (compare) strings, compose/decompose characters and much more.
• Byte-codepages
A simple codepage uses a table of N entries (typically 256) to map N characters or symbols to a numeric value. Every character in a string uses a 8-bit byte in most codepages, making good use of available memory when it was a scarce resource. The limitation to N characters however was an issue to portability and a large number of incompatible codepages were created. As a result one had to know or guess the encoding used in a received text file to process it correctly.
To overcome the limitation introduced by short codepages making representation of large alphabets impossible (e.g. Chinese), more extensive codepages were created using a variable-width convention, the MBCS encodings (Multi-Byte Character Set).

The question of the representation of strings in memory or files using a given character set arose when IT started to use non-simple codepages.

AutoIt strings encodings

Native AutoIt strings use the UCS-2 character set and encoding. It is the subset of Unicode limited to the BMP (Basic Multilingual Plane), the first 64k Unicode codepoints. This encoding uses 16-bit encoding units (each character is represented by a unsigned short value) where codepoints in range U+D800..U+DFFF (surrogates in UTF16) are not special and simply reserved for private use.
Note that Windows has been handling Unicode for a very long time: Win 3.x, Win95, NT added a DLL to handle UCS-2, XP and up handled UTF16-LE.

However, some applications need to process strings using other encodings.
• UTF8
UTF8 is a MBCS (variable-length) representation of a Unicode string using a series of 1 to 4 8-bit bytes. It is the ubiquitous encoding used in web pages, XML, ... being both bandwidth-efficient and able to encode the full Unicode character set.
• Codepages
Depending Windows language settings, external files or streams you need to process or produce, you may need to use or convert strings to/from a non-Unicode codepage.
• OEM codepages
Are the traditional "DOS" codepages, one for each possible language setting in effect.
• Windows codepages
Also named "ANSI" codepages, they are the codepage used by the modern Windows console and defined by the language setting in effect, unless changed by the DHCP command.
Note that the Windows console can also accept/display UTF8 after issuing the "DHCP 65001" command.
• Special codepages
Among those are the MBCSs used by many asian contexts and EBCDIC used by most IBM mainframes.

Converting to/from some codepage from/to native UCS2 AutoIt strings

You can use these functions to perform the wanted conversion. Codepage identifier 65001 means UTF8 but you can pass any identifier supported by Windows.
A list of codepages supported by Windows can be found here: https://docs.microsoft.com/en-us/window ... dentifiers

; To convert a native AutoIt string (UCS-2) to some codepage (by default UTF8):
Func _StringToCodepage($sStr, $iCodepage = Default)
If $iCodepage = Default Then $iCodepage = 65001
Local $aResult = DllCall("kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _
"ptr", 0, "int", 0, "ptr", 0, "ptr", 0)
Local $tCP = DllStructCreate("char[" & $aResult[0] & "]")
$aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _
"struct*", $tCP, "int", $aResult[0], "ptr", 0, "ptr", 0)
Return DllStructGetData($tCP, 1)
EndFunc ;==>_StringToCodepage

; To convert a string from a given codepage (by default UTF8) to a native AutoIt string (UCS-2):
Func _CodepageToString($sCP, $iCodepage = Default)
If $iCodepage = Default Then $iCodepage = 65001
Local $tText = DllStructCreate("byte[" & StringLen($sCP) & "]")
DllStructSetData($tText, 1, $sCP)
Local $aResult = DllCall("kernel32.dll", "int", "MultiByteToWideChar", "uint", $iCodepage, "dword", 0, "struct*", $tText, "int", StringLen($sCP), _
"ptr", 0, "int", 0)
Local $tWstr = DllStructCreate("wchar[" & $aResult[0] & "]")
$aResult = DllCall("kernel32.dll", "int", "MultiByteToWideChar", "uint", $iCodepage, "dword", 0, "struct*", $tText, "int", StringLen($sCP), _
"struct*", $tWstr, "int", $aResult[0])
Return DllStructGetData($tWstr, 1)
EndFunc ;==>_CodepageToString

If you only need to convert native AutoIt strings to/from UTF8 (a very common use) you can use this

$sMyString = "Hello Χαίρετε こんにちは Привет xin chào हैलो مرحبا 你好 שלום வணக்கம்"
$sUTF8String = BinaryToString(StringToBinary($sMyString & @LF, 4), 1)

; reverse conversion:
$sMyStringBack = BinaryToString(StringToBinary($sUTF8String & @LF, 1), 4)

Tips

It is a good idea to use the default UTF8 encoding for your source files: your strings will display verbatim in both your source code and in Windows controls.
It is also a good idea to set the SciTe4AutoIt3 console to UTF8 if ever you need to display characters or symbols not found in your default Windows codepage.

To send UTF8 strings to the SciTe console, you can use this function:

; Unicode-aware ConsoleWrite for UTF8 SciTe console
Func _ConsoleWrite($s)
ConsoleWrite(BinaryToString(StringToBinary($s & @LF, 4), 1))
EndFunc ;==>_ConsoleWrite

In addition, if your program may use the compiled CUI interface *or* the uncompiled SciTe console (e.g. for debugging), you can use this:

; Indirect Unicode-aware function for UTF8 Scite or CUI consolewrite
Func __ConsoleWrite($s)
(@Compiled ? _CUI_ConsoleWrite : _ConsoleWrite) ($s)
EndFunc ;==>__ConsoleWrite

; Function for UTF16 CUI consolewrite
Func _CUI_ConsoleWrite(ByRef $s)
Local Static $hCon = __ConsoleInit()
DllCall("kernel32.dll", "bool", "WriteConsoleW", "handle", $hCon, "wstr", $s & @LF, "dword", StringLen($s) + 1, "dword*", 0, "ptr", 0)
Return
EndFunc ;==>__ConsoleWrite

; Helper function for CUI consolewrite
Func _CUI_ConsoleInit()
DllCall("kernel32.dll", "bool", "AllocConsole")
Return DllCall("kernel32.dll", "handle", "GetStdHandle", "int", -11)[0]
EndFunc ;==>_CUI_ConsoleInit

For instance, run this code sample using the above functions; Hello should display correctly in different languages identically in the MsgBox and the console (SciTe or CUI):

$sMyString = "Hello Χαίρετε こんにちは Привет xin chào हैलो مرحبا 你好 שלום வணக்கம்"
__ConsoleWrite($sMyString)
MsgBox(0, "", $sMyString)

AutoIt Français

[R] Encodage UTF8 d'un contenu d'Edit

[R] Encodage UTF8 d'un contenu d'Edit

Re: [R] Encodage UTF8 d'un contenu d'Edit

Re: [..] Encodage UTF8 d'un contenu d'Edit

Re: [R] Encodage UTF8 d'un contenu d'Edit

Re: [R] Encodage UTF8 d'un contenu d'Edit

Re: [R] Encodage UTF8 d'un contenu d'Edit

Re: [R] Encodage UTF8 d'un contenu d'Edit