BBC BASIC for Windows
Programming >> Database and Files >> Matching UTF8 / ANSI accented text
http://bb4w.conforums.com/index.cgi?board=database&action=display&num=1425500472

Matching UTF8 / ANSI accented text
Post by hellomike on Mar 4th, 2015, 7:21pm

Hi fellow programmers,

For a project I need to match strings from 2 different sources.

The first source is a simple ANSI text file where strings can have accented characters. The second source is a (SQLite3) database where strings come out as UTF8.

So for example a string from source 1 is

"COMÉDIE À LA FRANÇAISE, LA"

and from the database it is

"Comédie Ã_€ La Française, La" (_ actually is unprintable block character)

This should be a match but I find no way of making this happen.
The perfect solution would be if both strings can be converted to upper case without accented characters.
Then "COMEDIE A LA FRANCAISE, LA" would match with "COMEDIE A LA FRANCAISE, LA".

Also converting the UTF8 text to "Comédie À La Française, La", would be sufficient because that can easily made uppercase (accented characters included).

To make this conversion, I experimented with
- SYS "MultiByteToWideChar"
- SYS "CompareString"
- SYS "NormalizeString"

All with minimal, undesirable or no effect.

What code/technique/calls would I need to normalize both source inputs to comparable versions?

Thanks for all the help in advance.

Regards,

Mike
Re: Matching UTF8 / ANSI accented text
Post by rtr2 on Mar 4th, 2015, 8:36pm

on Mar 4th, 2015, 7:21pm, hellomike wrote:
What code/technique/calls would I need to normalize both source inputs to comparable versions?

As you have discovered, Windows does not provide an API to perform a direct conversion between UTF-8 and ANSI encodings. However it does provide a routine MultiByteToWideChar to convert either of those to UTF-16 encoding. So the most straightforward approach is probably to convert both your strings to UTF-16 and then compare the two outputs (a regular string comparison will do that job perfectly well).

The MultiByteToWideChar API is not particularly difficult to use, but as always you must abide by the requirements of the MSDN description in every detail. A useful hint to be aware of is that when an API function returns a string you must add an explicit NUL termination to the supplied parameter string (which must be long enough to receive the output, i.e. 2*N+1, including the NUL, where N is the number of UTF-16 characters expected).

Richard.

Re: Matching UTF8 / ANSI accented text
Post by hellomike on Mar 5th, 2015, 7:45pm

on Mar 4th, 2015, 8:36pm, g4bau wrote:
So the most straightforward approach is probably to convert both your strings to UTF-16 and then compare the two outputs (a regular string comparison will do that job perfectly well).


Thanks for the answer Richard. At least I now know that I didn't miss something.
Non-matching string might have to be displayed/printed so at some point I must convert it to something readable. I'm thinking of either:
- using "MultiByteToWideChar" and then glue every other character of the 2-byte-per-character output into an ANSI string or
- scan the UTF8 string searching for the 'Ã' byte, removing it and replacing the next byte with the appropriate accented character (via a table)

The sources might hold up to say around 50,000 strings to compare and BBC BASIC for Windows is fast so the process is fast enough for such a small amount (no need for machine-code).

Regards,

Mike
Re: Matching UTF8 / ANSI accented text
Post by rtr2 on Mar 5th, 2015, 8:58pm

on Mar 5th, 2015, 7:45pm, hellomike wrote:
Non-matching string might have to be displayed/printed so at some point I must convert it to something readable.

I don't quite understand why. BB4W can display and print both ANSI and UTF-8 strings so no conversion ought to be necessary. It's true that ordinarily one wouldn't be switching dynamically between the two encodings, but I can't see why it wouldn't work correctly if you do (bit 7 of the VDU flags byte @vdu%?74 if memory serves me correctly).

So what I would envisage is converting both the ANSI and UTF-8 strings to UTF-16, comparing those UTF-16 strings, and if they are different displaying/printing the source ANSI (@vdu%?74 AND= &7F) and UTF-8 (@vdu%?74 OR= &80) strings directly using PRINT.

Richard.
Re: Matching UTF8 / ANSI accented text
Post by rtr2 on Mar 6th, 2015, 10:35am

on Mar 4th, 2015, 8:36pm, g4bau wrote:
A useful hint to be aware of is that when an API function returns a string you must add an explicit NUL termination to the supplied parameter string

I have written a short Wiki article with a couple of examples:

http://bb4w.wikispaces.com/Calling+DLL+functions+that+return+strings

Richard.
Re: Matching UTF8 / ANSI accented text
Post by hellomike on Mar 6th, 2015, 6:38pm

Hi,

Currently I implemented the following:
Code:
        DIM wchar% 257
        ansi$=""
        SYS "MultiByteToWideChar",CP_UTF8,0,UTF8%,-1,wchar%,128 TO n%
        FOR i%=wchar% TO wchar%+n%*2-3 STEP 2
          ansi$+=CHR$(?i%)
        NEXT
 

Where UTF8% is a pointer to a null terminated UTF8 string.

I then compare the 2 ansi strings using:
Code:
      REM NORM_IGNORECASE = 0x00000001;
      REM NORM_IGNORENONSPACE = 0x00000002;
      REM NORM_IGNORESYMBOLS = 0x00000004;
      REM LINGUISTIC_IGNORECASE = 0x00000010;
      REM LINGUISTIC_IGNOREDIACRITIC = 0x00000020;
      REM NORM_IGNOREKANATYPE = 0x00010000;
      REM NORM_IGNOREWIDTH = 0x00020000;
      REM NORM_LINGUISTIC_CASING = 0x08000000;
      REM SORT_STRINGSORT = 0x00001000;
      REM SORT_DIGITSASNUMBERS = 0x00000008;

      SYS "CompareString",0,&1021,str1$,LENstr1$,str2$,LENstr1$ TO result%
 

The 2 times LENstr1$ is intentional since the string from the text file is always the same or shorter in length than the string from the SQLite DB.

This works as I intended however I have to admit that I don't know why I had to use &1021 in "CompareString" and not simply &1 or &10.

Thanks for mentioning @vdu%?74. I will check it out to see if I can benefit from it.
The new Wiki article is awesome to have a better understanding. Indeed I not always understand why for a parameter to an API call, sometimes '!^str$' is used and other times just 'str$'. I thought the interpreter will make sure the pointer is passed anyhow.

Regards,

Mike
Re: Matching UTF8 / ANSI accented text
Post by rtr2 on Mar 6th, 2015, 9:30pm

on Mar 6th, 2015, 6:38pm, hellomike wrote:
I have to admit that I don't know why I had to use &1021 in "CompareString" and not simply &1 or &10.

Your method of converting UTF-16 to ANSI is non-standard and I'm not even sure it is guaranteed to work. Why did you use that method rather than the 'official' WideCharToMultiByte API? Indeed, why did you choose not to take my advice and convert both source strings to UTF-16 and compare those (e.g. using the CompareStringW API)?

Another question raised by your code is what encoding your 'ANSI' string is actually using. The Windows constant CP_ACP does not necessarily mean the ANSI code page, instead it refers to the (default) code page for which the PC is configured. If it's set up for use in the UK or US it may well be the ANSI code page, but in almost any other country it probably won't be, and that could cause your code to fail.

Quote:
I not always understand why for a parameter to an API call, sometimes '!^str$' is used and other times just 'str$'.

If (and only if) the last character of string str$ is a NUL there is not really any difference between them; they both pass the same address to the API function. However if str$ does not end with a NUL they have quite different effects (passing str$ causes the string to be copied to a temporary place and a NUL terminator added).

Richard.


Re: Matching UTF8 / ANSI accented text
Post by hellomike on Mar 10th, 2015, 8:25pm

Richard,

I'm still optimizing my code with the advise received. All the information is really useful, don't worry.

Quote:
(passing str$ causes the string to be copied to a temporary place and a NUL terminator added)

Clear now.

Regards,

Mike