Author |
Topic: Unicode in Folder names (Read 1335 times) |
|
hellomike
New Member
member is offline


Gender: 
Posts: 46
|
 |
Re: Unicode in Folder names
« Reply #4 on: Apr 1st, 2015, 11:41am » |
|
Ah, OK. Yes that clarifies it.
Thanks.
Still, I don't think using the "FindFirstFileW" API will be the solution for my problem since the folder names are ANSI strings with the odd Unicode character in it.
I'm thinking of substituting the colon, dot, slash, etc character (byte) by the 2-byte equivalent and using that new name$ variable for subsequent "FindNextFile" calls.
Back to testing....
Regards,
Mike
|
|
Logged
|
|
|
|
rtr2
Guest
|
 |
Re: Unicode in Folder names
« Reply #5 on: Apr 1st, 2015, 3:43pm » |
|
on Apr 1st, 2015, 11:41am, hellomike wrote:| the folder names are ANSI strings with the odd Unicode character in it. |
|
I don't understand what you mean by that. You cannot 'mix' ANSI and Unicode encodings in the same string, that has no meaning. Obviously a lot of characters have the same numeric code in all common encodings - for example the 'ASCII' characters occupy code points 32 to 126 in ANSI, UTF-8, UTF-16 and UTF-32 - but that doesn't alter the fact that any given string must have a specific encoding.
I worry that you are turning something which is intrinsically very simple into something difficult. As I understand it you cannot use ANSI file functions because some filenames contain non-ANSI characters. So that means you must use Unicode file functions throughout. Your only remaining choice is what encoding to use for representing those Unicode file and directory names within your program, and the only sensible alternatives are UTF-8 and UTF-16.
If you ever want to display the names in the 'mainwin' or send them to the printer, then UTF-8 is probably the better choice because that is supported natively in BB4W whereas UTF-16 is not. Also, if you want to search the names using INSTR, or do other string manipulations, you may find that using UTF-8 is easier, because the native BBC BASIC string functions assume a string of bytes. But if neither of those apply you might as well stick with UTF-16 since that is the only Unicode encoding Windows understands and it would minimise conversions.
Hopefully that helps clarify things.
Richard.
|
|
Logged
|
|
|
|
hellomike
New Member
member is offline


Gender: 
Posts: 46
|
 |
Re: Unicode in Folder names
« Reply #6 on: Apr 2nd, 2015, 4:22pm » |
|
As before, just getting confirmed how stuff works (and doesn't) helps a great deal!
Quote:| I worry that you are turning something which is intrinsically very simple into something difficult. |
|
Yep, my approach was needlessly difficult. The theory behind it all isn't complex but also not really "very simple" and then again, once understood, everything is simple.
It was confusing for me that the API "FindFirstFileW" wasn't really documented on MSDN and it took me a while to realize that the call works with wide strings for input and output and that a wide string delimiter is now 0x0000.
So there is progress and the following code now lists the folder-names correctly after making a function to make a Wide string version for the initial rootdir (D:\X30Share). Code: CP_UTF8 = &FDE9
VDU 23,22,640;512;8,16,16,128+8 : REM Select UTF-8 mode
*font Courier New
rootpath$="D:\X30Share"
N%=FNscandir(FNANSItoWide(rootpath$))
PRINT '"There were ";N%;" files in root path"
END
DEF FNscandir(path$)
LOCAL dir%,sh%,res%,n%,utf8%
DIM dir% LOCAL 317,utf8% LOCAL 260
SYS "FindFirstFileW",path$+"\"+CHR$0+"*"+CHR$0+CHR$0+CHR$0,dir% TO sh%
IF sh%<>-1 THEN
REPEAT
IF dir%!44<>&0000002E AND dir%!44<>&002E002E THEN
IF !dir% AND &10 THEN
SYS "WideCharToMultiByte",CP_UTF8,0,dir%+44,-1,utf8%,260,0,0
PRINT $$utf8%
REM Now I have to somehow append the double byte string at dir%+44
REM to path$ in order to do the recurse call to this function
REM path$+=$$(dir%+44) won't work...
ELSE
n%+=1
ENDIF
ENDIF
SYS "FindNextFileW",sh%,dir% TO res%
UNTIL res%=0
SYS "FindClose",sh%
ENDIF
=n%:REM Return number of files in the folder
REM --------------------------------------------------------------------
DEF FNANSItoWide(a$)
LOCAL wide$,i%
FOR i%=1 TO LENa$
wide$+=MID$(a$,i%,1)+CHR$0
NEXT
=wide$ Also testing for "." and ".." had to change.
I will manage appending the double byte string to path$ but had a strange error. I gathered that the returned names at dir%+44 now occupy twice as many bytes so I though to enlarge the memory area for dir% to be on the save side and changed only that line Code: DIM dir% LOCAL 511,utf8% LOCAL 260
After listing the folder names, the program errors out with
Not in a function
and emphasized the "=n%" line.
I'm using BB4W V5.95a.
Regards,
Mike
|
|
Logged
|
|
|
|
rtr2
Guest
|
 |
Re: Unicode in Folder names
« Reply #7 on: Apr 2nd, 2015, 4:44pm » |
|
on Apr 2nd, 2015, 4:22pm, hellomike wrote:I though to enlarge the memory area for dir% to be on the save side and changed only that line Code: DIM dir% LOCAL 511,utf8% LOCAL 260 |
|
You were quite right in thinking that it was necessary to increase the amount of memory allocated to dir%, but rather than erring on the "safe side" in fact you didn't increase it enough! If you look at the definition of WIN32_FIND_DATA at MSDN you'll find that the wide-character version occupies 592 bytes so in your program you require as a minimum:
Code: DIM dir% LOCAL 591,utf8% LOCAL 260 Strictly speaking the 260 should be increased as well, because the theoretical maximum path length when encoded as UTF-8 is longer than MAX_PATH bytes, but in practice you would be unlikely to exceed that.
Quote: I would not advise that if you are working with UTF-16 strings. Windows (particularly 64-bit Windows) sometimes requires that such strings are WORD-aligned, i.e. at an even memory address, and BB4W v6.00a guarantees that when using DIM ... LOCAL. However v5.95a does not. Try this program on both v5.95a and v6.00a to see what I mean:
Code: FOR N% = 1 TO 10
PROC1(N%)
NEXT
END
DEF PROC1(S%)
DIM dir% LOCAL S%
PRINT dir%
ENDPROC Richard.
|
| « Last Edit: Apr 2nd, 2015, 5:07pm by rtr2 » |
Logged
|
|
|
|
hellomike
New Member
member is offline


Gender: 
Posts: 46
|
 |
Re: Unicode in Folder names
« Reply #8 on: Apr 4th, 2015, 2:55pm » |
|
Yes I see the difference between v5 and v6 using the code snippet.
I will continue development using BB4W v6.x.
Thanks for all the help and tips.
Mike
|
|
Logged
|
|
|
|
|