Author |
Topic: Unicode in Folder names (Read 1331 times) |
|
hellomike
New Member
member is offline


Gender: 
Posts: 46
|
 |
Unicode in Folder names
« Thread started on: Mar 30th, 2015, 3:47pm » |
|
Hi,
Since some characters can't be used in file- /folder names (colons, back & forward slashes, dots, etc), an application substituted them with Unicode equivalents. Now I need to tree-search such folders (Windows7) but once a folder having such Unicode characters has been read, my program fails to open the folder to search deeper.
With this code Code:
rootpath$="D:\X30Share"
@%=2
N%=FNscandir(rootpath$)
PRINT '"There were ";N%;" files in root path"
END
DEF FNscandir(path$)
LOCAL dir%,sh%,res%,m%,n%,name$
DIM dir% LOCAL 317
SYS "FindFirstFile",path$+"\*",dir% TO sh%
IF sh%<>-1 THEN
REPEAT
name$=$$(dir%+44)
IF name$<>"." AND name$<>".." THEN
IF !dir% AND &10 THEN
m%=FNscandir(path$+"\"+name$):REM Recursive
PRINT "Folder """;name$;"""" TAB(62) " contains ";
IF m%=0 THEN PRINT "NO"; ELSE PRINT m%;
PRINT " files"
ELSE
n%+=1
ENDIF
ENDIF
SYS "FindNextFile",sh%,dir% TO res%
UNTIL res% = 0
SYS "FindClose", sh%
ENDIF
=n%:REM Return number of files in the folder
the folder names read correctly. That is, it seems as if the windows API automatically converts the (2 byte?) Unicode characters to the ANSI 1-byte equivalent. $$(dir%+44) seems only having ANSI characters now.
This is great for printing but when I use the string as a folder-name again, for a recursive call to search deeper, of course the ANSI name folder does not exist.
In a test (root) folder I made some objects. DOS dir gives:
-------------------------------------- D:\X30Share>dir /b A couple of accented characters (ًٍإدْ) but all ANSI A name with a Unicode forward/slash just a file.idb just another file.idb Quite a normal folder name This name has a Unicode: colon This name has two Unicode dots, here.and here. This name is very simple
D:\X30Share> --------------------------------------
So, "D:\X30Share" contains 2 files and 6 subfolders, some with ordinary names and some with a Unicode character in it.
Programs output is: Code:Folder "A couple of accented characters (ًٍإدْ) but all ANSI" contains 15 files
Folder "A name with a Unicode forward/slash" contains NO files
Folder "Quite a normal folder name" contains 20 files
Folder "This name has a Unicode: colon" contains NO files
Folder "This name has two Unicode dots, here.and here." contains NO files
Folder "This name is very simple" contains 36 files
There were 2 files in root path
>
And there the problem is, it doesn't see some subfolders and reports NO files in them.
Of course I also tried the program that Richard advised in another thread and on the Wiki.
Code: CP_UTF8 = &FDE9
VDU 23,22,640;512;8,16,16,128+8 : REM Select UTF-8 mode
*font Courier New
rootpath$="D:\X30Share"
@%=2
N%=FNscandir(rootpath$)
PRINT '"There were ";N%;" files in root path"
END
DEF FNscandir(path$)
LOCAL dir%,sh%,res%,m%,n%,name$,utf8%
DIM dir% LOCAL 317,utf8% LOCAL 260
SYS "FindFirstFileW",path$+"\*"+CHR$0+CHR$0,dir% TO sh%
IF sh%<>-1 THEN
REPEAT
SYS "WideCharToMultiByte",CP_UTF8,0,dir%+44,-1,utf8%,260,0,0
name$=$$utf8%:REM or =$$(dir%+44) ??
IF name$<>"." AND name$<>".." THEN
IF !dir% AND &10 THEN
m%=FNscandir(path$+"\"+name$):REM Recursive
PRINT "Folder """;name$;"""" TAB(62) " contains ";
IF m%=0 THEN PRINT "NO"; ELSE PRINT m%;
PRINT " files"
ELSE
n%+=1
ENDIF
ENDIF
SYS "FindNextFileW",sh%,dir% TO res%
UNTIL res% = 0
SYS "FindClose", sh%
ENDIF
=n%:REM Return number of files in the folder
But here it finds nothing at all. Code:
There were 0 files in root path
>
Any help/hints are welcome to tackle this issue.
Thanks in advance.
Mike
|
|
Logged
|
|
|
|
rtr2
Guest
|
 |
Re: Unicode in Folder names
« Reply #1 on: Mar 30th, 2015, 8:41pm » |
|
on Mar 30th, 2015, 3:47pm, hellomike wrote:Of course I also tried the program that Richard advised in another thread and on the Wiki. |
|
The code you listed is broken, not least because it passes an ANSI string to a Wide API function:
Code: SYS "FindFirstFileW",path$+"\*"+CHR$0+CHR$0,dir% TO sh% path$ here is an ANSI string, so it cannot be passed to FindFirstFileW! I'm concerned that you say I "advised" the use of this code; I'm quite sure I didn't as it could never have worked.
Richard.
|
|
Logged
|
|
|
|
hellomike
New Member
member is offline


Gender: 
Posts: 46
|
 |
Re: Unicode in Folder names
« Reply #2 on: Mar 31st, 2015, 6:26pm » |
|
Richard,
Of course I should have said 'program that Richard advised and I modified'. By no means I meant that you ill-advised so don't be concerned please. Actually I'm always amazed that you answer all kind of Windows OS/API issues since these are not BB4W related. Anyhow, you are correct of course that "FindFirstFileW" expects a Wide string as input.
In the Wiki article you state Quote:If a file or directory contains a non-ANSI character (e.g. from a different alphabet) it will be returned as a question mark |
|
Note however that these Unicode characters (/,\,:,*,?, etc) do not return as questions marks using "FindFirstFile" but -as it seems- the ANSI version of the Unicode character. Some Windows-easter-egg? Again, suitable for printing but unsuitable for appending to the current path to form a new path to dig into.
In the artice the following is used: Code:SYS "FindFirstFileW", "*"+CHR$0+CHR$0, dir% TO sh%
But isn't that 3 characters (ANSI asterisk and two null-bytes), so also not Wide-characters?
Sorry if I don't understand hence the reason for starting this thread.
Thanks
Mike
|
|
Logged
|
|
|
|
rtr2
Guest
|
 |
Re: Unicode in Folder names
« Reply #3 on: Mar 31st, 2015, 8:29pm » |
|
on Mar 31st, 2015, 6:26pm, hellomike wrote: Code:SYS "FindFirstFileW", "*"+CHR$0+CHR$0, dir% TO sh% But isn't that 3 characters (ANSI asterisk and two null-bytes), so also not Wide-characters? |
|
The intention was that it would be four bytes in all (taking into account the automatically-added termination) but it would be safer to add the extra NUL explicitly:
Code: which corresponds to:
Code: or expressed as wide characters:
Code: or in C:
Code: Richard.
|
|
Logged
|
|
|
|
hellomike
New Member
member is offline


Gender: 
Posts: 46
|
 |
Re: Unicode in Folder names
« Reply #4 on: Apr 1st, 2015, 11:41am » |
|
Ah, OK. Yes that clarifies it.
Thanks.
Still, I don't think using the "FindFirstFileW" API will be the solution for my problem since the folder names are ANSI strings with the odd Unicode character in it.
I'm thinking of substituting the colon, dot, slash, etc character (byte) by the 2-byte equivalent and using that new name$ variable for subsequent "FindNextFile" calls.
Back to testing....
Regards,
Mike
|
|
Logged
|
|
|
|
rtr2
Guest
|
 |
Re: Unicode in Folder names
« Reply #5 on: Apr 1st, 2015, 3:43pm » |
|
on Apr 1st, 2015, 11:41am, hellomike wrote:the folder names are ANSI strings with the odd Unicode character in it. |
|
I don't understand what you mean by that. You cannot 'mix' ANSI and Unicode encodings in the same string, that has no meaning. Obviously a lot of characters have the same numeric code in all common encodings - for example the 'ASCII' characters occupy code points 32 to 126 in ANSI, UTF-8, UTF-16 and UTF-32 - but that doesn't alter the fact that any given string must have a specific encoding.
I worry that you are turning something which is intrinsically very simple into something difficult. As I understand it you cannot use ANSI file functions because some filenames contain non-ANSI characters. So that means you must use Unicode file functions throughout. Your only remaining choice is what encoding to use for representing those Unicode file and directory names within your program, and the only sensible alternatives are UTF-8 and UTF-16.
If you ever want to display the names in the 'mainwin' or send them to the printer, then UTF-8 is probably the better choice because that is supported natively in BB4W whereas UTF-16 is not. Also, if you want to search the names using INSTR, or do other string manipulations, you may find that using UTF-8 is easier, because the native BBC BASIC string functions assume a string of bytes. But if neither of those apply you might as well stick with UTF-16 since that is the only Unicode encoding Windows understands and it would minimise conversions.
Hopefully that helps clarify things.
Richard.
|
|
Logged
|
|
|
|
hellomike
New Member
member is offline


Gender: 
Posts: 46
|
 |
Re: Unicode in Folder names
« Reply #6 on: Apr 2nd, 2015, 4:22pm » |
|
As before, just getting confirmed how stuff works (and doesn't) helps a great deal!
Quote:I worry that you are turning something which is intrinsically very simple into something difficult. |
|
Yep, my approach was needlessly difficult. The theory behind it all isn't complex but also not really "very simple" and then again, once understood, everything is simple.
It was confusing for me that the API "FindFirstFileW" wasn't really documented on MSDN and it took me a while to realize that the call works with wide strings for input and output and that a wide string delimiter is now 0x0000.
So there is progress and the following code now lists the folder-names correctly after making a function to make a Wide string version for the initial rootdir (D:\X30Share). Code: CP_UTF8 = &FDE9
VDU 23,22,640;512;8,16,16,128+8 : REM Select UTF-8 mode
*font Courier New
rootpath$="D:\X30Share"
N%=FNscandir(FNANSItoWide(rootpath$))
PRINT '"There were ";N%;" files in root path"
END
DEF FNscandir(path$)
LOCAL dir%,sh%,res%,n%,utf8%
DIM dir% LOCAL 317,utf8% LOCAL 260
SYS "FindFirstFileW",path$+"\"+CHR$0+"*"+CHR$0+CHR$0+CHR$0,dir% TO sh%
IF sh%<>-1 THEN
REPEAT
IF dir%!44<>&0000002E AND dir%!44<>&002E002E THEN
IF !dir% AND &10 THEN
SYS "WideCharToMultiByte",CP_UTF8,0,dir%+44,-1,utf8%,260,0,0
PRINT $$utf8%
REM Now I have to somehow append the double byte string at dir%+44
REM to path$ in order to do the recurse call to this function
REM path$+=$$(dir%+44) won't work...
ELSE
n%+=1
ENDIF
ENDIF
SYS "FindNextFileW",sh%,dir% TO res%
UNTIL res%=0
SYS "FindClose",sh%
ENDIF
=n%:REM Return number of files in the folder
REM --------------------------------------------------------------------
DEF FNANSItoWide(a$)
LOCAL wide$,i%
FOR i%=1 TO LENa$
wide$+=MID$(a$,i%,1)+CHR$0
NEXT
=wide$ Also testing for "." and ".." had to change.
I will manage appending the double byte string to path$ but had a strange error. I gathered that the returned names at dir%+44 now occupy twice as many bytes so I though to enlarge the memory area for dir% to be on the save side and changed only that line Code: DIM dir% LOCAL 511,utf8% LOCAL 260
After listing the folder names, the program errors out with
Not in a function
and emphasized the "=n%" line.
I'm using BB4W V5.95a.
Regards,
Mike
|
|
Logged
|
|
|
|
rtr2
Guest
|
 |
Re: Unicode in Folder names
« Reply #7 on: Apr 2nd, 2015, 4:44pm » |
|
on Apr 2nd, 2015, 4:22pm, hellomike wrote:I though to enlarge the memory area for dir% to be on the save side and changed only that line Code: DIM dir% LOCAL 511,utf8% LOCAL 260 |
|
You were quite right in thinking that it was necessary to increase the amount of memory allocated to dir%, but rather than erring on the "safe side" in fact you didn't increase it enough! If you look at the definition of WIN32_FIND_DATA at MSDN you'll find that the wide-character version occupies 592 bytes so in your program you require as a minimum:
Code: DIM dir% LOCAL 591,utf8% LOCAL 260 Strictly speaking the 260 should be increased as well, because the theoretical maximum path length when encoded as UTF-8 is longer than MAX_PATH bytes, but in practice you would be unlikely to exceed that.
Quote: I would not advise that if you are working with UTF-16 strings. Windows (particularly 64-bit Windows) sometimes requires that such strings are WORD-aligned, i.e. at an even memory address, and BB4W v6.00a guarantees that when using DIM ... LOCAL. However v5.95a does not. Try this program on both v5.95a and v6.00a to see what I mean:
Code: FOR N% = 1 TO 10
PROC1(N%)
NEXT
END
DEF PROC1(S%)
DIM dir% LOCAL S%
PRINT dir%
ENDPROC Richard.
|
« Last Edit: Apr 2nd, 2015, 5:07pm by rtr2 » |
Logged
|
|
|
|
hellomike
New Member
member is offline


Gender: 
Posts: 46
|
 |
Re: Unicode in Folder names
« Reply #8 on: Apr 4th, 2015, 2:55pm » |
|
Yes I see the difference between v5 and v6 using the code snippet.
I will continue development using BB4W v6.x.
Thanks for all the help and tips.
Mike
|
|
Logged
|
|
|
|
|