BBC BASIC for Windows - Unicode in Folder names

BBC BASIC for Windows

Programming

Database and Files (Moderator: admin)

Unicode in Folder names

« Previous Topic | Next Topic »

Pages: 1

Author

Topic: Unicode in Folder names (Read 1331 times)

hellomike
New Member

member is offline

Gender:

Posts: 46

Unicode in Folder names
« Thread started on: Mar 30^th, 2015, 3:47pm »

Hi,

Since some characters can't be used in file- /folder names (colons, back & forward slashes, dots, etc), an application substituted them with Unicode equivalents.
Now I need to tree-search such folders (Windows7) but once a folder having such Unicode characters has been read, my program fails to open the folder to search deeper.

With this code
Code:

      rootpath$="D:\X30Share"
      @%=2

      N%=FNscandir(rootpath$)
      PRINT '"There were ";N%;" files in root path"
      END

      DEF FNscandir(path$)
      LOCAL dir%,sh%,res%,m%,n%,name$
      DIM dir% LOCAL 317
      SYS "FindFirstFile",path$+"\*",dir% TO sh%
      IF sh%<>-1 THEN
        REPEAT
          name$=$$(dir%+44)
          IF name$<>"." AND name$<>".." THEN
            IF !dir% AND &10 THEN
              m%=FNscandir(path$+"\"+name$):REM Recursive
        
              PRINT "Folder """;name$;"""" TAB(62) " contains ";
              IF m%=0 THEN PRINT "NO"; ELSE PRINT m%;
              PRINT " files"
            ELSE
              n%+=1
            ENDIF
          ENDIF
          SYS "FindNextFile",sh%,dir% TO res%
        UNTIL res% = 0
        SYS "FindClose", sh%
      ENDIF
      =n%:REM Return number of files in the folder

the folder names read correctly.
That is, it seems as if the windows API automatically converts the (2 byte?) Unicode characters to the ANSI 1-byte equivalent. $$(dir%+44) seems only having ANSI characters now.

This is great for printing but when I use the string as a folder-name again, for a recursive call to search deeper, of course the ANSI name folder does not exist.

In a test (root) folder I made some objects. DOS dir gives:

--------------------------------------
D:\X30Share>dir /b
A couple of accented characters (ëíÅÏò) but all ANSI
A name with a Unicode forward/slash
just a file.idb
just another file.idb
Quite a normal folder name
This name has a Unicode: colon
This name has two Unicode dots, here.and here.
This name is very simple

D:\X30Share>
--------------------------------------

So, "D:\X30Share" contains 2 files and 6 subfolders, some with ordinary names and some with a Unicode character in it.

Programs output is:
Code:

Folder "A couple of accented characters (ëíÅÏò) but all ANSI"  contains 15 files
Folder "A name with a Unicode forward/slash"                   contains NO files
Folder "Quite a normal folder name"                            contains 20 files
Folder "This name has a Unicode: colon"                        contains NO files
Folder "This name has two Unicode dots, here.and here."        contains NO files
Folder "This name is very simple"                              contains 36 files

There were 2 files in root path
>

And there the problem is, it doesn't see some subfolders and reports NO files in them.

Of course I also tried the program that Richard advised in another thread and on the Wiki.

Code:

      CP_UTF8 = &FDE9
      VDU 23,22,640;512;8,16,16,128+8 : REM Select UTF-8 mode
      *font Courier New
      rootpath$="D:\X30Share"
      @%=2

      N%=FNscandir(rootpath$)
      PRINT '"There were ";N%;" files in root path"
      END

      DEF FNscandir(path$)
      LOCAL dir%,sh%,res%,m%,n%,name$,utf8%
      DIM dir% LOCAL 317,utf8% LOCAL 260
      SYS "FindFirstFileW",path$+"\*"+CHR$0+CHR$0,dir% TO sh%
      IF sh%<>-1 THEN
        REPEAT
          SYS "WideCharToMultiByte",CP_UTF8,0,dir%+44,-1,utf8%,260,0,0
          name$=$$utf8%:REM or =$$(dir%+44) ??
          IF name$<>"." AND name$<>".." THEN
            IF !dir% AND &10 THEN
              m%=FNscandir(path$+"\"+name$):REM Recursive
        
              PRINT "Folder """;name$;"""" TAB(62) " contains ";
              IF m%=0 THEN PRINT "NO"; ELSE PRINT m%;
              PRINT " files"
            ELSE
              n%+=1
            ENDIF
          ENDIF
          SYS "FindNextFileW",sh%,dir% TO res%
        UNTIL res% = 0
        SYS "FindClose", sh%
      ENDIF
      =n%:REM Return number of files in the folder

But here it finds nothing at all.
Code:

There were 0 files in root path
>

Any help/hints are welcome to tackle this issue.

Thanks in advance.

Mike

Logged

rtr2
Guest

Re: Unicode in Folder names
« Reply #1 on: Mar 30^th, 2015, 8:41pm »

on Mar 30^th, 2015, 3:47pm, hellomike wrote:

Of course I also tried the program that Richard advised in another thread and on the Wiki.

The code you listed is broken, not least because it passes an ANSI string to a Wide API function:

Code:

       SYS "FindFirstFileW",path$+"\*"+CHR$0+CHR$0,dir% TO sh%

path$ here is an ANSI string, so it cannot be passed to FindFirstFileW! I'm concerned that you say I "advised" the use of this code; I'm quite sure I didn't as it could never have worked.

Richard.

Logged

hellomike
New Member

member is offline

Gender:

Posts: 46

Re: Unicode in Folder names
« Reply #2 on: Mar 31^st, 2015, 6:26pm »

Richard,

Of course I should have said 'program that Richard advised and I modified'. By no means I meant that you ill-advised so don't be concerned please.
Actually I'm always amazed that you answer all kind of Windows OS/API issues since these are not BB4W related.
Anyhow, you are correct of course that "FindFirstFileW" expects a Wide string as input.

In the Wiki article you state
Quote:

If a file or directory contains a non-ANSI character (e.g. from a different alphabet) it will be returned as a question mark

Note however that these Unicode characters (/,\,:,*,?, etc) do not return as questions marks using "FindFirstFile" but -as it seems- the ANSI version of the Unicode character. Some Windows-easter-egg?
Again, suitable for printing but unsuitable for appending to the current path to form a new path to dig into.

In the artice the following is used:
Code:

SYS "FindFirstFileW", "*"+CHR$0+CHR$0, dir% TO sh%

But isn't that 3 characters (ANSI asterisk and two null-bytes), so also not Wide-characters?

Sorry if I don't understand hence the reason for starting this thread.

Thanks

Mike

Logged

rtr2
Guest

Re: Unicode in Folder names
« Reply #3 on: Mar 31^st, 2015, 8:29pm »

on Mar 31^st, 2015, 6:26pm, hellomike wrote:

Code:

SYS "FindFirstFileW", "*"+CHR$0+CHR$0, dir% TO sh%

But isn't that 3 characters (ANSI asterisk and two null-bytes), so also not Wide-characters?

The intention was that it would be four bytes in all (taking into account the automatically-added termination) but it would be safer to add the extra NUL explicitly:

Code:

"*"+CHR$0+CHR$0+CHR$0

which corresponds to:

Code:

0x2A 0x00
0x00 0x00

or expressed as wide characters:

Code:

0x002A
0x0000

or in C:

Code:

L"*"

Richard.

Logged

hellomike
New Member

member is offline

Gender:

Posts: 46

Re: Unicode in Folder names
« Reply #4 on: Apr 1^st, 2015, 11:41am »

Ah, OK. Yes that clarifies it.

Thanks.

Still, I don't think using the "FindFirstFileW" API will be the solution for my problem since the folder names are ANSI strings with the odd Unicode character in it.

I'm thinking of substituting the colon, dot, slash, etc character (byte) by the 2-byte equivalent and using that new name$ variable for subsequent "FindNextFile" calls.

Back to testing....

Regards,

Mike

Logged

rtr2
Guest

Re: Unicode in Folder names
« Reply #5 on: Apr 1^st, 2015, 3:43pm »

on Apr 1^st, 2015, 11:41am, hellomike wrote:

the folder names are ANSI strings with the odd Unicode character in it.

I don't understand what you mean by that. You cannot 'mix' ANSI and Unicode encodings in the same string, that has no meaning. Obviously a lot of characters have the same numeric code in all common encodings - for example the 'ASCII' characters occupy code points 32 to 126 in ANSI, UTF-8, UTF-16 and UTF-32 - but that doesn't alter the fact that any given string must have a specific encoding.

I worry that you are turning something which is intrinsically very simple into something difficult. As I understand it you cannot use ANSI file functions because some filenames contain non-ANSI characters. So that means you must use Unicode file functions throughout. Your only remaining choice is what encoding to use for representing those Unicode file and directory names within your program, and the only sensible alternatives are UTF-8 and UTF-16.

If you ever want to display the names in the 'mainwin' or send them to the printer, then UTF-8 is probably the better choice because that is supported natively in BB4W whereas UTF-16 is not. Also, if you want to search the names using INSTR, or do other string manipulations, you may find that using UTF-8 is easier, because the native BBC BASIC string functions assume a string of bytes. But if neither of those apply you might as well stick with UTF-16 since that is the only Unicode encoding Windows understands and it would minimise conversions.

Hopefully that helps clarify things.

Richard.

Logged

hellomike
New Member

member is offline

Gender:

Posts: 46

Re: Unicode in Folder names
« Reply #6 on: Apr 2^nd, 2015, 4:22pm »

As before, just getting confirmed how stuff works (and doesn't) helps a great deal!

Quote:

I worry that you are turning something which is intrinsically very simple into something difficult.

Yep, my approach was needlessly difficult. The theory behind it all isn't complex but also not really "very simple" and then again, once understood, everything is simple.

It was confusing for me that the API "FindFirstFileW" wasn't really documented on MSDN and it took me a while to realize that the call works with wide strings for input and output and that a wide string delimiter is now 0x0000.

So there is progress and the following code now lists the folder-names correctly after making a function to make a Wide string version for the initial rootdir (D:\X30Share).
Code:

      CP_UTF8 = &FDE9
      VDU 23,22,640;512;8,16,16,128+8 : REM Select UTF-8 mode
      *font Courier New
      rootpath$="D:\X30Share"

      N%=FNscandir(FNANSItoWide(rootpath$))
      PRINT '"There were ";N%;" files in root path"
      END

      DEF FNscandir(path$)
      LOCAL dir%,sh%,res%,n%,utf8%
      DIM dir% LOCAL 317,utf8% LOCAL 260
      SYS "FindFirstFileW",path$+"\"+CHR$0+"*"+CHR$0+CHR$0+CHR$0,dir% TO sh%
      IF sh%<>-1 THEN
        REPEAT
          IF dir%!44<>&0000002E AND dir%!44<>&002E002E THEN
            IF !dir% AND &10 THEN
              SYS "WideCharToMultiByte",CP_UTF8,0,dir%+44,-1,utf8%,260,0,0
              PRINT $$utf8%
              REM Now I have to somehow append the double byte string at dir%+44
              REM to path$ in order to do the recurse call to this function
              REM path$+=$$(dir%+44) won't work...
            ELSE
              n%+=1
            ENDIF
          ENDIF
          SYS "FindNextFileW",sh%,dir% TO res%
        UNTIL res%=0
        SYS "FindClose",sh%
      ENDIF
      =n%:REM Return number of files in the folder

      REM --------------------------------------------------------------------
      DEF FNANSItoWide(a$)
      LOCAL wide$,i%

      FOR i%=1 TO LENa$
        wide$+=MID$(a$,i%,1)+CHR$0
      NEXT
      =wide$

Also testing for "." and ".." had to change.

I will manage appending the double byte string to path$ but had a strange error.
I gathered that the returned names at dir%+44 now occupy twice as many bytes so I though to enlarge the memory area for dir% to be on the save side and changed only that line
Code:

      DIM dir% LOCAL 511,utf8% LOCAL 260

After listing the folder names, the program errors out with

Not in a function

and emphasized the "=n%" line.

I'm using BB4W V5.95a.

Regards,

Mike

Logged

rtr2
Guest

Re: Unicode in Folder names
« Reply #7 on: Apr 2^nd, 2015, 4:44pm »

on Apr 2^nd, 2015, 4:22pm, hellomike wrote:

I though to enlarge the memory area for dir% to be on the save side and changed only that line
Code:

      DIM dir% LOCAL 511,utf8% LOCAL 260

You were quite right in thinking that it was necessary to increase the amount of memory allocated to dir%, but rather than erring on the "safe side" in fact you didn't increase it enough! If you look at the definition of WIN32_FIND_DATA at MSDN you'll find that the wide-character version occupies 592 bytes so in your program you require as a minimum:

Code:

      DIM dir% LOCAL 591,utf8% LOCAL 260

Strictly speaking the 260 should be increased as well, because the theoretical maximum path length when encoded as UTF-8 is longer than MAX_PATH bytes, but in practice you would be unlikely to exceed that.

Quote:

I'm using BB4W V5.95a.

I would not advise that if you are working with UTF-16 strings. Windows (particularly 64-bit Windows) sometimes requires that such strings are WORD-aligned, i.e. at an even memory address, and BB4W v6.00a guarantees that when using DIM ... LOCAL. However v5.95a does not. Try this program on both v5.95a and v6.00a to see what I mean:

Code:

      FOR N% = 1 TO 10
        PROC1(N%)
      NEXT
      END

      DEF PROC1(S%)
      DIM dir% LOCAL S%
      PRINT dir%
      ENDPROC

Richard.

« Last Edit: Apr 2^nd, 2015, 5:07pm by rtr2 »

Logged

hellomike
New Member

member is offline

Gender:

Posts: 46

Re: Unicode in Folder names
« Reply #8 on: Apr 4^th, 2015, 2:55pm »

Yes I see the difference between v5 and v6 using the code snippet.

I will continue development using BB4W v6.x.

Thanks for all the help and tips.

Mike

Logged

Pages: 1


« Previous Topic \| Next Topic »