Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

In the following Java program running in Linux using OpenJDK 1.6.0_22 I simply list the contents of the directory taken in as parameter at the command line. The directory contains the files which have file names in UTF-8 (e.g. Hindi, Mandarin, German etc.).

import java.io.*;

class ListDir {

    public static void main(String[] args) throws Exception {
    //System.setProperty("file.encoding", "en_US.UTF-8");
        System.out.println(System.getProperty("file.encoding"));
    File f = new File(args[0]);
    for(String c : f.list()) {
        String absPath = args[0] + "" + c;
        File cf = new File(args[0] + "/" + c);
        System.out.println(cf.getAbsolutePath() + " --> " + cf.exists());
    }
    }
}

If I set the LC_ALL variable to en_US.UTF-8 the results are printed fine. But if I set the LC_ALL variable to POSIX and supply the file.encoding and sun.jnu.encoding properties as UTF-8 from command line I get the garbage output and cf.exists() returns false.

Can you please explain this behavior. As I read on so many websites file.encoding is said to be sufficient to read file names and use them for operations. Here it looks like that property has no effect at all.

Update 1: If I set file.encoding to something like GBK (Chinese) and LC_ALL variable to en_US.UTF-8 then cf.exists() returns true. only the '?' appears instead of file name. Surprise o_O.

Update 2: More investigation and it looks like its not a Java issue. It looks like libc on Linux used locale settings to translate file name encodings and those settings will cause file not found error/exception. "file.encoding" is for how Java interprets file names.

Update 3 Now it looks problem is how Java interprets file names. The following simple C code works on Linux regardless of file encoding and value of LC_ALL environment variable (I am happy that this proves for answer given here: https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-encoding). But still I am not clear how Java interprets on LC_ALL variable. Now looking into OpenJDK code for that.

Sample C code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <dirent.h>

int main(int argc, char *argv[])
{
    char *argdir = argv[1];
    DIR *dp = opendir(argdir);
    struct dirent *de;
    while(de = readdir(dp)) {
        char *abspath = (char *) malloc(strlen(argdir)  + 1 + strlen(de->d_name) + 1);
        strcpy(abspath, argdir);
        abspath[strlen(argdir)] = '/';
        strcpy(abspath + strlen(argdir) + 1, de->d_name);
        printf("%d %s ", de->d_type, abspath);
        FILE *fp = fopen(abspath, "r");
        if (fp) {
            printf("Success");
        }
        fclose(fp);
        putchar('
');
    }
}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
344 views
Welcome To Ask or Share your Answers For Others

1 Answer

Note: So finally I think that I have nailed it down. I am not confirming that it is right. But with some code reading and tests this is what I found out and I don't have additional time to look into it. If anyone is interested they can check it out and tell if this answer is right or wrong - I would be glad :)

The reference I used was from this tarball available at OpenJDK's site: openjdk-6-src-b25-01_may_2012.tar.gz

  1. Java natively translates all string to platform's local encoding in this method: jdk/src/share/native/common/jni_util.c - JNU_GetStringPlatformChars() . System property sun.jnu.encoding is used to determine the platform's encoding.

  2. The value of sun.jnu.encoding is set at jdk/src/solaris/native/java/lang/java_props_md.c - GetJavaProperties() using setlocale() method of libc. Environment variable LC_ALL is used to set the value of sun.jnu.encoding. Value given at the command prompt using -Dsun.jnu.encoding option to Java is ignored.

  3. Call to File.exists() has been coded in file jdk/src/share/classes/java/io/File.java and it returns as

    return ((fs.getBooleanAttributes(this) & FileSystem.BA_EXISTS) != 0);

  4. getBooleanAttributes() is natively coded (and I am skipping steps in code browsing through many files) in jdk/src/share/native/java/io/UnixFileSystem_md.c in function : Java_java_io_UnixFileSystem_getBooleanAttributes0(). Here the macro WITH_FIELD_PLATFORM_STRING(env, file, ids.path, path) converts path string to platform's encoding.

  5. So conversion to wrong encoding will actually send a wrong C string (char array) to subsequent call to stat() method. And it will return with result that file cannot be found.

LESSON: LC_ALL is very important


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...