Package lumis.util

Class CharsetUtil


  • public class CharsetUtil
    extends Object

    Utility class to guess the encoding of a given text file.

    Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.

    A byte buffer of 4KB is usually sufficient to be able to guess the encoding.

    Usage:

     // guess the encoding
     Charset guessedCharset = CharsetToolkit.guessEncoding(file, 4096);
    
     // create a reader with the correct charset
     CharsetToolkit toolkit = new CharsetToolkit(file);
     BufferedReader reader = toolkit.getReader();
    
     // read the file content
     String line;
     while ((line = br.readLine())!= null)
     {
     System.out.println(line);
     }
     
    author Guillaume Laforge
    Since:
    4.0.0
    Version:
    $Revision: 10458 $ $Date: 2009-06-02 15:49:09 -0300 (Tue, 02 Jun 2009) $
    • Constructor Detail

      • CharsetUtil

        public CharsetUtil​(File file)
                    throws IOException
        Parameters:
        file - of which we want to know the encoding.
        Throws:
        IOException
    • Method Detail

      • setDefaultCharset

        public void setDefaultCharset​(Charset defaultCharset)
        Defines the default Charset used in case the buffer represents an 8-bit Charset.
        Parameters:
        defaultCharset - the default Charset to be returned by guessEncoding() if an 8-bit Charset is encountered.
      • getCharset

        public Charset getCharset()
      • setEnforce8Bit

        public void setEnforce8Bit​(boolean enforce)
        If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. It might be a file without any special character in the range 128-255, but that may be or become a file encoded with the default charset rather than US-ASCII.
        Parameters:
        enforce - a boolean specifying the use or not of US-ASCII.
      • getEnforce8Bit

        public boolean getEnforce8Bit()
        Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
        Returns:
        a boolean representing the flag of use of US-ASCII.
      • getDefaultCharset

        public Charset getDefaultCharset()
        Retrieves the default Charset
        Returns:
      • hasUTF8Bom

        public boolean hasUTF8Bom()
        Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
        Returns:
        true if the buffer has a BOM for UTF8.
      • hasUTF16LEBom

        public boolean hasUTF16LEBom()
        Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
        Returns:
        true if the buffer has a BOM for UTF-16 Low Endian.
      • hasUTF16BEBom

        public boolean hasUTF16BEBom()
        Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
        Returns:
        true if the buffer has a BOM for UTF-16 Big Endian.