Copyright (C) 2000-2012 |
Manpages Tcl_GetEncodingSection: Tcl Library Procedures (3)Updated: 8.1 Index Return to Main Contents NAMETcl_GetEncoding, Tcl_FreeEncoding, Tcl_ExternalToUtfDString, Tcl_ExternalToUtf, Tcl_UtfToExternalDString, Tcl_UtfToExternal, Tcl_WinTCharToUtf, Tcl_WinUtfToTChar, Tcl_GetEncodingName, Tcl_SetSystemEncoding, Tcl_GetEncodingNames, Tcl_CreateEncoding, Tcl_GetDefaultEncodingDir, Tcl_SetDefaultEncodingDir - procedures for creating and using encodings.SYNOPSIS#include <tcl.h> Tcl_Encoding Tcl_GetEncoding(interp, name) void Tcl_FreeEncoding(encoding) char * Tcl_ExternalToUtfDString(encoding, src, srcLen, dstPtr) int Tcl_ExternalToUtf(interp, encoding, src, srcLen, flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr, dstCharsPtr) char * Tcl_UtfToExternalDString(encoding, src, srcLen, dstPtr) int Tcl_UtfToExternal(interp, encoding, src, srcLen, flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr, dstCharsPtr) char * Tcl_WinTCharToUtf(tsrc, srcLen, dstPtr) TCHAR * Tcl_WinUtfToTChar(src, srcLen, dstPtr) char * Tcl_GetEncodingName(encoding) int Tcl_SetSystemEncoding(interp, name) void Tcl_GetEncodingNames(interp) Tcl_Encoding Tcl_CreateEncoding(typePtr) char * Tcl_GetDefaultEncodingDir(void) void Tcl_SetDefaultEncodingDir(path) ARGUMENTS
INTRODUCTIONThese routines convert between Tcl's internal character representation, UTF-8, and character representations used by various operating systems or file systems, such as Unicode, ASCII, or Shift-JIS. When operating on strings, such as such as obtaining the names of files or displaying characters using international fonts, the strings must be translated into one or possibly multiple formats that the various system calls can use. For instance, on a Japanese Unix workstation, a user might obtain a filename represented in the EUC-JP file encoding and then translate the characters to the jisx0208 font encoding in order to display the filename in a Tk widget. The purpose of the encoding package is to help bridge the translation gap. UTF-8 provides an intermediate staging ground for all the various encodings. In the example above, text would be translated into UTF-8 from whatever file encoding the operating system is using. Then it would be translated from UTF-8 into whatever font encoding the display routines require. Some basic encodings are compiled into Tcl. Others can be defined by the user or dynamically loaded from encoding files in a platform-independent manner. DESCRIPTIONTcl_GetEncoding finds an encoding given its name. The name may refer to a builtin Tcl encoding, a user-defined encoding registered by calling Tcl_CreateEncoding, or a dynamically-loadable encoding file. The return value is a token that represents the encoding and can be used in subsequent calls to procedures such as Tcl_GetEncodingName, Tcl_FreeEncoding, and Tcl_UtfToExternal. If the name did not refer to any known or loadable encoding, NULL is returned and an error message is returned in interp. The encoding package maintains a database of all encodings currently in use. The first time name is seen, Tcl_GetEncoding returns an encoding with a reference count of 1. If the same name is requested further times, then the reference count for that encoding is incremented without the overhead of allocating a new encoding and all its associated data structures. When an encoding is no longer needed, Tcl_FreeEncoding should be called to release it. When an encoding is no longer in use anywhere (i.e., it has been freed as many times as it has been gotten) Tcl_FreeEncoding will release all storage the encoding was using and delete it from the database. Tcl_ExternalToUtfDString converts a source buffer src from the specified encoding into UTF-8. The converted bytes are stored in dstPtr, which is then NULL terminated. The caller should eventually call Tcl_DStringFree to free any information stored in dstPtr. When converting, if any of the characters in the source buffer cannot be represented in the target encoding, a default fallback character will be used. The return value is a pointer to the value stored in the DString. Tcl_ExternalToUtf converts a source buffer src from the specified encoding into UTF-8. Up to srcLen bytes are converted from the source buffer and up to dstLen converted bytes are stored in dst. In all cases, *srcReadPtr is filled with the number of bytes that were successfully converted from src and *dstWrotePtr is filled with the corresponding number of bytes that were stored in dst. The return value is one of the following:
Tcl_UtfToExternalDString converts a source buffer src from UTF-8 into the specified encoding. The converted bytes are stored in dstPtr, which is then terminated with the appropriate encoding-specific NULL. The caller should eventually call Tcl_DStringFree to free any information stored in dstPtr. When converting, if any of the characters in the source buffer cannot be represented in the target encoding, a default fallback character will be used. The return value is a pointer to the value stored in the DString. Tcl_UtfToExternal converts a source buffer src from UTF-8 into the specified encoding. Up to srcLen bytes are converted from the source buffer and up to dstLen converted bytes are stored in dst. In all cases, *srcReadPtr is filled with the number of bytes that were successfully converted from src and *dstWrotePtr is filled with the corresponding number of bytes that were stored in dst. The return values are the same as the return values for Tcl_ExternalToUtf. Tcl_WinUtfToTChar and Tcl_WinTCharToUtf are Windows-only convenience functions for converting between UTF-8 and Windows strings. On Windows 95 (as with the Macintosh and Unix operating systems), all strings exchanged between Tcl and the operating system are "char" based. On Windows NT, some strings exchanged between Tcl and the operating system are "char" oriented while others are in Unicode. By convention, in Windows a TCHAR is a character in the ANSI code page on Windows 95 and a Unicode character on Windows NT. If you planned to use the same "char" based interfaces on both Windows 95 and Windows NT, you could use Tcl_UtfToExternal and Tcl_ExternalToUtf (or their Tcl_DString equivalents) with an encoding of NULL (the current system encoding). On the other hand, if you planned to use the Unicode interface when running on Windows NT and the "char" interfaces when running on Windows 95, you would have to perform the following type of test over and over in your program (as represented in psuedo-code):
Tcl_GetEncodingName is roughly the inverse of Tcl_GetEncoding. Given an encoding, the return value is the name argument that was used to create the encoding. The string returned by Tcl_GetEncodingName is only guaranteed to persist until the encoding is deleted. The caller must not modify this string. Tcl_SetSystemEncoding sets the default encoding that should be used whenever the user passes a NULL value for the encoding argument to any of the other encoding functions. If name is NULL, the system encoding is reset to the default system encoding, binary. If the name did not refer to any known or loadable encoding, TCL_ERROR is returned and an error message is left in interp. Otherwise, this procedure increments the reference count of the new system encoding, decrements the reference count of the old system encoding, and returns TCL_OK. Tcl_GetEncodingNames sets the interp result to a list consisting of the names of all the encodings that are currently defined or can be dynamically loaded, searching the encoding path specified by Tcl_SetDefaultEncodingDir. This procedure does not ensure that the dynamically-loadable encoding files contain valid data, but merely that they exist. Tcl_CreateEncoding defines a new encoding and registers the C procedures that are called back to convert between the encoding and UTF-8. Encodings created by Tcl_CreateEncoding are thereafter visible in the database used by Tcl_GetEncoding. Just as with the Tcl_GetEncoding procedure, the return value is a token that represents the encoding and can be used in subsequent calls to other encoding functions. Tcl_CreateEncoding returns an encoding with a reference count of 1. If an encoding with the specified name already exists, then its entry in the database is replaced with the new encoding; the token for the old encoding will remain valid and continue to behave as before, but users of the new token will now call the new encoding procedures. The typePtr argument to Tcl_CreateEncoding contains information about the name of the encoding and the procedures that will be called to convert between this encoding and UTF-8. It is defined as follows:
The encodingName provides a string name for the encoding, by which it can be referred in other procedures such as Tcl_GetEncoding. The toUtfProc refers to a callback procedure to invoke to convert text from this encoding into UTF-8. The fromUtfProc refers to a callback procedure to invoke to convert text from UTF-8 into this encoding. The freeProc refers to a callback procedure to invoke when this encoding is deleted. The freeProc field may be NULL. The clientData contains an arbitrary one-word value passed to toUtfProc, fromUtfProc, and freeProc whenever they are called. Typically, this is a pointer to a data structure containing encoding-specific information that can be used by the callback procedures. For instance, two very similar encodings such as ascii and macRoman may use the same callback procedure, but use different values of clientData to control its behavior. The nullSize specifies the number of zero bytes that signify end-of-string in this encoding. It must be 1 (for single-byte or multi-byte encodings like ASCII or Shift-JIS) or 2 (for double-byte encodings like Unicode). Constant-sized encodings with 3 or more bytes per character (such as CNS11643) are not accepted. The callback procedures toUtfProc and fromUtfProc should match the type Tcl_EncodingConvertProc:
The toUtfProc and fromUtfProc procedures are called by the Tcl_ExternalToUtf or Tcl_UtfToExternal family of functions to perform the actual conversion. The clientData parameter to these procedures is the same as the clientData field specified to Tcl_CreateEncoding when the encoding was created. The remaining arguments to the callback procedures are the same as the arguments, documented at the top, to Tcl_ExternalToUtf or Tcl_UtfToExternal, with the following exceptions. If the srcLen argument to one of those high-level functions is negative, the value passed to the callback procedure will be the appropriate encoding-specific string length of src. If any of the srcReadPtr, dstWrotePtr, or dstCharsPtr arguments to one of the high-level functions is NULL, the corresponding value passed to the callback procedure will be a non-NULL location. The callback procedure freeProc, if non-NULL, should match the type Tcl_EncodingFreeProc:
This freeProc function is called when the encoding is deleted. The clientData parameter is the same as the clientData field specified to Tcl_CreateEncoding when the encoding was created.
Tcl_GetDefaultEncodingDir and Tcl_SetDefaultEncodingDir access and set the directory to use when locating the default encoding files. If this value is not NULL, the TclpInitLibraryPath routine appends the path to the head of the search path, and uses this path as the first place to look into when trying to locate the encoding file. ENCODING FILESSpace would prohibit precompiling into Tcl every possible encoding algorithm, so many encodings are stored on disk as dynamically-loadable encoding files. This behavior also allows the user to create additional encoding files that can be loaded using the same mechanism. These encoding files contain information about the tables and/or escape sequences used to map between an external encoding and Unicode. The external encoding may consist of single-byte, multi-byte, or double-byte characters.Each dynamically-loadable encoding is represented as a text file. The initial line of the file, beginning with a ``#'' symbol, is a comment that provides a human-readable description of the file. The next line identifies the type of encoding file. It can be one of the following letters:
The rest of the lines in the file depend on the type. Cases [1], [2], and [3] are collectively referred to as table-based encoding files. The lines in a table-based encoding file are in the same format as this example taken from the shiftjis encoding (this is not the complete file):
The third line of the file is three numbers. The first number is the fallback character (in base 16) to use when converting from UTF-8 to this encoding. The second number is a 1 if this file represents the encoding for a symbol font, or 0 otherwise. The last number (in base 10) is how many pages of data follow. Subsequent lines in the example above are pages that describe how to map from the encoding into 2-byte Unicode. The first line in a page identifies the page number. Following it are 256 double-byte numbers, arranged as 16 rows of 16 numbers. Given a character in the encoding, the high byte of that character is used to select which page, and the low byte of that character is used as an index to select one of the double-byte numbers in that page - the value obtained being the corresponding Unicode character. By examination of the example above, one can see that the characters 0x7E and 0x8163 in shiftjis map to 203E and 2026 in Unicode, respectively. Following the first page will be all the other pages, each in the same format as the first: one number identifying the page followed by 256 double-byte Unicode characters. If a character in the encoding maps to the Unicode character 0000, it means that the character doesn't actually exist. If all characters on a page would map to 0000, that page can be omitted. Case [4] is the escape-sequence encoding file. The lines in an this type of file are in the same format as this example taken from the iso2022-jp encoding:
In the file, the first column represents an option and the second column is the associated value. init is a string to emit or expect before the first character is converted, while final is a string to emit or expect after the last character. All other options are names of table-based encodings; the associated value is the escape-sequence that marks that encoding. Tcl syntax is used for the values; in the above example, for instance, ``{}'' represents the empty string and ``\x1b'' represents character 27. When Tcl_GetEncoding encounters an encoding name that has not been loaded, it attempts to load an encoding file called name.enc from the encoding subdirectory of each directory specified in the library path $tcl_libPath. If the encoding file exists, but is malformed, an error message will be left in interp. KEYWORDSutf, encoding, convert
IndexThis document was created by man2html, using the manual pages. Time: 03:03:07 GMT, April 20, 2024 |