GNU Info

Info Node: (libc.info)Converting Strings

(libc.info)Converting Strings


Next: Multibyte Conversion Example Prev: Converting a Character Up: Restartable multibyte conversion
Enter node , (file) or (file)node

Converting Multibyte and Wide Character Strings
-----------------------------------------------

   The functions described in the previous section only convert a single
character at a time.  Most operations to be performed in real-world
programs include strings and therefore the ISO C standard also defines
conversions on entire strings.  However, the defined set of functions
is quite limited; therefore, the GNU C library contains a few
extensions that can help in some important situations.

 - Function: size_t mbsrtowcs (wchar_t *restrict DST, const char
          **restrict SRC, size_t LEN, mbstate_t *restrict PS)
     The `mbsrtowcs' function ("multibyte string restartable to wide
     character string") converts an NUL-terminated multibyte character
     string at `*SRC' into an equivalent wide character string,
     including the NUL wide character at the end.  The conversion is
     started using the state information from the object pointed to by
     PS or from an internal object of `mbsrtowcs' if PS is a null
     pointer.  Before returning, the state object is updated to match
     the state after the last converted character.  The state is the
     initial state if the terminating NUL byte is reached and converted.

     If DST is not a null pointer, the result is stored in the array
     pointed to by DST; otherwise, the conversion result is not
     available since it is stored in an internal buffer.

     If LEN wide characters are stored in the array DST before reaching
     the end of the input string, the conversion stops and LEN is
     returned.  If DST is a null pointer, LEN is never checked.

     Another reason for a premature return from the function call is if
     the input string contains an invalid multibyte sequence.  In this
     case the global variable `errno' is set to `EILSEQ' and the
     function returns `(size_t) -1'.

     In all other cases the function returns the number of wide
     characters converted during this call.  If DST is not null,
     `mbsrtowcs' stores in the pointer pointed to by SRC either a null
     pointer (if the NUL byte in the input string was reached) or the
     address of the byte following the last converted multibyte
     character.

     `mbsrtowcs' was introduced in Amendment 1 to ISO C90 and is
     declared in `wchar.h'.

   The definition of the `mbsrtowcs' function has one important
limitation.  The requirement that DST has to be a NUL-terminated string
provides problems if one wants to convert buffers with text.  A buffer
is normally no collection of NUL-terminated strings but instead a
continuous collection of lines, separated by newline characters.  Now
assume that a function to convert one line from a buffer is needed.
Since the line is not NUL-terminated, the source pointer cannot
directly point into the unmodified text buffer.  This means, either one
inserts the NUL byte at the appropriate place for the time of the
`mbsrtowcs' function call (which is not doable for a read-only buffer
or in a multi-threaded application) or one copies the line in an extra
buffer where it can be terminated by a NUL byte.  Note that it is not
in general possible to limit the number of characters to convert by
setting the parameter LEN to any specific value.  Since it is not known
how many bytes each multibyte character sequence is in length, one can
only guess.

   There is still a problem with the method of NUL-terminating a line
right after the newline character, which could lead to very strange
results.  As said in the description of the `mbsrtowcs' function above
the conversion state is guaranteed to be in the initial shift state
after processing the NUL byte at the end of the input string.  But this
NUL byte is not really part of the text (i.e., the conversion state
after the newline in the original text could be something different
than the initial shift state and therefore the first character of the
next line is encoded using this state).  But the state in question is
never accessible to the user since the conversion stops after the NUL
byte (which resets the state).  Most stateful character sets in use
today require that the shift state after a newline be the initial
state-but this is not a strict guarantee.  Therefore, simply
NUL-terminating a piece of a running text is not always an adequate
solution and, therefore, should never be used in generally used code.

   The generic conversion interface (Note: Generic Charset Conversion)
does not have this limitation (it simply works on buffers, not
strings), and the GNU C library contains a set of functions that take
additional parameters specifying the maximal number of bytes that are
consumed from the input string.  This way the problem of `mbsrtowcs''s
example above could be solved by determining the line length and
passing this length to the function.

 - Function: size_t wcsrtombs (char *restrict DST, const wchar_t
          **restrict SRC, size_t LEN, mbstate_t *restrict PS)
     The `wcsrtombs' function ("wide character string restartable to
     multibyte string") converts the NUL-terminated wide character
     string at `*SRC' into an equivalent multibyte character string and
     stores the result in the array pointed to by DST.  The NUL wide
     character is also converted.  The conversion starts in the state
     described in the object pointed to by PS or by a state object
     locally to `wcsrtombs' in case PS is a null pointer.  If DST is a
     null pointer, the conversion is performed as usual but the result
     is not available.  If all characters of the input string were
     successfully converted and if DST is not a null pointer, the
     pointer pointed to by SRC gets assigned a null pointer.

     If one of the wide characters in the input string has no valid
     multibyte character equivalent, the conversion stops early, sets
     the global variable `errno' to `EILSEQ', and returns `(size_t) -1'.

     Another reason for a premature stop is if DST is not a null
     pointer and the next converted character would require more than
     LEN bytes in total to the array DST.  In this case (and if DEST is
     not a null pointer) the pointer pointed to by SRC is assigned a
     value pointing to the wide character right after the last one
     successfully converted.

     Except in the case of an encoding error the return value of the
     `wcsrtombs' function is the number of bytes in all the multibyte
     character sequences stored in DST.  Before returning the state in
     the object pointed to by PS (or the internal object in case PS is
     a null pointer) is updated to reflect the state after the last
     conversion.  The state is the initial shift state in case the
     terminating NUL wide character was converted.

     The `wcsrtombs' function was introduced in Amendment 1 to ISO C90
     and is declared in `wchar.h'.

   The restriction mentioned above for the `mbsrtowcs' function applies
here also.  There is no possibility of directly controlling the number
of input characters.  One has to place the NUL wide character at the
correct place or control the consumed input indirectly via the
available output array size (the LEN parameter).

 - Function: size_t mbsnrtowcs (wchar_t *restrict DST, const char
          **restrict SRC, size_t NMC, size_t LEN, mbstate_t *restrict
          PS)
     The `mbsnrtowcs' function is very similar to the `mbsrtowcs'
     function.  All the parameters are the same except for NMC, which is
     new.  The return value is the same as for `mbsrtowcs'.

     This new parameter specifies how many bytes at most can be used
     from the multibyte character string.  In other words, the
     multibyte character string `*SRC' need not be NUL-terminated.  But
     if a NUL byte is found within the NMC first bytes of the string,
     the conversion stops here.

     This function is a GNU extension.  It is meant to work around the
     problems mentioned above.  Now it is possible to convert a buffer
     with multibyte character text piece for piece without having to
     care about inserting NUL bytes and the effect of NUL bytes on the
     conversion state.

   A function to convert a multibyte string into a wide character string
and display it could be written like this (this is not a really useful
example):

     void
     showmbs (const char *src, FILE *fp)
     {
       mbstate_t state;
       int cnt = 0;
       memset (&state, '\0', sizeof (state));
       while (1)
         {
           wchar_t linebuf[100];
           const char *endp = strchr (src, '\n');
           size_t n;
     
           /* Exit if there is no more line.  */
           if (endp == NULL)
             break;
     
           n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state);
           linebuf[n] = L'\0';
           fprintf (fp, "line %d: \"%S\"\n", linebuf);
         }
     }

   There is no problem with the state after a call to `mbsnrtowcs'.
Since we don't insert characters in the strings that were not in there
right from the beginning and we use STATE only for the conversion of
the given buffer, there is no problem with altering the state.

 - Function: size_t wcsnrtombs (char *restrict DST, const wchar_t
          **restrict SRC, size_t NWC, size_t LEN, mbstate_t *restrict
          PS)
     The `wcsnrtombs' function implements the conversion from wide
     character strings to multibyte character strings.  It is similar to
     `wcsrtombs' but, just like `mbsnrtowcs', it takes an extra
     parameter, which specifies the length of the input string.

     No more than NWC wide characters from the input string `*SRC' are
     converted.  If the input string contains a NUL wide character in
     the first NWC characters, the conversion stops at this place.

     The `wcsnrtombs' function is a GNU extension and just like
     `mbsnrtowcs' helps in situations where no NUL-terminated input
     strings are available.


automatically generated by info2www version 1.2.2.9