如何剪出一个中文单词&英语单词mix string in c language

时间:2016-10-13 06:16:04

标签: c linux utf-8


char *str  = "你a好测b试";




strncpy(buf, str, 4);
strcat(buf, "...");


strncpy(buf, str, 13);
strcat(buf, "...");



5 个答案:

答案 0 :(得分:2)



您还可以使用已经处理utf8的函数库,例如http://www.cprogramming.com/tutorial/utf8.c http://www.cprogramming.com/tutorial/utf8.h

特别是这个函数:int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz);可能非常有用,它将创建一个整数数组,每个整数为1个字符。然后,您可以根据需要修改数组,然后使用int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz);


答案 1 :(得分:1)

Basic Multilingual Plane旨在包含几乎所有现代语言的字符。特别是它确实包含中文。


从C ++ 11开始,<codecvt>标头声明了专用转换器std::codecvt_utf8,专门将UTF8窄字符串转换为宽Unicode字符串。我必须承认它不是很容易使用,但它应该足够了。代码可以是:

char str[]  = "你a好测b试";
std::codecvt_utf8<wchar_t> cvt;
std::mbstate_t state = std::mbstate_t();

wchar_t wstr[sizeof(str)] = {0}; // there will be unused space at the end
const char *end;
wchar_t *wend;

auto cr = cvt.in(state, str, str+sizeof(str), end,
        wstr, wstr+sizeof(str), wend);
*wend = 0;

获得wstr宽字符串后,您可以将其转换为wstring并使用所有C ++库工具,或者如果您更喜欢C字符串,则可以使用ws...对应字符串str...函数。

答案 2 :(得分:1)


#include <stddef.h>

size_t count_bytes_for_chars(const char *s, int n)
    const char *p = s;
    n += 1;  /* we're counting up to the start of the subsequent character */

    while (*p && (n -= (*p & 0xc0) != 0x80))
    return p-s;


#include <string.h>
#include <stdio.h>
int main()
    const char *str = "你a好测b试";
    char buf[50];
    int truncate_at = 4;

    size_t bytes = count_bytes_for_chars(str, truncate_at);
    strncpy(buf, str, bytes);
    strcpy(buf+bytes, "...");

    printf("'%s' truncated to %d characters is '%s'\n", str, truncate_at, buf);


'你a好测b试' truncated to 4 characters is '你a好测...'

答案 3 :(得分:0)

Pure C解决方案:

所有UTF8 multibyte characters will be made from char-s with the most-significant-bit set to 1,第一个字符的第一位表示代码点的字符数。


  1. 固定数量的代码点后跟三个点,这将需要一个可变大小的输出缓冲区

  2. 一个固定大小的输出缓冲区,它会强加任何你能够适应的内容&#34;

  3. 这两个解决方案都需要一个辅助函数来告诉有多少个字符构成下一个代码点:

    // Note: the function does NOT fully validate a
    // UTF8 sequence, only looks at the first char in it
    int codePointLen(const char* c) {
      if(NULL==c) return -1;
      if( (*c & 0xF8)==0xF0 ) return 4; // 4 ones and one 0 
      if( (*c & 0xF0)==0xE0 ) return 3; // 3 ones and one 0
      if( (*c & 0xE0)==0xC0 ) return 2; // 2 ones and one 0
      if( (*c & 0x7F)==*c   ) return 1; // no ones on msb
      return -2; // invalid UTF8 starting character

    因此,标准1(固定数量的代码点,可变输出buff大小)的解决方案 - 不会将...附加到目的地,但您可以询问&#34;我需要多少个字符&#34 ;提前,如果它超出你的承受能力,保留额外的空间。

    // returns the number of chars used from the output
    // If not enough space or the dest is null, does nothing
    // and returns the lenght required for the output buffer
    // Returns negative val if the source in not a valid UTF8
    int copyFirstCodepoints(
       int codepointsCount, const char* src,
       char* dest, int destSize
    ) {
      if(NULL==src) {
        return -1;
      // do a cold run to see if size of the output buffer can fit
      // as many codepoints as required
      const char* walker=src;
      for(int cnvCount=0; cnvCount<codepointsCount; cnvCount++) {
        int chCount=codePointLen(walker);
        if(chCount<0) {
          return chCount; // err
      if(walker-src < destSize && NULL!=dest) {
        // enough space at destination
        strncpy(src, dest, walker-src);
      // else do nothing
      return walker-src;


    // return negative if UTF encoding error
    int howManyCodepointICanFitInOutputBufferOfLen(const char* src, int maxBufflen) {
      if(NULL==src) {
        return -1;
      int ret=0;
      for(const char* walker=src; *walker && ret<maxBufflen; ret++) {
         int advance=codePointLen(walker);
         if(advance<0) {
           return src-walker; // err because negative, but indicating the err pos
         // look on all the chars between walker and walker+advance
         // if any is 0, we have a premature end of the source
         while(advance>0) {
           if(0==*(++walker)) {
             return src-walker; // err because negative, but indicating the err pos
         } // walker is set on the correct position for the next attempt
      return ret;

答案 4 :(得分:0)

static char *CutStringLength(char *lpszData, int nMaxLen)
    if (NULL == lpszData || 0 >= nMaxLen)
            return "";
    int len = strlen(lpszData);
    if(len <= nMaxLen)
            return lpszData;
    char strTemp[1024] = {0};
    strcpy(strTemp, lpszData);
    char *p = strTemp;
    p = p + (nMaxLen-1);

    if ((unsigned char)(*p) < 0xA0)
        *(++p) = '\0';  // if the last byte is Mandarin character
    else if ((unsigned char)(*(--p)) < 0xA0)
        *(++p) = '\0';  // if the last but one byte is Mandarin character
    else if ((unsigned char)(*(--p)) < 0xA0)
        *(++p) = '\0';  // if the last but two byte is Mandarin character
        int i = 0;
        p = strTemp;
        while(*p != '\0' && i+2 <= nMaxLen)
           if((unsigned char)(*p++) >= 0xA0 && (unsigned char)(*p) >= 0xA0)
       *p = '\0';
    printf("str = %s\n",strTemp);
    return strTemp;