如何获取Unicode字符的基指针?

时间:2017-08-29 05:14:17

标签: java unicode

目前我有“codePointAt”,它返回字符串中字符的代码点。 是否有任何API或其他方法来获取当前字符的基本指针?

public class Testclass {

    public static void main(String[] args) {

        String unicodeString = "कागज़";
        int currentPoint = unicodeString.codePointAt(0);

        // Now currentPoint = 0x0915
        // I need currentPoint = 0x0900
    }
}

注意#我无法通过加/减来创建基指针,因为不同语言的基点从不同的One / Ten的位值开始。对于例如

亚美尼亚语 - 0530-058F - 基本指针0x0530(十位值)
梵文 - 0900-097F - 基指针0x0900(百位值)

目前我正在使用if-else块来获取基本指针,该指针不是动态的,也是冗长的方法。 :-(

int basePointer;
if(currentPoint>0x600 && currentPoint<=0x6FF)//Means Arabic
{
    basePointer = 0x0600;
}
if(currentPoint>0x900 && currentPoint<=0x97F)//Means Devnagri
{
    basePointer = 0x0900;
}

4 个答案:

答案 0 :(得分:2)

好的,稍微考虑一下之后,这是一种只使用Java API的方法。它由三部分组成:

  1. blockStarts中无法访问的阻止基准表Character.UnicodeBlock重新生成为Map
  2. 使用Character.UnicodeBlock.of(int)查找给定代码点的块名称
  3. 使用Map查找给定块名称的Unicode块基础
  4. 请注意,重新生成块基表在我的机器上大约10-15毫秒时相对较慢,因此最好生成一次并重复使用。我已经保留了基本的时间码。

    private static final int SUPPLEMENTARY_PRIVATE_USE_AREA_A_BASE = 0x0F0000;
    private static final int SUPPLEMENTARY_PRIVATE_USE_AREA_B_BASE = 0x100000;
    
    private static final Character.UnicodeBlock SUPPLEMENTARY_PRIVATE_USE_AREA_A =
        Character.UnicodeBlock.of(SUPPLEMENTARY_PRIVATE_USE_AREA_A_BASE);
    private static final Character.UnicodeBlock SUPPLEMENTARY_PRIVATE_USE_AREA_B =
        Character.UnicodeBlock.of(SUPPLEMENTARY_PRIVATE_USE_AREA_B_BASE);
    
    public static Map<Character.UnicodeBlock, Integer> makeUnicodeBlockBaseMap() {
      long startNanos = System.nanoTime();
      Map<Character.UnicodeBlock, Integer> unicodeBases = new HashMap<>();
      // Unicode blocks start on 16 (0x10) byte boundaries.
      for (int cp = 0x00000; cp < SUPPLEMENTARY_PRIVATE_USE_AREA_A_BASE; cp += 0x10) {
        Character.UnicodeBlock ucb = Character.UnicodeBlock.of(cp);
        if (ucb != null) {
          unicodeBases.putIfAbsent(ucb, cp);
        }
      }
      // These blocks are huge, so add them manually.
      unicodeBases.put(SUPPLEMENTARY_PRIVATE_USE_AREA_A, SUPPLEMENTARY_PRIVATE_USE_AREA_A_BASE);
      unicodeBases.put(SUPPLEMENTARY_PRIVATE_USE_AREA_B, SUPPLEMENTARY_PRIVATE_USE_AREA_B_BASE);
      long endNanos = System.nanoTime();
      System.out.format("Total time = %.3f s%n", (endNanos - startNanos) / 1e9);
      return unicodeBases;
    }
    
    public static void main(String[] args) {
      Map<Character.UnicodeBlock, Integer> unicodeBlockBases = makeUnicodeBlockBaseMap();
    
      String unicodeString = "कागज़";
      int currentPoint = unicodeString.codePointAt(0);
    
      Character.UnicodeBlock ucb = Character.UnicodeBlock.of(currentPoint);
      System.out.println(ucb);                                   // DEVANAGARI
      System.out.format("0x%04X%n", unicodeBlockBases.get(ucb)); // 0x0900
    }
    

答案 1 :(得分:0)

您可以为每种语言将开始/结束位置设置为SortedMap并检查codePoints:

 private static final SortedSet<Integer, Integer> startToBase = new TreeMap<>();
 private static final SortedSet<Integer, Integer> endToBase = TreeMap<>();
 static {
   // Fill the SortedMaps:
   // latin
   startToBase.put(0, 0);
   endToBase.put(0x00ff, 0);
   // ...
 }
 // Or load this from a web service, table or anything you find comfortable

 public static final int baseCodePoint(int codePoint) {
   // The codePoint should be inserted here (after)
   int baseFromStart = startToBase.get(startToBase.headMap(codePoint + 1).lastKey());
   // the code point should be inserted here (before).
   int baseFromEnd   = endToBase.get(endToBAse.tailMap(codePoint).firstKey());
   if (baseFromStart == baseFromEnd) {
     return baseFromStart;
   }
   throw new IllegalArgumentException(codePoint + " is unknown.");
 }

答案 2 :(得分:0)

这就是我所做的,感谢GáborBakos的灵感:

TreeMap<Integer, Integer> languageCodePoints = new TreeMap<>();
languageCodePoints.put(0x0020, 0x007E);
languageCodePoints.put(0x00A0, 0x00FF);
languageCodePoints.put(0x0100, 0x017F);
languageCodePoints.put(0x0900, 0x097F); // Devanagri  

// So on for all other languages, referred ISO/IEC 10646:2010 
// for code points of present languages

在我使用的功能中:

String unicodeString = "कागज़";
int currentPoint = unicodeString.codePointAt(0);
int startCodePoint = languageCodePoints.floorKey(currentPoint);

现在我需要“startCodePoint = 0x900”。我觉得很简单。 :-P
唯一的一点是,我必须为新语言条目维护“languageCodePoints”TreeMap,但远比switch / if-else好。

感谢所有人的支持。 : - )

答案 3 :(得分:-1)

您可以使用位操作来查找基本指针,如下所示:

 switch (codePoint & 0xffffff00) {
   case 0x0600: // Arabic
   case 0x0900: // Devnagri, though you might need to check it is below 0x97F
   case 0x0000: // Latin
   default:     // Something else
 }

啊,对不起,我认为亚美尼亚语需要进一步处理,但希望一般的想法适用于大多数语言。

public static int baseCodePoint(int codePoint) {
  switch (codePoint & 0xffffff00) {
    case 0x0900: if (codePoint < 0x0980) return 0x0900;
    case 0x0500: if (codePoint >= 0x0530 && codePoint <= 0x058F) return 0x0530;
    // case ...: other bases where it is not the real base
    // Handling regular base pointers
    default: return codePoint & 0xffffff00;
  }
}