Class SpoofChecker


  • public class SpoofChecker
    extends Object

    This class, based on Unicode Technical Report #36 and Unicode Technical Standard #39, has two main functions:

    1. Checking whether two strings are visually confusable with each other, such as "desparejado" and "ԁеѕрагејаԁо".
    2. Checking whether an individual string is likely to be an attempt at confusing the reader (spoof detection), such as "pаypаl" spelled with Cyrillic 'а' characters.

    Although originally designed as a method for flagging suspicious identifier strings such as URLs, SpoofChecker has a number of other practical use cases, such as preventing attempts to evade bad-word content filters.

    Confusables

    The following example shows how to use SpoofChecker to check for confusability between two strings:

     
     SpoofChecker sc = new SpoofChecker.Builder().setChecks(SpoofChecker.CONFUSABLE).build();
     int result = sc.areConfusable("desparejado", "ԁеѕрагејаԁо");
     System.out.println(result != 0);  // true
     
     

    SpoofChecker uses a builder paradigm: options are specified within the context of a lightweight SpoofChecker.Builder object, and upon calling SpoofChecker.Builder.build(), expensive data loading operations are performed, and an immutable SpoofChecker is returned.

    The first line of the example creates a SpoofChecker object with confusable-checking enabled; the second line performs the confusability test. For best performance, the instance should be created once (e.g., upon application startup), and the more efficient areConfusable(java.lang.String, java.lang.String) method can be used at runtime.

    UTS 39 defines two strings to be confusable if they map to the same skeleton. A skeleton is a sequence of families of confusable characters, where each family has a single exemplar character. getSkeleton(java.lang.CharSequence) computes the skeleton for a particular string, so the following snippet is equivalent to the example above:

     
     SpoofChecker sc = new SpoofChecker.Builder().setChecks(SpoofChecker.CONFUSABLE).build();
     boolean result = sc.getSkeleton("desparejado").equals(sc.getSkeleton("ԁеѕрагејаԁо"));
     System.out.println(result);  // true
     
     

    If you need to check if a string is confusable with any string in a dictionary of many strings, rather than calling areConfusable(java.lang.String, java.lang.String) many times in a loop, getSkeleton(java.lang.CharSequence) can be used instead, as shown below:

     // Setup:
     String[] DICTIONARY = new String[]{ "lorem", "ipsum" }; // example
     SpoofChecker sc = new SpoofChecker.Builder().setChecks(SpoofChecker.CONFUSABLE).build();
     HashSet<String> skeletons = new HashSet<String>();
     for (String word : DICTIONARY) {
       skeletons.add(sc.getSkeleton(word));
     }
    
     // Live Check:
     boolean result = skeletons.contains(sc.getSkeleton("1orern"));
     System.out.println(result);  // true
     

    Note: Since the Unicode confusables mapping table is frequently updated, confusable skeletons are not guaranteed to be the same between ICU releases. We therefore recommend that you always compute confusable skeletons at runtime and do not rely on creating a permanent, or difficult to update, database of skeletons.

    Spoof Detection

    The following snippet shows a minimal example of using SpoofChecker to perform spoof detection on a string:

     SpoofChecker sc = new SpoofChecker.Builder()
         .setAllowedChars(SpoofChecker.RECOMMENDED.cloneAsThawed().addAll(SpoofChecker.INCLUSION))
         .setRestrictionLevel(SpoofChecker.RestrictionLevel.MODERATELY_RESTRICTIVE)
         .setChecks(SpoofChecker.ALL_CHECKS &~ SpoofChecker.CONFUSABLE)
         .build();
     boolean result = sc.failsChecks("pаypаl");  // with Cyrillic 'а' characters
     System.out.println(result);  // true
     

    As in the case for confusability checking, it is good practice to create one SpoofChecker instance at startup, and call the cheaper failsChecks(java.lang.String, com.ibm.icu.text.SpoofChecker.CheckResult) online. In the second line, we specify the set of allowed characters to be those with type RECOMMENDED or INCLUSION, according to the recommendation in UTS 39. In the third line, the CONFUSABLE checks are disabled. It is good practice to disable them if you won't be using the instance to perform confusability checking.

    To get more details on why a string failed the checks, use a SpoofChecker.CheckResult:

     
     SpoofChecker sc = new SpoofChecker.Builder()
         .setAllowedChars(SpoofChecker.RECOMMENDED.cloneAsThawed().addAll(SpoofChecker.INCLUSION))
         .setRestrictionLevel(SpoofChecker.RestrictionLevel.MODERATELY_RESTRICTIVE)
         .setChecks(SpoofChecker.ALL_CHECKS &~ SpoofChecker.CONFUSABLE)
         .build();
     SpoofChecker.CheckResult checkResult = new SpoofChecker.CheckResult();
     boolean result = sc.failsChecks("pаypаl", checkResult);
     System.out.println(checkResult.checks);  // 16
     
     

    The return value is a bitmask of the checks that failed. In this case, there was one check that failed: RESTRICTION_LEVEL, corresponding to the fifth bit (16). The possible checks are:

    These checks can be enabled independently of each other. For example, if you were interested in checking for only the INVISIBLE and MIXED_NUMBERS conditions, you could do:

     
     SpoofChecker sc = new SpoofChecker.Builder()
         .setChecks(SpoofChecker.INVISIBLE | SpoofChecker.MIXED_NUMBERS)
         .build();
     boolean result = sc.failsChecks("৪8");
     System.out.println(result);  // true
     
     

    Note: The Restriction Level is the most powerful of the checks. The full logic is documented in UTS 39, but the basic idea is that strings are restricted to contain characters from only a single script, except that most scripts are allowed to have Latin characters interspersed. Although the default restriction level is HIGHLY_RESTRICTIVE, it is recommended that users set their restriction level to MODERATELY_RESTRICTIVE, which allows Latin mixed with all other scripts except Cyrillic, Greek, and Cherokee, with which it is often confusable. For more details on the levels, see UTS 39 or SpoofChecker.RestrictionLevel. The Restriction Level test is aware of the set of allowed characters set in SpoofChecker.Builder.setAllowedChars(com.ibm.icu.text.UnicodeSet). Note that characters which have script code COMMON or INHERITED, such as numbers and punctuation, are ignored when computing whether a string has multiple scripts.

    Additional Information

    A SpoofChecker instance may be used repeatedly to perform checks on any number of identifiers.

    Thread Safety: The methods on SpoofChecker objects are thread safe. The test functions for checking a single identifier, or for testing whether two identifiers are potentially confusable, may called concurrently from multiple threads using the same SpoofChecker instance.

    • Method Detail

      • getRestrictionLevel

        @Deprecated
        public SpoofChecker.RestrictionLevel getRestrictionLevel()
        Deprecated.
        This API is ICU internal only.
        Get the Restriction Level that is being tested.
        Returns:
        The restriction level
      • getChecks

        public int getChecks()
        Get the set of checks that this Spoof Checker has been configured to perform.
        Returns:
        The set of checks that this spoof checker will perform.
      • getAllowedLocales

        public Set<ULocale> getAllowedLocales()
        Get a read-only set of locales for the scripts that are acceptable in strings to be checked. If no limitations on scripts have been specified, an empty set will be returned. setAllowedChars() will reset the list of allowed locales to be empty. The returned set may not be identical to the originally specified set that is supplied to setAllowedLocales(); the information other than languages from the originally specified locales may be omitted.
        Returns:
        A set of locales corresponding to the acceptable scripts.
      • getAllowedJavaLocales

        public Set<Locale> getAllowedJavaLocales()
        Get a set of Locale instances for the scripts that are acceptable in strings to be checked. If no limitations on scripts have been specified, an empty set will be returned.
        Returns:
        A set of locales corresponding to the acceptable scripts.
      • getAllowedChars

        public UnicodeSet getAllowedChars()
        Get a UnicodeSet for the characters permitted in an identifier. This corresponds to the limits imposed by the Set Allowed Characters functions. Limitations imposed by other checks will not be reflected in the set returned by this function. The returned set will be frozen, meaning that it cannot be modified by the caller.
        Returns:
        A UnicodeSet containing the characters that are permitted by the CHAR_LIMIT test.
      • failsChecks

        public boolean failsChecks​(String text,
                                   SpoofChecker.CheckResult checkResult)
        Check the specified string for possible security issues. The text to be checked will typically be an identifier of some sort. The set of checks to be performed was specified when building the SpoofChecker.
        Parameters:
        text - A String to be checked for possible security issues.
        checkResult - Output parameter, indicates which specific tests failed. May be null if the information is not wanted.
        Returns:
        True there any issue is found with the input string.
      • failsChecks

        public boolean failsChecks​(String text)
        Check the specified string for possible security issues. The text to be checked will typically be an identifier of some sort. The set of checks to be performed was specified when building the SpoofChecker.
        Parameters:
        text - A String to be checked for possible security issues.
        Returns:
        True there any issue is found with the input string.
      • areConfusable

        public int areConfusable​(String s1,
                                 String s2)
        Check the whether two specified strings are visually confusable. The types of confusability to be tested - single script, mixed script, or whole script - are determined by the check options set for the SpoofChecker. The tests to be performed are controlled by the flags SINGLE_SCRIPT_CONFUSABLE MIXED_SCRIPT_CONFUSABLE WHOLE_SCRIPT_CONFUSABLE At least one of these tests must be selected. ANY_CASE is a modifier for the tests. Select it if the identifiers may be of mixed case. If identifiers are case folded for comparison and display to the user, do not select the ANY_CASE option.
        Parameters:
        s1 - The first of the two strings to be compared for confusability.
        s2 - The second of the two strings to be compared for confusability.
        Returns:
        Non-zero if s1 and s1 are confusable. If not 0, the value will indicate the type(s) of confusability found, as defined by spoof check test constants.
      • getSkeleton

        public String getSkeleton​(CharSequence str)
        Get the "skeleton" for an identifier string. Skeletons are a transformation of the input string; Two strings are confusable if their skeletons are identical. See Unicode UAX 39 for additional information. Using skeletons directly makes it possible to quickly check whether an identifier is confusable with any of some large set of existing identifiers, by creating an efficiently searchable collection of the skeletons. Skeletons are computed using the algorithm and data described in Unicode UAX 39.
        Parameters:
        str - The input string whose skeleton will be generated.
        Returns:
        The output skeleton string.
      • getSkeleton

        @Deprecated
        public String getSkeleton​(int type,
                                  String id)
        Deprecated.
        ICU 58
        Calls getSkeleton(CharSequence id). Starting with ICU 55, the "type" parameter has been ignored, and starting with ICU 58, this function has been deprecated.
        Parameters:
        type - No longer supported. Prior to ICU 55, was used to specify the mapping table SL, SA, ML, or MA.
        id - The input identifier whose skeleton will be generated.
        Returns:
        The output skeleton string.
      • equals

        public boolean equals​(Object other)
        Equality function. Return true if the two SpoofChecker objects incorporate the same confusable data and have enabled the same set of checks.
        Overrides:
        equals in class Object
        Parameters:
        other - the SpoofChecker being compared with.
        Returns:
        true if the two SpoofCheckers are equal.