String Searching

Abstract

This document describes string searching operations on the Web in order to allow greater interoperability. String searching refers to natural language string matching such as the "find" command in a Web browser. This document builds upon the concepts found in Character Model for the World Wide Web 1.0: Fundamentals [CHARMOD] and Character Model for the World Wide Web 1.0: String Matching [CHARMOD-NORM] to provide authors of specifications, software developers, and content developers the information they need to describe and implement search features suitable for global audiences.

Users of the Web often want to search for specific text in a document or collection of documents without having to read line-by-line. Specifications sometimes seek to support this desire by exposing text searching in the Web platform.

There are different types of document searching. One type, called a full text search, is the sort of searching most often found in applications such as a search engine. This type of searching is complex, can be resource intensive, and often depends on processes outside the scope of a given search request.

A more limited form of text search (and the topic of this document) is sub-string matching. One familiar form of sub-string matching is the find feature of browsers and other types of user-agent. For user agents with physical keyboards, this functionality is often accessed via a key combination such as Cmd+F or Ctrl+F. Such a feature might be exposed on the Web via the API window.find, which is currently not fully standardized, or capabilities such as the proposed [SCROLL-TO-TEXT-FRAGMENT].

Note

Find operations can provide optional mechanisms for improving or tailoring the matching behavior. For example, the abilility to add (or remove) case sensitivity, whether the feature supports different aspects of a regular expression language such as wildcard characters, or whether to limit matches to whole words.

One way that sub-string matching usually differs from full-text search is that, while it might use various algorithms in an attempt to suppress or ignore textual variations, it usually does not produce matches that contain additional or unspecified character sequences, words, or phrases, such as would result from stemming or other NLP processes.

When attempting to standardize sub-string matching, specification authors often struggle with the complexity that is inherent in the encoding of natural language in computer systems, including the different mechanisms employed to encode characters in the [Unicode] standard.

Quite often, the user's input doesn't consist of exactly the same sequence of code points as that used in the document being searched, while the user still expects a match to occur. This can happen for a variety of reasons. Sometimes it is because the text being searched varies in ways the user could not have predicted. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed. It can even be because the user cannot be bothered to input the text accurately.

In this section, we examine various common cases known to us which specification authors need to take into consideration when specifying a sub-string match API or mechanism.

User expectations about whether their search term matches a given part of a document or corpus sometimes depends on the user's language, the language of the document, or both. It might also involve other factors, such as which keyboards or input methods are available on a given device. This might be because various operations that are part of searching, such as case folding, are locale-affected, or that, given the complexity of human language and culture, that expectations about matching or about the use and interpretation of various character sequences differs, even within a given script. Similarly, the handling of accents, alternate scripts, or character encoding (such as variations in the formation of grapheme clusters) is linked to the specific language of the text in question.

It is important to emphasize that we mean language here, and not script. Many different languages that share a script apply different processing or imply different expectations.

Implementations of a "find" feature often have to guess what language the user intended based solely on the user's input or on various "hints" in the runtime environment, such as the operating environment locale, the user agent's localization, or the language of the active keyboard. These hints are, at best, a proxy for the user's intent, particularly when the user is searching a document that doesn't match any of these or when the searched document contains more than one language.

Example 1: User language interaction with user expectations

Different languages treat the letter combinations a, ae, and ä differently. English speakers expect ae to be different from a and ä. Since ä is a foreign letter, they usually expect it to match the unmarked a. German speakers expect ae and ä to be equivalent (and different from a). Finnish speakers expect all three to be separate.

Now suppose you have a sentence in Finnish: Haen Han Solon. Hän on salakuljettaja.

(For the curious, this translates to: I’ll go get Han Solo. He is a smuggler.)

The above sentence is tagged as Finnish (lang="fi"). Notice that the letter "n" attached to the end of Han Solo's name (Han Solon) is a part of Finnish grammar.

Here are some spelling variations that speakers of English, German, and Finnish might enter when performing a "find" operation on the text. (Hint: Try them in the "find" command for your browser when viewing this page.)

Han
Hän
Haen
han
hän
haen

Finnish speakers expect that each of the above examples is a different word. They might expect that the case variation between Hän and hän might be ignored. German speakers might expect that Hän and Haen are equivalent. English speakers might expect Han to match Hän (but perhaps not the reverse, since ä is not native to English). However, the language tagging of the document doesn't seem to affect most find operations. Neither is there usually a way for the user to affect which language is applied to the search term.

Here is a phrase that we believe means warm marrow in Turkish: ılık ilik.

Here are some spelling variations that English and Turkish speakers might enter:

Search Term	Code Points
ILIK	U+0049 U+004C U+0049 U+004B
İLİK	U+0130 U+004C U+0130 U+004B
ilik	U+0069 U+006C U+0069 U+006B
ılık	U+0131 U+006C U+0131 U+006B

Depending on your browser and runtime locale, you can get anomolous matching with these terms. In some browsers, the first three terms above consistently match ilik (with an ASCII dotted-i) but not the word ılık with ıU+0131 LATIN SMALL LETTER DOTLESS I.

This is not what Turkish users would expect, since they expect "I"/"ı" and "İ"/"i" to be caseless pairs. A side-effect of this is that the search term "ılık" only matches its lowercase equivalent—and that the uppercase variations do not match that word, even when they match the lowercase version with dotted letter i ("ilik"). Such variation means that both English and Turkish users will notice that the search misses words.

A user might expect a term entered in lowercase to match uppercase equivalents (and perhaps vice-versa). Sub-string matching features, such as the browser "find" command, often offer a user-selectable option for matching (or not) the case of the input to that of the text.

For a survey of case folding, see the discussion here in [CHARMOD-NORM].

Unicode defines canonical and compatibility relationships between characters which can impact user perceptions of string searching. For a detailed discussion of Unicode Normalization forms see Section 2.2 of [CHARMOD-NORM] as well as the definitions found in Unicode Normalization Forms [UAX15].

Example 2

For example, consider the letter "K". The characters with a normalization including U+004B LATIN CAPITAL LETTER K include the following, many of which might be expected to match a letter "K" in a sub-string search request by a user because they appear to contain a logical "letter K":

Ķ U+0136 LATIN CAPITAL LETTER K WITH CEDILLA
Ǩ U+01E8 LATIN CAPITAL LETTER K WITH CARON
ᴷ U+1D37 MODIFIER LETTER CAPITAL K
Ḱ U+1E30 LATIN CAPITAL LETTER K WITH ACUTE
Ḳ U+1E32 LATIN CAPITAL LETTER K WITH DOT BELOW
Ḵ U+1E34 LATIN CAPITAL LETTER K WITH LINE BELOW
K U+212A KELVIN SIGN
Ⓚ U+24C0 CIRCLED LATIN CAPITAL LETTER K
㎅ U+3385 SQUARE KB
㏍ U+33CD SQUARE KK
㏎ U+33CE SQUARE KM CAPITAL
Ｋ U+FF2B FULLWIDTH LATIN CAPITAL LETTER K
𝐊 U+1D40A MATHEMATICAL BOLD CAPITAL K
𝐾 U+1D43E MATHEMATICAL ITALIC CAPITAL K
𝑲 U+1D472 MATHEMATICAL BOLD ITALIC CAPITAL K
𝒦 U+1D4A6 MATHEMATICAL SCRIPT CAPITAL K
𝓚 U+1D4DA MATHEMATICAL BOLD SCRIPT CAPITAL K
𝔎 U+1D50E MATHEMATICAL FRAKTUR CAPITAL K
𝕂 U+1D542 MATHEMATICAL DOUBLE-STRUCK CAPITAL K
𝕶 U+1D576 MATHEMATICAL BOLD FRAKTUR CAPITAL K
𝖪 U+1D5AA MATHEMATICAL SANS-SERIF CAPITAL K
𝗞 U+1D5DE MATHEMATICAL SANS-SERIF BOLD CAPITAL K
𝘒 U+1D612 MATHEMATICAL SANS-SERIF ITALIC CAPITAL K
𝙆 U+1D646 MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL K
𝙺 U+1D67A MATHEMATICAL MONOSPACE CAPITAL K
🄚 U+1F11A PARENTHESIZED LATIN CAPITAL LETTER K
🄺 U+1F13A SQUARED LATIN CAPITAL LETTER K

In many complex scripts it is possible to encode letters or vowel-signs in more than one way, but the alternatives are canonically equivalent.

Some languages are written in more than one script. A user searching a document might type in text in one script, but wish to find equivalent text in both scripts.

Example 3

Japanese uses two syllabic scripts, hiragana and katakana. These scripts encode the same phonemes; thus the user might expect that typing in a search term in hiragana would find the exact same word spelled out in katakana.

In the example shown here, the word nihongo (Japanese for "Japanese") is shown in both hiragana and katakana. Note that this word is usually represented by kanji (Han ideograph) characters: 日本語.

Description	Example
Hiragana	にほんご
Hiragana	U+306B U+307B U+3093 U+3054
Katakana	ニホンゴ
Katakana	U+30CB U+30DB U+30F3 U+30B4

Some compatibility characters were encoded into Unicode to account for single- or multibyte representation in legacy character encodings or for compatibility with certain layout behaviors in East Asian languages.

Example 4: Examples of East Asian width variations

Description	Example
full-width katakana	ニホンゴ
full-width katakana	U+30CB U+30DB U+30F3 U+30B4
half-width katakana These are compatibility characters	ﾆﾎﾝｺﾞ
half-width katakana These are compatibility characters	U+FF86 U+FF83 U+FF9D U+FF7A U+FF9E
half-width Latin letters These are ASCII letters!	abcXYZ
half-width Latin letters These are ASCII letters!	U+0061 U+0062 U+0063 U+0058 U+0059 U+005A
full-width Latin letters These are compatibility characters.	ａｂｃＸＹＺ
	U+FF41 U+FF42 U+FF43 U+FF38 U+FF39 U+FF3A

Many scripts have their own digit characters for the numbers from 0 to 9. In some Web applications, the familiar ASCII digits are replaced for display purposes with the local digit shapes. In other cases, the text actually might contain the Unicode characters for the local digits. Users attempting to search a document might expect that typing one form of digit will find the eqivalent digits.

Example 5: Examples of digit shapes in four scripts

Here are some selected examples of different digit shapes, from zero to nine, in four scripts. Many scripts have equivalent sets of digits with distinct shapes.

Script	Digits
Script	0	1	2	3	4	5	6	7	8	9
Latin	0	1	2	3	4	5	6	7	8	9
Gujurati	૦	૧	૨	૩	૪	૫	૬	૭	૮	૯
Thai	๐	๑	๒	๓	๔	๕	๖	๗	๘	๙
Arabic	٠	١	٢	٣	٤	٥	٦	٧	٨	٩

Some languages have different orthographic traditions that vary by region or dialect or allow different spellings of the same word. Searches and spell-checking may need to know about these variations.

Indic script languages have many instances of this kind of problem. Sometimes these are spelling errors, but in other cases multiple spellings are acceptable.

For example, the Bengali language (language tag bn) is notorious for having a wide range of spelling variations permitted by the language: nearly 80% of Bengali words have at least two spellings. Many words have 3, 4, or more variations—with at least one word having 16 different valid spellings.

Example 7

One example is the word which transliterates to the Latin script as rani, but which users may spell with different letters and vowel marks. In modern Bengali ণ [U+09A3 BENGALI LETTER NNA] and ন [U+09A8 BENGALI LETTER NA] are pronounced /n/, and ি [U+09BF BENGALI VOWEL SIGN I ] and ী [U+09C0 BENGALI VOWEL SIGN II ] are both pronounced /i/. Therefore different users might choose any of the following alternative code point sequences for the same word:

	U+09A8 BENGALI LETTER NA	U+09A3 BENGALI LETTER NNA
U+09BF BENGALI VOWEL SIGN I	রানি	রাণি
U+09BF BENGALI VOWEL SIGN I	U+09B0 U+09BE U+09A8 U+09BF	U+09B0 U+09BE U+09A3 U+09BF
U+09C0 BENGALI VOWEL SIGN II	রানী	রাণী
U+09C0 BENGALI VOWEL SIGN II	U+09B0 U+09BE U+09A8 U+09C0	U+09B0 U+09BE U+09A3 U+09C0

Other Indic scripts provide alternative mechanisms for representing particular sounds, and in most cases either representation is considered equally valid. The most common instance of this involves representation of syllable-final nasals.

For example, the /n/ sound in the word for snake in Hindi can be written using either ँ [U+0901 DEVANAGARI SIGN CANDRABINDU] and ं [U+0902 DEVANAGARI SIGN ANUSVARA] Both of the following are possible valid spellings:

Example 8

Description	Example
With ँ [U+0901 DEVANAGARI SIGN CANDRABINDU]	साँप
With ँ [U+0901 DEVANAGARI SIGN CANDRABINDU]	U+0938 U+093E U+0901 U+092A
With ं [U+0902 DEVANAGARI SIGN ANUSVARA]	सांप
With ं [U+0902 DEVANAGARI SIGN ANUSVARA]	U+0938 U+093E U+0902 U+092A

In an additional twist to this story, two diacritics with different code points could be used here. In our previous example we used ं [U+0902 DEVANAGARI SIGN ANUSVARA ] to represent the nasal sound because the accompanying vowel-sign rises above the hanging baseline. If the vowel-sign was one that didn't rise above the hanging baseline, we would normally use ँ [U+0901 DEVANAGARI SIGN CANDRABINDU ] instead. The function of both of these diacritics is the same, but their code points are different.

The alternative use of either a letter or a diacritic for syllable-final nasals is common to several other Indian languages. In addition to Devanagari, used to write languages such as Hindi (language tag hi) or Marathi (language tag mr, scripts such as Malayalam, Gujarati, Odia, and others provide similar spelling options.

Example 9: Example of another Indic script spelling variation

Here is an example from Malayalam (ml) showing alternative spellings of the same word.

Description	Example
with U+0D03 MALAYALAM SIGN VISARGA	ദുഃഖം
with U+0D03 MALAYALAM SIGN VISARGA	U+0D26 U+0D41 U+0D03 U+0D16 U+0D02
without U+0D03 MALAYALAM SIGN VISARGA	ദുഖം
without U+0D03 MALAYALAM SIGN VISARGA	U+0D26 U+0D41 U+0D16 U+0D02

Some languages use whitespace to separate words, sentences, or paragraphs while others do not. When performing sub-string matching, different forms of whitespace found in [Unicode] must be normalized so that the match succeeds.

Users will sometimes vary their input when dealing with letters that contain accents or diacritic marks when entering search terms in scripts (such as the Latin script) that use various diacritics, even though the text they are searching includes the additional marks. This is particularly true on mobile keyboards, where input of these characters can require additional effort. In these cases, users generally expect the search operation to be more "promiscuous" to make up for their failure to make the additional effort needed.

Example 11

German uses several letters that have an umlaut accent, such as ö [U+00F6 LATIN SMALL LETTER O WITH DIERISIS] or ü [U+00FC LATIN SMALL LETTER U WITH DIERISIS]. Users sometimes will enter these accents when searching, but sometimes they replace the umlauts with the letter e. For example, instead of entering Dürst they might enter Duerst. Either spelling is recognizable and has the same meaning. The umlauts are probably "better" than the e spelling, but German speakers are not confused by the difference.

Note

Other languages use these same characters for a different purpose than German does. The formal name of the "umlaut" diacritic in Unicode is diaeresis, which means approximately "break" or "pause". Languages such as French, Spanish, and English occasionally use the diaeresis to indicate the need to pronounce a specific letter, such as the word "ambigüedad" in Spanish or a name like "Zoë" in English.

This effect might vary depending on context as well. For example, a person using a physical keyboard may have direct access to accented letters, while a virtual or on-screen keyboard may require extra effort to access and select the same letters.

In some orthographies it is necessary to match strings with different numbers of characters.

A prime example of this involves vowel diacritics in abjads. For example, some languages that use the Arabic and Hebrew scripts do not require (but optionally allow) the user to input short vowels. (For some other languages in these scripts, the inclusion of the short vowels is not optional.) The presence or absence of vowels in the text being input or searched might impede a match if the user doesn't enter or know to enter them.

In some cases, visually similar or identical glyph patterns can be made from different sequences of code points. Sometimes this is intentional and variations can be removed via Unicode normalization. But there are other cases in which similar-appearing graphemes are not made the same by normalisation, and they are not semantically equivalent.

Example 13

For example, here are a number of character sequences that produce the same or similar textual appearance in the Malayalam script. The inappropriate sequences should be avoided because they will cause the meaning of the text to change: searches, matching and other aspects of the text will fail to be understood by the application or the font. In some cases, fonts will indicate that there is a problem by forcing the appearance of a dotted circle or otherwise failing to render the text correctly, but this may not always be the case.

Use	Do not use
ൈ	െെ
[U+0D48 MALAYALAM VOWEL SIGN AI]	[U+0D46 MALAYALAM VOWEL SIGN E + U+0D46 VOWEL SIGN E]
ഈ	ഇൗ
[U+0D08 MALAYALAM LETTER II]	[U+0D07 MALAYALAM LETTER I + U+0D57 AU LENGTH MARK]
ഊ	ഉൗ
[U+0D0A MALAYALAM LETTER UU]	[U+0D09 MALAYALAM LETTER U + U+0D57 AU LENGTH MARK]
ഓ	ഒാ
[U+0D13 MALAYALAM LETTER OO]	[U+0D12 MALAYALAM LETTER O + U+0D3E VOWEL SIGN AA]
ഐ	എെ
[U+0D10 MALAYALAM LETTER AI]	[U+0D0E MALAYALAM LETTER E + U+0D46 VOWEL SIGN E]
ഔ	ഒൗ
[U+0D14 MALAYALAM LETTER AU]	[U+0D12 MALAYALAM LETTER O + U+0D57 MALAYALAM AU LENGTH MARK]

Some languages which use the Arabic script also have graphemes which can be encoded in more than one way. In some cases, these variations are handled by Unicode Normalization, but in other cases they are not considered equivalent by Unicode, even if they appear visually to be identical. Sometimes these variations are considered to be valid spelling variations. In other cases they are the result of user's mistaken perception.

Example 14

A number of language are written in the Arabic script but are unrelated to the Arabic language. Some of these languages therefore require character sequences to represent sounds not present in Arabic. A significant problem for some of these languages is that these specially-encoded character sequences can be visually similar (or identical) to character sequences encoded for other uses and users may experience difficulty entering or knowing how to enter the correct sequence, such as when inputting a search term.

One such language is Kashmiri (language tag ks). Here are some selected examples one might find in Kashmiri:

Description	Examples
Canonically equivalent alternatives (differences resolved by Unicode Normalization)	إ	`U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW`	إ	`U+0627 ARABIC LETTER ALEF` + `U+0655 ARABIC HAMZA BELOW`
Not canonically equivalent (differences that remain after Unicode Normalization) Many of these are linked to user perception of whether the vowel is part of the base letter (ijam) vs. separable (tashkil)	ێ	`U+06CE ARABIC LETTER YEH WITH SMALL V`	یٚ	`U+06CC ARABIC LETTER FARSI YEH` + `U+065A ARABIC VOWEL SIGN SMALL V ABOVE`
Confusables or spelling errors these can be common in certain kinds of text due to gaps in keyboard support or due to a similarity in appearance	ئ	`U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE`	یٔ	`U+06CC ARABIC LETTER FARSI YEH` + `U+0654 ARABIC HAMZA ABOVE`

(For more information, see Richard Ishida's doc here.)

Some languages, such as English or Arabic, use spaces between words. Other languages, such as Chinese, Japanese, or Thai, do not. Some language use spaces to separate other text units, such as phrases. In those languages that do not use spaces between words, computing "whole word" matching often depends on the ability to determine word boundaries when the boundaries are not themselves encoded into the text.

User Input	Matched Strings
e (lowercase 'e')	"re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ"
E (uppercase 'E')	"RE-RESUME" and "RE-RÉSUMÉ"
é (lowercase 'e' with acute accent)	"re-résumé" and "RE-RÉSUMÉ"
É (uppercase 'E' with acute accent)	"RE-RÉSUMÉ"

String Searching

Abstract

Status of This Document

1. Introduction

1.1 Goals and Scope

1.2 Document Conventions

1.3 Terminology

2. Searching Text in Natural Language Content

2.1 Problems with Determining Equivalence

2.1.1 Matching variation due to language

2.1.2 Case Folding

2.1.3 Unicode Normalization and character equivalence

2.1.4 Script Equivalence

2.1.5 East Asian Width

2.1.6 Digit Shaping

2.1.7 Orthographic or Dialectical Variation

2.1.7.1 South Asian (Indic script) languages

2.1.8 Whitespace Normalization

2.1.9 Accents and diacritic marks

2.1.10 Optional characters

2.1.11 Visually identical text that is not canonically equivalent

2.2 Word boundaries and "whole word" matching

3. Considerations for Searching

3.1 Types of Search Option

4. Acknowledgements

A. References

A.1 Informative references