- The Book of Dojo
- Quick Installation
- Hello World
- Debugging Tutorial
- Introduction
- Part 1: Life With Dojo
- Part 2: Dijit
- Part 3: JavaScript With Dojo and Dijit
- Part 4: Testing, Tuning and Debugging
- Part 5: DojoX
- The Dojo Book, 0.4
String Manipulation
Submitted by smoverton on Thu, 09/27/2007 - 18:46.
Although the string data in JavaScript is specified in UTF-16 internally, JavaScript does not provide enough manipulation functions for Unicode strings. It is not practicable to develop a full Unicode compatible JavaScript library like ICU, since some Unicode algorithms, like normalization and collation, are too complicated to run in a script language. This chapter shows what you can do and what you should not do with strings in JavaScript. If you need full support for Unicode compatible string manipulations, you must use ICU library (ICU4C or ICU4J) at the server side instead of in JavaScript.
window.escape()
window.unescape()
window.encodeURI()
window.decodeURI()
window.encodeURIComponent()
window.decodeURIComponent()
I18N Guidelines for String Manipulation
- You should use the Js2Xlf tool to convert JSON files into XLIFF files for translation.
- You should deal with free text using ICU library at the server side.
- You should use only casing functions for locale neutral situation.
- You should not use locale sensitive casing functions provided by JavaScript.
- You should always escape a string as a whole rather than character by character.
- You should not use any comparing, searching, or replacing functions for strings that might contain combining character sequences.
- You should not use inserting, removing, or splitting functions for strings that might contain special characters.
- You should not use trimming functions for strings that might contain special characters.
- You should not use counting functions for strings that might contain special characters.
- You must check the return value of String.charAt() when the string contains surrogates.
Special Kinds of Characters
Some special characters in Unicode can probably cause problems in a string to be manipulated:
- Combining Character Sequence: a combining character sequence starts with a normal character followed by one or more combining marks. For example, "A" followed by a U+0768 (an accent) becomes "À". A combining character sequence should be treated as one character.
- Surrogate Character: surrogate characters must appear in pairs. For example, "U+DB40 U+DC00" represents an ancient Chinese character. A surrogate pair must be treated as one character.
Basic String Manipulations
There are several basic string manipulations. Some of them are locale-sensitive, which means that they might cause different results in different locales:- Transformation: to transform a string into a specified form, for example, casing and escaping. Casing is a locale-sensitive operation, because some characters might have different casing results in different locales. For example, the Latin letter "i" is uppercased to "İ" (a dotted "I") in the Turkish language.
- Recognition: to recognize special characters in strings, for example, white-space characters.
- Join: to concatenate two strings into one.
- Division: to separate one string into two substrings from a certain position.
- Enumeration: to count characters in a string, for example, getting the string's length.
- Comparison: to compare one string with another, for example, searching and sorting. Comparison is locale-sensitive.
Possible Problems
The following table shows the possible problems that might occur when you perform a specific manipulation on strings that contain special characters:Ordinary Character | Combining Character Sequence | Surrogate Character | |
Conversion | Locale Sensitive (Unsolvable) |
Locale Sensitive (Unsolvable) |
Locale Sensitive (Unsolvable) |
Recognition | Unicode Specific (Unsolvable) |
Unicode Specific (Unsolvable) |
Unicode Specific (Unsolvable) |
Join | (No Problem) | (No Problem) | (No Problem) |
Division | (No Problem) | Broken Combining Character (Unsolvable) |
Invalid Code Point (Solvable) |
Enumeration | (No Problem) | Wrong Character Number (Unsolvable) |
Wrong Character Number (Solvable) |
Comparison | (No Problem) | No Canonical Comparison (Unsolvable) |
(No Problem) |
Locale Sensitive: JavaScript has a pair of locale sensitive casing functions -- String.toLocaleUpperCase and String.toLocaleLowerCase, but they only honor the default locale of the client's operating system and cannot be controlled by Web applications.
Unicode Specific: JavaScript might not recognize all Unicode characters that have a specific property. For example, some white-space characters (e.g., the no-break space -- U+00A0) cannot be removed when the string is trimmed in JavaScript:
alert("\u00A0".replace(/\s/g, "").length); // 0 (in Firefox), 1 (in IE)
Show Me (THIS STILL NEEDS TO BE FIXED)
- Broken Combining Character: a combining character sequence might be broken into meaningless characters. For example, when you replace all "A" in a string to "B", you might get all "À" replaced by "B̀ ", which is obviously not what you want. This kind of problems cannot be solved in JavaScript.
- Invalid Code Point: a pair of surrogates can be separated into unpaired surrogates, which are invalid code points in Unicode. This problem could be a very serious problem, because invalid code points might crash some browsers (Safari 2.0.3 loses response when you try to select such invalid characters). Fortunately, unlike other kinds of problems, this kind of problems can be solved in JavaScript, because the surrogate scope is quite simple to be checked (from U+D800 to U+DFFF). You must make sure that no unpaired surrogates appear in the result string.
- Wrong Character Number: because of combining character sequence and surrogate character, one meaningful character can consist of several characters in JavaScript that are 16-bit integers. You want to get the number of meaningful characters, but only get the number of 16-bit integers. This kind of problems cannot be solved in JavaScript, unless it is caused by surrogate characters.
- No Canonical Comparison: JavaScript cannot make canonical comparison on the strings in different Normal Form. Although the ECMAScript Specification v3 recommends that canonical comparison be supported by the String.localeCompare(str) function, no browser actually implements this feature.
You should deal with free text using ICU library at the server side.
The only safe operation for all strings is joining. To manipulate a string, first you need to know what kinds of characters are allowed in it and whether the manipulation is locale sensitive. If the string accepts combining character sequences and surrogate characters or if the manipulation is locale sensitive like casing, you are recommended to send the string to the server and deal with it by using ICU library at the server side. For example, your application might need to handle the following kinds of strings: Restricted Identifier: most identifiers, such as the user's login name, are restricted to letters and numbers, so you can manipulate them in JavaScript freely and safely. Program Information: program information is a string that is only used for your application code, for example, enumeration item names, program commands, and statements. Make sure that no combining character sequence or surrogate character is included in your program information. Then you can manipulate them without potential problems. URL or E-mail Address: URLs and e-mail addresses might contain almost any Unicode characters. You should deal with them at the server side. Free Input Text: the text entered freely by users should be manipulated at the server side. String Manipulating Functions in JavaScript and Dojo
You should use only casing functions for locale neutral situation.
Some casing functions perform only locale neutral casing conversion. You should use them only for program information that is not displayed to end users directly. These functions include:- String.toLowerCase()
- String.toUpperCase()
You should not use locale sensitive casing functions provided by JavaScript.
Other casing functions perform locale sensitive casing conversion with the default locale of the client's operating system. You should never use them. The reasons are: first, it is not guaranteed that browsers implement these functions as the exact behavior defined in the latest Unicode Standard; second, you cannot control the output by your Web application. It is almost unacceptable that an end user needs to change the default locale of the operating system so as to change the locale of your Web application. These functions include:
- String.toLocaleLowerCase()
- String.toLocaleUpperCase()
You should always escape a string as a whole rather than character by character.
You should not use any comparing, searching, or replacing functions for strings that might contain combining character sequences.
Functions in this section might encounter problems from the manipulation of comparison and division on free text. They might replace wrong characters, return invalid strings or characters, and they cannot perform canonical comparison. Use them only for program information without special characters. These functions include:
- String.indexOf()
- String.lastIndexOf()
- String.localeCompare()
- String.match()
- String.replace()
- String.search()
- dojo.string.substitute()
Should not use inserting, removing, or splitting functions for strings that might contain special characters.
Functions in this section might have the problem from division on free text. They might return invalid string or characters. Use them only for program information without special characters. These functions include:
- String.slice()
- String.split()
- String.substring()
You should not use trimming functions for strings that might contain special characters.
Trimming functions might have the problem of Recognition. They cannot recognize all white-space characters defined by Unicode. Use them only for program information without special characters. These functions include:
- dojo.string.trim()
- dojo.trim()
You should not use counting functions for strings that might contain special characters.
Functions for counting characters can only return the number of UTF-16 units rather than the real meaningful characters. These functions include:
- String.length
- dojo.string.pad()
You must check the return value of String.charAt() when the string contains surrogates.
The String.charAt() function might return an unpaired surrogate that is invalid. You should always check the returned value, for example: view plaincopy to clipboardprint?
function codePointAt(s, pos) { var c = s.charAt(pos); if (c == "") { return c; } var code = c.charCodeAt(0); if (code >= 0xD800 && code <= 0xDBFF) { // lower surrogate detected, we need to fetch more char if (pos < s.length - 1) { code = s.charCodeAt(pos + 1); return code < 0xDC00 || code > 0xDFFF ? "" : s.substring(pos, pos + 2); } else { return ""; // error in the string } } else if (code >= 0xDC00 && code <= 0xDFFF) { // higher surrogate detected, we need to fetch more char backward if (pos > 0) { code = s.charCodeAt(pos - 1); return code < 0xD800 || code > 0xDBFF ? "" : s.substring(pos - 1, pos + 1); } else { return ""; // error in the string } } }
- Printer-friendly version
- Login or register to post comments
- Unsubscribe post
Need to define terms
Perhaps using hyperlinks to external content.
ICU, ICU4C, ICU4J, XLIFF, Js2Xlf (not yet available in Dojo; we could point at the trac ticket, I suppose?)
JS2XLF (Java application) link
JS2XLF (java tool) is included in trac for ticket #4155.
http://trac.dojotoolkit.org/ticket/4155