Understanding how to Work with Bi-Directionality (BiDi) Text

I’m a native Hebrew Speaker. Hebrew, as well as Arabic and several other languages, is a Right-to-Left (abbreviated ‘RTL’) language, meaning that it is written from Right to Left. Translators and other content creators working with bidirectionality (abbreviated ‘BiDi’) content – i.e. text containing both Left-to-Right (abbreviated ‘LTR’) and Right-to-Left characters and segments – are facing some unique challenges when it comes to formatting the text correctly. The development of the Unicode Bidirectional Algorithm and the growing support that it has enjoyed in the last decade or so have significantly improved how softwares handle BiDi content. Today all modern Operating Systems – for the desktop and mobile – support BiDi virtually out-of-the-box. Usually there is still a setting or two to tweak, but long gone are the days of installing awkward languages packs and/or workarounds to overcome or mask some of the technical limitations of BiDi support.

Despite these advancements, some issues still remain. The algorithm is not perfect and although it makes life much easier, the difference in directionality is still a concern and something to keep in mind when working with BiDi content. The algorithm should be perceived as a mechanism that lays down the technical foundations upon which the BiDi content is built. It is largely the responsibility of the content creators to understand how the BiDi algorithm works and use best practices when preparing the content.

In this article I will attempt to explain how the BiDi algorithm parses and handles the directionality of text, its shortcomings, and describe some of the most common BiDi issues and how to solve them.

This article focuses on BiDi issues in a word processor and Translation Environment Tool (abbreviated ‘TEnTs’) enviroments, but the principles and solutions described here apply globally. For details about the corresponding terminology in a plain text or (X)HTML enviroment, please see Directionality of Paragraphs and documents section below.
If you are interested in the tl;dr version of this article, please jump to the Takeaway section.

The Directionality property

The term ‘Directionality’ describes the direction in which a string of text flows from one character to the next. In general, the directionality property can be classified into three groups:

  • LTR (Left-to-Right);
  • RTL (Right-to-Left); and
  • Neutral

Directionality is a property of characters, paragraphs and documents. While the directionality of characters is an intrinsic property, that of paragraphs and documents can be assigned.

Directionality of Characters

The directionality property of characters is intrinsic to the specific character. Each character can be classified into one of the above three directionality groups: LTR (generally a Latin character), RTL (a Hebrew or Arabic character for example), or neutral (punctuation, numbers, special characters). When two LTR characters are typed consecutively the second character is placed to the right of the first character; conversely, when two RTL characters are typed consecutively the second character is placed to the left of the first character, thus creating the natural directionality of the language: Left-to-Right or Right-to-Left, respectively. Characters with neutral directionality property usually inherit the directionality of the characters that precede and follow them, unless these characters are of opposite directionality or when there is no following character present, e.g. at the end of a sentence or paragraph, and in these conditions the characters with neutral directionality inherit the directionality of the parent paragraph or document.

Directionality

LTR

RTL

Letters

AB_

אב_

Neutral Characters

.;!?_

.;!?_

Table 1. Example of how the text flows in RTL and LTR languages. The underscore (_) represents the placement of the Caret for typing the next character. The table also shows how neutral characters, such as punctuation marks, flow under LTR and RLT paragraph or document directionality.

Directionality of Paragraphs and documents

Unlike characters, the directionality property of paragraphs and documents is not intrinsic, it is assigned to them. The directionality of a paragraph controls how the text flows within that paragraph and the directionality property of a document defines the default directionality of all its text segments, unless it was specifically defined otherwise. Paragraphs and documents can have LTR or RTL directionality, but not neutral directionality.

Note about the terminology used in this article and the corresponding terms in Plain Text and (X)HTML environments
In this article I focus on working with BiDi text in a word processor and/or TEnT, which are the two main work environments for a translator. Therefore I use the terms ‘document’, ‘paragraph’, and ‘sentence’ to describe the structure of the content. However, the principles and solutions described in this article apply globally, so I think that it is important to take a minute and note the similarities to two other text oriented environments: Plain Text documents (including email messages composed as Plain Text) don’t have paragraphs and the directionality of the entire content is inherited from that of document, which is usually LTR. Some text editors allow to change the reading direction, but this apply only visually to that session and doesn’t affect the directionality of the document. In (X)HTML the term ‘document’ corresponds to ‘webpage’, and the term ‘paragraph’ corresponds to the block elements ‘div’, ‘paragraph’, and the inline block element ‘span’.

Directionality of Sentences

The most common BiDi problems occur with mixed sentences that have LTR, RTL and neutral characters in them, with a close second being LTR or RTL standalone text segments under a paragraph or document with the opposite directionality. The directionality of a sentence is a little more complicated because it is made up of words (text units) which have an intrinsic directionality property, while at the same time it is also part of a paragraph or document that have its own directionality setting. When the directionality of the characters and paragraph or document matches everything flows as expected, but what happens when a directionality mismatch occurs?
Each sentence can be broken down to its text units, think of them as the words, punctuation marks and white space that make up the sentence. Let’s take the following sentence as an example:

Hi, this is an example!

Under LTR paragraph directionality everything flows as expected. But under RTL paragraph directionality the same sentence is rendered as “!Hi, this is an example”, with the punctuation mark placed on the wrong side of the sentence. To understand why this has happened we first need to break down the sentence to its ten text units:

  1. One LTR unit consisting of the characters “H-i”; then
  2. A neutral unit consisting of the characters “comma and space”; then
  3. A LTR character consisting of the characters “t-h-i-s”; then
  4. A neutral unit consisting of the space character; then
  5. A LTR unit consisting of the characters “i-s”; then
  6. A neutral unit consisting of the space character; then
  7. A LTR text unit consisting of the characters “a-n”; then
  8. A neutral unit consisting of the space character; followed by
  9. A LTR text unit consisting of the characters “e-x-a-m-p-l-e”; and lastly
  10. A neutral unit consisting of the exclamation point (!) character

Now that we have the sentence broken down to its text units let’s review each one and determine its directionality: The word Hi flows from Left-to-Right because its directionality is an intrinsic property of the characters that make it up. The following text unit is made up from a comma and a space and therefore has a neutral directionality. Because it is preceded and followed by LTR characters it inherits their directionality and flows from Left-to-Right, meaning that it will be positioned to the right of the preceding LTR text unit. The same goes on for the rest of the text units in this sentence until the last one – the exclamation point (!). This text unit has a neutral directionality but it is not followed by a LTR character (the same was true if it would have been followed by a RTL character), and therefore it inherits by default the directionality of the paragraph or document. In this example the paragraph directionality is set to RTL which means that the text units are ordered from Right-to-Left, and therefore the exclamation point symbol is placed at the left side of the sentence instead of the grammatically correct right side.
The same is true for RTL text under LTR paragraph or document directionality, as demonstrated in the following table.

A visual representation of text units order:

Under LTR paragraph directionality: 1->2->3->4->5->6->7->8-9->10

Under RTL paragraph directionality: 10<-(1->2->3->4->5->6->7->8->9). The brackets indicate that text units 1 to 9 all have LTR directionality and flow as one cohesive LTR unit, while unit 10 with the RTL directionality that it has inherited from the paragraph has broke off and placed to the right of the choesive LTR segment.

Language LTR or RTL text under the same (LTR or RTL, respectively) paragraph direction. LTR or RTL text under the opposite (RTL or LTR, respectively) paragraph direction.
English Hi, This is an example!

Hi, this is an example!

Hebrew

שלום, זאת דוגמה!

שלום, זאת דוגמה!

Table 2. Examples of LTR and RTL sentences (which are made up of text units.i.e. the words and punctuation and space between them) under the same and the opposite paragraph directionality. Note that when the directionality property of the paragraph and characters don’t match, the punctuation mark is placed at the wrong side of the sentence.

Understanding how to Control the directionality property

Equipped with the above information I’m sure that experienced BiDi users have already identified the source for some of the most common BiDi issues. One of the almost instinctive reactions to BiDi issues is the attempt to solve them by visually manipulating the on-screen text. For example, to solve the punctuation mark being placed at the wrong side of the sentence some users choose to reverse the word and punctuation order so the text segment will appear correctly on the screen. Although these workarounds might seem straightforward, I strongly recommend to avoid them. Some softwares are just better at displaying BiDi content than others but that doesn’t mean that they do not store it correctly, and manipulating the text to solve the visual problem is generally a mistake because instead of fixing (the sometimes non-existing) problem, one might have actually introduced an error. Another solution re-phrasing the sentence to avoid the visual directionality mismatch. While this is sometimes possible, generally it must not be considered a best practice to give the linguistic fluency and readability a secondary priority because of a technical limitation.

It is strongly recommended to avoid any solution that involves manipulating the word and symbols order on the screen, or any unnecessary re-phrasing just for the sake of masking the BiDi issue. Instead, the directionality of the BiDi text should be controlled by using invisible control characters. An invisible character (also called non-printing character is a character that does not appear on-screen (or in print) but affects the layout or formatting of the visible characters. A good example of an invisible character is the paragraph mark that in a word processor marks the end of one paragraph and the beginning of the next. In Microsoft Word, for example, you can toggle the show/hide formatting marks by clicking the ‘Show/Hide’ button (the one indicated by the Paragraph [¶] icon) or by pressing the Ctrl+* shortcut.

Left-to-Right (LRM) and Right-to-Left (RLM) Marks

The Left-to-Right (LRM) and Right-to-Left (RLM) marks are two invisible, zero-width characters that control the directionality of the text by simulating the presence of a character with LTR or RTL directionality whenever a visible character cannot be used. They might be easier to understand by imagining them as an invisible Latin letter (LRM – strong LTR directionality) or Hebrew/Arabic letter (RLM – strong RTL directionality). The most common use of the LRM and RLM characters is to control the directionality of LTR and RTL text units when in directionality mismatch with their parent paragraph or document. For example, placing the LRM character at the end of the “Hi, this is an example!” example sentence under RTL paragraph directionality solves the BiDi issue of the exclamation point being placed at the wrong side.

Directionality

LTR or RTL text under the opposite (RTL or LTR, respectively) paragraph direction. The same sentence under the opposite paragraph direction with a LRM or RLM character placed after the punctuation nark to solve the BiDi issue
English

Hi, this is an example!

Hi, this is an example!<LRM>
Hebrew

שלום, זאת דוגמה!

שלום, זאת דוגמה!<RLM>

Table 3. Using the LRM and RLM directionality control characters to correct the BiDi issue by simulating the presence of a LTR or RTL character after the exclamation mark.

Before the directionality control characters were added the neutral-directionality exclamation point had inherited the directionality of the parent paragraph because it was preceded by a strong directionality character but not followed by a character with the same directionality, and as a result ended up being placed at the grammatically wrong side of the sentence. The addition of the invisible, zero-width appropriate control character after the punctuation mark fixed this common BiDi issue by simulating the presence of a strong directionality character. It is exactly like typing a letter after the exclamation point, only in places where an a visible character with the appropriate directionality cannot be used.

The Left-to-Right Embedding (LRE), Right-to Left-Embedding (RLE), and Pop Directional Formatting (PDF) Marks

The LRM and RLM characters allow for a very granular control over the directionality of the text, and actually enough to solve any BiDi issue that can be solved. However, sometimes there is a need to embed LTR or RTL section in a paragraph with different directionality, and controlling everything with the LRM and RLM control character might not be the most efficient solutions in terms of time and effort. This where the Left-to-Right Embedding (LRE), Right-to Left-Embedding (RLE ), and the Pop Directional Formatting (PDF, not to be confused with the file format) marks come into play. The LRE or RLE marks are placed at the start of the embedded section and the PDF mark is placed at the end of the embedded section to signal the end of the embedding. Alternatively, the embedding also ends when the opposite LRE or RLE mark is parsed, or at the end of the paragraph. Their function is very similar to the LTR and RTL run commands in Microsoft Word.

I consider these control characters to be of less importance because their use-case scenario is not very common in most modern text editing environments (they are mostly useful for working with plain text documents in which the paragraph directionality cannot be controlled otherwise), and because I believe that it is more important to first master the use of the LRM and RLM control characters.

Left-to-Right Override (LRO) and Right-to-Left Override (RLO) marks

As their name suggests, these control character are used to mark the start of an override section, meaning that the text following the override mark will be treated as LTR or RTL regardless of the intrinsic directionality property of the characters. Their use case scenarios are extremely limited and uncommon, and due to the data integrity and security concerns associated with their use, it is highly recommended to avoid them whenever possible, and most times it is possible.

Inserting the Directionality Control Characters

There are several methods for inserting the directionality control characters into a document. Some notable methods are:

1. Using TEnTs
Some TEnTs allow inserting special and controlling characters using a menu, toolbar and/or shortcut. In SDL Studio the directionality control character are available from the Quick Insert menu. Starting with SDL Studio 2011 SP2 they are also available as a keyboard shortcut: Crtl+Alt+Shift+L (for LRM) and Ctrl+Alt+Shift+R (for RLM) by default, the key assignment can be customized from the Option > Keyboard Shortcuts menu.

2. The Character Map in Windows and Microsoft Word
In Windows and Microsoft Word the direction control characters can be selected and inserted from the character map. In Windows the easiest way to access the character map is by typing ‘character map’ in the search field of the Start menu or screen. In Microsoft Word the character map is accessible by selecting the Symbol option from the Insert tab of the Ribbon, under the Symbols group.

MS Word character map showing the RLM control character
In Microsoft Word the direction control characters can be selected and inserted from the Character Map. In this screenshot the Right-to-Left Mark was selected by inserting its Unicode code: U+200F

3. Using Notepad (or other text editors)

Some text editors allow inserting the control character from a menu. I’m using Notepad as an example because it is available in all Windows versions. To insert a control character: Right click the location into which you want to insert the control character, choose the Insert Unicode Control Character option from the context menu, and then select therequired control character from the list.

The Insert Unicode control character menu in Notepad
In Notepad the direction control characters can be inserted from the Insert Unicode control character menu

You can then copy and paste the control character to another program that doesn’t have a built-in feature for adding them, use a clipboard manager software to paste the character without having to use Notepad each time, use a text expansion program to replace a customized text string with the contro character, or virtually any other method that works for you.

4. Using the Keyboard
Another option is to use keyboard shortcuts by typing the Alt Code of the control character. The advantage of this method is that it works system-wide and doesn’t require any additional software; Its shortcomings are that the codes are usually specific to a keyboard layout and the whole process of inserting the character Alt codes is quite cumbersome. To insert the Alt Codes, verify that the Numlock indicator is turned on, press and hold the ALT key, type the code using the keypad, and release the Alt key when done.

Table 4. Summary of Unicode control characters values

The directionality mark

Unicode value

Alt Code in Windows
(Hebrew Keyboard layout)

In (X)HTML

Left-to-Right Mark (LRM)

U+200E

Alt+0253

&lrm;‎

Right-to-Left Mark (RLM)

U+200F

Alt+0254

&rlm;

Left-to-Right Embedding (LRE)

U+202A

&#8234;

Right-to Left-Embedding (RLE)

U+202B

&#8235;

Pop Directional Formatting (PDF)

U+202C

&#8236;

5. Left-to-Right and Right-to-Left Run in Microsoft Word
Although this method is not directly associated with inserting directionality control characters, and specific to Microsoft Word, I still think that it deserves an honorable mention.
Microsoft Word handles BiDi text pretty well and sometimes no extra editing is required. However, it is a good idea to be familiar with two built-in commands called Left-to-Right run and Right-to-Left run that allow the user to quickly define the directionality of a text segment. These commands work very much like the Left-to-Right or Right-to-Left embedding control characters.
These commands are not available from the Ribbon by default and need to be added to the Ribbon or Quick Access Toolbar:

  • Click the small down arrow in the Quick Access Toolbar and select the More Commands option; alternatively
  • Right click the Quick Access Toolbar and select the Customize Quick Access Toolbar… option to add the commands to the Quick Access Tool bar or the Customize the Ribbon… option to add them to the Ribbon; alternatively
  • Click File (or office Orb) > Options and select the Customize Ribbon or Quick Access Toolbar sub-menus to add the commands to the Ribbon or Quick Access Toolbar, respectively
Accessing the More Commands menu to add commands to Microsoft Word
Selecting the More Commands options from the the Customize Quick Access Toolbar in Microsoft Word
  • From the Choose commands from dropdown list select the All Commands option
  • From the All Commands list navigate to the Ltr run command and click the Add button to add it to the Quick Access Toolbar/Ribbon
  • Repeat the above process to add the Rtl run command

The Customize Quick Access Toolbar sub-menu in Microsot Word

  • Click OK to confirm and close the Options window
The Left-to-Right and Right-to-Left run commands in MS Word Quick Access Toolbar
The LTR and RTL run commands in the Quick Access Toolbar. Because it is difficult to tell them apart from their icons, I recommend to place them in a logical order

To use the LTR or RTL run commands, select the text segment and then click the LTR or RLT run command to assign the the required directionality property.

Examples of common BiDi issues and how to solve them using the LRM and RLM control characters

This section gives a short overview of some common BiDi issues and how to solve them using the directionality control characters. The first image shows the common BiDi issues of mixed or LTR text under RTL paragraph directionality; the second image shows the text after the BiDi issues have been solved using the LRM and RLM control characters; and the table that follows them describes the BiDi issues and their solutions, by order of appearance.
The screenshots are taken from SDL Studio because it made it easier to show the otherwise invisible and zero-width control characters in a visible and identifiable way.

A picture showing some common BiDi issues
The right column shows some common BiDi issues of LTR or mixed directionality sentences under RTL paragraph directionality
A picture showing some common BiDi issues
Using the LRM (depicted by the arrow pointing to the Right) and RLM (depicted by the arrow pointing to the Left) Unicode direction control characters to solve the BiDi issues
BiDi issue description Solution
A special character or punctuation at the wrong end of the word or sentence:
Because the Trademark symbol is a special character with a neutral directionality property, when it is not enclosed between two LTR characters under a RTL paragraph directionality, it inherits the directionality of the paragraph and as a result ends up to the left of the company name.
Adding the LRM control character after the ™ symbol solves this issue.
Numeric values out of order:
The numeric values are separated by the X character. Numbers have neutral directionality property and because X is a LRM character the numbers get out of flow under RTL paragraph or document.
Adding the LRM control character after the last numeric value reorders the values correctly.
The vertical and horizontal pixel values of the resolution are separated by the X character. Numbers have neutral directionality property and because the last number (1200) is not followed by a LTR character it inherits the RTL directionality of the paragraph and ends up out of place. Adding the LRM control character after the last numeric value reorders the values correctly.
In Hebrew, and possibly other RTL languages, the lower value in a range of values is written on the right and the larger value on the left. Moreover, the degree sign should be placed to the right of the value, exactly like in English, and to make things a little more complicated, the numbers and the degree sign have neutral directionality whereas the C character has LTR directionality property. Adding the RLM control character after each measurement unit reorders everything correctly. In same cases this could be a good use-case scenario for RTL embedding (or the RTL run command in Word).
The BiDi algorithm’s minor imperfections show up again when the minus symbol is not identified as part of the temperature value and ends up at the wrong side, and seems to connect with the dash that precedes it (the Hebrew grammar requires to use a dash here). Adding the LRM control character after the minus symbol solves this issue.
The order of items written in Latin characters under RTL paragraph directionality:
In RTL languages the comma is placed to the left of the name, followed by a space and the next name. However, because the comma and space have neutral directionality property and they enclosed within two LTR characters they are wrongfully ordered as if they they are part of a LTR sentence.
Adding the RLM control character after each comma gives the items the correct directionality.This could be a good use-case scenario for RTL embedding (or the RTL run command in Word).
A mixed sentence:
Sentences consisting of joined parts: the first needs to be translated and the second does not. To make things more complicated, the part that needs translation ends with a LTR character(s).
Adding the RLM control character after the last square brackets re-flows the sentence correctly, with the [X] button now being placed at the end of the RTL part that was translated.

Table 5. Description of some common BiDi issues and how to solve them (in order of appearance)

You can experiment with all the examples in this article by typing or pasting them into Notepad or other text editor. To simulate the required LTR or RTL paragraph directionality in Notepad, press Left Ctrl+Shift or Right Ctrl+Shift (or right click an empty space and select the required reading direction from the context menu) to change the reading order to LTR or RTL, respectively.
If you want to experiment with the common BiDi issues in your TEnT of choice (or any other tool), I made them available as a Docx file that you can download.

Takeaway

BiDi support has improved significantly over the last decade or so with the increased software support for the Unicode BiDi algorithm. However, the BiDi algorithm has its flaws and should be viewed as a technical framework or infrastructure and not as a fully automated, type-and-forget kind of solution. In this article I’ve attempted to explain what is the directionality property, how it is being parsed and handled by the Unicode BiDi algorithm, and how it can be controlled. Some of the common workarounds for tackling BiDi issues are visually re-ordering the order of the words and symbols that are displayed on the screen, while others involve the to artificial re-phrasing of the sentence just for masking the BiDi issue. Neither can be considered as a solution and should be avoided. The best practice for controlling the directionality property is using the invisible, zero-width directionality control characters, mainly the Left-to-Right (LRM) and Right-to-Left (RLM) control characters.

A thought about the quoting and charging for the BiDi-specific effort
In his excellent article Format Surcharging in-Translation Kevin Lossner addresses the issue of charging for the effort involved in working with heavily formatted documents, the effort that many take for granted and think nothing of, although inserting or working around the tags and making sure that no tag was accidentally deleted is by no mean a negligible task. I wholeheartedly agree with his arguments and think that the rationale behind them and the solution offered generally apply to BiDi control characters as well. The directionality control characters should be considered as tags, although unlike tags currently there is no way to count them in advance and it is often hard to accurately estimate the required BiDi effort just from a quick glance. BiDi issues could be as simple as fixing that odd Trademark symbol in that company name, but they could also be quite complex and strenuous like in the case of a lot of number ranges or mixed sentences with joint LTR and RTL parts, which makes it all that much difficult to get an accurate estimate. Regretfully, I don’t have any recommendation about how to accurately estimate the BiDi effort at this point, except for some trial and error.

Noticed a mistake? Want to add something, share a tip, ask a question or just discuss BiDi issues in general? Please feel free to leave a comment.

If you are interested in translating the above examples into Arabic, please contact me here or via social media.

Tweet about this on TwitterShare on LinkedInShare on Google+Share on FacebookShare on RedditEmail this to someone

2 Comments

  1. Hello Shai, and thanks for this detailed explanation. I just wanted to point out that even though the Notepad tool is available in all versions of Windows, only the most recent versions include the Unicode tools you mention. This is because, until Win7, the Notepad was a non-Unicode text editor.

  2. Hello Jose,
    Thank you for pointing this out.
    If I recall correctly, I think that Notepad supported Unicode already in Windows Vista, but I’m not fully sure at the moment.
    What I meant to say, and did a poor job at, is that it is part of all Windows editions (Home premium, professional, etc.) so every (modern) Windows user should have it installed.
    Thank you for the clarification, I will revise the text accordingly.

Comments are closed.