Removing Tags from a Translation Memory

Earlier this week a colleague contacted me with an issue. One of his agency clients sent him a Translation Memory, but much to his dismay, the TM was cluttered with unnecessary formatting tags that rendered it pretty much useless. This clutter is usually being referred to as a “Tag soup”, and it is primarily the result of a PDF or image file conversion into formatted text. A Tag Soup could also form out of style mismatch when copying and pasting formatted text between applications, for example, copying an HTML email message and pasting it into a Word Processor. The conversion process or style mismatch introduces all kinds of unnecessary formatting tags that need to be removed in a post-processing stage before the converted document can be worked on. Emma Goldsmith wrote an excellent and exhaustive article about how to get rid of a tag soup in documents before starting to work on them, but what if the tag soup is already in the Translation Memory? The good news is that this issue can be resolved fairly easily. It is worth mentioning that the methods described here for removing the Tag Soup can also be used to remove legitimate formatting tags if such need ever arises.
For the purpose of this demonstration I’ve prepared a ‘Tag Souped TM’. As evident from the screenshots below, this is quite an extreme case of a Tag Soup (although similar ones do exist in reality) because I wanted to show that the removal methods discussed here are effective even with a severe case of a Tag Soup. Furthermore, for ease of presentation the source and target languages in the Translation Memory are the same.

Although these methods are considered generally safe, before attempting to use any of them I recommend creating a backup copy of the TM, just in case something goes wrong.

Using SDL Studio

All versions of SDL Studio to date have a nifty little feature for removing these formatting tags from the TM.

  • Switch to the Translation Memories view in SDL Studio (at the bottom of the left pane).
  • Open the relevant TM from the File menu in Studio 2009 and 2011, from the Home tab and Tasks group of the Ribbon in Studio 2014, or by right clicking the Translations Memories folder in the Translation Memories pane and selecting the Open Translation Memory option in all versions.
  • With the TM open, select the Batch Edit option from the File menu in Studio 2009 and 2011, the Home tab and Tasks group of the Ribbon in Studio 2014, or by right clicking the TM in the Translation Memories pane in all versions.
  • In the Batch Edit window, click Add and select the Delete Tags options.
  • The default setting deletes only the formatting tags. However, if for some reason placeholder tags should also be deleted (not a common scenario), select the Delete text placeholder tags checkbox. If unsure, leave this checkbox unchecked.
TM with Tag Soup
The TM in all of its Tag Soup glory
Studio's Batch Edit window with the Delete Tags option selected
The Batch Edit window in SDL Studio with the Delete Tags option selected
The result: the TM with all the formatting tags removed
The result: The Translation Memory without all the formatting clutter

Exporting to TMX and using Olifant

The Tag Soup could be removed even if your Translation Environment Tool of choice doesn’t have a similar function to the one described above.

  • Download and Install Olifant
  • Export the TM to the Translation Memory eXchange (TMX) file format
  • Start Olifant and open the TMX export file
  • With the TMX open, go to the Entries menu and select the Remove Codes option
  • From the dialog box, select the Remove both TMX tags and their content option
  • Save (or export to a new TMX file if you want to keep the original version) the clean TM and import it to your Translation Environment Tool of choice.

Before

Tag Soup TMX in Olifant
The TMX export of the cluttered TM in Olifant.

After

The TMX in Olifant after the removal of the cluttering tags
The same TMX after the removal of the uncessary tags

Please note:

  • The screenshots and instructions are for the .NET version of Olifant. There is a also a newer Java based version of Olifant, but at the time of this writing it is only in Alpha stage and not yet ready for production.
  • I’ve tested this method on Windows 8 and it didn’t work. It is therefore possible that the .NET version of Olifant is not fully compatible with Windows 8.

Using Regex

Regular expression (abbreviated Regex) is a very powerful method for manipulating text. Regex is a topic that goes well beyond the scope of this article, but it can be harnessed to search and delete these tags. To do that: Open the TM in your favorite Translation Environment Tool; the only prerequisite is that tool has to support Regex and allow direct (or batch) TM editing. Alternatively, export the TM inoto TMX and open it in a TMX editor such as the aforementioned Olifant. The Regex syntax for capturing the tags may vary depending on the way Regex is supported by the tool, but as a general guideline the syntax <[^>]*> is a good reference point that can be adapted as needed (for more information about Regex support, please consult the tool documentation). This syntax searches for everything enclosed withing the < and > characters, or in other words the tags. Now these tags can be deleted using a simple search and replace operation. I tested this syntax in Olifant and it seems to work.

Regex in Olifant to search for tags
Olifant’s Find and Replace window with the <[^>]*> regex syntax in the search field. Note the blue highlighted match that corresponds to a tag
A word of caution: Although it might seem as a good idea, I don’t recommend preforming the search and replace operation in a text editor such as Notepad(++) unless you know what you are doing. Otherwise, some structural tags that have nothing to do with the Tag Soup might get changed or deleted, resulting in a corrupt file.

Although Tag Soup in a Translation Memory is less common than its document counterpart, knowing how to clean it might come in handy. In most cases either of the above methods, or in some extreme cases a combination of them, will be enough to remove the clutter and produce a much more usable TM.

Tweet about this on TwitterShare on LinkedInShare on Google+Share on FacebookShare on RedditEmail this to someone

4 Comments

  1. Many thanks for mentioning my article on tag soups, Shai.
    This is a very useful explanation of stripping tags in a TM. I wouldn’t have thought of using Regex for doing it – but it comes in useful for so many things.
    Keep these interesting posts coming!
    Emma

  2. Thank you Emma for your kind words and article.
    Good content is a good content and I will always be happy to link and share it.

  3. Hi Shai,

    Just thought I’d mention that the new Heartsome TMX Editor (8.0) has a great tag removal feature too, as well as a ‘Clean up TM’ feature. It looks like it might finally be time to retire Olifant. I’ve been waiting and watching the Olifant alpha now for quite some time, but it looks like it is very low on the Okapi team’s list of priorities, which really is a shame as I think a lot of translators would prefer it to be finished first, rather than all their other (highly complicated) stuff (Rainbow, Ratel, etc.).

    Michael

    1. Hi Michael,

      Thank you for taking the time to comment and point out Heartsome TMX Editor (8.0).
      Olifant’s development does seem to stall, and that’s too bad.
      A TMX Editor is a good tool to have considering the limited TM editing capabilities of the popular TEnTs (although this seems to be
      gradually improving), and the fact that currently TMX is pretty much the only data exchange format that can be reliably used to transfer or
      archive data independent of any proprietary format.
      between different tools.

Leave a Reply