Shuttleworth-Lagoudaki TM Profesional
Shuttleworth-Lagoudaki TM Profesional
Shuttleworth-Lagoudaki TM Profesional
net/publication/268348851
Article
CITATIONS READS
0 102
1 author:
Mark Shuttleworth
Hong Kong Baptist University
15 PUBLICATIONS 42 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Mark Shuttleworth on 29 June 2015.
Introduction
Translation Memory (TM) technology has been with us for a good fifteen years now.
Heralded as the answer to the translator’s dreams in many respects, its success in real
terms has been considerable, and limited mainly by the level of take-up among
professional translators and translation companies.
The basic concept behind the technology is simple: as users translate, the application
‘remembers’ their translation sentence by sentence, and then ‘reminds’ them of precisely
what they wrote whenever a sentence recurs. Thus, more formally stated, TM is a
computer application that allows users to store previous translations along with their
originals in a database (or an index) and to re-use them in new translation projects,
whenever similar source text is encountered.
How does it work? Initially, the program splits the pair of texts (original + translation) into
‘segments’ (i.e. sentences, phrases or words) and then aligns them as the text is
translated segment by segment. A pair of aligned segments – i.e. the source segment
and its translation – is what we often call a ‘translation unit’, or TU. Each translation unit
is then stored and indexed in a database/index (or ‘translation memory’) in an organised
way with a variety of information – such as the date and time of translation, the identity
of the translator, and so on – attached. When the user starts a new translation – typically
of an updated version of the original text – which has some source segments identical or
similar to the ones existing in the database, the system recognises them, retrieves the
past translations for those segments and suggests them to the user.
Terminology is typically handled in a similar but slightly different manner, with the TM
tool connecting to a separate terminology database for automatic term look-up.
Thus TM cannot by any stretch of the imagination be likened to machine translation, its
better-known but less widely-used cousin, as the translator remains in complete control
of the translation, rather than simply correcting and polishing a rough version that the
system has produced. Also unlike machine translation, TM systems are not limited to a
specific language pair but can be used to translate between any pair of languages. In
essence, TM technology offers the user a database tool for accessing past translations;
the database is empty upon its first use, but expands rapidly in size as users fill it with
their translations; the bigger the TM database gets the more valuable it becomes.
1
Intrinsic to the technology is the ability to distinguish between matches that are identical
(‘exact’ or ‘100%’) and those that are only similar (‘fuzzy’). An example of the former
might be as follows:
Some tools ignore the question of formatting for the purposes of determining whether or
not a particular match is exact, while others permit the user to require a complete
replication of the formatting of the original in order to qualify as 100%.
A fuzzy match, then, is a match which is anything less than exact, however that is
defined. For example:
Typically, for the convenience of the user a TM tool will highlight the changes that have
occurred, as can be seen in the example.
Use
There are a number of scenarios in which TM technology has a particularly clear
application, although its use is not completely restricted to these areas. These scenarios
are as follows:
Bearing these four scenarios in mind, one can say that certain text types are ideal
candidates for TM use:
• Repetitive texts such technical (e.g. manuals, technical documentation), financial and
legal documents (MS Word documents)
• E-content (HTML and XML files)
2
• Software (e.g. menu items, error messages, etc.: Java properties, Windows resource
files, etc.)
• Text contained in complex formats (such as DTP files: FrameMaker, Illustrator,
Interleaf, Pagemaker, etc.)
• Literary texts
• Short texts (one paragraph document, slogans, etc.)
• One-off projects or, more generally, small volumes of translation work
It would be misleading to claim that all reactions have been totally positive, of course, as
there are critics who point to a number of limitations in the technology as it is
implemented in most currently available commercial TM systems: its unsuitability for
non-repetitive texts, the inflexibility of only having matches on the sentence level, the
difficulty of retrieving contextual information and the time it takes to produce useful TMs.
The best-known TM packages currently available are Déjà Vu, WordFast, SDL Trados,
STAR Transit, MultiTrans and Omega-T. Besides the most obvious difference of price
(WordFast is cheap and Omega-T free, while the others sell at full commercial prices),
the packages differ, for example, in terms of
• the text editing environment: some tools use MS Word while others offer their own
text editor (usually in a tabular format)
• the granularity of segmentation (which can occur at sentence, phrase or word level)
• the indexing method (indexing of segments vs. full-text indexing)
• the structure of the resources repository (physical TM database vs. virtual database
vs. index)
• the match retrieval techniques used (character-string-based matching vs.
linguistically enhanced matching)
• the level of automation offered
3
• the ease with which a particular tool can be integrated with a machine translation
engine
However, since all basically perform the same core tasks it is largely a matter of
personal preference which one a particular user will select.
Apart from TM system developers, several other research bodies (such as many
universities and some European institutions) have realized the importance of these
systems and their potential benefits to translation activity, and they have joined forces
with the industry to enhance and accelerate the research into TM technology. The focus
of current research is geared in several directions, depending on the interests of each
research team. However, the most important areas indicated by reports on the weaker
aspects of modern TM systems have been:
Recently, the research efforts of certain research teams have brought into light very
significant developments in TM technology. Those developments have addressed some
of the inherent limitations and problems encountered by traditional TM systems and
have opened up the way to a new generation of systems.
A common request expressed regularly by translators has been the ability of the TM
system to show some context for the match that is suggested to the user. This is
considered an important functionality as translators rely heavily on the context of use of
words and phrases before they decide on the correct translation. Up until recently, no
commercial TM system could offer this functionality, since it had split the original source
and target text into segments, so all it had in its database were a number of incoherent
segments but no text. Translators working on traditional TM systems normally resort to
concordance tools in order to get some additional information on the suggested matches
that will help them choose the right translation. Things changed with a new approach
adopted by some tools (such as MultiTrans and LogiTrans), called the full-text approach.
Instead of segmenting the texts at the beginning, they store them as full bitexts and
4
index them in the TM database using the character-string-in-bitext (CSB) technique.
Once the bitexts are in the database, they are aligned at paragraph level. This approach
has the advantage of retaining and displaying the context (the full paragraph in which the
match is found) for any match retrieved and suggested to the user.
Another challenge that is still under the spotlight of TM research is the improvement of
the match retrieval techniques so that the system offers better match recall and
precision. In order to enable the system to find all matches available in one’s TM
database for a queried source segment (match recall), some developers (such as Atril,
the company that produces Déjà Vu) have adopted a character-string-based technique
which uses character string matching algorithms to look for a match not only in
segments but also in sub-parts of the segments. In this way, the possibility of finding
more matches increases considerably. In terms of enabling the system to find the correct
matches for the queried source segment (match precision), a few linguistically enhanced
matching techniques have been developed addressing that challenge. Traditionally, TM
systems have been treating the segments – of any length – as a sequence of character
strings, therefore the match retrieval algorithms were trying to match the ‘surface’
appearance of source segments with the appearance of segments available in the TM
repository. A new generation of TM systems, in order to offer better match precision,
have introduced linguistic information to the segments, so that the system can look for
matches based not only on the appearance of segments but also on the linguistic
information they contain. In the case of Masterin® for example, each segment in the TM
database is annotated with grammatical information and constitutes a ‘translation
pattern’. So, during the match search, matches are sought by a deep-structure pattern
recognition method in addition to character string recognition techniques.
5
A greater weight is expected to be given to language resources in terms of both access
via the TM system and maximum deployment. Language resources such as glossaries
and dictionaries will be sold perhaps as add-ins to the TM application, so that translators
will not have to wait long before their termbase has reached a level where it can offer
them valuable help on translation problems. Furthermore, future TM tools will be
probably able to integrate language resources (such as bilingual corpora – parallel or
aligned, glossaries and dictionaries – online or on CD-ROMs) in their TM database
efficiently, easily, quickly and on a large scale. The Web is also expected to contribute to
the improved utility of TM tools. In particular, thanks to the evolution of the Semantic
Web researchers and developers will be looking into ways to exploit the Web as a vast
resource of bilingual (or monolingual) corpora by developing capabilities to extract texts
from the Web and store them in a TM database so that they can be used by translators
as reference material.
In terms of the acquired resources, future research will concentrate on improving the
algorithms for search and retrieval of matches, so that the system offers more relevant
results more quickly and with more useful linguistic and contextual information.
Finally, there will be a higher degree of convergence between TM systems and Machine
Translation systems in the future, as both types of systems share common problems, for
which the solutions lie in the combination of the two technologies. A TM system with
carefully implemented Machine Translation capabilities seems to be the obvious way
towards the ideal translation support tool.
Conclusion
TM technology was developed with a view to serving the needs of translation
professionals within a constantly changing and demanding global environment.
Translators have a choice whether or not to use TM systems, depending on the nature
of their work and how much weight they attribute to potential benefits deriving from TM
use. However, due to the ignorance and misinformation which have triggered many of
the misconceptions around TM systems, it is natural that some translators have
developed a fear of this technology. The simplest way to combat this fear is to keep an
open mind and seek to be informed about these systems. There may well turn out to be
solutions out there that can take care of what each translator considers as grunt work
and, thus, let him focus on the creative part of translation.
V i e w p u b l i c a t i o n s t a t s