Copyright ©2000 Sun Microsystems, Inc.
Sun, Sun Microsystems, Inc., Java and all Java-based marks
and logos are trademarks or registered trademarks of
Sun Microsystems, Inc. in the United States and other countries.
The JSpeech Markup Language (JSML) is a text format used by applications to annotate text input to speech synthesizers. JSML elements provide a speech synthesizer with detailed information on how to speak text and thus enable improvements in the quality, naturalness and understandability of synthesized speech output. JSML defines elements that describe the structure of a document, provide pronunciations of words and phrases, indicate phrasing, emphasis, pitch and speaking rate, and control other important speech characteristics. JSML is designed to be simple to learn and use, to be portable across different synthesizers and computing platforms, and to applicable to a wide range of languages.
This document is derived from the JavaTM Speech API Markup Language (Version 0.6, October, 1999) which is available from Sun Microsystems's web site: http://java.sun.com/products/java-media/speech/.
Sun Microsystems wishes to submit this document for consideration by the W3C Voice Browser Working Group towards the development of internet standards for speech technology. We expect the resulting W3C recommendations to be of great importance to the developer community.
Please refer to Sun's submission for statements on IP rights.
This document is a submission to the World Wide Web Consortium from Sun Microsystems, Inc. (see Submission Request, W3C Staff Comment). For a full list of all acknowledged Submissions, please see Acknowledged Submissions to W3C.
This document is a Note made available by W3C for discussion only. This work does not imply endorsement by, or the consensus of the W3C membership, nor that W3C has, is, or will be allocating any resources to the issues addressed by the Note. This document is a work in progress and may be updated, replaced, or rendered obsolete by other documents at any time.
This document is derived from an existing specification published by Sun and developed together with companies that collectively wrote the JavaTM Speech API. That specification is known as the Java Speech API Markup Language, and is available for reference from http://java.sun.com/products/java-media/speech/. Except for the change in the specification's name and the corresponding references in the specification, this document is technically identical to the previously published document. We have changed the name simply to protect Sun trademarks. In any case, we expect that any derived specification produced by the Consortium will have a different name.
Should any changes be required to the document, we would expect future versions to be produced by the W3C process. Sun maintains ownership of the JSML specification and reserves the right to maintain and evolve the JSML specification independently and such independent maintenance and evolution shall be owned by Sun.
A list of current W3C technical documents can be found at the Technical Reports page.
Sun Microsystems, Inc. received contributions to this specification from Apple Computer, Inc., AT&T, Dragon Systems, Inc., IBM Corporation, Novell, Inc., Philips Speech Processing and Texas Instruments Incorporated as well as from many internet reviewers.
JSML has benefited from previous initiatives to mark-up speech output, in particular those that use an SGML or XML syntax:
Figure 1: Text from an application is converted to audio output
Speech synthesizers are developed to produce natural-sounding speech output. However, producing natural human speech is a complex process, and the ability of speech synthesizers to mimic human speech is limited in many ways. For example, speech synthesizers do not "understand" what they say, so they do not always use the right style or phrasing and do not provide the same nuances as people.
The JSpeech Markup Language (JSML) allows applications to annotate text that is to be spoken with additional information that can improve the quality and naturalness of synthesized speech.
JSML is an XML Application (eXtensible Markup Language). XML is an Internet standard for representing structure and meaning in documents (Section 2 reviews XML in more detail). JSML defines a specific set of elements to mark up text to be spoken, and defines the interpretation of those elements so that there is a common understanding between synthesizers and documents producers of how marked up text will be spoken.
The JSML element set includes several types of element. First, JSML documents can include structural elements that mark paragraphs and sentences. Second, there are JSML elements to control the production of synthesized speech, including the pronunciation of words and phrases, the emphasis of words (stressing or accenting), the placements of boundaries and pauses, and the control of pitch and speaking rate. Finally, JSML includes elements that represent markers embedded in text and that enable synthesizer-specific controls.
For example, for the text in Figure 1, we could use JSML to indicate sentence structure by placing tags at the start and end of the text and emphasize the word "can" by surrounding it in an emphasis element:
<div type="sentence">Computers <emphasis>can</emphasis>
speak!</div>
JSML documents may be provided to a speech synthesizer from various sources and, as stated in the goals, JSML is intended to be effective for producing speech output for a wide range of text types in differing application domains. For instance, JSML could be used for reading books, technical documents, email, sports scores, web pages, airline flight information and much more.
However, a speech synthesizer cannot possibly understand how to clearly read plain text from such diverse sources as email, which often contains smilies:) and other idiosyncratic text forms, or airline flight information which might be extracted from a database into a software object, or an HTML with formatting intended to look good in a visual browser.
The role of JSML is as a consistent markup for text obtained from such diverse sources. Thus, it is the responsibility of the application or user that generates the JSML document to mark up the text in a way that provides the speech synthesizer with the structural and production information required to speak the text clearly and appropriately. Figure 2 illustrates the basic steps in this process.
Figure 2: JSML Processing
Consider an example of reading a web page. The source data for a web page is usually an HTML page (Hypertext Markup Language), possibly with Cascading Style Sheets (CSS) or Audio Cascading Style Sheets (ACSS) providing additional data on how to render the page visually or audibly. The application processing the web page is the web browser - an application designed specifically to process and then render HTML documents. To render an HTML document visually the browser controls a graphical display to write characters and images. To render an HTML document aurally (i.e. to speak it), the browser controls a speech synthesizer and provides the synthesizer with JSML documents to be read.
Another common example is reading email. When an email reader converts an email message to spoken text it can choose to include email header information (sender, subject, date, etc.) and can mark up special content such as times, dates and email addresses so that they are spoken clearly. The email application might also perform special processing of text in the body of the message to handle attachments, indented text, common email abbreviations and so on. Here is a sample of an email message converted to JSML:
<div type="paragraph">Message from
<emphasis>Alan Schwarz</emphasis> about new synthesis
technology. Arrived at <sayas class="time">2pm</sayas>
today.</div>
<div type="paragraph">I've attached a diagram showing the
new way we do speech synthesis.</div>
With the rapid, wide-spread adoption of XML on the internet, developers now have access to many books, online guides and courses on XML. Some places to start looking for XML material include:
http://www.w3.org/XML/
http://www.xml.org/
http://www.ucc.ie/xml/
http://www.w3.org/TR/REC-xml
In practical terms, a well-formed document requires that all elements, entities and other items in the document be syntactically correct. For example, a container element must have matching start and end tags and elements must be correctly nested.
What is not required is that the document be valid. Being a valid document imposes the additional constraint that the elements, attributes and values of the document match the Document Type Declaration (DTD) for JSML that is provided in Appendix A. In XML terminology, a speech synthesizer uses a non- validating XML parser.
The practical implication is that if a JSML document contains an element or other item not defined in the JSML specification, the speech synthesizer is required to ignore it. An advantage of this is that applications may retain structural or other information within a JSML document that it is useful to the application but which is ignored by the synthesizer. A disadvantage of non-validation is that misspelled tag names do not generate errors which can make errors more difficult to detect. Thus, for development purposes, we include in this specification a Document Type Declaration (DTD) which can be used with XML tools during development to check JSML documents for such errors.
When included, the '<' character must be the very first character in the document (not even preceded by whitespace). The declaration may optionally define the character format of the document. This is most useful when authoring JSML documents for non-ASCII languages. For example, a JSML document in Japanese may have the following declaration that it uses a Japanese character set:
<?xml version="1.0" encoding="SJIS" ?>
Elements are either container elements or empty elements. A container element is
marked by a balanced pair of start and end tags (e.g., <emphasis>
to open paired
with </emphasis>
to close). The start and end tags must have exactly the same
name, and that name defines the type of the element. The text appearing between
the start and end tags is the contained text as shown in Figure 3 and may include
other elements. The start tag may contain zero or more attributes. Each attribute
has an attribute name and an attribute value. The attribute value is always in
quotes.
Figure 3: Container Element and Attributes
An empty element has a start tag but no end tag, but has no contained text. The tag for an empty element may have zero or more attributes. XML introduces a new syntax for empty elements, as shown in Figure 4, by requiring a closing slash in the tag.
Figure 4: Empty Element and Attributes
An XML comment begins with a '<!--'
character sequence and ends with a '-->'
character sequence and may contain any text except the two-character sequence
'--'
. For example,
How now brown <!-- This is an example comment --> cow.
A CDATA section can be used in XML documents to escape blocks of text that contain characters that would otherwise be considered as markup. For example, to avoid '<' and '>' characters being interpreted as the start and end of a tag we could place them within a CDATA section:
Email from <![CDATA[ <joe@acme.com> ]]>
Entities are useful as a short-hand for defining common chunks of content. All entities have two parts. The entity declaration must occur first in the document and is of the form
<!ENTITY jsml "JSpeech Markup Language">
The entity reference may occur any number of times following the declaration, and is of the form
The effect of the reference is for the replacement text in declaration ("JSpeech Markup Language") to be inserted at the reference point.
Character entities serve two functions. First, they enable a document to use characters in the Unicode character set when they are not available from the keyboard. For example, the Greek small letter beta ('b') can written as either of the following:
β <!-- hexadecimal code -->
Second, XML provides character entities that escape characters that might otherwise be considered as markup. This symbol set includes:
Entity Symbol Name
< < less than
> > greater than
& & ampersand
" " quote
' ' apostrophe
<emphasis> legal </emphasis>
<break/>
<a> <b> legal </b> </a>
<a> legal </a> <b> legal </b>
<a value="legal">
<emphasis> legal </emphasis>
<EMPHASIS> illegal </EMPHASIS>
A JSML document consists of a root element containing structural, production, and miscellaneous elements. All JSML elements are designed to provide a speech synthesizer with information on how to speak text contained within those elements. The following table presents an overview of JSML's elements. These elements are defined in detail in the following sections.
jsml
" element. For
example:
<?xml version="1.0"?>
<jsml>
... the body ...
</jsml>
The body should represent one complete body of text to be spoken. It would not be appropriate, for example, to break a single sentence across two JSML documents.
The root jsml
element may contain any sequence of the remaining JSML
elements, entities, CDATA sections and unmarked text.
The optional lang
attribute allows a document to be marked as containing text
of a particular language. The format of the language attribute following the
internet standard defined by RFC 1766. In summary, the language is given as a
primary tag followed by zero or more subtags, each separated by "-". White space
is not allowed and all tags are case insensitive. The two letter primary tag is an
ISO 639 language abbreviation: for example, "de" for German, "en" for English,
"ja" for Japanese, or "es" for Spanish. The sub-tag may be an ISO 3166 country
code: for example, "US" for the United States, "br" for Brazil, "cn" for China.
Examples of complete language attributes are:
en, en-US, en-uk, de-ch, zh-cn
The "div
" element declares a span of text to be of a specific text structure type.
The current specification allows the "div" element to mark paragraphs and
sentences. For example:
<div type="paragraph">This a short paragraph.</div>
<div type="para"><div type="sent">The subject has changed,
so this is a new paragraph.</div><div type="sent">This
paragraph contains two sentences.</div></div>
The "type
" attribute has defined values of "paragraph
" and "para
" to mark
paragraphs, and values of "sentence
" and "sent
" to mark sentences. The
abbreviated forms have identical interpretation to the full form. It is typical that
paragraphs contain sentences. It is not typical for paragraphs to be contained
within other paragraphs or within sentences.
Future releases of JSML may add additional structural types. For example, types of conversational interactions may be useful for dialog systems and grammatical constructs within sentences might also be marked.
For text not contained within an explicit "div
" element for a paragraph,
synthesizers will typically apply heuristics to determine paragraph boundaries.
For text not contained within an explicit "div
" element for a sentence,
synthesizers will typically apply heuristics to determine sentence boundaries.
Developers should be aware that heuristics may be less reliable than explicitly
marked structures.
The "voice
" element is a container element that is used to mark text to be spoken
in a specified voice. Voices are defined using the "gender
", "age
" and "variant
"
attributes, or in certain cases, using the "name
" attribute. For example, the
following requests that the text be spoken in a 30 year-old female voice:
<voice gender="female" age="30"> Some text. </voice>
The voice element is a request for a specific speaking voice but it will not always be possible for a specific synthesizer to produce the speaking voice. This is because most speech synthesizers have installed a specific set of voices with specific characteristics. If the specified voice is not available then the synthesizer is responsible for selecting the closest approximation. In the example above, if a 30-year-old female voice where not available the synthesizer might select another female voice with a different perceived age.
The descriptive values for gender of "neutral
" is intended for voices that are not
obviously male or female, for example, robotic voices, other non-human voices,
and some children's voices.
The descriptive values for age are intended to cover broad categories of perceived
age in voice: "child
" is up to 12 years old, "teenager
" is roughly 13 to 19 years
old, "younger_adult
" is roughly 20 to 40 years old, "middle_adult
" is roughly
40 to 60 years old, "older_adult
" is roughly 60 years and older. The "adult
"
value represents any adult voice (i.e. younger, middle or older adult) and thus
indicates 20 years or older.
In many documents that use multiple voices it is important to be able request
different voices, for example, two different 20 year-old male voices. The
"variant
" attribute allows such requests to be made and is defined as a variant
within the other specified attributes. For example, the following tags request the
first and second teenaged male voices:
<voice gender="male" age="teenager" variant="1"> ...
<voice gender="male" age="teenager" variant="2"> ...
If the synthesizer has 3 built-in teenaged male voices then the variants will eventually cycle so that variants 1, 2 and 3 will repeat as variants 4, 5 and 6 and so on. If the age were not specified, then variants would cycle through all the available male voices. A synthesizer will guarantee that a voice defined by gender, age and variant will be the same whenever referred to in the same JSML document.
The "variant
" attribute may have the special value of "+". With this value, the
synthesizer will attempt to assign a different voice from the current speaking
voice within the constraints of the age and gender parameters.
Because different speech synthesizers have different sets of available voices, there is not a guarantee that JSML documents will be produced identically on different systems. However, with consistent use of the three attributes described so far, reasonable behavior is supported.
The fourth voice selection attribute is the "name
" attribute. Most synthesizers
assign names to each of these voices (sometimes also called "voice fonts"). In
many operating environments the names of these voices is available to the
application or person writing a JSML document and can be used in the voice tag.
If specified, the name attribute takes precedence over the other voice attributes
and the synthesizer will attempt to use the named voice. If the name is unknown,
the synthesizer then attempts to apply the other parameters. When the name
parameter is included for a specific synthesizer, it is good practice to also include
the age and gender parameters of the voice so that the document is spoken
reasonably on other synthesizers. For example:
<voice gender="female" age="teenager" name="Yuriko"> ...
A change in voice will usually have an effect upon the prosodic attributes of the contained text, in particular upon the pitch, pitch range and speaking rate values. The natural speaking pitch of a voice is one of its intrinsic characteristics. For example, male voices are typically lower than female and child voices. The preferred speaking rate and range of acceptable rates are also intrinsic to a voice.
When changing voices, the synthesizer may make some effort to preserve the current setting of the prosodic parameters. For instance, if the speaking rate is high when a voice is changed, the synthesizer should attempt to maintain a similar speaking rate.
Humans readers are usually able to resolve such issues because they can apply and understanding of the context (e.g. memo about a meeting date), understanding of the text context (e.g. the preceding and following words indicate the text construct's meeting), or understanding of the communication medium (e.g. email often contains text forms not found elsewhere).
Whenever it is practical, such information should be incorporated into a JSML
document using the "sayas
" element. The "class
" attribute is defined with a
range of common text structures that can be interpreted by a speech synthesizer.
The format of the "class
" attribute is an identifier optionally followed by a colon
(":") and a format. For example, class="date:mdy"
indicates a date that is
formatted in US style as month-day-year.
The following is a table of the currently defined list of class values and the optional formats. Values not included in this list will be ignored by speech synthesizers. The way in which the text is converted to a spoken form is determined by the speech synthesizer and not all forms will always be converted in the same way by all synthesizers. For example, "5/15" (or "15/5") may reasonably be spoken in English as "May fifteenth" or as "the fifteenth of May".
The following are examples of how "sayas
" elements may be spoken:
<sayas class="literal">JSML</sayas>
<!-- spoken as "J. S. M. L." -->
<sayas class="literal">12</sayas>
<!-- spoken as "one two" -->
<sayas class="number">31.14</sayas>
<!-- spoken as "thirty one point one four" -->
<sayas class="currency">$49.50</sayas>
<!-- spoken as "forty nine dollars, fifty cents" -->
The defined list of classes and formats does not cover all possible formats that appear in text - it would be impossible to produce a list that covers all possible forms in a large number of languages. When a text form occurs that is not included in the list, an alternative markup is to convert the written form to the spoken form by hand. For example, if the date class did not exist, the spoken form of the text could be substituted so instead of:
The program starts in <sayas class="date:my">7/99</sayas>.
The program starts in July nineteen ninety nine.
One advantage of the "sayas
" element, when it can be applied, is that the
sometimes difficult task of converting text to a speakable form is delegated to the
speech synthesizer. More importantly, when processing documents of different
languages, you do not have to consider the text constructs of multiple languages.
In many cases, an application will be unable to identify or determine the class of
all the text sequences that might be marked with the "sayas
" element. In such
cases the application can leave the text forms as is and let the synthesizer attempt
to determine how to speak them. Since most speech synthesizers have some
ability to detect convention text forms this approach will usually succeed but there
is a greater risk of misinterpretation or mispronunciation.
The "sayas
" element is a container element. It typically contains only plain text or
CDATA sections. It should not contain "div
" elements. It may contain other
production elements but it is reasonable for the speech synthesizer to ignore them
as it interprets the text.
phoneme
" element marks a sequence of text as being a phonemic string.
Phonemic strings are defined using the International Phonetic Alphabet (IPA).
Phoneme sequences may be used where words are difficult to pronounce (e.g. words of foreign origin and many proper name) or where pronunciation is ambiguous (e.g. "I will read a book" pronounced "reed", compared to "I have read a book" pronounced "red").
Where a pronunciation is repeated many times in a document it is often convenient to define an entity for that pronunciation. For example:
<!ENTITY boat "">
... the <phoneme>&boat;</phoneme> is on the water...
The International Phonetic Alphabet character set is a subset of Unicode. The IPA
characters are represented by codes from "ɐ"
to "ʯ"
, by modifiers
from "ʰ"
to "˿"
, by diacritic characters from "̀"
to
"ͯ"
, and by certain Latin, Greek and symbol characters from the range
"�"
to "ſ"
. Character entities are often useful in representing
phonemic strings because most of these IPA characters do not appear on
keyboards. Details of the Unicode IPA support are provided in The Unicode
Standard, Version 2.0 (The Unicode Consortium, Addison-Wesley Developers
Press, 1996).
Unfortunately, IPA is difficult to learn and use and there is not yet standardization on the use of subsets of IPA for particular languages and dialects. Nevertheless, speech technology is converging on IPA as the only available system to represent the sounds of a wide range of languages and dialects. There is some hope that this convergence will lead to developer tools and increased standardization that will make IPA more practical.
The "phoneme
" element is a container element. It is not legal to nest any other
JSML elements within the "phoneme
" element.
The emphasis element marks a range of text that should be spoken with emphasis, or what is also referred to as stress or prominence. Depending upon the language and many other factors, emphasized text may be spoken more loudly, at a different speed, or at a different pitch.
The "level
" attribute can be used to indicate the degree of emphasis to be applied
to the contained text. Defined values are "strong
" (for strong emphasis),
"moderate
" (for some emphasis) and "none
" (for no emphasis). The default level
is "moderate
".
The car is <emphasis>red</emphasis>, not blue.
Buy <emphasis level="strong">4</emphasis> burgers and fries.
The "break
" element is an empty element that is used to mark phrases and
boundaries in the speech output, what are often though of as pauses. To indicate
what type of break is desired, the element can include a "size
" attribute or a
"time
" attribute. (If both attributes are included, the "size
" attribute takes
precedence.)
A "size
" attribute indicates a break that is relative to the characteristics of the
current speech. A "time
" attribute requests a pause for an absolute amount of time
in either seconds or milliseconds. Where possible, the break should be defined by
a "size
" attribute rather than "time
". This is because, in most languages, the
perception of phrasing is speech is produced by complex interactions of pitch,
timing changes, and sometimes pauses. Those factors are significantly affected by
speaking context. For example, a 300 millisecond break in fast speech sounds
more significant than it does in slow speech.
Take a deep breath<break/> then continue.
1 <break size="small"/> 2 <break size="small"/> 3 ...
Press 1 or wait for the tone <break time="3s"/>.
The "
prosody
"
element provides prosodic control for text segments. Prosody is a
collection of features of speech that includes its timing, intonation and phrasing.
Proper control of prosody can improve the understandability and naturalness of
speech. For example, in English, important new information is often spoken more
slowly and with greater pitch range to add emphasis.
The "
prosody
"
element provides broad parameters to a speech synthesizer. For
example, setting the rate to 120 words per minute does not mean that every word
is spoken in half a second, but instead suggests an approximate average rate over a
longer sequence of words.
The four prosodic attributes - "rate
", "volume
", "pitch
", "range
" - are all
numeric values with descriptive equivalents. The legal absolute and relative
numeric values are shown in the table. The legal numeric forms are integers and
simple floating point values (e.g. "150", "+8.5", "-10.8%"). The reasonable
numeric ranges for these values depend upon a number of factors including
language, speaking voice and the speech synthesizer design. As a general rule, it
is best to use the descriptive values as a first choice, relative values next, and
absolute values as a last resort.
The descriptive values for rate are "fast
", "medium
", "slow
" and "default
".
Numeric values for rate are difficult to define because words are different across
different languages. In English, normal speaking rates may be 150 to 200 words
per minute. 300 words per minute is very fast. Some users, particularly users with
disabilities who listen regularly to speech synthesizers, may use speaking rates up
to 500 words per minute. For example,
<prosody rate="150">Text at 150 words per minute</prosody>
The descriptive values for volume are "loud
", "medium
", "quiet
" and "default
".
Numeric values for volume lie in the range 0.0, for silence, to 1.0 for maximum
volume. For example,
I can speak <prosody volume="quiet"> softly </prosody>
The descriptive values for both pitch and pitch range are "high
", "medium
", "low
"
and "default
". The reasonable range of numeric values will depend upon factors
including the language and the voice. Female and child voices are typically higher
than male voices. Different male or female voices may have different natural pitch
ranges and therefore different defaults. Some languages have different cultural
conventions for pitch (e.g. polite voices are sometimes higher). As a broad rule of
thumb, male voices will usually have a baseline pitch between 80 Hertz and 180
Hertz. Female voices lie often between 150 Hertz and 300 Hertz. Pitch range is
often between 20% and 60% of the baseline pitch, with smaller ranges producing
more monotone, or flat, speech.
Value | Description |
Nst | Sets the pitch value to N semitones. |
+Nst | Increase the pitch value by N semitones. |
-Nst | Decrease the pitch value by N semitones. |
The pitch and pitch range values support semitone values for absolute and relative settings. A semitone is difference in pitch between notes on a piano and many other musical instruments and a semitone value of "60.0" corresponds to "middle C" on a conventional piano or to a frequency of 261.6Hz. Legal relative and absolute semitone attribute values are shown in the table above.
While speaking a sentence, pitch moves up and down in natural speech to convey extra information about what is being said. The baseline pitch represents the normal minimum pitch of a sentence. The pitch range represents the amount of variation in pitch above the baseline. Setting the baseline pitch and pitch range can affect whether speech sounds monotonous (small range) or dynamic (large range).
Figure 5: Baseline Pitch and Pitch Range
Note that in all cases, relative values for pitch, rate and volume increase the portability of JSML across speaking voices and synthesizers. Relative settings allow users to apply the same JSML to different voices (e.g., male and female voices with very different pitch ranges) and to set a local preference for speaking rate. For example, some users set the speaking rate very high (300 words per minute or faster) so they can listen to a lot of text very quickly.
Finally, it is quite common for more than one prosodic attribute to be changed in a single prosody element. For example, in English, when speaking parenthetical text (such as this), the pitch, pitch range and volume are usually lowered together. For example:
<div type="sent">He drove his new car, <prosody pitch="-10%"
range="-20%" volume="-20%">not his ugly old car</prosody>,
because he wanted to seem more impressive.</div>
marker
" element requests a notification from the speech synthesizer to the
application when the element is reached during speech output. The "marker
"
element has the same effect as the "mark
" attribute that is optionally available for
all JSML elements, but has no other side-effects. For example:
Answer <marker mark="yes_no_prompt"/> yes or no.
The mechanisms for providing notifications to an application are left to the environment in which the JSML text is being produced. In some environments there may be no such mechanism available.
This "engine
" element allows applications to utilize a speech synthesizer's
proprietary capabilities by substituting engine-specific control data for the
contained text. The non-proprietary data is the contained text of the element and
will be spoken by any synthesizer except one that matches the identifier provided
in the "name
" attribute. For a synthesizer that matches the "name
" attribute, the text
value of the "data
" attribute is spoken instead of the contained text. For example,
take the following JSML text:
I am <engine name="Acme Voice" data="an Acme"> another
</engine> speech synthesizer.
An "Acme Voice" synthesizer will say "I am an Acme speech synthesizer.
".
All other speech synthesizers will say "I am another speech synthesizer.
"
A JSML document may contain "engine
" elements for any number of speech
synthesizers. Nesting "engine
" elements is a useful way of providing variants of
the same span of text for multiple engines.
<?xml version="1.0" encoding="utf-8"?> <!-- **************************************************** --> <!-- DTD: JSpeech Markup Language - v0.6 --> <!-- --> <!-- Note: JSML is interpreted by speech synthesizers --> <!-- with a non-validating parser, so strictly speaking --> <!-- a DTD is not required. This DTD is intended --> <!-- to be used by development tools such as format --> <!-- checkers to verify JSML documents. --> <!-- **************************************************** --> <!-- **************************************************** --> <!-- Revision history: --> <!-- created 1 December 1998 by William Walker --> <!-- v0.5 specification --> <!-- revised 12 October 1999 by Andrew Hunt --> <!-- v0.6 specification --> <!-- **************************************************** --> <!-- **************************************************** --> <!-- Define common entities --> <!-- **************************************************** --> <!-- The set of production elements --> <!ENTITY % production 'voice|sayas|phoneme|emphasis|break|prosody'> <!-- The set of miscellaneous elements --> <!ENTITY % miscellaneous 'marker|engine'> <!-- The mark attribute present on all elements --> <!ENTITY % att-mark 'mark CDATA #IMPLIED'> <!-- **************************************************** --> <!-- JSML structural elements and attributes --> <!-- **************************************************** --> <!-- Root JSML element --> <!ELEMENT jsml (#PCDATA | div | %production; | %miscellaneous;)*> <!ATTLIST jsml lang CDATA #IMPLIED %att-mark; > <!-- preserve white space - it is significant in JSML --> <!ATTLIST jsml xml:space (default|preserve) "preserve"> <!-- div: text structure element --> <!ELEMENT div (#PCDATA | div | %production; | %miscellaneous;)*> <!ATTLIST div type (para|paragraph|sent|sentence) #REQUIRED %att-mark;> <!-- **************************************************** --> <!-- JSML production elements and attributes --> <!-- **************************************************** --> <!-- "voice" requests a change in speaking voice --> <!ELEMENT voice (#PCDATA | div | %production; |%miscellaneous;)*> <!ATTLIST voice gender (male | female | neutral) #IMPLIED age CDATA #IMPLIED variant CDATA #IMPLIED name CDATA #IMPLIED %att-mark;> <!-- "sayas" indicates the type of the contained text --> <!ELEMENT sayas (#PCDATA)> <!-- The set of sayas classes --> <!-- We do not enumerate all possible formats here --> <!ENTITY % sayastypes '(literal|date|time|name|phone|net|address| currency|measure|number)'> <!ATTLIST sayas class (%sayastypes;|CDATA) #REQUIRED %att-mark;> <!-- "phoneme": contained text is an IPA phoneme string --> <!ELEMENT phoneme (#PCDATA)> <!ATTLIST phoneme original CDATA #IMPLIED %att-mark;> <!-- "emphasis": specify stress for contained text --> <!ELEMENT emphasis (#PCDATA | %production; | %miscellaneous;)*> <!ATTLIST emphasis level (none|moderate|strong) "moderate" %att-mark;> <!-- "break": insert a pause or other boundary --> <!ELEMENT break EMPTY> <!ATTLIST break size (none|small|medium|large) "medium" time CDATA #IMPLIED %att-mark;> <!-- "prosody": set acoustic properties for contained text --> <!ELEMENT prosody (#PCDATA |div|%production;|%miscellaneous;)*> <!ATTLIST prosody rate CDATA #IMPLIED volume CDATA #IMPLIED pitch CDATA #IMPLIED range CDATA #IMPLIED %att-mark;> <!-- "marker": insert a callback request --> <!ELEMENT marker EMPTY> <!ATTLIST marker %att-mark;> <!-- "engine": insert synthesizer-specific data --> <!ELEMENT engine (#PCDATA | div | %production;|%miscellaneous;)*> <!ATTLIST engine name CDATA #IMPLIED data CDATA #REQUIRED %att-mark; >