UD for Hindi
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, some punctuation marks (e.g., comma) are attached to a neighboring word, while others (e.g., the sentence-terminating danda) are not. We tokenize punctuation as separate tokens (words).
Morphology
Tags
- Hindi uses all 17 universal POS categories, including particles (PART).
- Hindi has the following auxiliary verbs (AUX):
- है hai and था thā are present and past equivalents of “to be”. They are used as copulas and in periphrastic tenses.
- रह raha (“to stay”) for the progressive aspect (with the stem of the main verb and the auxiliary ह/था).
- कर kara (“to do”) for the habitual aspect (with the perfective participle of the main verb).
- जा jā (“to go”) for the passive (with the perfective participle of the main verb).
- Modal auxiliaries:
- सक saka (“be able, can”)
- पा pā (“to manage”)
- चाहिए cāhie (“needed, should, ought to”)
- हो ho (“to have to”)
- पड़ paṛa (“must”)
- There are other verbs that are not auxiliaries under the UD definition, although some authors
would call them auxiliaries outside the UD context. Some of them regularly appear in compound
verbs as the semantically less salient element, others are control and raising verbs. Some
examples follow:
- लग laga (“to start”)
- चुक cuka (“to finish”)
- जा jā (“to go”) (note that this verb can also be used as real auxiliary in passives)
- ले le (“to take”)
- दे de (“to give”)
- डाल ḍāla (“to throw”)
- पड़ paṛa (“to fall”) (note that this verb can also be used as modal “must”)
- बैठ baiṭha (“to sit”)
- उठ uṭha (“to rise”)
- रख rakha (“to keep”)
- आ ā (“to come”)
Features
*
Instruction: Describe inherent and inflectional features for major word classes (at least NOUN and VERB). Describe other noteworthy features. Include links to language-specific feature definitions if any.
Syntax
*
Instruction: Give criteria for identifying core arguments (subjects and objects), and describe the range of copula constructions in nonverbal clauses. List all subtype relations used. Include links to language-specific relations definitions if any.
Treebanks
There are 2 Hindi UD treebanks: