Abstract

We present the SinOCR and SinFUND datasets, two comprehensive resources designed to advance Optical Character Recognition (OCR) and form understanding for the Sinhala language. SinOCR, the first publicly available and the most extensive dataset for Sinhala OCR to date, includes 100,000 images featuring printed text in 200 different Sinhala fonts and 1,135 images of handwritten text, capturing a wide spectrum of writing styles. SinFUND, the first fully annotated dataset of its kind, comprises 100 diverse, manually filled Sinhala forms, offering a robust foundation for developing template-free form understanding models. These datasets are crucial for addressing the challenges posed by paper-based documentation in low-resource languages, enhancing accuracy and efficiency in digital document processing. Both datasets aim to stimulate further research and innovation, providing valuable benchmarks for the OCR and form understanding communities. Access to these datasets will facilitate the development of more sophisticated models, promoting digital transformation and improved administrative processes in Sri Lanka and potentially other regions with similar linguistic challenges. The benchmarks will be published in a research article with the same title.

Instructions:

The dataset contains the following three subfolders

1. SinFUND: Sinhala forms dataset

2. SinOCR-handwritten

3. SinOCR-printed

Comments

test

Submitted by Shalitha Thilak... on Tue, 06/11/2024 - 06:18

Hi, is it possible to get this dataset?

Submitted by Dejan Pecevski on Thu, 09/19/2024 - 05:24

Dataset Files

SinOCR - handwritten.zip (4.18 MB)

Datasets

Standard Dataset

SinOCR and SinFUND - Sinhala OCR and Form Understanding Datasets

Abstract

Comments

Dataset Files

QUESTIONS?