Ancient Chinese classics preserved and digitised

Editorial Type: News Date: 2021-06-08 Views: 333 Tags: Document, OCR, AI, Capture, Recognition, DAMO, Alibaba
Alibaba's research institute digitises ancient Chinese books using advanced AI

The digitisation of Chinese classics is challenging, as Chinese ancient characters are complex. Throughout history, one Chinese character might have several variants and written forms. Digitising Chinese ancient books through optical character recognition (OCR) not only facilitates machine reading but also gives a new life to numerous ancient books for public perusal.

Alibaba DAMO Academy (DAMO), the global research institute of Alibaba, started a new project to digitise Chinese classics together with the Alibaba Foundation, the Library of the University of California, Berkeley, Sichuan University, National Library of China, and Zhejiang Library. The program aims to digitise and aggregate ancient Chinese books and convert scanned images into texts for open access. This way, libraries in China and abroad can work together to make their ancient Chinese books freely available to the world.

The first batch of Chinese classics in this joint effort comes from the C.V. Starr East Asian Library of University of California, Berkeley, one of the largest academic libraries with rich holdings of Chinese ancient books. 200,000 digital pages of ancient books are now on display including woodblock printed books and manuscripts from the Song Dynasty and Yuan Dynasty, a period in ancient China dating back over 1,000 years ago.

UC Berkeley Library provided scanned pages and metadata while DAMO used optical character recognition (OCR) to turn the scanned images into text. Furthermore, DAMO teamed up with scholars in Sichuan University to develop an AI model for single-character indexing, automatic character grouping, and various forms of machine learning such as self-supervised learning and few shot learning. This model yields an accuracy rate of 97.5% in recognising ancient characters. The new model can now recognise 30,000 ancient Chinese characters with efficiency, surpassing the speed of human reading by thirtyfold.

Jeff Zhang, Head of Alibaba DAMO Academy, said: "Alibaba will continue to invest in resources and cutting-edge technology to support such projects. Making ancient books available to the public is in line with our values and belief in 'Tech for Change'. We believe that technology can play a critical role in preserving precious cultural relics and heritage, and we look forward to working with libraries in China and abroad to make this happen."

damo.alibaba.com