Carnegie Mellon-led project writes new, digital chapter for out-of-print texts

The Carnegie Mellon-led Million Book Project announced Wednesday that it has successfully put over 1.5 million books online.

Founded and directed by School of Computer Science professor Raj Reddy, the Million Book Project has purportedly surpassed similar scanning projects such as Google Book Search and The Internet Archive. The project is supported by two grants totaling $3.6 million, partners from China’s Zhejiang University and the Chinese Academy of Science, the Indian Institute of Science, and Egypt’s Bibliotheca Alexandrina. These partners contributed scanning facilities and personnel, trained by Carnegie Mellon University Libraries. The project supplies scans in subjects ranging from politics and religion to science and engineering in over 20 languages.

Carnegie Mellon has been involved in all aspects of the project, said Gloriana St. Clair, Dean of University Libraries. "Carnegie Mellon's contributions include ... A 1000 book pilot to develop workflows and select scanners, software that runs the site, and the
personnel who gathered the coalition of partners (namely, Dr. Raj Reddy)."

The archive’s website ( states that the principal benefit of the venture will be the even distribution of library materials across different levels of education.

Citing that university libraries often hold the majority of volumes available in America, while public and secondary school libraries have much smaller collections, the project currently makes available a university-scale library to anyone with an Internet connection at a nominal cost to educational budgets.

With the additional servers, the site is now able to accommodate more users than ever, said Vamshi Ambadi, graduate student at the Language Technology Institute. "We have accomodated 50,000 unique visitors [on the archive website] today."

While the project’s main goal to scan all books on Earth was believed impossible, it was able to bring 1.5 million selected books to the Internet in a little less than two years.

While this is still less than 1 percent of all available works in the world, directors agreed in a November 2007 conference to continue and expand the venture, which can greatly shorten the time required to complete the project.

Interestingly, the archive sports over 970,000 volumes in Chinese, at least two and a half times more than those in English, while both languages significantly dwarf all others ranging from Hindi to Russian. Project partner Zhenkun Zhou, visiting scholar at Carnegie Mellon’s School of Computer Science, explains that more people work for this project in China than other participating countries.

“In order to get the Chinese government’s support, we have to digitize many Chinese books,” Zhou explains. At least one-fifth of the over 1,000 workers in this project operate in Zhejiang University alone.

Zhou added that copyright issues, a major barrier against the Million Book Project, have contributed to this disparity.

Copyrights stay in effect for varying lengths of time in different countries, and because of China’s long history, the project has significantly more Chinese works that are exempt from copyright protection.

Compared to those in Chinese, there are far fewe books in English available due to tighter copyright constraints. About half of the current collection remains under copyright. The recent addition of Egypt’s Bibliotheca Alexandrina may help even the distribution. Project directors hope that digitization of volumes in Alexandria Library, which has suffered destruction several times over the course of its history, would make them immortal to further harm.

Current goals over the next year, according to the archive website, include restructuring the hosting solution for scanned volumes.

The new plan is to have every country host the material it scans, while storing metadata of authors, publishers, and publishing dates in one centralized server for easier searching. However, the press release states that most of the work will focus on correcting “inaccuracies and non-standard cataloging practices.”

The archive currently increases by over 7000 volumes per day.