Arab American Newspapers Project
The Khayrallah Center is delighted to announce the release of the fully searchable database of Arab American newspapers. This online database provides full and open access to the general public and researchers to this rich source of the history of Arabs in the Americas, and beyond.
Background and Information
- The Khayrallah Center has been digitizing Arabic language newspapers (rendering print and microfilm into image and PDF files) published in the US between 1894 and 1960. At this point our collection spans 39 collections, with nearly 300,000 pages. We continue to add to our collection daily from newspapers around the world, including Arabic language newspapers from the US, Argentina and Brazil. We expect to collect over one million pages over the next three years.
- Unlike Latin-based script, and with rare exceptions, Arabic archival material has not been readily searchable because of a lack of sufficiently accurate OCR (Optical Character Recognition). In the past few years a few universities have been working on developing this capability. Several years ago, the Khayrallah Center also began developing a program to provide a fully searchable database of Arabic language digitized newspapers. We initially published a beta-test version of the site, and with your comments and feedback we are now pleased to announce our official program is ready!
- During the beta stage in the development of the project we achieved the following:
- 75% – 93% accuracy in rendering Arabic text from images (this varies per the quality of the original image)
- We developed a system that remaps the recognized text onto a PDF file making that file searchable in Arabic.
- We developed a search engine to allow users to search in Arabic throughout this database.
We initially released a public beta version of this program with a limited dataset of 1,100 pages. Thanks to your comments, user experience, and suggestions, we have been able to create an official version of the database with thousands more uploads and detailed features.
These features include:
- More advanced search functionalities that allow the user to specify date range, place of publication, specific publication, as well as Boolean features (“and” “or” searches).
- Pagination: early in the project, all results, even if they are in the hundreds of individual pages, were displayed on the same search result page. Now we display 10 results per page and allow users to scroll to “Next page”
- Initially, when you clicked on “View” the single page was displayed. Now we display the page itself, and a ribbon at the bottom of the screen with all the pages from the same issue as the singular page.
- Natural Language Processing: To enhance OCR accuracy we added another stage to process the output text to “spell check” and correct mis-recognized characters
- Better mapping of text onto PDF: When searching inside the page in the PDF viewer, at times the word(s) you searched for were not highlighted but rather the ones next to them. We corrected that by improving the mapping accuracy.
- At this point the rendered text appeared to be dropping spaces between many words when remapped onto the PDF file. So, الحفلة السورية was sometimes rendered الحفلةالسورية. This obviously affected searching within the PDF file. We corrected that problem.
We hope that you will find this tool useful in your work, and we look forward to your feedback. Please keep in mind that, even now this is a work in progress, and will greatly benefit from your experience and advice.