Arab American Newspapers Project
The Khayrallah Center is delighted to announce the release of the beta version of its fully searchable database of Arab American newspapers. This online database provides full and open access to the general public and researchers to this rich source of the history of Arabs in the Americas, and beyond.
Background and Information:
- The Khayrallah Center has been digitizing Arabic language newspapers (rendering print and microfilm into image and PDF files) published in the US between 1894 and 1960. At this point our collection spans 17 newspapers, with nearly 300,000 pages. We continue to add to our collection daily from newspapers around the world, including Arabic language newspapers from the US, Argentina and Brazil. We expect to collect over one million pages over the next three years.
- Unlike Latin-based script, and with rare exceptions, Arabic archival material has not been readily searchable because of a lack of sufficiently accurate OCR (Optical Character Recognition). In the past few years a few universities have been working on developing this capability. Three years ago, the Khayrallah Center also began developing a program to provide a fully searchable database of Arabic language digitized newspapers. We are pleased to announce our program is now ready for the beta-test stage!
At this stage in the development of the project we have achieved the following:
- 75% - 93% accuracy in rendering Arabic text from images (this varies per the quality of the original image)
- We have developed a system that remaps the recognized text onto a PDF file making that file searchable in Arabic.
- We have developed a search engine to allow users to search in Arabic throughout this database.
We are releasing the public beta version of this program with a limited dataset of 1,100 pages. We invite you to use this database and provide us with your feedback. Your comments, user experience, and suggestions will be of great value as we move to the final stages of development prior to release.
Features that are still being developed, and issues we are aware of:
- More advanced search functionalities that allow the user to specify date range, place of publication, specific publication, as well as Boolean features ("and" "or" searches).
- Pagination: at this point all results, even if they are in the hundreds of individual pages, will be displayed on the same search result page. In the future, we will display 10 results per page and allow users to scroll to "Next page"
- Currently when you click on "View" the single page is displayed. In the future we will display the page itself, and a ribbon at the bottom of the screen with all the pages from the same issue as the singular page.
- Natural Language Processing: To enhance OCR accuracy we will add another stage to process the output text to "spell check" and correct mis-recognized characters
- Better mapping of text onto PDF: When searching inside the page in the PDF viewer you will notice at times that the word(s) you searched for are not highlighted but rather the ones next to them. We will work to correct that by improving the mapping accuracy.
- At this point the rendered text appears to be dropping spaces between many words when remapped onto the PDF file. So, الحفلة السورية is sometimes rendered الحفلةالسورية. This obviously affects searching within the PDF file. We are working to correct that problem.
- Using any PDF viewer other the one we provide is jumbling the Arabic text, and thus you cannot search in Arabic properly inside the PDF file (using Command+F for Mac, or Control+F for PC). This is most evident in Adobe Acrobat. We are working with Adobe to figure out the problem. That is why we have disable for now the download feature on our search result page. This does NOT have any effect on your ability to search in Arabic using our search functions.
To use this beta version of the searchable database we ask you to register. After you log in you will be taken to the search page which looks like this.
The red box in the middle of the page is where you enter the Arabic text using either your keyboard, or the small virtual keyboard under the search box in case you do not have Arabic script capability on your computer. In addition, there is a feedback form that appears on the results page for you to provide us with comments about your experience.
We hope that you will find this tool useful in your work, and we look forward to your feedback. Please keep in mind that this is a work in progress, and will greatly benefit from your experience and advice.