Importing the Quran into Drupal using Feeds & friends

The Quran is a significant work of literature. It has been studied and analyzed for centuries, and it is considered a reference for all followers of Islam. Many online sources offer access to the Quran, including the Quranic Arabic Corpus, a full-featured linguistic browser created by the Language Research Group at the University of Leeds.

As a native Arabic speaker, I always thought it would be interesting to own an electronic copy of the Quran which I could subject to my amateurish linguistic explorations. Through the site above, I found a link to an XML file containing the full Quran text in Arabic. My mind immediately went: XML + Feeds = Drupal!! So I set out to import the text, and a few hours and a few bug fixes later, here I am to present the result.

For the impatient, I've attached a feature which sets up the proper Data tables and Feed importer to import the file above. You will need the following modules:

  • Feeds (6.x-1.0-beta10)
  • Data (6.x-1.0-alpha14)
  • Feeds XPath Parser (6.x-1.x-dev as of 2010/11/02)
  • Feeds MultiTable in my own module Feeds Hacks (6.x-1.x-dev as of 2010/11/02)
  • Features (6.x-1.0)

Data model

The Quran has an exceedingly simple data model: it consists of chapters and verses, respectively named sûra and âya. Each sûra has a name and each âya has text. The Data tables coran_sura(id, name) and coran_aya(id, sura_id, text) reflect this structure.

Parsing the XML

The XML file reflects the structure above. Here's an excerpt:

<?xml version="1.0" encoding="utf-8" ?>
<quran>
  <sura index="1" name="ﺎﻠﻓﺎﺘﺣﺓ">
    <aya index="1" text="ﺐِﺴْﻣِ ٱﻞﻠَّﻫِ ٱﻝﺮَّﺤْﻤَٰﻧِ ٱﻝﺮَّﺤِﻴﻣِ" />
    <aya index="2" text="ٱﻞْﺤَﻣْﺩُ ﻞِﻠَّﻫِ ﺮَﺑِّ ٱﻞْﻌَٰﻠَﻤِﻴﻧَ" />
    <aya index="3" text="ٱﻝﺮَّﺤْﻤَٰﻧِ ٱﻝﺮَّﺤِﻴﻣِ" />
    <aya index="4" text="ﻢَٰﻠِﻛِ ﻱَﻮْﻣِ ٱﻝﺪِّﻴﻧِ" />
    <aya index="5" text="ﺈِﻳَّﺎﻛَ ﻦَﻌْﺑُﺩُ ﻭَﺈِﻳَّﺎﻛَ ﻦَﺴْﺘَﻌِﻴﻧُ" />
    <aya index="6" text="ٱﻩْﺪِﻧَﺍ ٱﻞﺻِّﺮَٰﻃَ ٱﻞْﻤُﺴْﺘَﻘِﻴﻣَ" />
    <aya index="7" text="ﺹِﺮَٰﻃَ ٱﻝَّﺬِﻴﻧَ ﺄَﻨْﻌَﻤْﺗَ ﻊَﻠَﻴْﻬِﻣْ ﻎَﻳْﺭِ ٱﻞْﻤَﻐْﺿُﻮﺑِ ﻊَﻠَﻴْﻬِﻣْ ﻮَﻟَﺍ ٱﻞﺿَّﺎٓﻠِّﻴﻧَ" />
  </sura>
  <sura index="2" name="ﺎﻠﺒﻗﺭﺓ">
    <aya index="1" text="ﺍلﻣٓ" />
    <aya index="2" text="ﺬَٰﻠِﻛَ ٱﻞْﻜِﺘَٰﺑُ ﻝَﺍ ﺮَﻴْﺑَ ﻒِﻴﻫِ ﻩُﺩًﻯ ﻞِّﻠْﻤُﺘَّﻘِﻴﻧَ" />
    ....
  </sura>
</quran>

In order to parse the XML, I used the excellent Feeds XPath Parser which allows to specify any number of fields to be extracted from the XML file, given an XPath expression (two expressions, in fact: one for the root path (the âya in our case) and one for each child attribute.) In order to get the parent sûra index and name from the âya row, I used the XPath expression ../@index and ../@name respectively.

Processing the parsed information

My module feeds_multitable provides a Data processor called FeedsMultiTableDataProcessor that allows each input record to be written to more than one table. This is a feature I've needed before to retrieve Twitter tweets and normalize them into tweet/author tables, and it came in handy here as well. To set it up, you just select the data tables in the processor's configuration form, and you then provide the mappings from each XPath expression you wrote earlier to the corresponding field in the desired table. The processor is smart enough to handle compound primary keys and can detect duplicates, much like the original FeedsDataProcessor.

That's about it! If all goes well, you should end up with 114 chapters and 6236 verses.

PS. Limitations: I found that some layer of my software stack (Ubuntu 9.10 and FF 3.6.12) doesn't correctly display the Arabic character Unicode U+0671 (ARABIC LETTER ALEF WASLA) which is used in the XML file. Unable to find a fix, I resorted to replacing that character with the more generic U+0627 (ARABIC LETTER ALEF), using the hook_feeds_after_parse that Feeds provides:

<?php
// @file: dcoran.module

/**
 * Implementation of hook_feeds_after_parse().
 */
function dcoran_feeds_after_parse(&$importer, &$source) {
 
$items =& $source->batch->items;
 
mb_internal_encoding("UTF-8");
 
mb_regex_encoding("UTF-8");
  foreach (
$items as $key => $item) {
   
$items[$key]['xpathparser:2'] = mb_ereg_replace('ٱ', 'ﺍ', $item['xpathparser:2']);
  }
}
?>

If anyone has encountered this issue before and found a fix, I'd be grateful if you could share it.

AttachmentSize
dcoran_configuration-6.x-1.0.tar10.5 KB