Identify Arabic Text Segments

Arabic Segments Identifier:

This method will identify Arabic text in a given UTF-8 multi-language document and return an array of start and end positions for Arabic text segments. Understanding the language and encoding of a given document is an essential step in working with unstructured multilingual text. Without this basic knowledge, applications such as information retrieval and text mining cannot accurately process data, and important information may be completely missed or misrouted.

Any application that works with Arabic in multiple languages documents can benefit from this functionality. Applications can use it to take a fully automated approach to process Arabic text by quickly and accurately determining Arabic text segments within multiple languages document.

Example Output 1:

Peace سلام שלום Hasîtî शान्ति Barış 和平 Мир

English:: Say Peace in all languages! The people of the world prefer peace to war and they deserve to have it. Bombs are not needed to solve international problems when they can be solved just as well with respect and communication. The Internet Internationalization (I18N) community, which values diversity and human life everywhere, offers "Peace" in many languages as a small step in this direction.
Arabic: نص عربي: أنطقوا سلام بكل اللغات! كل شعوب العالم تفضل السلام علي الحرب وكلها تستحق أن تنعم به. إن القنابل لا تحل مشاكل العالم ويتم تحقيق ذلك فقط بالاحترام والتواصل. مجموعة تدويل الإنترنت (I18N) ، والتي تأخذ بعين التقدير الاختلافات الثقافية والعادات الحياتية بين الشعوب، فإنها تقدم "السلام" بلغات كثيرة، كخطوة متواضعة في هذا الاتجاه.
Hebrew:: אמרו "שלום" בכל השפות! אנשי העולם מעדיפים את השלום על-פני המלחמה והם ראויים לו. אין צורך בפצצות כדי לפתור בעיות בין-לאומיות, רק בכבוד ובהידברות. קהילת בינאום האינטרנט (I18N), אשר מוקירה רב-גוניות וחיי אדם בכל מקום, מושיטה יד ל"שלום" בשפות רבות כצעד קטן בכיוון זה.

Some Authors:

Frank da Cruz, New York City (USA)
Marco Cimarosti, Milano (Italy)
Michael Everson, Dublin (Ireland)
فريد عدلي / Farid Adly,
Editor in Chief, Italian-Arab News Agency ANBAMED
(Notizie dal Mediterraneo - أنباء البحر المتوسط), Acquedolci (Italy)

Example Code 1:


<?php
    require '../src/arabic.php';
    $Arabic = new \ArPHP\I18N\Arabic();

    $p = $Arabic->arIdentify($html);

    for ($i = count($p)-1; $i >= 0; $i-=2) {
        $arStr   = substr($html, $p[$i-1], $p[$i] - $p[$i-1]);
        $replace = '<mark>' . $arStr . '</mark>';
        $html    = substr_replace($html, $replace, $p[$i-1], $p[$i] - $p[$i-1]);
    }

    echo $html;

Related Documentation: arIdentify