Details

    • Type: New Feature
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.9.8
    • Fix Version/s: 2.0
    • Component/s: o.c.jsword.index
    • Labels:
      None

      Description

      Change the indexing of accented data to strip them from the index

        Attachments

          Activity

          Hide
          dmsmith DM Smith added a comment -

          This may require icu4j. We've avoided adding icu4j because of its size. Not that it is any smaller, rather it is larger, however size is less of an issue these days.

          Show
          dmsmith DM Smith added a comment - This may require icu4j. We've avoided adding icu4j because of its size. Not that it is any smaller, rather it is larger, however size is less of an issue these days.
          Hide
          chrisburrell Chris Burrell added a comment -

          Actually, STEP has a un-accentter, at least for Hebrew and Greek, and Java has good support for regular expressions around removing diacriticals

          Show
          chrisburrell Chris Burrell added a comment - Actually, STEP has a un-accentter, at least for Hebrew and Greek, and Java has good support for regular expressions around removing diacriticals
          Hide
          macjo sijo cherian added a comment -

          For the ~19 languages that we use natural lang Analyzer, most accents get removed. So this should not be an issue from search perspective. Right?
          We can include more languages if we upgrade to newer lucene.
          Is there any other purpose of removing accents, that I am missing?

          For e.g. I took John 3:16 from French Bible:
          Car Dieu a tant aimé le monde qu'il a donné son Fils unique, afin que quiconque croit en lui ne périsse point, mais qu'il ait la vie éternelle.

          Our current (stemmed) index-data / parsed-query looks like:
          "car dieu a tant aim le mond qu il a don son fil uniqu afin que quiconqu croit en lui ne per point mais qu il ait la vi éternel"

          Show
          macjo sijo cherian added a comment - For the ~19 languages that we use natural lang Analyzer, most accents get removed. So this should not be an issue from search perspective. Right? We can include more languages if we upgrade to newer lucene. Is there any other purpose of removing accents, that I am missing? For e.g. I took John 3:16 from French Bible: Car Dieu a tant aimé le monde qu'il a donné son Fils unique, afin que quiconque croit en lui ne périsse point, mais qu'il ait la vie éternelle. Our current (stemmed) index-data / parsed-query looks like: "car dieu a tant aim le mond qu il a don son fil uniqu afin que quiconqu croit en lui ne per point mais qu il ait la vi éternel"
          Hide
          dmsmith DM Smith added a comment -

          Is there any other purpose of removing accents, that I am missing?

          This issue was mainly for Greek, where many texts have no accents but some do. Having no accents would allow for searches across all Greek texts.

          And also for those that are unable to enter "accents" because of lack of knowledge (e.g. minor understanding of the language or illiterate in how to enter them from a keyboard). It is far easier for many to not compose accents because of the multiple keystrokes needed.

          My recent comment was that I needed to test the upgrade of Jira to see that mail still worked. I looked for an issue for which a comment seemed appropriate.

          We can include more languages if we upgrade to newer lucene.

          We should upgrade to a more recent version. The issue is that a newer version of Lucene would require a friendly mechanism to notice that prior indexes need to be rebuilt. The foundation of that is present, but needs a bit more work.

          Show
          dmsmith DM Smith added a comment - Is there any other purpose of removing accents, that I am missing? This issue was mainly for Greek, where many texts have no accents but some do. Having no accents would allow for searches across all Greek texts. And also for those that are unable to enter "accents" because of lack of knowledge (e.g. minor understanding of the language or illiterate in how to enter them from a keyboard). It is far easier for many to not compose accents because of the multiple keystrokes needed. My recent comment was that I needed to test the upgrade of Jira to see that mail still worked. I looked for an issue for which a comment seemed appropriate. We can include more languages if we upgrade to newer lucene. We should upgrade to a more recent version. The issue is that a newer version of Lucene would require a friendly mechanism to notice that prior indexes need to be rebuilt. The foundation of that is present, but needs a bit more work.
          Hide
          macjo sijo cherian added a comment -

          Chris,
          Can you point me to src for STEP un-accentter for Hebrew and Greek? I will look into integrating it as a index/query analyzer.

          Anyone with inputs for pro/con on un-accented query/index for:
          1. Hebrew bibles
          2. For Ancient greek (grc)
          2. Modern greek

          thanks
          sijo

          Show
          macjo sijo cherian added a comment - Chris, Can you point me to src for STEP un-accentter for Hebrew and Greek? I will look into integrating it as a index/query analyzer. Anyone with inputs for pro/con on un-accented query/index for: 1. Hebrew bibles 2. For Ancient greek (grc) 2. Modern greek thanks sijo

            People

            • Assignee:
              dmsmith DM Smith
              Reporter:
              joe Joe Walker
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: