Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

From: Jamie McCracken <jamie mccrack googlemail com>
To: Martyn Russell <martyn lanedo com>
Cc: "Tracker \(devel\)" <tracker-list gnome org>, Jamie McCracken <jamie mccrack googlemail com>
Subject: Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
Date: Mon, 26 Apr 2010 09:47:10 -0400

On Mon, 2010-04-26 at 09:54 +0100, Martyn Russell wrote:

On 25/04/10 21:59, Jamie McCracken wrote:

On Sun, 2010-04-25 at 22:34 +0200, Aleksander Morgado wrote:

Hi Jamie,

I think it makes sense to fix this. Just to be clear, does this mean we
don't need Pango in libtracker-fts/tracker-parser.c to determine word
breaks for CJK?


Thats not broken so would not recommend trying to "fix" that


Well, given the details Aleksander demonstrated previously in this 
thread, word breaking for Chinese symbols is broken and yes that should 
be fixed.


its not broken in the parser AFAIK - the parser is heavily optimised for
breaking and works well with CJK (via pango).


I think it is silly to use 2 different libraries to do the same thing 
and if one does things better than another...


Its way too slow to use CJK breaking on non-CJK text - really the parser
checks the language before using the appropriate algorithm. The
extractor lacks the intelligence to do it efficiently

IMHO, The tracker_text_normalize() in the extractor should just do utf8
validation. It should not attempt word breaking as thats cpu expensive
and being done by the parser already


Well, extraction already is pretty expensive. I see your point there but 
also, it doesn't make sense to send n bytes over d-bus that won't be 
used either. So really it is the lesser of two evils. Currently we do 
push a lot of data over d-bus.


sure its a trade off 

I just think word limits should be estimated or ignored in the
extractors (we have a byte limit as well as a word limit in any event)

Follow-Ups:
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado

References:
- [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Martyn Russell
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Martyn Russell

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]