[tracker] FTS Parsers: new README file explaining benefits of each one
- From: Aleksander Morgado <aleksm src gnome org>
- To: commits-list gnome org
- Cc:
- Subject: [tracker] FTS Parsers: new README file explaining benefits of each one
- Date: Wed, 26 May 2010 16:30:21 +0000 (UTC)
commit f4cd0b1cd0886ddfc5ea0d3637f76cf590febe33
Author: Aleksander Morgado <aleksander lanedo com>
Date: Wed May 26 18:29:40 2010 +0200
FTS Parsers: new README file explaining benefits of each one
src/libtracker-fts/README.parsers | 51 +++++++++++++++++++++++++++++++++++++
1 files changed, 51 insertions(+), 0 deletions(-)
---
diff --git a/src/libtracker-fts/README.parsers b/src/libtracker-fts/README.parsers
new file mode 100644
index 0000000..54b4ede
--- /dev/null
+++ b/src/libtracker-fts/README.parsers
@@ -0,0 +1,51 @@
+
+This file contains information about the different parser implementations
+ available in Tracker, each of them based on a different unicode support library
+ (GNU libunistring, libunac, glib/pango).
+
+Specific parser implementation can be selected with the following option at
+ configure time: --with-unicode-support=[libunistring|libicu|glib]
+
+
+Parser based on GNU libunistring (http://www.gnu.org/software/libunistring)
+ * Performs word-breaking as defined by UAX#29 [1], but still doesn't allow
+ 'next-word' searches (as of v0.9.3), but feature is in the roadmap).
+ * Performs full-word casefolding [2] in non-ASCII strings.
+ * Performs lowercasing in ASCII strings.
+ * Performs NFKD normalization in non-ASCII strings.
+ * Library API is UTF-8 friendly.
+ * Up to 50% faster than the glib/pango parser for ASCII words.
+ * Up to 60% faster than the libicu parser for ASCII words.
+
+Parser based on ICU libicu (http://icu-project.org):
+ * Performs word-breaking as defined by UAX#29 [1], and allows 'next-word'
+ searches, perfect in the Tracker case.
+ * Performs full-word casefolding [2] in non-ASCII strings.
+ * Performs lowercasing in ASCII strings.
+ * Performs NFKD normalization in non-ASCII strings.
+ * Library API is not UTF-8 friendly, strongly based on a custom data type
+ (UChar), which is based on UTF-16 (so great for Windows systems, where
+ Unicode strings are encoded in UTF-16).
+ * Up to 37% faster than the libunistring parser for non-ASCII words.
+
+Parser based on glib/pango:
+ * Custom word breaking for non-CJK strings (fails if input string is decomposed
+ in NFD or NFKD normalizations).
+ * Pango-based word breaking (not fully compliant with UAX#29 [1]) for CJK
+ strings.
+ * Doesn't work properly with strings containing mixed CJK and non-CJK text
+ (for the same file with mixed CJK and non-CJK, while both libunistring and
+ libicu versions where around 1 second, the glib/pango parser needed several
+ minutes).
+ * Performs single-character lowercasing in non-CJK strings (so fails with
+ special casefolding cases where a single character is casefolded in more
+ than one character).
+ * Performs NFC normalization in non-CJK strings.
+
+
+References:
+ [1] UAX#29, Unicode Standard Annex #29: TEXT BOUNDARIES
+ http://unicode.org/reports/tr29
+ [2] Section 5.18 of Unicode 5 standard: CASE MAPPINGS
+ http://www.unicode.org/versions/latest/ch05.pdf
+
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]