Re: The URL regex.

From: Albrecht Dreß <albrecht dress arcormail de>
To: Pawel Salek <pawsa TheoChem kth se>
Cc: Carlos Morgado <chbm chbm nu>, balsa-list gnome org,Brian Stafford <brian stafford uklinux net>
Subject: Re: The URL regex.
Date: Mon, 21 May 2001 19:22:14 +0200

> I have noticed  the url regex expression 
> const char *url_str = "((ht|f)tps?://[^[:blank:]\r\n]+)";
> has been replaced by
> const char *url_str = "\\<((ht|f)tp[s]?://[^[:blank:]]+)\\>";  

I think the second one was my original, not sure anymore. I don't remember who
introduced the first version (it *might* actually be me, but it's of course
not correct;-))...

> IMO, the <>-brackets around the URL are rare and I think including them

I *thought* that "\<" and "\>" match a word separator in regular expressions,
not the literal "<" and ">". `man 7 regex' says that it should be "[[:<:]]" in
this case, though. But the first one seems to work anyway. Hmmm....

> in the regex is not needed. Most often, the URLs are just quoted in the
> text and it is better to be able to click on them. (BTW, \r\n characters
> are included in [:blank:] and can be removed). What do you think?

That's right!

Am 21.05.2001 11:14:45 schrieb(en) Brian Stafford:
> I would also suggest that "(ht|f)tp[s]?" is rewritten for clarity
> e.g. "(http|ftp)s?". 

Agree... 

> Also the [^[:blank:]]+ pattern doesn't exclude
> characters like "()<>" etc that would normally be % quoted in a
> URL.  So its likely that the pattern picks up some trailing garbage.
> 
> Taking the legal characters from RFC 2396, I suggest the following
> for the trailing portion of the pattern
> 
> (%[[:digit:]A-Fa-f][[:digit:]A-Fa-f]|[-_.!~*'();/?:@&=+$,[:alnum:])+
> 
> This pattern includes the parentheses characters but these could be
> problematic so it might be best to omit them.
> 
> The complete RE, omitting the () characters, would be
> 
> (http|ftp)s?://(%[[:digit:]A-Fa-f][[:digit:]A-Fa-f]|[-_.!~*';/?:@&=+$,[:alnum:])+

IMHO, it is sufficient to check for a string without blanks, separated by
word boundaries (space, beginning/end of line, ")", ".", ..., *if* we find the
correct coding for that, see above;-)). If the user gets a mail with a strange
URL in it, the browser might fail, and there is some manual intervention
needed. On the other hand, creating bullet-proof regex's for both http and ftp
is more complicated, as the syntax differs a little bit (ftp allows a login
string, e.g. ftp://user:secret@some.host.com:42/some/file, see RFC 1738, http
doesn't). So I think we have two options:

* keep the current solution and rest in peace or

* make one *separate* regex for each of the following: https?, ftp, mailto,
nntp, news, telnet.

I am currently working on a solution to make all of this list clickable, and
what I read from you makes me beleive that the second solution is the better
one. What do you think about that?

Pawel, if you just want to put together a new release: I am afraid that I will
need some more days to get this fixed. But for the time being, it might be ok
to use my second patch (the one which changes the cursor shape)?

Thanks, Albrecht.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Albrecht Dreß  -  Monschauer Straße 22  -  D-53121 Bonn (Germany)
      Phone (+49) 228 6199571  -  E-Mail albrecht.dress@arcormail.de
_________________________________________________________________________

Follow-Ups:
- Re: The URL regex.
  - From: Pawel Salek
- Re: The URL regex.
  - From: Brian Stafford

References:
- The URL regex.
  - From: Pawel Salek

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]