Discussion:
how can I extract all urls in a string by using re.findall() ?
(too old to reply)
could ildg
2005-04-07 02:05:08 UTC
Permalink
I want to retrieve all urls in a string. When I use re.fiandall, I get
a list of tuples.
My code is like below:

[code]
url=unicode(r"((http|ftp)://)?(((([\d]+\.)+){3}[\d]+(/[\w./]+)?)|([a-z]\w*((\.\w+)+){2,})([/][\w.~]*)*)")
m=re.findall(url,html)
for i in m:
print i
[/code]

html is a variable of string type which contains many urls in it.
the code will print many tuples, and each tuple seems not to represent
a url. e.g, one of them is as below:

(u'http://', u'http', u'image.zhongsou.com/image/netchina.gif', u'',
u'', u'', u'', u'image.zhongsou.com', u'.com', u'.com',
u'/netchina.gif')

Why is there two "http" in it? and why are there so many ampty strings
in the tupe above? It's obviously not a url. How can I get the urls
correctly?

Thanks in advance.
--
鹦鹉聪明绝顶、搞笑之极,是人类的好朋友。
直到有一天,我才发觉,我是鹦鹉。
我是翻墙的鹦鹉。
Sidharth Kuruvila
2005-04-07 03:29:54 UTC
Permalink
Reading the documentation on re might be helpfull here :-P

findall returns a tuple of all the groups in each match.

You might find finditer usefull.

for m in re.finditer(url, html) :
print m.group()

or you could replace all your paranthesis with the non-grouping
version. That is, all brackets (...) with (?:...)
Post by could ildg
I want to retrieve all urls in a string. When I use re.fiandall, I get
a list of tuples.
[code]
url=unicode(r"((http|ftp)://)?(((([\d]+\.)+){3}[\d]+(/[\w./]+)?)|([a-z]\w*((\.\w+)+){2,})([/][\w.~]*)*)")
m=re.findall(url,html)
print i
[/code]
html is a variable of string type which contains many urls in it.
the code will print many tuples, and each tuple seems not to represent
(u'http://', u'http', u'image.zhongsou.com/image/netchina.gif', u'',
u'', u'', u'', u'image.zhongsou.com', u'.com', u'.com',
u'/netchina.gif')
Why is there two "http" in it? and why are there so many ampty strings
in the tupe above? It's obviously not a url. How can I get the urls
correctly?
Thanks in advance.
--
鹦鹉聪明绝顶、搞笑之极,是人类的好朋友。
直到有一天,我才发觉,我是鹦鹉。
我是翻墙的鹦鹉。
--
http://mail.python.org/mailman/listinfo/python-list
--
http://blogs.applibase.net/sidharth
Cappy2112
2005-04-07 22:13:04 UTC
Permalink
Post by Sidharth Kuruvila
Reading the documentation on re might be helpfull here :-P
Many times, the python docs can make the problem more complicated,
espcecially with regexes.
could ildg
2005-04-08 01:32:17 UTC
Permalink
I agree with Cappy2112.
I got more puzzled after I read the docs
Post by Cappy2112
Post by Sidharth Kuruvila
Reading the documentation on re might be helpfull here :-P
Many times, the python docs can make the problem more complicated,
espcecially with regexes.
--
http://mail.python.org/mailman/listinfo/python-list
--
鹦鹉聪明绝顶、搞笑之极,是人类的好朋友。
直到有一天,我才发觉,我是鹦鹉。
我是翻墙的鹦鹉。
could ildg
2005-04-07 07:26:12 UTC
Permalink
That's it! Thank you~~
Post by Sidharth Kuruvila
Reading the documentation on re might be helpfull here :-P
findall returns a tuple of all the groups in each match.
You might find finditer usefull.
print m.group()
or you could replace all your paranthesis with the non-grouping
version. That is, all brackets (...) with (?:...)
Post by could ildg
I want to retrieve all urls in a string. When I use re.fiandall, I get
a list of tuples.
[code]
url=unicode(r"((http|ftp)://)?(((([\d]+\.)+){3}[\d]+(/[\w./]+)?)|([a-z]\w*((\.\w+)+){2,})([/][\w.~]*)*)")
m=re.findall(url,html)
print i
[/code]
html is a variable of string type which contains many urls in it.
the code will print many tuples, and each tuple seems not to represent
(u'http://', u'http', u'image.zhongsou.com/image/netchina.gif', u'',
u'', u'', u'', u'image.zhongsou.com', u'.com', u'.com',
u'/netchina.gif')
Why is there two "http" in it? and why are there so many ampty strings
in the tupe above? It's obviously not a url. How can I get the urls
correctly?
Thanks in advance.
--
鹦鹉聪明绝顶、搞笑之极,是人类的好朋友。
直到有一天,我才发觉,我是鹦鹉。
我是翻墙的鹦鹉。
--
http://mail.python.org/mailman/listinfo/python-list
--
http://blogs.applibase.net/sidharth
--
http://mail.python.org/mailman/listinfo/python-list
--
鹦鹉聪明绝顶、搞笑之极,是人类的好朋友。
直到有一天,我才发觉,我是鹦鹉。
我是翻墙的鹦鹉。
Loading...