[ Wikipedia API: getting articles with unicoded titles ]
I am trying to retrieve Wikipedia pages using the titles returned from calling the api. I am calling Wikipedia API from python's urllib2 library.
I am stuck with handling the unicoded titles when I want to automate the process. Here is an example of the problem:
Lets assume we want to retrieve the page about Escola Superior de Ciências Empresariais, that has the following link: http://en.wikipedia.org/wiki/Escola_Superior_de_Ci%C3%AAncias_Empresariais_(Set%C3%BAbal)
calling the I do get what I want: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=xml&titles=Escola_Superior_de_Ci%C3%AAncias_Empresariais_(Set%C3%BAbal)&rvprop=timestamp|user|comment|content
So going to the beginning of the problem, if I also use the wikipedia api to search for the school like: http://en.wikipedia.org/w/api.php?action=query&list=search&srprop=''&format=xml&srsearch=Escola_Superior_de_Empresariais
I get the following xml string:
<api>
<warnings>
<search xml:space="preserve">Unrecognized value for parameter \'srprop\': \'\'</search>
</warnings>
<query>
<searchinfo totalhits="7"/>
<search>
<p ns="0" title="Escola Superior de Ci\xc3\xaancias Empresariais (Set\xc3\xbabal)"/>
<p ns="0" title="List of universities in Cape Verde"/>
<p ns="0" title="Polytechnic Institute of Viana do Castelo"/>
<p ns="0" title="S\xc3\xa3o Vicente, Cape Verde"/>
<p ns="0" title="Economy of Portugal"/>
<p ns="0" title="Higher education in Portugal"/>
<p ns="0" title="Grupo Opaia SA"/>
</search>
</query>
</api>
my question now is that how can we convert the retrieved string
Escola Superior de Ci\xc3\xaancias Empresariais (Set\xc3\xbabal)
to a form that can be called directly from the Wikipedia API? That is to say, how can I create a link like:
http://en.wikipedia.org/wiki/Escola_Superior_de_Ci%C3%AAncias_Empresariais_(Set%C3%BAbal)
from the retrieved results?
Answer 1
Use urllib2.quote, which do this for you.
Essentially:
utf8_url = 'Escola Superior de Ci\xc3\xaancias Empresariais (Set\xc3\xbabal)'
percent_url = urllib2.quote(utf8_url)