[ Wikipedia API: getting articles with unicoded titles ]

I am trying to retrieve Wikipedia pages using the titles returned from calling the api. I am calling Wikipedia API from python's urllib2 library.

I am stuck with handling the unicoded titles when I want to automate the process. Here is an example of the problem:

Lets assume we want to retrieve the page about Escola Superior de Ciências Empresariais, that has the following link: http://en.wikipedia.org/wiki/Escola_Superior_de_Ci%C3%AAncias_Empresariais_(Set%C3%BAbal)

calling the I do get what I want: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=xml&titles=Escola_Superior_de_Ci%C3%AAncias_Empresariais_(Set%C3%BAbal)&rvprop=timestamp|user|comment|content

So going to the beginning of the problem, if I also use the wikipedia api to search for the school like: http://en.wikipedia.org/w/api.php?action=query&list=search&srprop=''&format=xml&srsearch=Escola_Superior_de_Empresariais

I get the following xml string:

<api>
  <warnings>
    <search xml:space="preserve">Unrecognized value for parameter \'srprop\': \'\'</search>
  </warnings>
  <query>
    <searchinfo totalhits="7"/>
    <search>
      <p ns="0" title="Escola Superior de Ci\xc3\xaancias Empresariais (Set\xc3\xbabal)"/>
      <p ns="0" title="List of universities in Cape Verde"/>
      <p ns="0" title="Polytechnic Institute of Viana do Castelo"/>
      <p ns="0" title="S\xc3\xa3o Vicente, Cape Verde"/>
      <p ns="0" title="Economy of Portugal"/>
      <p ns="0" title="Higher education in Portugal"/>
      <p ns="0" title="Grupo Opaia SA"/>
    </search>
  </query>
</api>

my question now is that how can we convert the retrieved string

Escola Superior de Ci\xc3\xaancias Empresariais (Set\xc3\xbabal)

to a form that can be called directly from the Wikipedia API? That is to say, how can I create a link like:

http://en.wikipedia.org/wiki/Escola_Superior_de_Ci%C3%AAncias_Empresariais_(Set%C3%BAbal)

from the retrieved results?

Answer 1


Use urllib2.quote, which do this for you.

Essentially:

utf8_url = 'Escola Superior de Ci\xc3\xaancias Empresariais (Set\xc3\xbabal)'
percent_url = urllib2.quote(utf8_url)