Wednesday, June 29, 2016

Pythonic Wrapper for Wikipedia API Redux

I mentioned a Pythonic wrapper for the Wikipedia API in a post on this blog a while back, about the so-called "Refcards-Project". Here I want to just "reflect" on this Wikipedia library/module for a few moments.

As stated before, it is easy enough to install, just type:

$ pip install wikipedia
So let's do something really simple. Let's take a Wikipedia page and extract a list of all the links found in that page (Links to Wikipedia pages, i.e. a list of the "titles" of the Wikipedia pages linked to WITHIN the given Wikipedia page we start with). It could go something like this:

>>> import wikipedia
>>> refcard = wikipedia.page("Reference card")
>>> refcard_links = refcard.links
>>> type(refcard_links)
<type 'list'>
>>> len(refcard_links)
18
>>> for i in refcard_links:
print i
Application program
Card stock
Cheat sheet
Computer platform
Data mining
Formal language
HTML
Idiom
Index card
LaTeX
Markup language
Memory
PDF
Printing
Programming language
R (programming language)
Syntax
User guide
That's it for now. I want to leave it at this as far as Python code goes, but as I said I want to "reflect" on this. I want to reflect on it in two fundamental ways. One way to reflect on Python code is to do it "philosophically" if you will. That's one form of reflection/introspection. You just "think" about it. The other way, though, is via the interpreter.

If you go back, you will see that I "asked" IDLE this question:


>>> type(refcard_links)
And what did it say, what was the answer?

>>> type(refcard_links)
<type 'list'>

Okay. So here's the point: I think it's important to "interrogate" the data that you are working with. It's important to know what the "types" of things are, and also many other "questions" can be asked. If you noticed, I also "asked" what the "length" of the list was:

>>> len(refcard_links)
18

Now, I'm not a computer scientist. I'm just an artist and researcher, a philosopher if you will. So for me "Reflection" just means "asking questions about something". You see, I "expected" that refcard_links would be of type list, but I had to make sure. I also wanted to know what the length was just in case it contained 13,000 different items. That would screw things up if I were writing a kind of web crawler that crawled every Wikipedia "page" in such a list of Wikipedia hyperlinks. Since there were only 18 items in the list (the "associative array") then I knew I was in business.

So what I could do, if I wanted, is write a simple script that takes any given Wikipedia page, given that it exists, and makes a list of say the first 5 links to other Wikipedia pages within that article. Then it could take say the first 2 links, and get/fetch those pages, and make lists of the first 5 links to THOSE pages, and so on and so forth.

I'm not trying to build a Wikipedia crawler, though. I'm actually much more interested in using Wikipedia to do things closer to what we might call Natural Language Processing. That is, as an author of fiction, I'd like to use content on Wikipedia to generate "cut-up novels", let's say, in the style of one William S. Burroughs.

Say I want to build a so-called "cut-up engine". I need text to "seed" the engine. So I take a bunch of pseudo-random Wikipedia pages, or Wikipedia pages that are "interlinked", and I fetch the content and then I can start building Markov chains or whatnot, using the content and other methods to generate new, creative texts algorithmically.

The possibilities are endless. Once you have a "source" of textual content, there are no limits to what you can do. I could build a conversational agent that uses Wikipedia content. Say someone says something... I can quickly scan the textual data of what they said, then find keywords, or try, then look them up.. then I can have a dozen Wikipedia pages about the different "objects" or events or people or places WITHIN THE TEXT I am analyzing.

Granted, I will need much more complex "machinery" under the hood, but to me this simple Wikipedia library/module already offers me a vast FIELD OF OPPORTUNITIES for testing my Pythonic skills. That's all for now. Just a reflection on how to simply get something going, how to start thinking about the possibilities of what can be built with just a quick "skim" over the basics of a given Python library or module. The rest is just up to two t hings: a) how much work you're willing to put into it, including further research and so forth and b) how fertile your imagination is, or how creative you are. The rest is just details.

I could also do simple data analysis. I could want to see how many pages link back to some given page. I could make graphs or networks of hyperlinks between Wikipedia pages. I could do visualizations. Hey look, this many pages link back to the "Philosophy" page. The links that appear in the first paragraph on any given Wikipedia page seem to have this or that property or attribute or characteristic. That's all data science is, data analysis. You need to start SOMEWHERE, though. And as an amateur/beginner in Python, I want to make things as fun as possible. So I explore libraries and modules like this to see what is the SIMPLEST thing I could build to do something INTERESTING.


No comments:

Post a Comment