Wednesday, June 29, 2016

Pythonic Wrapper for Wikipedia API Redux

I mentioned a Pythonic wrapper for the Wikipedia API in a post on this blog a while back, about the so-called "Refcards-Project". Here I want to just "reflect" on this Wikipedia library/module for a few moments.

As stated before, it is easy enough to install, just type:

$ pip install wikipedia
So let's do something really simple. Let's take a Wikipedia page and extract a list of all the links found in that page (Links to Wikipedia pages, i.e. a list of the "titles" of the Wikipedia pages linked to WITHIN the given Wikipedia page we start with). It could go something like this:

>>> import wikipedia
>>> refcard = wikipedia.page("Reference card")
>>> refcard_links = refcard.links
>>> type(refcard_links)
<type 'list'>
>>> len(refcard_links)
18
>>> for i in refcard_links:
print i
Application program
Card stock
Cheat sheet
Computer platform
Data mining
Formal language
HTML
Idiom
Index card
LaTeX
Markup language
Memory
PDF
Printing
Programming language
R (programming language)
Syntax
User guide
That's it for now. I want to leave it at this as far as Python code goes, but as I said I want to "reflect" on this. I want to reflect on it in two fundamental ways. One way to reflect on Python code is to do it "philosophically" if you will. That's one form of reflection/introspection. You just "think" about it. The other way, though, is via the interpreter.

If you go back, you will see that I "asked" IDLE this question:


>>> type(refcard_links)
And what did it say, what was the answer?

>>> type(refcard_links)
<type 'list'>

Okay. So here's the point: I think it's important to "interrogate" the data that you are working with. It's important to know what the "types" of things are, and also many other "questions" can be asked. If you noticed, I also "asked" what the "length" of the list was:

>>> len(refcard_links)
18

Now, I'm not a computer scientist. I'm just an artist and researcher, a philosopher if you will. So for me "Reflection" just means "asking questions about something". You see, I "expected" that refcard_links would be of type list, but I had to make sure. I also wanted to know what the length was just in case it contained 13,000 different items. That would screw things up if I were writing a kind of web crawler that crawled every Wikipedia "page" in such a list of Wikipedia hyperlinks. Since there were only 18 items in the list (the "associative array") then I knew I was in business.

So what I could do, if I wanted, is write a simple script that takes any given Wikipedia page, given that it exists, and makes a list of say the first 5 links to other Wikipedia pages within that article. Then it could take say the first 2 links, and get/fetch those pages, and make lists of the first 5 links to THOSE pages, and so on and so forth.

I'm not trying to build a Wikipedia crawler, though. I'm actually much more interested in using Wikipedia to do things closer to what we might call Natural Language Processing. That is, as an author of fiction, I'd like to use content on Wikipedia to generate "cut-up novels", let's say, in the style of one William S. Burroughs.

Say I want to build a so-called "cut-up engine". I need text to "seed" the engine. So I take a bunch of pseudo-random Wikipedia pages, or Wikipedia pages that are "interlinked", and I fetch the content and then I can start building Markov chains or whatnot, using the content and other methods to generate new, creative texts algorithmically.

The possibilities are endless. Once you have a "source" of textual content, there are no limits to what you can do. I could build a conversational agent that uses Wikipedia content. Say someone says something... I can quickly scan the textual data of what they said, then find keywords, or try, then look them up.. then I can have a dozen Wikipedia pages about the different "objects" or events or people or places WITHIN THE TEXT I am analyzing.

Granted, I will need much more complex "machinery" under the hood, but to me this simple Wikipedia library/module already offers me a vast FIELD OF OPPORTUNITIES for testing my Pythonic skills. That's all for now. Just a reflection on how to simply get something going, how to start thinking about the possibilities of what can be built with just a quick "skim" over the basics of a given Python library or module. The rest is just up to two t hings: a) how much work you're willing to put into it, including further research and so forth and b) how fertile your imagination is, or how creative you are. The rest is just details.

I could also do simple data analysis. I could want to see how many pages link back to some given page. I could make graphs or networks of hyperlinks between Wikipedia pages. I could do visualizations. Hey look, this many pages link back to the "Philosophy" page. The links that appear in the first paragraph on any given Wikipedia page seem to have this or that property or attribute or characteristic. That's all data science is, data analysis. You need to start SOMEWHERE, though. And as an amateur/beginner in Python, I want to make things as fun as possible. So I explore libraries and modules like this to see what is the SIMPLEST thing I could build to do something INTERESTING.


Sunday, June 12, 2016

[FIELDS_MEDALISTS]

On the Wikipedia page for the "Fields Medal", there is a table of Fields Medalists who were awarded the very high honor in the field of Mathematics. For fun, I took the data from this table and started playing around with the content in the "citation" column.

That is to say, in the table on the Fields Medal Wikipedia page, one has "fields" of "metadata" if you will, for Year, ISM Location, Medalists, Affiliation (when awarded), Affiliation (current/last), and one for Citation. An example of one of the "records" in the table is (in a form of comma-separated values styled representation):

1936, "Oslo, Norway", "Lars Ahlfors", "University of Helsinki, Finland", "Harvard University, US", "Awarded medal for research on covering surfaces related to Riemann surfaces of inverse functions of entire and meromorphic functions. Opened up new fields of analysis."

What I wanted to do, in Python, was make a "list" ("associative array") of all of the data from the table. I would then play with random selections of given "citation" data in the table.

I did this and I had a lot of fun. Without going into too many details, let's say that I created a list with just the citations, if that's all I want. I would make a statement starting with:

FIELDS_MEDALISTS = [...]

...with a "nested list" of the citations, within the FIELDS_MEDALISTS list, or whatever I want to call it, i.e. "CITATIONS", etc.

Once I had access to the citation data, the text or content of them, if you will, I thought that I could randomly choose citations. Ideally, I'd like to be able to do something slightly more sophisticated with the text of the citations, like make a word cloud of the words, minus stop-words, that are most common.

I think that you can see where I am trying to go with this: CREATE A RANDOM FIELDS MEDALIST CITATION (probably using markov chains, not sure yet). The idea, then, would be to write a kind of "FIELDS MEDALIST PREDICTOR".. that optimally would be ablt o predict future winners of the award. Although, honestly, I doubt I can do a very good job, since I'm not yet all that masterful when it comes to machine learnig and so forth. But the dataset itself, the "table" I called it, is pretty small. There haven't been all that many Fields Medalists in human history.

[more to come]...