Interview: Talking Wikidata with Rosie Stephenson-Goodknight, Visiting Wikipedia Scholar

Rosie Stephenson-Goodknight is a Wikipedia editor whom we have had the pleasure of hosting as a Wikipedia Visiting Scholar this year, supported by scholars in the Women Writers Project and Northeastern reference librarians. Rosie has worked to increase and improve the presence of Wikipedia articles on women and writing.

I recently spoke to Rosie about her latest work, her knowledge of the Wikimedia universe, and the potential for digital scholarship projects involving Wikidata and other linked open data initiatives. We are looking forward to continuing to learn from her expertise!

This interview has been edited for length and clarity.

What are you working on right now?

Mostly I work on Wikipedia, but I am a big champion of Wikidata. In fact I believe Wikidata will be the “mothership” of the Wikimedia movement. Everything – Wikipedia, WikiSource, WikiCommons and so forth, are part of the spokes that come to a center, which is Wikidata, the structured data hub of the Wiki movement.

Mostly what I use Wikidata for is to create lists – it’s excellent for that. As cofounder of a group called Women in Red, which seeks to increase the percentage of women’s biographies across the various language Wikipedias, we create lists of missing articles on English Wikipedia that have an article on another language Wikipedia. So in other words, if a woman has a biography on some language Wikipedia -be it Hindi, Swahili, French or German- but does not have one on English Wikipedia, the Wikidata-generated redlists indicate which articles we might want to translate. Because many of our editors are polyglots, including me, these Wikidata redlists are helpful, as they make us aware of missing articles on English Wikipedia, and I specifically do a lot of that translation work.

How does Wikidata tie into that?

Not all the Wikipedias use the same kind of categorization structure. Let’s say on German Wikipedia there is no category of “American woman novelist.” There is a category for man or woman, and there’s a category for novelist. There are 300 language versions of Wikipedia and each language Wikipedia chooses its own category structure. But Wikidata is a project focused on structured data. On Wikidata, occupation statements (which are similar to categories) about a novelist should include both “writer” and “novelist”. If the subject does not have an occupation statement of “writer”, she won’t she up on a redlist of missing writers simply because she has an occupation statement of “novelist”.

Another way that Wikidata is important to me is Wikidata-generated person info-boxes. On some language Wikipedias, for example, the Catalan language there are many Wikidata-generated info-boxes; on other language Wikipedias, there is some resistance to this development. Now that there is more focused attention on incorporating structured data on Commons, I think we’ll see a steady increase in the amount of Wikidata infoboxes used on that sister project.

As you might be able to tell, I am a very strong supporter of Wikidata. For some, it requires a strong leap of faith, as Wikidata is only five and a half years old and is still going through growing pains. I take a very long view of the Wiki movement. I think of it in terms of 50 years, 150 years, 550 years from now, and I think, eventually, everyone will see structured data as the center of the wiki movement.

What can Wikidata do to improve information on Wikipedia? You’ve given me some examples already, but what is the biggest impact that Wikidata could have – and what would you be most excited about seeing happen as a result of Wikidata?

I think that it goes so far beyond what is contained in any Wikipedia.

We have something on Wikidata, a project called the Sum of All Paintings (SOAP). If we have an entry on Wikidata for every painting that exists, then eventually we would probably write articles on most of those — or at least, we would have them accounted for on Wikidata. And the same would be true for so many other things. If we bring all of those together, there is such a huge benefit to have this information a linked, structured data hub. This isn’t just about [a writer’s] name, Jane Doe. There will be Wikidata properties for the books that Jane Doe wrote, and there will be statements regarding the reviews about these books. Well, you can see the richness of information that we will accumulate.

I want to make it clear that Wikidata is not the sum of all data. All data would include my cat’s name, and my cat’s name doesn’t belong in Wikidata or Wikipedia. There might be a photo of my cat on WikiCommons, as a example of a Calico tabby, but it won’t have a presence in Wikidata. So I just want to make sure I’m clear that Wikidata is not the sum of all knowledge, but rather the sum of all notable structured data.

That’s really an important distinction. What impact could Wikidata have on mapping or visualizations, or in translating those lists into visual format?

The fact is, data isn’t just limited to the Wikimedia universe. If the Northeastern University Women Writers Project uploaded its data into Wikidata, you could use the information from there to generate different kinds of displays, first of all, but could also use it to look at its holes, the missing information that it might want to fill. We would be able to see more clearly that it has plenty of information about women of a particular era or geographic region, but not so much about women of earlier decades, or from different countries. It could be that there’s a plethora of information within your holdings about American women and British women, but not so much about Canadian women and Australian women. And maybe that becomes a new direction; maybe the university decides it wants to expand its work in the area where there are these gaps.

Alternatively, you can review the publications held by Northeastern University against the data sets of other organizations, such as the Orlando Project, and see if that influences the acquisition of other publications by pre-Victorian women writers.

That’s fascinating. Are there any concerns with that? Are there any ways in which, you know, that type of structured data can reproduce some of the inequalities that already exist in traditional archives or repositories? Certainly, it can counteract some of that too.

Yes, it could certainly also enforce them.

If you think about how does someone get an authority control, they have to be published in the United States for example.

[Ed. note: Authority control is the process of creating and choosing one unique number or name for each concept or entity, for example using “Boston, Massachusetts” as an authorized phrase, and cross-referencing instances of “Boston”, “Boston, MA”, “Boston, Mass.”, “Beantown”, or “The Hub” to “Boston, Massachusetts.”]

How do you get published in the United States? If you were a pre-20th century, English language women writer, something very important and unique had to have happened so that you were able to get published. Certainly, there’s some bias associated with that.

There are also very bright women writers who wrote before the 20th century in the English language, whose works weren’t published. If we think that Western peer-reviewed journal reviews are the key to acceptable sources, then we are, in our own way, supporting bias. And, if a review of her work was done in a Western style, in a newspaper, magazine, book, or journal that mostly white Western men contributed to, then we’re not supporting other ways of establishing the notability of women’s works.

So it’s not perfect. It does, in this way, support bias, but we have to start somewhere. I am a big supporter of including the information that we have, and I am very much an inclusionist versus a deletionist. As we move forward and become hopefully more enlightened, we will find ways to include other information that historically has not been available.

What are some of the other “gaps” you see frequently?

I do a lot of international speaking on a topic that I have entitled, “He Who Writes the History Books Wins.” I talk about Nunavut in Canada, how a thousand years ago there were people who lived in what is now Nunavut, but until the European whalers and explorers arrived, there wasn’t a written language that would document the notable people of that time and place. Although a material culture was left behind, we lack information about who, 2,000 years ago, was a notable woman leader, or a woman writer.

We don’t have anything like that about Nunavut, Canada, nor do we have that about gladiators. Gladiators were typically slaves and as such, they didn’t write memoirs. Most of what we know about gladiators is what others wrote about them. And so there again, there’s all this missing knowledge. At some point, you have to accept that there are these missing pieces of knowledge and then just assure that what we do have access to is documented in a culturally-appropriate way. We can say, “This is what we have and there’s a recognition that there are pieces that are missing.”

Is there a way in Wikidata that we can better show those gaps and reveal what is missing?

I think the short answer would be that we [Wikimedians] deal with this on ”talk pages” where we discuss whether or not to approve adding something. For example, through discussions and consensus, we decide if we’ll add a database. We know that there are lots of missing databases. For example, the Northeastern University Women Writers Project has a collection of writings and reviews of them, but I don’t believe the collection is accounted for, on Wikidata.

We know that there are these holdings, and that these are not a part of Wikidata, [because we are familiar with the work.] But there are all these other universities with other holdings, there are all these GLAM institutions — galleries, libraries, archives, museums — who have holdings, and I can say with certainty that some of them are not accounted for on Wikidata and at least some of them should be.

It seems to have so much potential and really I’m glad people like you are thinking about these issues as as the work is getting done.

Where I have a shortcoming is that I’m not a data scientist, so I lack the deeper understanding of how to deal with data that others have. Wikidata requires a lot of different hands, lots of different people bringing different ideas and expertise together in order to move the ball forward a little bit at a time.

That definitely seems to be the most effective way forward. It’s going to take a team. Is there anything else that you just think people really need to know about the work you’ve done on Wikipedia?

I really love the work I’m doing for Northeastern University. I’ve created a total of 202 articles [as Wikipedia Visiting Scholar] and edited an additional 90. What intrigues me as I’m looking at this dashboard is that of the articles that I’ve worked on, there have subsequently been 451,000 article views.

For me especially, that’s both interesting and feels promising. When I started writing articles on Wikipedia in 2007, I thought I was writing in a vacuum.

I thought, you know, I write on obscure subjects – for example, pre-20th century English language women writers, or locations such as peninsulas, and peaks, lakes and rivers, or old demolished buildings in New York City, and so on. And I felt like I’m writing these articles and probably no one is looking at them. If people read them or not does not matter to me in the least bit, as I couldn’t not write these articles. So to see that there are 451,000 views of the articles I’ve created or improved within the scope of Northeastern University’s Women Writers Project, well, that is heartening. It means the work on these articles has an impact on society. That’s kind of like a wow moment.

That’s really incredible to think about, and truly shows the impact of the work.

I’ve added 511,000 words and I have made 647,000 total edits and added 626 images to WikiCommons associated with these articles. This is something I do because I love doing it. I’ve never kept track of the number of words, the number of edits, the number of images, but my Wikipedia Dashboard does show the activity and so I’m grateful for that. It means we can put some data around what I’ve been doing.