Developing and Deploying a Wikipedia Network

8 min readAug 30, 2021

Hello, everyone!

Today, I’m going to explain how to develop and deploy a network that will be created using Wikipedia data, Network X, and Gephi, an interesting tool to visualize graphs and networks.

In this article, we will build a large network, with more than 60 thousand nodes, which can be very difficult to work with if you have to construct it manually or through a CSV file. Therefore, we will also present an easy way to construct a large network by automatically collecting node and edge data from the Internet.

The other goal of this article is to analyze whether the complex network analysis fits in the context of other subjects and disciplines. As a study case, we will use the Wikipedia page about Tourism in Brazil. So, let’s dive in!

Building the network

A Wikipedia page body has external links and links to other pages. Those other pages are presumably somehow related to the study case content, Tourism in Brazil. To build a network of this initial page and other relevant pages, we will treat the pages (and the respective Wikipedia subjects) as the network nodes and the links between the pages as network edges. Besides, we will use snowball sampling (a breadth-first search or BFS algorithm) to discover all the nodes and edges of interest. As a result, we will have a network of all pages related to Tourism in Brazil and hopefully, we will be able to make some conclusions about it.

Initially, we have to install Network X (we used version 2.6.2) and a library called Wikipedia. Then, we set the seed to be Tourism in Brazil and defined some stoppages, to avoid unnecessary pages that are not related to the study case and that can be easily be cited by any other page.

Next, we will perform a breadth-first search to recursively find all pages related to the seed page. In other words, we start from the seed page (layer 0) and look for all the links within the seed page (layer 1). Then, for each page in layer 1, we also look for all the cited pages (layer 2) and store them as nodes. By doing this recursive method, we increase exponentially the number of nodes in the network. We could keep doing this until any other layer, but we decided to limit the search to layer 2.

When doing this in our network, we found 86652 nodes 221562 edges. Quite a lot, isn’t it? Besides, it took 7 minutes and 12 seconds to process it.

Many Wikipedia pages exist under two or more names. For example, Dune and Dunes, or Friend and Friends, or Great-Britain and Great Britain, which have essentially the same content. Therefore, we decided to merge these nodes. To do this, we looked for nodes with the same name but with an additional “s” in the end (denoting plural), or if there were two equal names but with a hyphen separating the words instead of a space. After merging this information, our network had 85906 nodes and 221021 edges. Also, it took 13 minutes to perform this processing.

So, before continuing to process the network, let’s think about what information we can extract from it besides seeing the connections. As previously said, an edge from node x to y represents that node x cited node y. However, the node y may or may node cite node x. Thus, we have a Directional Graph. We can consider a page relevant if it is cited many times, i.e., if a page has a lot of links to it, the topic of the page must be significant. The measure that defines the number of edges arriving at a specific node is the node indegree, which in our application equals the number of HTML links pointing to the respective page. Thus, we can conclude that if a page has a high node indegree, it is cited many times and it is relevant. On the other way around, a page that has a low node indegree is not relevant.

Now, to get a sense of how many irrelevant pages our current network has, we decided to make a histogram that counts the number of nodes with a Degree from 1 to 10, as showed in the Figure below.

Figure 1: Number of nodes in respect to the Degree.

As it can be seen, there are around 60 thousand nodes with a Degree equals to 1, which corresponds to about 70% of our current network! Also, there are around 10 thousand nodes with a Degree equals 2.

Since there is much irrelevant information in our network, we decided to eliminate all the nodes that have a Degree less than 3. By doing this, we removed 81.83% of the nodes and 37.16% of the edges. Now, our network has 15610 nodes and 138883 edges, which will be easier to visualize and process.

Finally, to have an initial insight of the most important nodes of the network, the Figure below shows the top 20 nodes with higher in Degree node, i.e., the most cited pages within our network.

Figure 2: Most relevant nodes in the network.

The most important node of this network is Brazil, which makes sense since we are investigating Tourism in Brazil. We also have many pages related to other subjects in the country, such as States, Economy, Industry, Agriculture, Corruption (sadly), and so on. It also makes sense to have these pages in the top 20 since many tourism pages can also address these other subjects of the country.

In tourism terms, we highlight the Pantanal, which is the world’s largest flooded grasslands, with a high biological diversity and a great place to visit.

Visualizing the Graph in Gephi

Now, let’s create a visual representation of the network using Gephi. So, first of all, we have to export our Graph into a file that Gephi can read. We do this by performing the command:

nx.write_graphml(graph, “brazil_tourism.graphml”)

where graph represents the Network X graph that we construct so far and brazil_tourism.graphml represents the name that will be saved. After that, we can open this Graph in Gephi.

In this article, we will focus on presenting what we have done to construct the final visual representation of the graph and not how to do it in Gephi.

So, first of all, we computed some additional statistics of the network using Gephi:

Diameter = 7
Average Degree = 8.923
Modularity = 0.576

Then, we used the ForceAtlas2 Distribution to change the visual connections of the nodes and changed the color of the nodes depending on the Modularity measure and the size of the nodes based on the in Degree, so bigger nodes have a higher in Degree. After, we also performed a filter to show the node name only if it has an in Degree higher than 189.

As shown in Figure 2, the most relevant nodes have content related to the Brazilian Country. So, in order to take a look at the states and places more relevant for Brazilian Tourism, I ordered the nodes by the in Degree values and manually selected the option to show the name of the ones with higher values. Thus, we had the following nodes:

Pantanal: 174
Amazon: 166
Minas Gerais: 166
Rio de Janeiro: 153
São Paulo: 133
Rio Grande do Sul: 126
Espírito Santo: 123
Bahia: 121
Pernambuco: 113
Rio Grande do Norte: 90

Then, I also manually changed the position of these nodes to facilitate visualizing their connections and groups. Finally, we obtained the following network:

Figure 3: Network connection related to Tourism in Brazil.

Deploying Network

Even though the Figure above has much information, it is just a static figure. In order to be able to get even more information about the network, such as node metrics, connections, and so on, we are going to make it interacting by deploying the network.

In order to do that, we have to install another Plug-in in Gephi. To do so, go on Tools, and click on Plug-ins. Then, you have to look for the SigmaExporter and install it.

After restarting Gephi, go on Files, Export, and Sigma.js template. Then, a window will pop up as shown in the Figure below.

You must complete all the information, such as what represents the Nodes and Edges, what metric was used to set the Colors, the Title, Author, and some other information that will depend on your project. After clicking ok, an HTTP project will be created in the directory informed.

Then, you can create a repository in your Github and upload the entire folder that was just created. Once you have uploaded, you must go on your repository website, click on Settings, Pages, and over Source select main and then click Save. The figure below shows this last configuration:

Figure 5: Setting an URL for your network.

By doing this, you can access your network using your Github URL. For this article, you can access using the following link:

https://vitorgaboardi.github.io/network_analysis/network/

Since there are many nodes and edges, it may take some time to load the Network, so be patient.

You can access the notebook developed in this article in my repository.

Conclusion

In this article, I have shown how to build a large network using the Wikipedia library in Python, make a basic pre-process, and filter some irrelevant information. For this application, the most important nodes were the ones with higher in Degree values, since they have more citations from other articles.

After, we build a visual representation of the network using Gephi and, finally, through the installation of some Plug-ins, we were able to deploy a network, making it easier to analyze and visualize any parameter or connection within the network besides creating an URL for your network.

So, that’s all folks! I hope that this article helped you somehow!

See you!

References

ivanovitchm - Overview

I'm an experimenter by design, and very interested in technologies related to Data Science & Machine Learning, Vehicles…

github.com