Politician Map in English Wikipedia

[ Parsing DBPedia Data ]

The main goal is to retrieve a list of all politicians from DBpedia as a starting point for the Wiki parsing team. Additional infobox data for each politician, such as age, date of birth and nationality are also extracted.

Process

  • Starting point DBpedia (data set representing Wikipedia in a structured way), which has its on article system to store the Wikipedia data (http://dbpedia.org/resource/Johnny_Depp vs http://en.wikipedia.org/wiki/Johnny_Depp). Obviously they are connected but to reference its data DBpedia uses the first link.
  • Used datasets:
    • Instance_types_transitive: contains type assignment for each dbpedia article based on 3 different systems
    • Mapping-based_literals: contains infobox property if they are plain text (no links)
    • Mapping-based_objects: contains infobox property if they links to other dbpedia articles
    • Genders: contains gender of dbpedia articles (unfortunatly very incomplete)
    • Wikipedia_links: contains a mapping from a DBpedia article to the actual Wikipedia page
    • Article_categories: contains all the categories a wikipedia page is assigned to
  • Used instance_types_transitive to get a list of pages describing people to get all DBpedia articles that are assigned to the type "Person" in at least one of the 3 type systems
  • Based on the list of all people pages, some knowledge were extracted out of the infobox properties:
    • If a property appeared in the mapping-based_literals dataset, it could be processed immediately
    • If a property appeared in the mapping-based_objects dataset, it needed to be transformed into a literal
    • If the link had the "usual" form for it will be transformed into a literal by transforming the last bit of the URL ( e.g. http://dbpedia.org/resource/Johnny_Depp -> "Johnny_Depp")
  • A filter was set with following properties: gender, name, birthDate, nationality, occupation (politicians). The actual dataset was built based on this property assignment.

Parsing Wikipedia

The main goal of the Wikipedia parsing team is to retrieve internal links between articles about politicians, one revision per month for all years from 2001 to 2016. A set of outgoing edge list is then generated, which represents the network of connected politicians.

Constraints and issues faced

  • Memory leakage – memory leak was detected during the initial attempt of running the code. Specifically, 2GB of RAM was already consumed after parsing approximately 3500 articles. After conducting some code profiling, the root was the problem was determined to be the one of the external libraries. To work around this problem, the code was rewritten to have the article list parsed in chunks of 1000 articles and called from bash script in a loop.
  • Python libraries – some of the libraries were deemed unfit for use due to the aforementioned memory leakage.
  • Time –Parsing of all politician Wikipedia pages was rather consuming; it took 4 days via virtual servers to reach completion.

Process

  • “Politician_data_tracker.csv” is created for the list of politicians (wiki address and ID) extracted from DBpedia
  • “parsed_data.csv” is created for a list of politicians after it is parsed, so that the code would not parse the parsed data again.
  • “profile-data” folder is created to store final deliverables.
  • In addition to standard Python libraries, additional libraries such as Mwclient, Wikitextparser and Pickle were used.
  • “Mwclient” was used to retrieve revisions of articles metadata (Revision ID and Timestamp).
  • Retrieved data was filtered to keep only one revision per month, in which the text within the particular Wiki page was subsequently retrieved.
  • “Wikitextparser” was used to parse the text within the Wiki page as well as to extract internal Wikipedia links.
  • Array of extracted links was saved in “profile-data” folder, using the python serialization library “Pickle”. The corresponding entries were then added to “Politician_data_tracker.csv” and “parsed_data.csv” files.
  • “Politician_data_tracker.csv” and “parsed_data.csv” are dynamically updated during the parsing. Separate script was used to create edge lists.

Network Analysis

Number of Female and Male Politicians

The plot shows that the number male politician Wiki pages stays significantly higher than the number of female politician Wiki pages for the whole observation period. As the total number of nodes increases over the years, so does the absolute size of the gap.

However, the relative ratio is slowly declining. As of December 2016, the number of male politician Wiki pages is more than five times the number of female politician Wiki pages, but in 2006, there were more than seven times more male politician Wiki page than female ones.

Gender Homophily

The first plot shows that females link to females more than males. Males link to females less than males. Percentage of female to female links equals to number of links from females to females divided by the number of links from females to males and females. Percentage of male to female links equals to number of links from males to females divided by the number of links from males to males and females.

The second plot shows that males link to males more than females. Females link to males less than females. Percentage of male to male links equals to number of links from males to males divided by the number of links from males to males and females. Percentage of female to male links equals to number of links from females to males divided by the number of links from females to males and females. In short, the homophily between female node is stronger than the homophily between male node.

In short, the homophily between female node is stronger than the homophily between male node.

Average in-degree

The plot shows that according to in-degree centrality measure, men are significantly more central.

The second plot shows error bars. The distribution of indegrees is represented by showing a single data point, representing the mean value, and error bars to represent the overall distribution of the indegrees. Based on the error bars, The variance of indegrees increases over the time period. Variance of male indegrees increases more than females ones.

The third plot shows the bar chart of average indegrees over the period. We considered error bars to show how variance is changing. In this plot only one revision is considered per year.

The error bars show the standard error that is calculated by dividing the standard deviation by the square root of number of measurements that make up the mean (number of revisions).

Male politician Wiki pages are more likely to be linked to than female ones, which means male politicians are on average more popular or more noticable. Another reason is that there are more male nodes than female nodes, which – combined with the fact that there is a measurable level of homophily in the network (as pointed out earlier) – means, that male nodes get more inlinks.

This stays the same over the full observation period. The relative difference is very slowly decreasing over the years.

Average k-core

According to k-core measure, men are more likely to be in well connected subnetworks than women.

The second plot shows error bars. Using error bars we can show the distribution of k-core values. So instead of showing a single data point, representing the mean value of the data, we considered error bars to represent the overall distribution of the data. Based on the error bars, The variance of k-cores increases over the time period. Variance of male k-core increases more than females ones.

The error bars show the standard error that is calculated by dividing the standard deviation by the square root of number of measurements that make up the mean (number of revisions).

UI/UX Documentation

Core responsibilities

The UI/UX team is in charge of design and development of a web interface to present the network graph and statistics to end users. The team is also responsible for integration of inputs from both the Wikipedia parsing and Network analysis teams.

Constraints and issues faced

  • Hairball visualization – attempts to produce a meaningful network graph which included all politicians (approximately 70,000) were not successful because the sheer large number of nodes and edges filled the entire screen. Such a large network also resulted in the visualization and slider being laggy and unresponsive.
  • High-dependence on other teams – no significant progress could be made until the other teams provided the input data. Once the input files were obtained, it was a lot easier to determine the limitations and possible areas of improvement for user interface.
  • Database – there were back-and-forth debates within the team on the necessity of implementing a database system.

Process

The team was sub-divided into two, one to design the website framework and interface, and the other to develop the visualization of the network graph.

Website framework and interface

Visualization

  • Selection of visualization library. After taking into account visualization requirements and ease of implementation, D3.js is selected.
  • Search for existing implementation of D3.js that could read from an input file and display a basic network of nodes and edges.
  • Enhance the implementation such as colouring and labeling of nodes, mouse-over events as well as addition of a slider.
  • Continuous testing and performance optimization of network graph based on actual input data from CSV files.