Eventnet tutorial (analyzing large event networks)
Eventnet is now hosted in GitHub at https://github.com/juergenlerner/eventnet.
In this tutorial we illustrate how large networks of relational events, comprising millions of nodes and hundreds of millions of relational events can be analyzed with eventnet. Analyzing such large networks requires the application of two sampling techniques: (1) case-control sampling where non-events (controls) are sampled from the risk set (already described in the basic tutorial) and (2) sampling from the observed events, that is, from the sequence of input events. This tutorial assumes that you are familiar with the first steps tutorial and with the basic tutorial.
Study overview
In this study we analyze network effects that drive attention of contributing Wikipedia users to Wikipedia articles. From this aspect, the study is similar to the one described in the tutorial around the case study activity and attention in the collective production of migration-related Wikipedia articles. However, on this page we describe the analysis of all edit events in the English-language edition of Wikipedia, giving rise to an event network comprising more than 6 million Wikipedia users and more than 5 million Wikipedia articles, connected by more than 360 million relational events. We further describe how the reliability of estimates under sampling can be experimentally assessed by repeated sampling. Results also yield guidelines for choosing the sample size.Replication (data, eventnet configuration, and analysis)
(top)All edit events are extracted from the public database dumps provided by the Wikimedia foundation. We extracted all events in which any registered user uploaded a new revision of any article in the English-language edition of Wikipedia in the time frame from January 15th, 2001 to January 1st, 2018. Users are identified by their user names, articles by their titles, and edit times are given in milliseconds.
This preprocessed data is available at Zenodo as Wikipedia Edit Event Data 2018 (WikiEvent.2018) (DOI: 10.5281/zenodo.1626323). If you are interested in replicating or modifying the analysis described on this page, download the ZIP file WikiEvent.2018.csv.zip provided under this link and un-zip it on your computer.
The eventnet configuration for replicating the analysis is provided in the file config.test.sampling.reliability.xml. Note that this configuration does not analyze the event network just once. Rather it repeats the analysis 380 times, varying the sample parameters, to experimentally assess the variability of parameter estimates that is caused by sampling. Thus, the configuration contains 380 observations (see the basic tutorial) that are almost identical but vary in the sample parameters as described below.
The computation of explanatory variables can be started with the command java -Xmx52g -jar eventnet-x.y.jar config.test.sampling.reliability.xml (where x.y is to be replaced by the version number). This will work only if the eventnet JAR file and the event input file WikiEvent.2018.csv are in the same directory from which you execute this command. (If not, update the input directory or simply move these files to the current directory.) Execution will create a directory output in the current directory (you might change this) and creates 380 CSV files (one for each observation) containing the computed statistics for all sampled events and controls. Each of these output files can be analyzed, for instance, with the coxph function of the R package survival, as described in the basic tutorial.
Updating directories or filenames might also be done by editing the configuration XML file directly, without starting the eventnet GUI, as indicated below.
... <input.files accept=".csv" has.header="true" delimiter="SEMICOLON" quote.char="DOUBLEQUOTE"> <input.directory name="."/> <file name="WikiEvent.2018.csv"/> </input.files> <output.directory name="./output"/> ...
Network model
(top)Many of the settings provided in the configuration file are similar to those described in the basic tutorial. Differences include that in this study we have no event types (all events are edit events) and that we set a decay with a halflife of 30 days to all attributes. The network effects are given by the statistics repetition, popularity, activity, four-cycle, and the interaction effect of popularity with activity (assortativity). The last effect does not have to be computed explicitly by eventnet.
A crucial difference is in the sampling strategy, defined in all observations. In this study, where we analyze 360 million events, we also sample from the observed events, in addition to case-control sampling. This is described in more detail in the next section.
An example script for how to analyze each of the output files with R is given in the following.
# attach the library library(survival) # set the working directory; change potentially setwd("./output") # read explanatory variables from any of the output files, for instance edit.events <- read.csv2("WikiEvent.2018.csv_EDIT.FIX.00.csv", dec = '.') # specify and estimate a Cox proportional hazard model edit.surv <- Surv(time = rep(0,dim(edit.events)[1]), event = edit.events$IS_OBSERVED) edit.model <- coxph(edit.surv ~ repetition + article_popularity * user_activity + four_cycle + strata(EVENT) , data = edit.events) # print model parameters print(summary(edit.model))
Sampling
(top)In each of the 380 observations (see an example below) we sample from the observed events and we apply case-control sampling.
... <observation name="EDIT.FIX.00" type="DEFAULT_DYADIC_OBSERVATION" description="edit events" apply.case.control.sampling="true" number.of.non.events="5" apply.sampling.from.observed.events="true" prob.to.sample.observed.events="1.0E-4" source.node.set="users" target.node.set="articles"/> ...Sampling from the observed events means that statistics are not computed for all input events but only for a random sample of them. In the observation above, the probability to include any given input event in the sample is p=0.0001, that is, we include on average one event out of 10,000 input events in each of the resulting CSV files in the output directory. Case-control sampling means that for each sampled observed event we include a fixed number randomly selected controls, that is, dyads from the risk set not experiencing the event at that time. In the example above we select m=5 controls per event. The risk set is the full cartesian product consisting of all user-article pairs. (The size of this risk set is larger than 30 trillion at the end of the observation period.)
The given configuration file defines observations for four different types of experiments.
- (fixed p and m) We define 100 idential observations with the sample parameters p=0.0001 and m=5. This allows us to assess the variability of model parameter estimates caused by sampling for these given sample parameters.
- (varying p for fixed m) For m=5 and ten different values of p, starting from p=0.00005 and being divided by 2 in each step, we define ten observations for each of these combinations of sample parameters. This allows us to assess which sample size is just sufficient to reliably estimate the model parameters.
- (varying m for fixed p) For p=0.00001 and eight different values of m, starting from m=1 and doubling in each step, we define ten observations for each of these combinations of sample parameters. This allows us to assess the benefit of sampling more controls per event.
- (varying m and p for a given total budget of dyads) Keeping the number of sampled dyads (events plus controls) constant at the value determined by p=0.0001 and m=5, we let m vary from 2 to 256 and decrease p accordingly, and define 10 observations for each of these combinations of values. This allows us to assess whether more events at the expense of less controls, or the other way round, gives more reliable estimates.
Results
(top)In general, some tens of thousands of events (sampled from the 360 million input events) and a small number of controls per event seem to be enough to reliably estimate model parameters. Some of the effecs, most notably the degree effects popularity and activity can even be estimated with much smaller numbers of events, such as a few hundred. The parameter of the repetition effect turned out to be the one that has the highest variability caused by sampling. This is probably caused by very skewed distributions of the repetition variable among the controls: most controls have a value of zero and only very few are assigned non-zero values. The variability in parameters is probably due to the rarity of sampling controls with non-zero values in the repetition statistic.
In general, it seems to be preferable to sample more events and less controls if the total budget of sampled dyads is limited. Thus, the number of controls per event (that is, the parameter number.of.non.events in the observation definitions) should be set to a small number, for instance, in the interval from two to ten.
These results will be discussed in more detail in future versions of this tutorial.