Influence Function Learning in Information Diffusion Networks

The rising role of on-line social networks in the information diffusion leads to increasing interest in viral marketing and on-line advertisement campaign. An essential task of a campaign is to evaluate the potential influence of a given set of users, that is, the expected number of follow-ups who will be influenced by this set of selected users before some given time T.

In practice, however, the influence function is unknown to us. Even the specific diffusion paths and the true diffusion processes may be latent and hard to observe. Instead, we normally can collect the temporal information induced from the underlying diffusion process. For instance, after people get sick, they go to see the doctor, so we can observe approximately when they got infected. Based on these temporal information, existing approaches first fit an information diffusion model and then evaluate the influence function based on the learned model accordingly. However, there still remain many challenges in this two-step paradigm. Real world information diffusion is complicated, and it is nontrivial to determine the most suitable diffusion model in practice. The true diffusion process might even be never known. Thus, a particularly chosen diffusion model may be wrong compared to real world data and lead to large bias. Moreover, because the diffusion network structure is often hidden from us, we then need to infer both the parameters in the diffusion model and the diffusion network structure. Due to the limited amount of training data, the recovered diffusion structure may be inaccurate which will lead to another source of errors to the influence estimation. Finally, given a specific diffusion model, the exact calculation of the influence function is also hard which often requires additional approximation approaches.

If the purpose is only to estimate the influence, can we learn the influence value of a given source set directly from the cascade data ? In this ICML 2014 paper , we provide a positive answer to this question. Without making any prior assumptions about the latent diffusion models, we seek to directly formulate the influence function itself by explicitly exploiting the property that the influence functions in many specific diffusion models are coverage functions. Theoretically, we show that the proposed algorithm InfluLearner can learn the influence with low sample complexity, and our empirical studies on real world datasets also confirm our method outperforms traditional two-stage approaches.

About InfluLearner

InfluLearner is a scalable algorithm for learning the influence function in information diffusion networks. Its current implementation is in MATLAB but can be trivially parallelized by using MPI. In the Download section, we also provide a simulator to generate synthetic networks and cascades based on the continuous-time independent cascade model.

Papers

Influence Function Learning in Information Diffusion Networks. Nan Du, Yingyu Liang, Maria-Florina Balcan, and Le Song. International Conference on Machine Learning (ICML). June 21 - June 26, 2014, Beijing, China. [PAPER] [Bibtex]

Download

Terms of agreement. We appreciate your interest in our research. The C++ implementation is provided and maintained for our research projects conducted in the School of Computational Science and Engineering at Georgia Institute of Technology. All rights of the source code implementations are reserved. Comments, bugs, and suggestions, if any, can be directed to Nan Du (dunan AT gatech.edu). Redistributions and use of them in source or binary forms, with or without modification, are permitted only to academic purposes. If you use this code, or any part of it, it will be appreciated to cite our paper as

@inproceedings{DuLiaBalSon14A, title={Influence Function Learning in Information Diffusion Networks}, author={Nan Du and Yingyu Liang and Maria-Florina Balcan and Le Song}, booktitle={International Conference on Machine Learning (ICML)}, year={2014}}

This code is provided free for non-commercial use under the terms of the GNU General Public License. The current version is the 1.0 and has been released on December 28, 2014. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. It is understood that by downloading any item from this Web page and the pages linked to by this page, one has entirely understood and fully acknowledged to agree all of the terms.

Download InfluLearner !

Download Simulator !

Documentation of InfluLearner

Input network format of synthetic data. We include a small sample network (1,024 nodes) in the implementation. The input file "weibull_kronecker-n1024-e1126-0-network" consists of two blocks separated by a blank line. The node ID starts from 0 to 1,023. In the 1st block, each line has the format

node ID, node Name

In the 2nd block, each line represents a directed edge from node ID1 to node ID2 with the following format.

node ID1, node ID2, shape, scale

Because we assume the most general form for the pairwise transmission function, for the simplicity of the demo code, we take the Weibull Distribution as the transmission function with the two parameters shape and scale, respectively. When $shape = 1.0$, it is reduced to the exponential distribution; When $shape = 2.0$, it is equivalent to the Rayleigh Distribution. A valid network format can be the following.

0,0
1,1
2,2

0,1,0.0959956,0.0888947
0,2,3.95348,9.42268
1,2,1.61835,8.38794

Input cascade format of synthetic data. A synthetic cascade is a sequence of the following triplet :

source ID; node ID; time;

which indicates the node with ID node ID was infected by the source with ID source ID at this time. For the sources, we have node ID = source ID and time = 0. For instance, in the following cascade

46;46;0;184;184;0;740;740;0;184;790;3.68969;46;851;3.91851;

46, 184, and 740 are the sources. 184 infects 790 at time 3.68969, and 46 infects 851 at time 3.91851.

Input cascade format of real data. A real cascade is a sequence of the following triplet :

node ID1; post ID; time1; node ID2; post ID; time2; ...

which indicates the node with ID node ID1 was infected at time1. post ID is reserved to indicate the group, which is unused here. Since real world cascades only have single source, the first node is always the source. For instance, in the following cascade for post 0

0,0,0,6,0,0.061667,942,0,8.05

node 0 is the source. node 6 and 942 are infected at 0.061667 and 8.05, respectively.

The implementation consists of the following functions :

function InfluLearner (network_filename, source_filename, num_samples, beginning, ending, N)

Main function to learn the influence function on the synthetic data.

  • network_filename : input network.
  • source_filename : input file of selected sources for training.
  • num_samples : number of cascades generated from each source.
  • beginning and ending: because the learning procedure is independent for each node, these parameters indicate which nodes are involved in each task. You can start from 1 and finish at 1,024, which means you run the code sequentially in a single task. Or, you can split into two tasks. The first task starts from node 1 to node 512 by calling InfluLearner(network_filename, source_filename, num_samples, 1, 512, N), and the second task starts from node 513 to node 1,024 by calling InfluLearner(network_filename, source_filename, num_samples, 513, 1024, N), respectively.
  • N : number of nodes.
function InfluLearnerReal(cascade_filename_train, N, T, beginning, ending)

Main function to learn the influence function on the real world data.

  • cascade_filename_train : input cascades.
  • N : number of nodes.
  • T : observation window.
  • beginning and ending: because the learning procedure is independent for each node, these parameters indicate which nodes are involved in each task.
function [A, shape, scale] = readNetwork(network_filename,N)

Read input graph file.

  • network_filename : input network file for training.
  • N : number of nodes.
  • A, shape and scale : matrices to store the edge weights, respectively.
function test(num_samples, network_filename, source_filename)

Test the model performance on the testing dataset of synthetic cascades.

  • num_samples : number of cascades generated from each source.
  • network_filename : input network.
  • source_filename : input file of selected sources for testing.
function testReal(filenumber)

Test the model performance on the testing dataset of real cascades.

  • filenumber : the testing cascades of real world data have the filenames as cascade-1000-filenumber-test, so this parameter indicates which testing file to use.
function [obj,gradient] = objFunction(w, phi, y_j)

Objective function implementation.

  • w : model parameters to learn.
  • phi : input random features.
  • y_j : input labels of being infected or not.
  • obj : returned objective function value.
  • gradient : returned gradient.

Compile on Linux/MacOS

To compile the sources, simply run the following function in MATLAB

mexAll

Run the demo on Linux/MacOS

To run the demo code within the directory InfluLearner on synthetic data, simply type

InfluLearner('weibull_kronecker-n1024-e1126-0-network','synthetic_sources_0', 8, 1, 1024, 1024)

This indicates that the network structure file is 'weibull_kronecker-n1024-e1126-0-network', the directory 'synthetic_sources_0' contains all the training data, we use 8 cascades per source set, we start the computation from node 1 to the last node 1024, and there are 1024 nodes in total. This function will store the results containing the random features in the directory

./synthetic/synthetic_sources_0/result-train_cascade_C8_weibull_kronecker-n1024-e1126-0-network

Next, to test the performance of the algorithm, run the following function

test(8, 'weibull_kronecker-n1024-e1126-0-network', 'synthetic_sources_0')

This means we test our model on the 'weibull_kronecker-n1024-e1126-0-network' file with the 'synthetic_sources_0' source set file with 8 sample cascades per source set.

To run the demo code within the directory InfluLearner on real data, simply type

InfluLearnerReal('cascade-1000-0-train', 1000, 15, 1, 1000)

This indicates that on the input cascade file 'cascade-1000-0-train' with 1,000 nodes, we learn the influence function from node 1 to 1,000. This function will store the results containing the random features in the directory

./real/result_cascade-1000-0-train

Next, to test the performance of the algorithm, run the following function

testReal(0)

This means we test our model on the respective testing cascade file 'cascade-1000-0-test'.

Documentation of Simulator

Compile on Linux/MacOS

tar -zxvf KroneckerNetworkAndCascadeGenerator.tgz

cd KroneckerNetworkAndCascadeGenerator

make

This will generate two the following two executables

  • GenerateNetwork for producing Kronecker synthetic networks.
  • GenerateCascades for simulating cascades.

Run GenerateNetwork

./GenerateNetwork

Generate synthetic kronecker networks. build: 00:28:46, Dec 29 2014. Time: 00:34:10 [Dec 29 2014]
==================================================================================================
usage: GenerateNetwork
-g:Parameters for the network (default:0.9 0.5; 0.5 0.9)
-n:Number of nodes (default:1024)
-e:Number of edges (default:1126)
-a:Minimum and maximum value for weibull shape (default:1;10)
-b:Minimum and maximum value for weibull scale (default:1;10)
-f:Name of the network (default:weibull_kronecker_synthetic)
RMat Kronecker: 1024 nodes, 1126 edges, Directed...
collisions: 2 (0.0018)
run time: 0.00s (00:34:10)

This will generate an example file with the default filename weibull_kronecker_synthetic under the current directory. You can then change the respective parameters to meet your requirements.

Run GenerateCascades

./GenerateCascades

Generate synthetic cascades. build: 00:45:13, Dec 29 2014. Time: 00:45:21 [Dec 29 2014]
==================================================================================================
usage: GenerateCascades
-n:Number of nodes (default:1024)
-c:Number of cascades per source (default:128)
-fg:Name of the network (default:weibull_kronecker_synthetic)
-fc:Name of the network (default:weibull_kronecker_synthetic_cascade)

This will generate another example file with the default filename weibull_kronecker_synthetic_cascade under the current directory based on the default network file weibull_kronecker_synthetic. You can then change the respective parameters to meet your requirements.