The rising role of on-line social networks in the information diffusion leads to increasing interest in viral marketing and on-line advertisement campaign. An essential task of a campaign is to evaluate the potential influence of a given set of users, that is, the expected number of follow-ups who will be influenced by this set of selected users before some given time T.
In practice, however, the influence function is unknown to us. Even the specific diffusion paths and the true diffusion processes may be latent and hard to observe. Instead, we normally can collect the temporal information induced from the underlying diffusion process. For instance, after people get sick, they go to see the doctor, so we can observe approximately when they got infected. Based on these temporal information, existing approaches first fit an information diffusion model and then evaluate the influence function based on the learned model accordingly. However, there still remain many challenges in this two-step paradigm. Real world information diffusion is complicated, and it is nontrivial to determine the most suitable diffusion model in practice. The true diffusion process might even be never known. Thus, a particularly chosen diffusion model may be wrong compared to real world data and lead to large bias. Moreover, because the diffusion network structure is often hidden from us, we then need to infer both the parameters in the diffusion model and the diffusion network structure. Due to the limited amount of training data, the recovered diffusion structure may be inaccurate which will lead to another source of errors to the influence estimation. Finally, given a specific diffusion model, the exact calculation of the influence function is also hard which often requires additional approximation approaches.
If the purpose is only to estimate the influence, can we learn the influence value of a given source set directly from the cascade data ? In this ICML 2014 paper , we provide a positive answer to this question. Without making any prior assumptions about the latent diffusion models, we seek to directly formulate the influence function itself by explicitly exploiting the property that the influence functions in many specific diffusion models are coverage functions. Theoretically, we show that the proposed algorithm InfluLearner can learn the influence with low sample complexity, and our empirical studies on real world datasets also confirm our method outperforms traditional two-stage approaches.
InfluLearner is a scalable algorithm for learning the influence function in information diffusion networks. Its current implementation is in MATLAB but can be trivially parallelized by using MPI. In the Download section, we also provide a simulator to generate synthetic networks and cascades based on the continuous-time independent cascade model.
Terms of agreement. We appreciate your interest in our research. The C++ implementation is provided and maintained for our research projects conducted in the School of Computational Science and Engineering at Georgia Institute of Technology. All rights of the source code implementations are reserved. Comments, bugs, and suggestions, if any, can be directed to Nan Du (dunan AT gatech.edu). Redistributions and use of them in source or binary forms, with or without modification, are permitted only to academic purposes. If you use this code, or any part of it, it will be appreciated to cite our paper as
@inproceedings{DuLiaBalSon14A, title={Influence Function Learning in Information Diffusion Networks}, author={Nan Du and Yingyu Liang and Maria-Florina Balcan and Le Song}, booktitle={International Conference on Machine Learning (ICML)}, year={2014}}
This code is provided free for non-commercial use under the terms of the GNU General Public License. The current version is the 1.0 and has been released on December 28, 2014. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. It is understood that by downloading any item from this Web page and the pages linked to by this page, one has entirely understood and fully acknowledged to agree all of the terms.
Input network format of synthetic data. We include a small sample network (1,024 nodes) in the implementation. The input file "weibull_kronecker-n1024-e1126-0-network" consists of two blocks separated by a blank line. The node ID starts from 0 to 1,023. In the 1st block, each line has the format
node ID, node Name
In the 2nd block, each line represents a directed edge from node ID1 to node ID2 with the following format.
node ID1, node ID2, shape, scale
Because we assume the most general form for the pairwise transmission function, for the simplicity of the demo code, we take the Weibull Distribution as the transmission function with the two parameters shape and scale, respectively. When $shape = 1.0$, it is reduced to the exponential distribution; When $shape = 2.0$, it is equivalent to the Rayleigh Distribution. A valid network format can be the following.
0,0
1,1
2,2
0,1,0.0959956,0.0888947
0,2,3.95348,9.42268
1,2,1.61835,8.38794
Input cascade format of synthetic data. A synthetic cascade is a sequence of the following triplet :
source ID; node ID; time;
which indicates the node with ID node ID was infected by the source with ID source ID at this time. For the sources, we have node ID = source ID and time = 0. For instance, in the following cascade
46;46;0;184;184;0;740;740;0;184;790;3.68969;46;851;3.91851;
46, 184, and 740 are the sources. 184 infects 790 at time 3.68969, and 46 infects 851 at time 3.91851.
Input cascade format of real data. A real cascade is a sequence of the following triplet :
node ID1; post ID; time1; node ID2; post ID; time2; ...
which indicates the node with ID node ID1 was infected at time1. post ID is reserved to indicate the group, which is unused here. Since real world cascades only have single source, the first node is always the source. For instance, in the following cascade for post 0
0,0,0,6,0,0.061667,942,0,8.05
node 0 is the source. node 6 and 942 are infected at 0.061667 and 8.05, respectively.
The implementation consists of the following functions :
Main function to learn the influence function on the synthetic data.
Main function to learn the influence function on the real world data.
Read input graph file.
Test the model performance on the testing dataset of synthetic cascades.
Test the model performance on the testing dataset of real cascades.
Objective function implementation.
Compile on Linux/MacOS
To compile the sources, simply run the following function in MATLAB
mexAll
Run the demo on Linux/MacOS
To run the demo code within the directory InfluLearner on synthetic data, simply type
InfluLearner('weibull_kronecker-n1024-e1126-0-network','synthetic_sources_0', 8, 1, 1024, 1024)
This indicates that the network structure file is 'weibull_kronecker-n1024-e1126-0-network', the directory 'synthetic_sources_0' contains all the training data, we use 8 cascades per source set, we start the computation from node 1 to the last node 1024, and there are 1024 nodes in total. This function will store the results containing the random features in the directory
Next, to test the performance of the algorithm, run the following function
This means we test our model on the 'weibull_kronecker-n1024-e1126-0-network' file with the 'synthetic_sources_0' source set file with 8 sample cascades per source set.
To run the demo code within the directory InfluLearner on real data, simply type
InfluLearnerReal('cascade-1000-0-train', 1000, 15, 1, 1000)
This indicates that on the input cascade file 'cascade-1000-0-train' with 1,000 nodes, we learn the influence function from node 1 to 1,000. This function will store the results containing the random features in the directory
Next, to test the performance of the algorithm, run the following function
This means we test our model on the respective testing cascade file 'cascade-1000-0-test'.
Compile on Linux/MacOS
tar -zxvf KroneckerNetworkAndCascadeGenerator.tgz
cd KroneckerNetworkAndCascadeGenerator
make
This will generate two the following two executables
Run GenerateNetwork
./GenerateNetwork
Generate synthetic kronecker networks. build: 00:28:46, Dec 29 2014. Time: 00:34:10 [Dec 29 2014]This will generate an example file with the default filename weibull_kronecker_synthetic under the current directory. You can then change the respective parameters to meet your requirements.
Run GenerateCascades
./GenerateCascades
Generate synthetic cascades. build: 00:45:13, Dec 29 2014. Time: 00:45:21 [Dec 29 2014]This will generate another example file with the default filename weibull_kronecker_synthetic_cascade under the current directory based on the default network file weibull_kronecker_synthetic. You can then change the respective parameters to meet your requirements.