Wednesday, December 15, 2010

Epidemiological Modeling of Computer Virus Propagation - Recommendations and Conclusions

Note. The text below is the last part of the paper titled "Epidemiological Modeling of Computer Virus Propagation."   Special thanks to Dr. Azucena A. Puertollano for her guidance and teachings.

This paper developed two fairly simple mathematical models of computer virus propagation. The simplicity was achieved by assuming many parameters constant. The applicability of the analytic expressions obtained from the model is therefore limited.

To come up with a more correct and reliable model, this paper recommends that the number of assumptions be reduced. Among those that should be eliminated are the assumptions made about the infection and disinfection rates. These rates should not be held constant because as more computers get infected, both the infection and disinfection rate increases with time.

Better models can also be achieved by including more parameters. This paper recommends the inclusion of such parameters as (a) the type of computer network topology (b) the level vigilance of the computer user in keeping the antivirus updated, (c) the effectiveness of the metrics used by computerized organization in combating viral attacks, etc.
  
The application of SI and SIR epidemiological models to the propagation of the computer virus provided the analytic expressions that describe the behavior of infection as well as the disinfection. It was observed that in the absence of an antivirus, the SI model is telling us that at some determinable time, all computers shall become infected with a virus.

The SIR model is telling us that in the presence of an antivirus, the increasing trend of infection will be reversed at some future time. Later from such time, there will be a steady state for the infection at which time the epidemic can be considered to have ended.

This paper concludes by asserting the mathematical modeling is useful not only in the study of computer viruses but in any fields where the behavior is associated with either growth or decay. The usefulness of the model is in describing what happens to the subject being studied at a particular time in the future. This description of the future will guide us what action to take and decision to make now for a brighter tomorrow.

The SIR Model of the Antivirus-controlled Propagation

Note. The text below is part 8 of the paper titled "Epidemiological Modeling of Computer Virus Propagation." 
 
When the antivirus exists, the number of infected I computers (aside from being augmented by those coming in from susceptible S population) is being reduced by disinfecting action of the antivirus. As previously noted, the SIR model can be used to for this scenario.

With infected I computers now being decremented as these become removed R computers, new equations for s(t) and i(t) must be derived. This is in addition to the new variable r(t) – the number of computers being disinfected. The equation for r(t) is derived first.

           In the derivation of the r(t) equation, it is first assumed that the removal rate at time t is proportional to the number of infected computers at that time. Thus, the equation


                   dr/dt= α • i(t)

where α, like λ, is an appropriate nonnegative constant determined by the physical parameters of the situation.
 
If n represents the number of computers at time t = 0, the total S population should consists of n + 1 computers. One is added because at t = 0, one computer must be infected to initiate the epidemic.
Since the total population always consists of n +1 computers, the assumptions involving ds/dt and dr/dt also determines di/dt. Thus, the following system of first-order nonlinear differential equations:

                   ds/dt= -λ • s(t)• i(t),               λ>0     Equation 7
                   di/dt= λ • s(t)• i(t)  – α •i(t),   α>0     Equation 8
                   dr/dt= α • i(t),                                  Equation

These equations can be solved exactly for s, i, and r as functions of time. In pursuing these equations, the constant ρ is introduced. It represents the relative rate of removal of infected computer, or for short, the relative removal rate. This ρ should not be confused with the constant α which represents the removal rate of infected computers from the I population. This ρ is the factor derived by dividing α with α. To wit,


                           ρ= α/λ          Equation 10
 
Again, α is removal (or disinfection) rate I and λ as infection rate. By dividing the two rates, the constant ρ is actually an indicator the speed of disinfection versus infection. With this notation, the above equation for di/dt can be written as
                                   di/dt= λ • i(t)  [s(t)-ρ]

From the above equation, an epidemic develops when di/dt is positive for some values of t.

Expressing Equation 10 in terms of λ, and substituting this to Equation 9 and then re-arranging the resulting equation such λ•i(t) is on the left side, the equation below is obtained.
                               λ•i(t)= α/ρ  dr/dt
 
Substituting the above to Equation 7, the equation below is obtained.
                                    ds/dt=-α/ρ s(t)dr/dt
or
                                    1/s  ds/dt=-α/ρ  dr/dt

The above equation can be solved by a straightforward integration. Taking into account the initial condition r(0) = 0, an equation relating the functions s and r is obtained. To wit,
                                        s(t)=s(0)e^(-r(t)/ρ)
 
Since s(t) + i(t) + r(t) = n +1 for all values of t, the equation below is obtained.
                               i(t)=n+1-r(t)-s(0)e^(-r(t)/ρ)
 
Substituting the equation for i(t) into Equation 8, the differential equation involving only the function r is obtained.
        

           dr/dt= α [n+1-r(t)-s(0)e^(-r(t)/ρ)]        Equation 11
 
Deriving an r equation18 in terms of time is made complex by the occurrence of the exponential term e^(-r/ρ). One method of coping with this difficulty is to expand the exponential term in power series of r. Expanding the exponential, the equation below is obtained.
 

             e^(-r/ρ)=1- r/ρ+1/2 (r/ρ)^2- 1/3! (r/ρ)^3+⋯
 
Substituting this expression for e^(-r/ρ) in Equation 11, the equation below is obtained.
        

           dr/dt= α [n+1-s(0)+(s(0)/ρ-1)r- s(0)/2 (r/ρ)^2+ …]

          Note that by considering only fewer values of e^(-r/ρ), the equation for the r will just be an approximation of the real r(t). Thus we represent this variable as ᵲ. For this study, we consider only up to the second power of e^(-r/ρ). Now the above equation can be re-written as

  dᵲ/dt= α [n+1-s(0)+(s(0)/ρ-1)ᵲ- s(0)/(2ρ^2 ) ᵲ^2]  Equation 12

Although the algebra is somewhat involved, the above Equation 12 presents no unusual problems. It can be solved more easily by assigning the letters a, b, and c to the following group of terms.
                         a = α[n+1-s(0)]
                                 b= α (s(0)/ρ-1)
                                 c= α s(0)/(2ρ^2)                    


Substituting these letters, the Equation 12 becomes


                             dᵲ/dt= a+bᵲ-cᵲ^2
or
                            dᵲ/(a+bᵲ-cᵲ^2 )=  dt
 
The left side is in quadratic equation form which, when integrated, yields the inverse hyperbolic tangent equation. Thus, 

                     -2/q  tanh^(-1) [(-2cȓ+b)/q]= t+ c1

Here q is used for the quantity (b^2+ 4ac)^(1/2), and the constant c1  can be derived by setting r(0) = 0.
 
Using the definition of the inverse hyperbolic tangent, the equation for ȓ(t) even if it is only an approximation of the exact r(t) is shown below.

                     ȓ(t)= 1/2c  [b-q tanh⁡((-q/q2 x t
)+ c2 )]

In terms of the exponential functions, the equation below will approximate exact r(t).


          ȓ(t)= 1/(2c) [b-q (1-e^(-qt+ c_2))/(1+e^(-qt+ c3)))]  Equation 13

The constants c2 and c3 are to be chosen so that ȓ(0) = 0.
 
Differentiating Equation 13 gives dȓ/dt  which can be substituted to Equation 9 to derive the equation for i(t). Then, differentiating this equation for i(t) gives di/dt which can be substituted to Equation 8 to derive the equation for s(t). These mathematical manipulations are no longer presented here.
 
The equations for the s(t), i(t) and r(t) give an approximate information on the behavior of the functions s, i and r.  To see how these functions behave over time, the analytic behavior of these functions needs to be plotted. 


Unfortunately, the graph cannot be imported into the blog at the moment.  Anyway, the such graph would show that the curve of the infected I computers reaches a certain peak after which it starts to decrease. After some time, all susceptible S members eventually become removed R members. At this time, the virus is already contained indicating that the epidemic has ended. 
---------------

18Daniel P. Maki, Mathematical Models and Applications (New Jersey: Prentice Hall, 1973), 367.

The SI Model of a Freely Propagating Computer Virus

Note. The text below is part 7 of the paper titled "Epidemiological Modeling of Computer Virus Propagation."

         In epidemiological models, the infection spreads via contact between members of a population. In computer terms, contact between computers happen whenever one sends a file to another. A computer virus infection can therefore be expected from every file sharing activity between members of a computer population.


       If the rate of growth of the infected computers is proportional to the number of contacts per unit of time between members of susceptible S group and infected I group, then rate at which S is changing is expressed as 

                                 ds/dt= -λ • s(t)• i(t)       Equation 1

       The negative sign indicates decay (a decreasing S) which is logically correct because they are migrating to I. Here, λ is a nonnegative constant determined by the physical parameters. The obvious generalization is to admit the possibility that λ is actually dependent on time and is better expressed as λ(t). In this study, it shall be assumed that λ is not time dependent.



       It takes only one infected computer to spread the virus into n numbers of computers. This notion is expressed mathematically as
 

                                    i(t)= n+1- s(t)      Equation 2
 

Substituting Equation 2 to Equation 1, the equation below is obtained.
 

                        ds/dt= -λ • s(t)•( n+1- s(t))     Equation 3
 

       This last equation results from the basic assumption of the SI model. That is, the entire population (of n +1 members) is contained only in the two groups S and I.
       
        With the assumption that at time t = 0, there is exactly one infected computer, then the condition to be satisfied by the solution s = s(t) of Equation 2 at time t = 0 is
 

                             s(0)=n.     Equation 4
 

These equations are now evaluated using mathematical tools to see what information can be obtained. First, Equation 3 was re-written as

                          (ds/dt)/(s (n+1-s))= -λ.

 

The left hand side was expanded using partial fractions to obtain
 

                 (ds/dt)/( (n+1)s)+(ds/dt)/((n+1)(n+1-s)) = -λ.
 

Integration was used to obtain
 

               1/( (n+1) ) [log⁡〖s-log⁡(n+1-s) 〗 ]=-λt+c
 

Working further into the equation, the equation below is obtained.
 

                   s/( (n+1-s) )= c1 e^(-λ(n+1)t)
 

By algebraic manipulations, the equation below is obtained.
 

                   s(t)= (n+1)/(1+c2e^k(n+1)t )
       

        This is now a useful form of a solution for Equation 1. Using the condition in Equation 2 (i.e. s(0)=n), the  c2 is found as c_2=n^(-1).  Thus the unique solution of Equation 1 for t > 0 which satisfies the condition s(0)=n for t > 0 is
 

                s(t)= (n(n+1))/(n+e^λ(n+1)t )       t>0 Equation 5
 
       Substituting this equation for s(t) into Equation 2 (i.e. i(t)= n+1- s(t)) and manipulating the resulting equation, the equation below is obtained.
 

              i(t)= ((n+1))/(1+ne^(-λ(n+1)t) )   t>0  Equation 6
        

These Equations 5 and 6 give precise and total information on the behavior of the functions s and i.  To see how these functions behave over time, the analytic behavior of these functions needs to be plotted. Unfortunately, the graph cannot be imported into the blog at the moment.  Anyway, the such graph would show that over time, the number of the susceptible S computers is decreasing as they are becoming members of infected I computers. Unless an antivirus is launched in time, all susceptible computers will eventually become infected.

Varieties of epidemiological models


Note. The text below is part 6 of the paper titled "Epidemiological Modeling of Computer Virus Propagation."

This paper discusses a few simple epidemiological models that exist. The first is the Susceptible-Infected (or SI) model. Noticeably, there are two classes of individuals in this model: the susceptible class S (who are able to catch the disease) and infected class I (who aside from carrying the disease are also infectious) 13.

The SI model describes the behavior of an infective disease in a constant population of individuals which for simplicity ignores spatial variations14. The flow of the members is unidirectional from S to I. That is, an S member gets and remains infected. This scenario parallels the spread of either a new virus for which there is no antivirus yet.  In this situation, all susceptible S members shall become infected I members in a matter of time.
  
The second is the Susceptible-Infected-Removed (or SIR) model which was developed by Kermack and others in 1927. With the addition of removed R class, it is evidently an extension of the SI model. In other texts, the R is referred to as the Recovered/dead class.  This addition allows for the model to include deaths from infection which the SI model ignores15.

The flow of the members is again unidirectional from S to I and then to R. That is, an S member gets infected thus becoming an I member. If he/she gets cured or die, he/she then becomes an R member. Being unidirectional, an R member do not go back to being an S or I member again and I members do not go back to being S member again.

This scenario parallels the spread of computer virus where there is already an antivirus. By the contacts between I and S class computers, an S computer becomes infected. By the corrective action of the antivirus, the infected I computer becomes removed (or recovered) R computer. By the preventive action of the antivirus, the recovered R computer cannot become susceptible or infected again to the same virus.

The next is the Susceptible-Infected-Removed-Susceptible (or SIRS) model. The addition of the Susceptible S class after the removed R class is for situations where the computer becomes susceptible again to the same virus16. This happens when the antivirus that is supposed to immune a computer from the same virus was deleted or is not always kept running.

Last in this paper’s list of epidemic models applicable to spread of computer virus is the Susceptible-Exposed-Infected-Removed (or SEIR) model. In this model, the infected but not yet contagious E class members are added. This additional E class is for those where the contracted virus is still under incubation17.

In the computer virus epidemic, the concept of incubation is commonly found in such security threats as bombs. These are either event or time-activated threats which later will cause damage to programs and files. Although the activation is set in the future, these computers are nevertheless infectious. Unless destructive effects of virus are included in the study, the SEIR is not readily applicable. The presumption is that infected computers are immediately infectious.

        There are more sophisticated epidemiological models applicable to computer virus propagation.  Inasmuch as this paper aims to apply a few simple ones, the discussion of these models was not included here. This paper developed an SI and SIR models of the computer virus propagation. 
         -----------------
13Christopher May, “Modelling a Computer Virus Epidemic” downloaded from maths.dur.ac.uk (April 30, 2009), 10.
14Christopher May,  11
15Daniel P. Maki, Mathematical Models and Applications (New Jersey: Prentice Hall, 1973), 359.
16Christopher May,
17Daniel P. Maki, 359.

Applicability of epidemiological modeling to computer virus propagation

Note. The text below is part 5 of the paper titled "Epidemiological Modeling of Computer Virus Propagation."  

The computer virus has some similarities with a biological disease. The former, like the latter, is also capable of self-replication, self-construction, and self-evolution. The computer virus is viewed as a possible form of artificial life11. On its own, it attaches itself to other files and programs. As an infected program, it also runs automatically.

Biological diseases and computer virus share many characteristics. For one, both their spread depends on the frequency of contact between those infected and those susceptible. This is why biological analogies were established since the first trials on studying how to combat computer viruses12.

In combating the spread of both the biological disease and computer virus, the approach is also similar. Whereas the infected person is treated with medicines, the infected computer is treated by the curative capability of an antivirus. Whereas the uninfected person is made immune with a vaccine, the uninfected computer is protected by the immunizing capability antivirus.

These are just some of the similarities between the characteristics of the biological disease and the computer virus. Nevertheless, these justify the application of epidemiological models to the spread of computer virus.

               ----------------
11Yu Zhang, Tao Li, and Renchao Qin, “Computer Virus Evolution Model Inspired by Biological DNA,” College of Computer Science, Sichuan University (China, 2008), 943.
12Jose Roberto et al, “Epidemiological Models Applied to Viruses in Computer Networks,” Telecommunications and Control Engineering Department, Sao Paulo University (Brazil, 2005), 31.

Epidemiological Modeling

Note. The text below is part 4 of the paper titled "Epidemiological Modeling of Computer Virus Propagation." This is now Chapter II - Epidemiological Modelling.

Growth and decay are two subjects that can be mathematically modeled. These two are natural phenomena that occur in both the living and the non-living things. Examples of living things are human population and those biological diseases like HIV and malaria. Meanwhile, examples of non-living things where these occur are the spread of rumors and the simple but not so obvious subject of waiting-in-line10. The computer virus propagation is viewed to have growth and decay periods. As such, this paper adopted the aforementioned growth and decay phenomena in modeling this propagation.

The paper patterned the model for computer virus propagation with the spread of biological a disease. The widespread infection of people from a biological disease is often called an epidemic. Being an epidemic, the model used to describe the spread of a biological disease is often called epidemiological model.

The spread of a computer virus (although non-biological) is typically viewed as similar to a biological one (the justification on this is discussed in the next section). Hence in the spread of a computer virus, epidemiological modeling is conventionally applied.
           -----------------
10Daniel P. Maki, Mathematical Models and Applications (New Jersey: Prentice Hall, 1973), 356.

Importance of modeling the spread of computer viruses

Note. The text below is part 3 of the paper titled "Epidemiological Modeling of Computer Virus Propagation."

It is usually difficult to prevent the launch of a new computer virus.  A new virus is usually not remedied by existing anti-viruses. The new program of this virus must be studied first before an antivirus program can be developed. Once launched, the propagation of the virus must least be controlled. Before effective control strategies can be presented, it is important to understand clearly how the computer virus propagates in the network7.  This can be achieved by a creating mathematical model of the propagation. The model helps in predicting future threats and in developing new measures to contain the virus.

The mathematical model needs to be well-designed and reliable. It must be accurate in its predictions and must be as general as possible, while remaining as simple and as low-cost as possible. The best model is one that is capable of detecting unknown viruses and their mutation8.

Objectives in developing a mathematical model

This paper’s objective is to develop a simple mathematical model that will describe the spread of computer virus under two scenarios. The first scenario is where a new virus gets launched to a group of interconnected computers. Being a new virus, the computers are assumed to have no antivirus to prevent the infection. The virus is said to be propagating freely.

The second scenario is where the virus gets launched to a group of interconnected computers at the time when the computers already have the necessary antivirus to cure the infection and prevent re-infection. The virus here is no longer propagating freely. 
--------------
7Giuseppe Serazzi and Stefano Zanero, “Computer Virus Propagation Models,” Dipartimento di Eletrronica e Informazione, Politecnico Milano (Milan, Italy, 2005), 1.
8Giuseppe Serazzi and Stefano Zanero, 2.