Main | March 2005 »

January 24, 2005

Java Samba Implementation

Since we are using DOGMA as the cluster of client machines to create test traffic for our test benches, the distributed client code is in Java. Nathan already has a java client to make http requests. OIT said that one of the services they provide which we need to take into account in this project is file sharing, in particular windows file sharing.


JCIFS
is an Open Source client library that implements the CIFS/SMB networking protocol in 100% Java.

Using this library we could also distribute a samba test using DOGMA.

Posted by Devlin at 04:54 PM | Comments (0)

Overhead of Virtualization

Dr. Windley suggested a methodology for determining the overhead due to virtualization, using a load balancer, a cluster of clients and varying the number of virtual machines. I do think it is a good way to begin tackling the problem, although I had a few reservations about whether the set-up of the test would lend itself to meaningful statistically interpretable results.

I am by no means a statistician, although my undergraduate statistics course gave me some rudimentary basis for recognizing the necessity and applicability of statistics in our problem.

What I had hoped that a statistical test could give us, is that we could model the different performance characteristics of a system using probability distributions. The main characteristics I came up with are CPU performance, Disk I/O, Virtualization overhead and network performance. Mathematically we could then combine these distributions to form an overall performance distribution.

Distributions seem fitting since typically performance is non-linear - and difficult to quantitize at that. We can construct distributions from data, which distribution can then be operated on since distributions are generalized functions.

My main concern about the overhead test has to do with lurking variables. In the cluster of computers, not only are they not synchronized, but are using the network as a connection to the virtual machines, which network can contribute a significant amount of randomness before it even reaches the load balancer. Left unaccounted for, it could have a significant impact on the measured results of the vm's -- without our knowledge.

Not knowing for sure, I had a very enlightening conversation with Dr. Dell Scott of the statistics department. He is their most experienced statistician with applications in computing. He confirmed my concern, that it was indeed correct, that variability in the traffic before the load balancer would be manifest itself in the measured results of the vm's. He also said that this variability, if understood, could be "subtracted" from the measured results leaving just the measurements we intended to collect. I thought that we would have to model the incoming traffic according to some type of distribution, and then do a lot of statistical voodoo to account for it. Dr. Scott said that it could be accomplished just using conditional probabilities, based directly off of the actual traffic for a particular test. So, we would just need to record or log the test traffic, which we could then utilize to subtract out its variability and such from our final results. Nathan has included into the client software (which will be creating the traffic for the tests) a logging mechanism - which I think will be useful for a great many things. I think a simpler and more representative log of the actual traffic could be obtained from just another computer on the network capturing traffic with etheral or similar.

In an academic paper about generating representative workloads for server performance, the authors talked about many issues, one of them being the self-similar characteristic of network traffic. Some of the artifacts of self-similarity is that it has a detrimental effect on network performance, real-world network traffic is self-similar, and self-similarity cannot be modeled with a Poisson distribution. Since our cluster of clients generating the test requests are not generating self-similar traffic, this would mean that the measure results on the vm's was misleading. Dr. Scott pointed out to me that this very "problem" actually provides useful knowledge and insights. In particular, since self-similarity of traffic degrades performance, a random (or otherwise statistically understood distribution) would yield performance greater than the real world, effectively defining an upper bound. We would know that real performance could be no better than this benchmark (although of course we will need to verify this claim). So, with the logging of incoming traffic I mentioned above, this test scenario could give us an upper bound for performance (although nothing close to an upper bound on virtualization overhead).

Posted by Devlin at 03:20 PM | Comments (0)