That summer we all saw a rush of press on Gnutella, and the rumour mill started churning. Most stories covering Gnutella were grossly and inappropriately evangelistic, praising the not-yet-analyzed Gnutella as a technology capable of delivering on wildly fantastic promises of fully distributed, undeterrable, unstoppable, larger-than-life file sharing on the grandest scale. Many folks were convinced that Gnutella was the next generation Napster. Gene Kan, the first to spearhead the Gnutella evangelistic movement, claimed in one early interview: "Gnutella is going to kick Napster in the pants." Later Kan admitted "Gnutella isn't perfect", but still went on to say that "there's no huge glaring thing missing". Well, something just wasn't right, and though we couldn't see it, it did seem pretty glaring.
We all understood the excitement. Herein was a technology that could potentially prove the true magnitude of Metcalfe's Law. That realization evoked nothing short of the phrase "holy shit!". But what I couldn't understand was why no one was questioning the legitimacy of these claims. For several months the only analyses anyone heard of practical implementations were generalizations and speculative comments, without much scientific or mathematical basis.
So I quickly got fed up, and resolved to write a research paper. Sometime in late March, I had begun analyzing the network structure of the Gnutella system, trying to find a way to gauge the capacity of a GnutellaNet in generalized terms, and to predict its realistic limits. What later resulted was a set of mathematical equations that could describe reachability, capacity, and bandwidth throughput. I then fed those equations into Mathematica to produce 3-D plots depicting, much to my own satisfaction, visual realizations of exactly what didn't make sense.
At about the same time, a fellow colleague in the security industry wrote a short paper detailing the various and flagrant insecurities inherent in this particular implementation of a distributed system. Seth McGann's security advisory titled Self-Replication Using Gnutella centered on the characteristics an Internet Worm inside a GnutellaNet could thrive from, and also touched on a few other flaws that would be useful to an attacker. His advisory posted in May of 2000, and unfortunately went mostly unnoticed (or misunderstood, because of its technical nature).
Later in August, Xerox PARC published a research paper on the social characteristics of a GnutellaNet, proving through empirical observation that transience hurts this type of fully distributed network considerably, and that Gnutella was not such an invincible proposition after all.
These days the Internet doesn't lack for useful papers on Gnutella. Research papers by the folks at Distributed Search Solutions are fairly high in quality and remain objective, if not optimistic about the future of Gnutella. Other informative articles persist on O'Reilly's P2P Website, and elsewhere.
So where's my paper, and why haven't you seen it? Well, in case you didn't know, I'm one of the founding developers of Napster, and for several good reasons, including the sobering fact that I was one of the leaders of the main competitor, I did not release my material to the public. Several times I resigned myself to re-writing my paper to accommodate the release of new information and analyses, but I never finished. Now I regret having sat on this for so long, for every paper on Gnutella that has come out in the last year has served as nothing but vindication of my conclusion from so early on: Gnutella will never scale.
Following is what remains of my paper, hacked up, sliced, diced and re-written. The information and analyses are still useful, but as I just said, the conclusions are the same. This paper simply proves those conclusions through mathematics.
This paper assumes a working knowledge of Gnutella networks and internals, and therefore uses terminology and phraseage specific to Gnutella. If the wording seems somewhat strange or foreign to you, please stop reading this paper and seek other documentation before proceeding. Furthermore, explanation of the accompanying math is intentionally terse. Every effort has been made to verify the accuracy of the equations herein, but this discussion is intentionally limited to that which is solely relevant to Gnutella in order to keep at a minimum any distraction from an already complex topic.
Scaling Gnutella will require more than just better resource management tools -- in its current incarnation Gnutella is mathematically and technologically unable to scale to a network of any reasonably large size. Following herein is a discussion focused on mathematically describing the metrics of a GnutellaNet topology, and using derived equations to interpret and visualize realistic limits of the technology. In order to keep the math as simple as possible, let's assume we're examining a relatively quiet GnutellaNet network, and dissect the flow of information one step at a time.
|Variables and Equations|
|P||The number of users connected to the GnutellaNet.|
|N||The number of connections held open to other servents in the network. In the default configuration of the original Gnutella client, this is 4.|
|T||Our TTL, or Time To Live, on packets. TTL's are used to age a packet and ensure that it is relayed a finite number of times before being discarded.|
|B||The amount of available bandwidth, or alternatively, the maximum capacity of the network transport.|
|f(n, x, y)||
A function describing the maximum number of reachable users that are
at least x hops away, but no more than y
f(n, x, y) = Sum[((n-1)^(t-1))*n, t = x->y]
A function describing the maximum number of reachable users for any
given n and t.|
g(n, t) = f(n, 1, t)
|h(n, t, s)||
A function describing the maximum amount of bandwidth generated
by relaying a transmission of s bytes given any
n and t. Generation is defined as
the formulation and outbound delivery of data.|
h(n, t, s) = n*s + f(n, 1, t-1)*(n-1)*s
|i(n, t, s)||
A function describing the maximum amount of bandwidth incurred
by relaying transmission of s bytes given any
n and t. Incurrence is defined as
the reception or transmission of data across a unique connection to a
i(n, t, s) = (1 + f(n, 1, t-1))*n*s + f(n, t, t)*s
It benefits the casual reader to first explain in terms of a balanced, equally distributed GnutellaNet, so for this exercise assume that everyone has the same N and T. In the initial release of Gnutella, N = 4 and T = 5. Further, let P = 2000, arbitrarily. Finally, let us assume no other interfering factors exist (for now).
Early reports of Gnutella's usage claimed upwards of 2000 to 4000 users on the GnutellaNet. This is significant because these reports inaccurately implied that all 4,000 users on the GnutellaNet were reachable and searchable. The reality is that even in an ideally balanced GnutellaNet, P is never relevant to your potential reach; N and T are the only limiting factors.
Raising N (number of connections open) and T (number of hops) extend the number of reachable users geometrically.
Keep in mind, the above illustrates potential reach given two assumptions: the network is fully balanced, and everyone shares the same N and T.
So, the next obvious step for an intrepid and now better-informed Gnutella user is to increase N and T, so as to extend their potential reach into the GnutellaNet web. Not so fast! As your reach increases geometrically, so does the amount of bandwidth generated and incurred. Let's now move the discussion towards B.
Delving Deeper into B
Before proceeding, it is very important to understand that many assumptions must be made in order to carry out these computations. Observed characteristics of GnutellaNet topologies are simply too varying to accurately generalize. That said, I still believe that there exists a statistical mean of each characteristic in a GnutellaNet, which is to say that if I were to take a snapshot of the current topology of a public GnutellaNet I could derive an average N, T, and so forth. While potentially inaccurate as a realistic representation, these means can still produce a useful generalization.
In our discussion of B, there are really two different perspectives on how to measure the amount of bandwidth: the amount generated, and the amount incurred. This is a very important distinction to make, because knowing the amount of raw data generated is statistically useful, but understanding the bandwidth cost incurred by individual events in the network is much more important since it more realistically signifies the impact on an Internet connection. As previously stated, h(n, t, s) represents the amount of bandwidth generated by relaying a packet through the network, counting only data that is outbound to another destination. i(n ,t, s), on the other hand, counts all outbound and inbound transmissions, yielding a more accurate perspective on bandwidth usage. Let's introduce an example.
Joe Smith likes classic rock, and is desperately searching for any live recordings of The Grateful Dead. Joe loads up his Gnutella client, connects to the GnutellaNet, and executes his search, "grateful dead live". What actually happens?
|Search Query Packet Makeup|
|IP header||20 bytes|
|TCP header||20 bytes|
|Gnutella header||23 bytes|
|Minimum Speed||1 byte|
|Search string||18 bytes + 1 byte|
It isn't useful to account for Data Link Layer transmissions since
they vary widely and don't significantly impact these calculations, so
they have been intentionally ommitted.
IP and TCP header calculations assume simplest case scenario.
Joe's search request results in an 83 byte data packet. Initially, everyone would agree that it looks like a tiny, unnoticeable amount of data. Let's take a look at the bandwidth cost of simply relaying the search request. h(n, t, s) is comprised of the data Joe transmits across his connections to other Gnutella users (n*s), plus transmissions of all tiers between Joe and the last tier, which is only receiving.
|Bandwidth Generated in Bytes (S=83)|
From above, given a concurrent demographic comparable to Napster (assuming equally balanced), searching for a simple 18 byte string "grateful dead live" unleashes 90 megabytes worth of data to be transmitted.
Even so, I don't consider h(n, t, s) to be the best measure. Let's now look at i(n, t, s), which is comprised of the originating transmission, 1 reception and N-1 transmission for tiers 1 through T-1, and 1 reception for the last tier.
|Bandwidth Incurred in Bytes (S=83)|
i(n, t, s) has the unique property of representing double h(n, t, s).
From above, a whopping 1.2 gigabytes of aggregate data could potentially cross everyone's networks, just to relay an 18 byte search query. This is of course where Gnutella suffers greatly from being fully distributed.
Also, let's not forget that there is no consideration of time in this set of calculations. In the average case, 1.2 gigabytes worth of data takes a very long time to generate and propagate through the Internet. However, even in more realistic cases, propagating a few megabytes worth of data through several hundred thousand nodes across the Internet still takes a considerable amount of time.
At this point, though, our exercise is still incomplete. What percentage of Gnutella clients share content? Of them, what percentage are likely to respond to Joe's query? And of those, what would be the mean number of responses, and their mean length?
This is where we'll begin to see generalizations diverging from reality. Still though, let's take a quick gander at what evangelists thought Gnutella would be capable of. For this, we'll need to introduce a few more variables and equations.
|More Variables and Equations|
|a||Mean percentage of users who typically share content.|
|b||Mean percentage of users who typically have responses to search queries.|
|r||Mean number of search responses the typical respondent offers.|
|l||Mean length of search responses the typical respondent offers.|
A function representing the Response Factor, a constant value
that describes the product of the percentage of users responding and
the amount of data generated by each user.|
R = (a*b) * (88 + r*(10 + l))
|j(n, T, R)||
A function describing the amount of data generated in response to a
search query by tier T, given any n and
Response Factor R.|
j(n, T, R) = f(n, T, T) * R
|k(n, t, R)||
A function decsribing the maximum amount of bandwidth generated
in response to a search query, including relayed data, given any
n and t and Response Factor
k(n, t, R) = Sum[ j(n, T, R) * T, T = 1->t ]
Assuming that a mean exists for the characteristics of our measurement makes these calculations much simpler. That said, recall that I don't believe this assumption to be false; that at any given moment there does exist some measurable a, b, r and l. Let's assume conservative estimates for now, and apply observed behaviour from other reports later.
The difficulty in gauging the sheer amount of data coming back to us stems from our inability to realistically discern where in the partial mesh of connections the data is coming from. By design, the only thing we will know about about the packets received is the (hopefully) unique message ID. If the message ID correlates to the message ID of one of our pending queries, the response is ours. Otherwise, the response is someone else's traffic, and if it correlates to an known ID in our routing table, it is simply passed along.
|Search Response Packet Makeup|
|IP header||20 bytes|
|TCP header||20 bytes|
|Gnutella header||23 bytes|
|Number of hits||1 byte|
|IP Address||4 bytes|
|Result Set||r * (8 + l + 2) bytes|
|Servent Identifier||16 bytes|
|Total:||88 + r*(10 + l) bytes|
Let's take a look now at what the variation of N and T yields in terms of bandwidth costs. For our first case, let's choose some reasonable values: a = 30%, b = 50%, r = 5 and l = 40, or R = 50.7.
|Bandwidth Generated in Bytes (R=50.7)|
Precision is limited to 6 or less digits; sorry, I don't know how to make mathematica behave differently in this case.
With 30% of Gnutella users sharing, and only half of them responding, the standard client settings yield over 14MB of return responses. I believe this particular R value to be near reality as far as percentages are concerned, but r and l are probably conservative, given recent reports by Clip2 DSS and others. Let's raise R a bit, here's R = 72.
|Bandwidth Generated in Bytes (R=72)|
These different values don't appear to have much of an impact on the overall bottom line; just over 13MB of traffic generated in response with standard client settings. Let's take one more look and adjust some of the values: a = 30%, b = 40%, r = 10 and l = 60, or R = 94.56. I believe this R to be the most realistic.
|Bandwidth Generated in Bytes (R=94.56)|
Standard client settings yield a whopping 17MB generated in response to Joe's search query.
In order to better understand the results above, one must understand the Response Factor, R, and the reasoning behind it. Recent analyses of Gnutella networks show a small percentage of participants actually sharing content, and a disproportionately small percentage of those sharing actually having most of the content. It is highly improbable that a means to statistically describe the widely varying response characteristics of participants in a GnutellaNet exists. R is a compromise for this difficult task, representing a gross mean across an ideal GnutellaNet of responses we can expect the average query to generate. The key word here is ideal; we know these gross means to exist, but they are as yet unmeasurable, or at least at this point unverifiable, given the quickly changing network topology.
Bringing it all together
So, now that we have all the pieces to the puzzle, let's fit them together. How much aggregate data, including request and response, is generated by Joe's search for "grateful dead live"? Let's intersect h(n, t, s) with k(n, t, R) to get The Big Picture.
|Bandwidth Generated in Bytes (S=83, R=94.56)|
The Big Picture, h(n, t, s) and k(n, t, R) combined.
What's really stunning about the above table is the stark realization that in supporting numbers of users comparable to Napster, Gnutella would generate more than an unbelievably significant 800MB worth of data for just one of those users to search the entire network for "grateful dead live" and receive responses.
Our job is still not finished yet, though. What remains is to apply these statistics to observed query rates to gain an understanding of the real-time impact of a GnutellaNet on a network.
Behold, The Firestorm
When Napster, Inc. was served with an injunction designed to halt all file-sharing service through the Napster network, Gnutella and similar services experienced what is now commonly referred to as the "Napster Flood". While an inordinate number of users perceived the injunction as their personal charge to download from Napster as much as possible before the service was brought down, still a great many flocked to other file-sharing services such as Gnutella.
During this period of time, Clip2 DSS observed query rates peaking at 10 queries per second, double the normal 3-5 per second. The possibility of exceeding 10 qps during periods of heavy usage these days is not unlikely.
The final item of interest in this paper is the extrapolation of bandwidth rates (per second) from the bandwidth costs calculated above and observed rates. For thoroughness, query rates for a quiet (3qps), normal (5 qps), and burdened (10 qps) GnutellaNet are examined. For each test case, the main assumption is that Joe Smith's behaviour satisfies the typical user demographic.
|Bandwidth rates for 3 qps (S=83, R=94.56)|
|Bandwidth rates for 5 qps (S=83, R=94.56)|
|Bandwidth rates for 10 qps (S=83, R=94.56)|
Keeping things in Perspective
From the charts above, it becomes mind-numbingly clear that the Gnutella distributed architecture is fundamentally flawed and can have a horrific impact on any network. On a slow day, a GnutellaNet would have to move 2.4 gigabytes per second in order to support numbers of users comparable to Napster. On a heavy day, 8 gigabytes per second.
A lot of potentially obscure assumptions are made here, though, and they should be carefully examined and understood before making conclusions:
So why should the above charts be taken with a grain of salt? Well, the real GnutellaNet that exists today is certainly not ideal, and has been occasionally observed persisting as several smaller, fractured GnutellaNets. Also, there's a great deal of transience in the GnutellaNet; observations yield only roughly 30-40% of participants remain for 24 hours or more. And it should be obvious to even the most casual observer that query rates are not constant, and are more likely to burst and lull as the topology shifts and usage varies.
One important factor in evaluating the usefulness of the above is to consider the usage demographic. Current usage may show 3-5 queries per second with anywhere between 4,000 and 8,000 users, but if Gnutella were to ever grow in size, both by users and consequentially by files, search rates would likely increase dramatically. This would be for at least two reasons: more users equates to more people interested in locating content equates to more aggregate queries per second, and more content equates to wider variance in type of material equates to, quite simply, more to search for. So, applying query rates involving only thousands of users to GnutellaNet populations orders of magnitude greater in size is probably inaccurate; instead, at greater sizes, the above computed bandwidth rates are probably much too small. Indeed, one can extrapolate from the above, using the test case of 1,000,000 users:
Most important of all, though, the above numbers assume a capable network connection exists for all participants. If networks weren't capable of relaying the amounts of traffic discussed above, traffic jams would occur and query rates would drop, query response rates would drop, and overall traffic rates, as a result, would drop. And we know they aren't capable; we know that a significant percentage of participants are dialup users, and their low bandwidth capabilities cause significant traffic congestion and topology fragmentation when improperly configured.
Even though many assumptions were made throughout the course of these calculations, some of which are provably unrealistic, these exercises still yield a useful perspective. In an ideal world, Gnutella is truly a "broadband killer app" in the most literal of senses -- it can easily bring the Internet infrastructure to its knees. And it should also be noted that only search query and response traffic was accounted for, omitting various other types of Gnutella traffic such as PING, PONG, and most importantly, the bandwidth costs incurred by actual file transfers. 2.4GBps is just search and response traffic, but what about the obnoxiously large amount of bandwidth necessary to transfer files between clients?
Those reading this paper should be careful to note that non-intended uses of the GnutellaNet also incur noticeable bandwidth hits: using search queries to chat with other participants, SPAM placed inside search queries and results to advertise various things, and gibberish, typically resulting from misbehaving users or clients. Futhermore, with individuals writing their own clients and protocol extensions, we may begin to see loop detection being rendered useless. Depending on how individual clients implement loop detection (comparing message ID's versus comparing message ID's + a checksum of the packet's payload), protocol extensions may interfere with legacy clients and result in more traffic than necessary being generated and relayed.
The main argument against this paper is that GnutellaNets are never ideal, and as adoption and usage grows, are statistically less likely to be ideal, given the increase in complexity of the topology as the number of participants increase. I would agree with this principle, but I believe it only serves as better proof of the premise: if an ideally distributed and fully capable network generates 2.4GBps to accomodate 1M users (and we already know this figure to be unrealistic in terms of what the modern Internet is capable of), then a poorly distributed network with insufficient bandwidth will certainly not be able to support the same number of participants or the traffic they generate. In other words, again, Gnutella can't scale.
Another key argument against these computations is that they are all focused on the center of an ideal GnutellaNet, and applying this generalization to all configurations of nodes is misleading and inaccurate. Traffic is measured and generalized from a maximizable point; this is to say that the "center" node will always generate the most amount of traffic given the same configuration throughout, whereas a leaf node in an ideal GnutellaNet generates only a fraction of that bandwidth. However, empirical analysis yields the observation that, in practice, leaf nodes don't generally have only one connection into the GnutellaNet. As a matter of fact, leaf nodes don't tend to occur naturally at all, since it is rarely in a participant's best interest to limit themselves to one connection, in maximizing bandwidth capacity versus search depth. To date I've only observed this happening on a large scale with Reflectors, or strategically placed Gnutella "proxies" at high bandwidth locations on the Internet aimed at serving dialup and other small capacity clients. So, the inaccuracy of these numbers likely lies in their being, again, much too small. Also, regardless of how intertwined and convoluted the connection paths are, the data path is effectively rendered semi-ideal through loop detection, so the methodology turns out to be more realistic than first thought.
Yet another valid question to raise against the premise is, What is a reasonable size? Is it 100 users? Is it 1,000? Or 100,000? Or 1,000,000? Nothing short of global domination? Discerning what's reasonable is assuredly a subjective comparison, however, I use the phrasage interchangably with original statements like "Gnutella will kick Napster in the pants." Common sense dictates that in order to accomplish that, Gnutella would have to perform more efficiently, scale higher, and be more capable. These exercises prove that, on a perfect level, Gnutella just can't rise to meet the challenge. Consequentially, they prove that on an imperfect level Gnutella has no hope of performing on the same level.
In the final assessment, it's painfully obvious that Gnutella needs a complete overhaul. Major architectural flaws are fundamental in nature and cannot be mitigated effectively without redesign at the most basic level. Some intelligent caching could likely benefit the Gnutella architecture, since observations yield that many searches and responses result in repetitive, duplicate transmissions. However, given the transience of GnutellaNet participants, and the wide variety of participating clients, it would be difficult to predict with any amount of accuracy how effective technology like this would be.
Various efforts claim to be underway to redesign the protocol; among them, gPulp stands out as the farthest along, with message boards and mailing lists set up for those wanting to get involved. But, with its mission of consentual changes implemented through a working group, I harbor significant doubt as to whether they will ever be timely and effective at producing an alternative. GnutellaWorld, another revamp effort recently publicized by CNet's news.com, takes the lead on the initiative for developing Gnutella2. J.C. Nicholas, apparently representing GnutellaWorld, claimed in an interview with CNet that Gnutella2 technology would be out "soon". Characterized as an "Internet Earthquake" and promised to be "the greatest revolution since Linux", Gnutella2 sounds more like the same old hype than anything else. And with only 8-9 months under their collective belt as an organization, I personally wonder how far along efforts could be. If the fact that this open-source project's CVS repository remains quite empty, or that its mailing lists appear dormant presents any indication of progress, the Internet probably has some time to go before experiencing the next internet cataclysm. Considering GnutellaWorld's intentions of supporting 20 million people or more, I can only hope that it's nothing like the original Gnutella.