[[ Check out my Wordpress blog Context/Earth for environmental and energy topics tied together in a semantic web framework ]]

Monday, January 17, 2011

The Oil ConunDRUM

I synthesized the last several years of blog content and placed it into The Oil ConunDRUM. This document turned into a treatise of topics relating to the role of disorder and entropy in the applied sciences. Volume 1 is mainly on the analysis of the decline in global oil production, while Volume 2 uses often related analysis in studying renewable sources of energy and how entropy plays a role in our environment and everyday life.



 TOC essentially draws a line in the sand and a virtual stake in the ground. Everything I have written about and all the original analyses I have worked out on the blog has not fundamentally changed as I aggregated the information. As far as I can tell, no one else has picked up on the direction that I have taken, and nothing has come out of the research literature that comes close to unifying the set of topics as well as this does.

A couple of commenters have said I should publish the research work through peer-reviewed channels. That won't happen because the project covers too much territory and compiling a massive tome such as this represented the best option I could think of. I invite all with an interest in the natural world to take a whack at digesting it.

This is a list of the novel areas of research, listed in what I consider a ranked order of originality:
  1. The Oil Shock Model.
    A data flow model of oil extraction and production which allows for perturbations.

  2. The Dispersive Discovery Model.
    A probabilistic model of resource discovery which accounts for technological advancement and a finite search volume.

  3. The Reservoir Size Dispersive Aggregation Model.
    A first-principles model that explains and describes the size distribution of oil reservoirs and fields around the world.

  4. Solving the Reserve Growth "enigma".
    An application of dispersive discovery on a localized level which models the hyperbolic reserve growth characteristics observed.

  5. Shocklets.
    A kernel approach to characterizing production from individual fields.

  6. Reserve Growth, Creaming Curve, and Size Distribution Linearization.
    An obvious linearization of this family of curves, related to HL but more useful since it stems from first principles.

  7. The Hubbert Peak Logistic Curve explained.
    The Logistic curve is trivially explained by dispersive discovery with exponential technology advancement.

  8. Laplace Transform Analysis of Dispersive Discovery.
    Dispersion curves are solved by looking up the Laplace transform of the spatial uncertainty profile.

  9. The Maximum Entropy Principle and the Entropic Dispersion Framework.
    The generalized math framework applied to many models of disorder, natural or man-made. Explains the origin of the entroplet.

  10. Gompertz Decline Model.
    Exponentially increasing extraction rates lead to steep production decline.

  11. Anomalous Behavior in Dispersive Transport explained.
    Photovoltaic (PV) material made from disordered and amorphous semiconductor material shows poor photoresponse characteristics. Solution to simple entropic dispersion relations or the more general Fokker-Planck leads to good agreement with the data over orders of magnitude in current and response times.

  12. Framework for understanding Breakthrough Curves and Solute Transport in Porous Materials.
    The same disordered Fokker-Planck construction explains the dispersive transport of solute in groundwater or liquids flowing in porous materials.

  13. The Dynamics of Atmospheric CO2 buildup and extrapolation.
    Used the oil shock model to convolve a fat-tailed CO2 residence time impulse response function with a fossil-fuel stimulus. This shows the long latency of CO2 buildup very straightforwardly.

  14. Terrain Slope Distribution Analysis.
    Explanation and derivation of the topographic slope distribution across the USA. This uses mean energy and maximum entropy principle.

  15. Reliability Analysis and understanding the "bathtub curve".
    Using a dispersion in failure rates to generate the characteristic bathtub curves of failure occurrences in parts and components.

  16. Wind Energy Analysis.
    Universality of wind energy probability distribution by applying maximum entropy to the mean energy observed. Data from Canada and Germany.

  17. Dispersion Analysis of Human Transportation Statistics.
    Alternate take on the empirical distribution of travel times between geographical points. This uses a maximum entropy approximation to the mean speed and mean distance across all the data points.

  18. The Overshoot Point (TOP) and the Oil Production Plateau.
    How increases in extraction rate can maintain production levels.

  19. Analysis of Relative Species Abundance.
    Dispersive evolution of species according to Maximum Entropy Principle leads to characteristic distribution of species abundance.

  20. Lake Size Distribution.
    Analogous to explaining reservoir size distribution, uses similar arguments to derive the distribution of freshwater lake sizes. This provides a good feel for how often super-giant reservoirs and Great Lakes occur (by comparison)

  21. Labor Productivity Learning Curve Model.
    A simple relative productivity model based on uncertainty of a diminishing return learning curve gradient over a large labor pool (in this case Japan).

  22. Project Scheduling and Bottlenecking.
    Explanation of how uncertainty in meeting project deadlines or task durations caused by a spread of productivity rates leads to probabilistic schedule slips with fat-tails. Answers why projects don't complete on time.

  23. The Stochastic Model of Popcorn Popping.
    The novel explanation of why popcorn popping follows the same bell-shaped curve of the Hubbert Peak in oil production.

  24. The Quandary of Infinite Reserves due to Fat-Tail Statistics.
    Demonstrated that even infinite reserves can lead to limited resource production in the face of maximum extraction constraints.

  25. Oil Recovery Factor Model.
    A model of oil recovery which takes into account reservoir size.

  26. Network Transit Time Statistics.
    Dispersion in TCP/IP transport rates leads to the measured fat-tails in round-trip time statistics on loaded networks.

  27. Language Evolution Model.
    Model for relative language adoption which depends on critical mass of acceptance.

  28. Web Link Growth Model.
    Model for relative popularity of web sites which follows a diminishing return learning curve model.

  29. Scientific Citation Growth Model.
    Same model used for explaining scientific citation indexing growth.

  30. Particle and Crystal Growth Statistics.
    Detailed model of ice crystal size distribution in high-altitude cirrus clouds.

  31. Rainfall Amount Dispersion.
    Explanation of rainfall variation based on dispersion in rate of cloud build-up along with dispersion in critical size.

  32. Earthquake Magnitude Distribution.
    Distribution of earthquake magnitudes based on dispersion of energy buildup and critical threshold.

  33. Income Disparity Distribution.
    Relative income distribution which includes inflection point to to compounding interest growth on investments.

  34. Insurance Payout Analysis, and Hyperbolic Discounting.
    Fat-tail analysis of risk and estimation.

  35. Thermal Entropic Dispersion Analysis.
    Solving the Fokker-Planck equation or Fourier's Law for thermal diffusion in a disordered environment. A subtle effect.

  36. GPS Acquisition Time Analysis.
    Engineering analysis of GPS cold-start acquisition times.
You can refer back to details in the blog, but The Oil ConunDRUM cleans everything up. It features quality mathematical markup, references to scholarly work, a full subject index, hypertext table of contents, several hundred figures with captions, footnotes and sidebars with editorial commentary, embedded historical documents, source code appendices, and tables of nomenclature and glossary.



EDIT (1/21/11): Here is a critique from TOD. I can only assume the commenter doesn't understand the concept of convolution or doesn't realize that such a useful technique exists:
Your methods are fundamentally flawed you cannot aggregate across producing basins like you do. Its simply wrong.
To add multiple producing basins together you must adjust the time variable such that all of them start production at the same time or if they have peaked all the peaks are aligned.
The time that a basin was discovered and put into production is an irrelevant random variable and has no influence on the ultimate URR.
If you don't correctly normalize the time variable across basins your work is simply garbage. There is no coupling between basins and no reason to average them based on real time. Its junk math. No simple function exists in real time to describe the aggregate production profile.

The US simply happened to have its larger basins developed about the same time in real time. Hubbert's original analysis worked simply because the error in the normalized time and real time was small.

One of the mysteries of science and mathematics is the role of entropy. The mathematician Gian-Carlo Rota from MIT had this to say just a few years ago:
The take on this is that as Rota says about the Maximum Entropy Principle "Among all mathematical recipes, this is to the best of my knowledge the one that has found the most striking applications in engineering practice", yet it retains this sense of mystery in that no one can really prove it -- entropy just IS and by its existence, you have to deal with it the best you can.

EDIT (1/31/11): In the book, the last prediction of global crude production I made was a while ago. Here is an update:


The chart above is the best guess model from 2007 using the combined Dispersive Discovery+Oil Shock Model for crude. Apart from a conversion from barrels/year to barrels/day, this is the same model as I used in a 2007 TOD post and documented in The Oil ConunDRUM. The recent data from EIA is shown as the green dots back to 1980. I always find it interesting to take the 10,000 foot view. What may look like a plateau up close, may actually be part of the curve at a distance.

EDIT (2/22/2011): An additional USA Shock Model not included in the book. I included Alaska in this model.

Discovery data transcribed from this figure; the discoveries seem to end in 1985, so I extended the data with a dispersive discovery model. I added in Alaska North Slope at 22 billion barrels in 1968 and a small 300 million barrel starter discovery in 1858.
.
The blue line in the Dispersive Discovery Model is this equation, which is essentially a scaled version of the world model:
DD(t)=(1-exp(-URR/(B*((t-t')^6))))*B*((t-t')^6), URR=240,000 million barrels, B=2E-7, t'=1835.

I did not include any perturbation shocks to keep it simple. Apart from the data, the following is the entirety of the Ruby code; the discovery.txt file is yearly discovery data, which is from the first graph. The second graph shows reserve.out and production.out.


cat discovery.txt | ruby exp.rb 0.07 | ruby exp.rb 0.07 | ruby exp.rb 0.07 > reserve.out
cat reserve.out | ruby exp.rb 0.08 >production.out

$ cat exp.rb

def exp(a, b)
rate = b
length = a.length
temp = 0.0
for i in 0..length do
output = (a[i].to_f + temp) * rate
temp = (a[i].to_f + temp) * (1.0 - rate)
puts output
end
end
exp(STDIN.readlines, ARGV[0].to_f)

Saturday, January 08, 2011

Terrain Slopes

Entropy makes its mark everywhere. Take the case of modeling topography. How can we model and thus characterize disorder in the earth's terrain? Can we actually understand the extreme variability we see?

If we consider that immense forces cause upheaval in the crust then we can reason that the energy can also vary all over the map, so to speak. The process that transfers potential energy into kinetic energy to first order has to contain elements of randomness. To the huge internal forces within the earth, generating relief textures equates to a kind of brownian motion in relative terms -- over geological time, the terrain amounts to nothing more than inconsequential particles to the earth's powerful internal engine.

In a related sense the process also resembles the pressure distribution in the earth's atmosphere, a classic application of maximum entropy that we can re-apply in the case of modeling terrain slope distributions.

Premise. We take the terrain slope S as our random variable (defined as rise/run). The higher the slope, the more energetic the terrain. Applying Maximum Entropy to a section of terrain, we can approximate the local variations as a MaxEnt conditional probability density function:
p(S|E) = (1/cE) * exp(-S/cE)
where E is the local mean energy and c is a constant of proportionality. But we also assume that the mean E varies over a larger area that we are interested in, as in the superstatistical sense of applying a prior distribution.
p(E) = k*exp(-k*E)
where k is another MaxEnt measure of our uncertainty in the energy spread over a larger area.

The final probability is an integral over the marginal distribution consisting of the conditional multiplied by the prior:
p(S) = integral p(S|E) *p(E) dE from E=0 to infinity
This integrates as a BesselK function of the zero order, K0, available on any spreadsheet program (see here for a similar derivation in an unrelated field).
p(S) = 2/S0 * K0(2*sqrt(S/S0))
The average value of the terrain slope for this distribution is simply the value S0.

Now we can try it on a large set of data. I downloaded all the DEM data for the 1 degree quadrangles (aka blocks/tiles) in the USA from the USGS web site. http://dds.cr.usgs.gov/pub/data/DEM/250/

This consists of post data at approximately 92 meter intervals (i.e. a fixed value of run) at 1:250,000 scale for the entire USA. I concentrated on the lower 48 and some spillover into Canada. I used curl to iteratively download each of the nearly 1000 quadrangle files on the server.

I then wrote a program to read the data from individual DEM files and calculate the slopes between adjacent posts and came up with an average slope (rise/run) of 0.039, approximately a 4% grade or 2.2 degrees pitch. I take the absolute values of all slopes so that the average is not zero.

The cumulative plot of terrain slopes for all 5 billion calculated slope points appears on the following chart (Figure 1). I also added the cumulative probability distribution of the BesselK model with the calculated average slope as the single adjustable parameter.

Figure 1: CDF of USA DEM data and the BesselK model with a small variation in S0 (+/-4% about the average 0.037 rise/run) demonstrating sensitivity to the fit.

This kind of agreement does not just happen because of coincidence. It occurs because random forces contribute to maximizing the entropy of the topography. Enough variability exists for the terrain to reach an ergodic limit in filling the energy-constrained state space.

As supporting evidence, it turns out that we can generate a distribution that maps well to the prior by estimating the average slope from the conditional PDF of each of the 922 quadrangle blocks and then plotting this aggregate data set as another histogram (see Figure 2).

Figure 2: Generation of the prior distribution by taking the average slope of each of the nearly 1000 quadrangles . The best fit generates a value of S0 (1/27=0.037) close to that used in Figure 1.

Practically speaking, we see the variability in slopes expressed at the two different levels. The entire USA at the integrated (BesselK model) level and the aggregated regions at the localized (exponential prior) level. These remain consistent as they agree on the single adjustable parameter S0 .

The modeled distribution has many practical uses for analysis, including transportation studies and planning. Obviously, vehicles traveling up slopes use a significant amount of energy and you might like to have a model to base an analysis on without having to rely on the data by itself. (As a caveat, I did not include any of the spatial correlations that must also exist and might prove useful as well)

Perusing the recent research, I couldn't find anyone that had previously discovered this simple model. Not that they haven't tried, coming up with a good slope distribution model seems to amount to a mini Holy Grail among geophysicists. I went as far as dropping $10 to downloading the first paper, which turned out to be a bust.
  1. Probabilistic description of topographic slope and aspect.
    G. Vico and A. Porporato, JOURNAL OF GEOPHYSICAL RESEARCH, VOL. 114, F01011, doi:10.1029/2008JF001038, 2009
    http://www.agu.org/journals/jf/jf0901/2008JF001038/2008JF001038.pdf
  2. Nonlinear Processes in Geophysics Multifractal earth topography.
    J.-S. Gagnon, S. Lovejoy, and D. Schertzer, Nonlin. Processes Geophys., 13, 541–570, 2006
    http://hal.archives-ouvertes.fr/docs/00/33/10/93/PDF/npg-13-541-2006.pdf
  3. PROPAGATION OF DEM UNCERTAINTY: AN INTERVAL ARITHMETIC APPROACH.
    G Gonçalves, XXII International Cartographic Conference, 2005
    http://www.cartesia.org/geodoc/icc2005/pdf/oral/TEMA7/Session%202/GIL%20GON%C7ALVES.pdf
  4. SAR interferometry and statistical topography.
    Guarnieri, A.M. IEEE Transactions on Geoscience and Remote Sensing, Dec 2002
    http://home.dei.polimi.it/monti/papers/montiguarnieri02.pdf
If someone wants to generate Monte Carlo statistics for the BesselK model without having to do the probability inversion, the algorithm turns out surprisingly simple. Draw two independent random samples from a uniform [0.0 .. 1.0] interval, apply the natural log to each, multiply them together, and then multiply by the S0 scaling constant. That will give the following cumulative if done 5 billion times, which is the same size as my USA DEM data sample.

Figure 3: Generation of the BesselK model via Monte Carlo.

The only statistical noise is at the 1e-9 level, same as in the DEM data.

Examples of some random-walk realizations drawing from a two-level model follow. The flatter regions occur more often reflecting the regional data.

Saturday, October 23, 2010

Understanding Recovery Factors

A recent TOD post on reserve growth by Rembrandt Kopelaar motivated this analysis.

The recovery factor indicates how much oil that one can recover from the original estimate. This has important implications for the the ultimately recovery resources, and increases in recovery rate has implications for reserve growth.

First of all, we should acknowledge that we still have uncertainty as to the amount of original oil in place, so that the recovery factor has two factors of uncertainty.

The cumulative distribution of reservoir recovery factor typically looks like the following S-shaped curve. The fastest upslope indicates the region closest to the average recovery factor.

Figure 1: Recovery Factor cumulative distribution function (from)

To understand the spread in the recovery factors, one has to first realize that all reservoirs have different characteristics. Some are more difficult to extract from and others have easier recovery factors. One of the principle first-order effects has to do with the size of the reservoir: bigger reservoirs typically have better recovery factors and as one reservoir engineer mentioned on TOD
"Reserve growth tends to happen in bigger fields because thats where you get the most bang for your buck"
So if we make the simple assumption that cumulative recovery factors (RF) have Maximum Entropy uncertainty or dispersion for a given Size:
P(RF) = 1-exp (-k*RF/Size)
this makes sense as the recovery factor will extend for larger fields.

Add to the mix that reservoir Sizes go approximately as (see here):
Pr(Size)= 1/(1+Median/Size)
Then a simple reduction in these sets of equations (with the key insight that RF ranges between 0 and 1, i.e. between 0 and 100%) gives us
P(RF) = 1 - exp(-k*RF*RF/(1-RF)/Median)
the ratio Median/k indicates the fractional average recovery factor relative to the median field size.

A set of curves for various k/Median values below:


Figure 2: Recovery Factor distribution functions assuming maximum entropy

Rembrandt provided some recovery factor curves originally supplied by Laherrere, and I fit these to the Median/k fractions below.

Figure 3: Recovery factor curves from Rembrandt's TOD post,
alongside the recovery factor model described here.

Laherrere also provided curves for natural gas, where recovery factors turn out much higher.

Figure 4: Recovery Factor distribution functions for natural gas.
Note that the recovery factor is much higher than for oil.
(Note: I had to fix the typo in the graph x-axis naming)

It looks like this derivation has strong universality underlying it. This remains a very simple and parsimonious model as it has only one sliding parameter. The parameter Median/k works in a scale-free fashion because both numerator and denominator have dimensions of size. This means that one can't muck with it that much -- as recovery factors increase, the underlying uncertainty will remain and the curves in Figure 2 will simply slide to the right over time while adjusting their shape. This will essentially describe the future reserve growth we can expect; the uncertainty in the underlying recovery factors will remain and thus we should see the limitations in the smearing of the cumulative distributions.

To reverse the entropic dispersion of nature and thus to overcome the recovery factor inefficiency, we will certainly have to expend extra energy.


Wednesday, October 20, 2010

Bird Surveys

This post either points out something pretty obvious or else it reveals something of practical benefit. You can judge for now.

I briefly made a reference to bird survey statistics when I wrote this post on econophysics and income modeling. I took a typical rank histogram of bird species abundance and fit it the best I could to a dispersive growth model, further described here. The generally observed trend follows that many species exist in the middle of abundance and relatively small numbers of species exist at each end of the spectrum -- few species exceedingly common (i.e. starling) and few species exceedingly rare (i.e. whooping crane). Since the bird data comes from a large area in North America, the best fit followed a meta-community growth model. The meta-community adjustment impacts the knee of the histogram curve and broadens the Preston plot, effectively smearing over geological ages that different species have had to adapt.

Figure 1: Preston plot (top) and
rank histogram (bottom) of relative bird species abundance

If we assume that the relative species abundance has a underlying model related to steady-state growth according to p(rate), where rate is the relative advantage for species reproduction and survival, then this should transitively might apply to disturbances to growth as well. Recently, I ran into a paper that actually tried to discern some universality in diverse growth papers, and it coincidentally used the bird survey data along with two economic measures of firm size and mutual fund size.
I did the best I could with the figures in the paper but eventually went to the source, ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/, and used data from 1997 to 2009.

I applied the same abundance distribution as before and came up with the fit below (see blue and red curves below, data and model respectively). That provided a sanity check, but Schwarzkopf and Farmer indicated that the year-to-year relative growth fluctuations should also obey some fundamental behavior through the distribution of this metric:
RelativeGrowth(Year) = n(Year+1) / n(Year)
Sure enough, and for whatever reason, the "growth" in the surveyed data does show as much richness as the steady state averaged abundance distribution. The relative growth in terms of a fractional yearly change sits alongside the relative abundance curve below (in green). Notice right off the bat that the distribution of fractional changes drops off much more rapidly.

Figure 2 : The red meta-model curve smears the median from 200 to 60000

I believe that this has a simple explanation having to do with Poisson counting statistics. When estimating fractional yearly growth, we consider that the rarer bird species having the lowest abundance will contribute most strongly to fluctuation noise on year-to-year survey data. Values flipping from 1 to 2 will lead to 100% growth rates for example. (We have to ignore movements from 1 to 0 and 0 to 1 as these growths become infinite.

I devised a simple algorithm that takes two extreme values (R greater than 1 and R less than 1 ) and the steady state abundance N for each species. The lower bound of:
R1 = R * (1-sqrt(2/N))/(1+sqrt(2/N))
and the upper bound becomes:
R2 = R * (1+sqrt(2/N))/(1-sqrt(2/N))
The term 1.4/sqrt(N) derives from Poisson counting statistics in that the relative changes become inversely related to the size of the sample. We double count in this case because we don't know whether the direction will go up or down, relative to R, a number close to unity.

(This has much similarity to the model I just used in understanding language adoption. Small numbers of adopters experience suppressing fluctuations as 1/sqrt(N))

Expanding on the scale, the results of this algorithm are shown in Figure 3.

Figure 3 : Model of yearly growth fluctuation in terms of a cumulative distribution function

Placing it in terms of a binned probability density function, the results look like the following plot. Note the high counts match closely the data simply because the 1/sqrt(N) is relatively small. Away from these points, you can see the general trend develop even though the data is (understandably) obscured by the same counting noise.

Figure 4 : The probability density function of yearly growth fluctuations.

As an essential argument to take home, consider that a counting statistics argument probably accounts for the yearly growth fluctuations observed. Before you make any other assertions, you likely have to remove this source of noise. Looking at Figure 3 & 4, you can potentially see a slight bias toward positive growth for certain lower abundance species. This comes at the expense of lower decline elsewhere, except for some strong declines in several other low abundance species. This may indicate the natural ebb and flow of attrition and recovery in species populations, with some of these undergoing strong declines. I haven't done this but it makes sense to identify the species or sets of species associated with these fluctuations.

Two puzzling points also stick out. For one, I don't understand why Schwarzkopf and Farmer didn't immediately discern this effect. Their underlying rationale may have some of the same elements but it gets obscured by their complicated explanation. They do use a resampling technique (on 40+ years worth of data) but I didn't see much of a reference to conventional counting statistics, only the usual hand-wavy Levy flight arguments. They did find a power law of around-0.3 instead of the -0.5 we used for Poisson, so they may generate something equivalent to Poisson by drawing from a similar Levy distribution. Overall I find this violates Occam's razor, at least for this set of bird data .

Secondly, it seems that these differential growth curves have real significance in financial applications. More of the automated transactions look for short duration movements and I would think that ignoring counting statistics could lead the computers astray.



Epilogue

As an aside, when I first pulled the data off the USGS server, I didn't look closely at the data sets. It turns out that the years 1994,1995,1996 were included in the data but appeared to have much poorer sampling statistics. From 1994 to 1996, the samples got progressively larger but I didn't realize this when I first collected and processed the data.

Figure 6 : CDF of larger data sample.
Note the strange hitch in the data growth fluctuation curve.

At the time, I figured that the slope had a simple explanation related to uncertainties in the surveying practice. I also saw some similarities to the uncertainties in stock market returns that I blogged about recently in an econophysics posting.

Say the survey delta time has a probability distribution with average time -- the T most likely related to the time between surveys:
pt(time) = (1/T)*exp(-time/T)
then we also assume that a surveyor tries to collect a certain amount of data, x, during the duration of the survey. We could characterize this as a mean, X, or some uniform interval. We don't have any knowledge of higher order moments to we just apply the Maximum Entropy Principle
px(x) = (1/X)*exp(-x/X)
The ratio between these two establishes the relative rate of growth, rate = X/T. We can derive the following cumulative quite easily:
P(rate) = T*rate/(T*rate +X)
The yearly growth rate fluctuations of course turn out as the second derivative of this function. We take one derivative to convert :
dp(rate)/drate = 2*T/X/(1+rate*T/X)^3
On a cumulative plot as in Figure 6, this shows a power-law of order 2 (see the orange curve). Near the knee of the curve, it looks a bit sharper. If we use a uniform distribution of px(x) up to some maximum sample interval, then it matches the knee better (see the dashed curve).

So the simple theory says that much of the observed yearly fluctuation may arise simply due to sampling variations during the surveying interval. Plotting as a binned probability density function, the contrast shows up more clearly in Figure 7. In both cases is fit to X/T = 60. This number is bigger than unity because it looks like every year, the number of samples increases (I also did not divide by 15, the number of years in the survey).

But of course, the reason this maximum entropy model works as well as it does came about from real variation in the sampling techniques. Those years from 1994 to 1996 placed enough uncertainty and thus variance in the growth rates to completely smear the yearly growth fluctuation distribution.

Figure 7 : PDF of larger sample which had sampling variations.
Note that this has a much higher width than Figure 4.

Only in retrospect when I was trying to rationalize why a sampling variation this large would occur in a seemingly standardized yearly survey, did I find the real source of this variation. Clearly, the use of the Maximum Entropy Principle explains a lot, but you still may have to dig out the sources of the uncertainty.

Can we understand the statistics of something as straightforward as a bird survey? Probably, but as you can see, we have to go at it from a different angle than that typically recommended. I will keep an eye out if it has more widespread applicability; for now it obviously requires countable discrete entities.

Saturday, October 16, 2010

Tower of Babel, How languages diversify

One pattern that has evaded linguists and cognitive scientists for some time relates to the quantitative distribution in human language diversity. Much like how plant and animal species diversify in a specific pattern, with very few species dominating within an ecosystem and relatively few species exceedingly rare, the same thing happens with natural languages. You find a few languages spoken by many people, and very few spoken seldomly,with the largest number occupying the middle.

Consider a simple model of language growth whereby adoption of languages occur over time by dispersion. The cumulative probability distribution for the number of languages is
P(n) = 1/(1+1/g(n))
This form derives from the application of the maximum entropy principle to any random variate where one only knows the mean in the growth rate and an assumed mean in the saturation level. I refer to this as entropic dispersion and have used this many applications before so I no longer feel a need to rederive this term every time I bring it up.

The key to applying entropic dispersion is in understanding the growth term g(n). In many cases n will grow linearly with time so the result will assume a hyperbolic shape. In another case, an exponential growth brought up by technology advances will result in a logistic sigmoid distribution. Neither of these likely explains the language adoption growth curve.

Intuitively one imagines that language adoption occurs in fits and starts. Initially a small group of people (at least two for arguments sake) have to convince other people on the utility of the language. But a natural fluctuation arises with small numbers as key proponents of the language will leave the picture and the growth of the language will only sustain itself when enough adopters come along and the law of large numbers starts to take hold. A real driving force to adoption doesn't exist, as ordinary people have no real clue as to what constitutes a "good" language, so that this random walk or Brownian motion has to play an important role in the early stages of adoption.

So with that as a premise, we have to determine how to model this effect mathematically. Incrementally we wish to show that the growth term gets suppressed by the potential for fluctuation in the early number of adopters. A weaker steady growth term will take over once a sufficiently large crowd joins the bandwagon.
dn = dt / (C/sqrt(n) + K)
In this differential formulation, you can see how the fluctuation term which goes as 1/sqrt(n) suppresses the initial growth until it reaches a steady state as the K term becomes more important. Integrating this term once and we get the implicit equation:
2*C*sqrt(n) + K*n = t
Plotting this for C=0.007 and K=0.000004, we get the following growth function.

Figure 1 : Growth function assuming suppression during early fluctuations

This makes a lot of sense as you can see that growth occurs very slowly until an accumulated time at which the linear term takes over. That becomes the saturation level for an expanding population base as the language has taken root.

To put this in stochastic terms assuming that the actual growth terms disperse across boundaries, we get the following cumulative dispersion (plugging the last equation into the first equation to simulate an ergodic steady state):
P(n) = 1/(1+1/g(n)) = 1/(1+1/(2*C*sqrt(n) + K*n))
I took two sets of the distribution of population sizes of languages (DPL) of the Earth’s actually spoken languages from the references below and plotted the entropic dispersion alongside the data. The first reference provides the DPL in terms of a probability density function (i.e. the first derivative of P(n)) and the second as a cumulative distribution function. The values for C and K were as used above. The fit works parsimoniously well and it makes much more sense than the complicated explanations offered up previously for language distribution.


Figure 2 : Language diversity (top) probability density function (below) cumulative. The entropic dispersion model in green.

In summary, the two pieces to the puzzle are assuming dispersion according to the maximum entropy principle, and a suppressed growth rate due to fluctuations during the early adoption. This gives two power law slopes in the cumulative; 1/2 in the lower part of the curve and 1 in the higher part of the curve.

References
  1. Scaling Relations for Diversity of Languages (2008)
  2. Competition and fragmentation: a simple model generating
    lognormal-like distributions
    (2009)
  3. Scaling laws of human interaction activity (2009)
    Discussions on the fluctuation term.






NY Math Teacher Howard A. Stern Uses Ingenuity To Overcome Failure Statistics

The public school teacher highlighted in the linked article has this to say:

"So much of math is about noticing patterns," says Stern, who should know. Before becoming a teacher, he was a finance analyst and a quality engineer.

I always try to seek interesting patterns in the data, but more to the point, I try to actually understand the behavior from a fundamental perspective.

One way Stern uses technology is by helping his students visualize his lessons through the use of graphing calculators.

Stern has it exactly right, if we treat knowledge seeking as a game, like a suduko puzzle, we can attract more people to science in general.

I think that the pattern in language distribution has similarities to that of innovation adoption as well, similar to what Rogers describes in his book "Diffusions of Innovations". I will try to look into this further as I think the dispersive arguments holds some promise as an analytical approach.




Tuesday, October 12, 2010

Stock Market as Econophysics Toy Problem

Consider a typical stock market. It consists of a number of stocks that show various rates of growth, R. Say that these have an average growth rate, r. Then by the Maximum Entropy Principle, the probability distribution function is:
pr(R) = 1/r*exp(-R/r)
We can solve this for an expected valuation, x, of some arbitrary stock after time, t.
n(x|t) = ∫ pr(R) δ(x-Rt) dR
This reduces to the marginal distribution:
n(x|t) = 1/(rt) * exp(-x/(rt))
In general, the growth of a stock only occurs over some average time, τ, which has its own Maximum Entropy probability distribution:
p(t) = 1/τ *exp(-t/τ)
So when the expected growth is averaged over expected times we get this integral:
n(x) = ∫ n(x|t) p(t) dt
We have almost solved our problem, but this integration reduces to an ugly transcendental function K0 otherwise known as a modified Bessel function of the second kind and order 0.
n(x) = 2/(rτ) * K0(2*sqrt(x/(rτ) ))
Fortunately, the K0 function is available on any spreadsheet program (Excel, OpenOffice, etc) as the function BESSELK(X;0).

Let us try it out. I took 3500 stocks over the last decade (since 1999), and plotted the histogram of all rates of return below.


The red line is the Maximum Entropy model for the expected rate of return, n(x) where x is the rate of return. This has only a single adjustable parameter, the aggregate value rτ. We line this up with the peak which also happens to coincide with the mean return value. For the 10 year period, rτ = 2, essentially indicating an average doubling in the valuation of the average stock. This doesn't say anything about the stock market as a whole, which turned out pretty flat over the decade, only that certain high-rate-of-return stocks upped the average (much like the story of Bill Gates entering a room of average wage earners).

The following figure shows a Monte Carlo simulation where I draw 3500 samples from a rτ value of 1. This gives an idea of the amount of counting noise we might see.

I should point out that the MaxEnt model shows very little by way of excessively fat tails at high returns. A stock has to both survive a long time and grow at a rapid enough rate to get too far out in the tail. You see that in the data as only a couple of the stocks have returns greater than 100x. I don't rule out the possibility of high-return tails but we would need to put even more disorder in the pr(R) distribution than the MaxEnt provides for a mean return rate. The actual data seems a bit sharper and has more outliers than the Monte Carlo simulation, indicating some subtlety that I probably have missed. Yet, this demonstrates how to use the Maximum Entropy Principle most effectively -- you should only include the parameters that you can defend. From this minimal set of constraints you observe how far this can take you. In this case, I could only defend some concept of mean in rτ and then you get a distribution that reflects the uncertainty you have in the rest of the parameter space.

The stock market with its myriad of players follows an entropic model to first-order. All the agents seem to fill up the state space so that we can get a parsimonious fit to the data with an almost laughably simple econophysics model. For this model, the distribution curve on a log-log plot will always take on exactly that skewed shape (excepting for statistical noise of course) -- it will only shift laterally depending the general direction of the market.

The stock market becomes essentially a toy problem, no different than the explanation of statistical mechanics you may encounter in a physics course.

Has anyone else figured this out?

[EDIT]
Besides the slight fat-tail, which may be due to potential compounding growth similar to that found in individual incomes, the sharper peak may also have a second-order basis. This could result from a behavior called implied correlation which measures the synchronized behavior among stocks in the market. According to recent measurements, the correlation has hit all-time highs (the last around October 5). Qualitatively a high correlation would imply that the average growth rate r would show much less dispersion in that variate, and the dispersion would only apply to the length of time, t, that a stock rides the crest. Correlation essentially removes one of the parameters of variability from the model and the distribution sharpens up. The stock distribution then becomes the following simple damped exponential instead of the Bessel.
n(x) = 1/(rτ) * exp(-x/(rτ))
The figure below shows what happens when about 40% of the stocks would show this correlation (in green). The other 60% show independent variability or dispersion in the rates as per the original model.

I don't think this makes the collective stock behavior and more complex. I think it makes it simpler in fact. Implied correlation actually points to the future in the stock market. Dispersion in stock returns will narrow as all stocks move in unison. It makes it even more of a toy, with computers potentially dictating all movements.

Implied correlation has risen in the last few years (from here)




References
I personally don't deal with the stock market, preferring to watch it from afar. I found a few papers that try to understand this effect, but most just try to brute force fit it to various distributions.
  1. Analysis of same data from Seeking Alpha
  2. This paper is close but no cigar. It looks like they "detrend" the data to get of the skew, which I think misses the point :
    "Microscopic origin of non-Gaussian distributions of financial returns" (2007)
  3. This book has info on the Bessel distribution:
    "Return distributions in finance", J. Knight and S. Satchell
  4. Interesting from an econophysics perspective.
  5. This book appears worthless:
    "Fat-Tailed and Skewed Asset Return Distributions", S.T. Rachev, F.J. Fabozzi, C Menn

Thursday, October 07, 2010

Black-Scholes

Games for suits. This post has no relevance in the greater scheme of things.

As a premise, consider that the financial industry needs instruments of wealth creation that work opposite to that of stocks. For example, when stock prices remain low, then something else else should take up the slack -- otherwise important people won't make money. Wall Street invented derivatives, options, and other hedging methods to serve as an investment vehicle under these conditions.

We can try to show how this works.

If S is the stock price, then V ~ 1/S is an example "derivative" that works as a reciprocal to price. This becomes the normative description and defines the basic objective as to what the investment class wants to achieve -- an alternate form of income that balances swings in stock price, potentially reducing risk.

Further, we make the assumption that the derivative will grow or decline over time.

So we get:
V(S,t) = K/S * exp(a*t)

If a > 0 then the derivative will grow and if a is less than zero than the derivative will damp out over time. The term K is a constant of proportionality.

The infamous Black-Scholes equation supposedly governs the behaviour of derivatives with respect to stock prices (and time) according to this invariant:


The particulars may change but this formulation describes THE equation that Merton, Black, and Scholes devised to aid investors in making hedged investments using options and other derivatives. The way to read this equation is to note that derivatives will drift or diffuse into the space of the stock price, and proportional to the stock price itself. The drift term occurs due to the interest rate r providing a kind of forcing function. The derivative, V, can also grow due to pure interest rate compounding, as seen in the last term. Whether this actually holds or not, I don't really care as I don't participate in these schemes.

So if you look at it from a very neutral perspective you come up with some interesting observations. For one, you can trivially solve this partial differential equation for a generally disordered set of initial conditions. And the solution appears exactly the same as my first expression above:
V(S,t) = K/S * exp(a*t)

To verify this assertion, we test the expression in the B-S equation, substituting the partial derivatives as we go along.

a*K/S* exp(a*t) + 1/2(σS)2*2*K/S2*exp(a*t) - rS*K/S2*exp(a*t) - r*K/S * exp(a*t) = 0

Cancelling out all common factors:

a/S + 1/2(σS)2*2/S2- rS/S2 - r/S = 0

Reducing the value of S

a/S + 1/2(σ)2*2/S- r/S - r/S = 0

a + 1/2(σ)2*2- r - r = 0

gets us to:

a = 2*r - σ2

The term r is proportional to interest, and σ is volatility or variance in stock price.

So this simple expression that I just cooked up will obey Black-Scholes as long as we choose the constant a term to correspond to the interest and volatility as shown above, and we get:
V(S,t) = K/S * exp((2*r - σ2)*t)

Note that if the volatility (i.e. diffusion) stays high relative to interest, the exponential will damp out with time. If interest (i.e. drift) goes higher than volatility, the exponential will accelerate, creating a huge amount of paper gains.

At this point someone will argue that this solution does not reflect reality. I beg to differ. When you make your bed of mathematical box-springs, you have to lie in it. This solution to Black-Scholes is perfectly fine as it gives a steady-state picture of the partial differential equation. The diffusional and drift components cancel with the right mix of production vs destruction in derivative wealth. If you don't like it, then come up with something different than that specific B-S equation.

I have a feeling that all the seeming complexity of financial quantitative analysis with its Ito calculus and Wiener processes acts as a shiny facade to a simple reality. The math exists to model the inverse relationship of stocks to derivatives. If this didn't happen -- and the lords of high finance absolutely require this relationship to make money -- the math as formulated would vanish from their toolbox. In other words, the math only exists to justify what the financial operatives want to see happen. Everyone appears to implicitly buy this mathematical artifice hook, line, and sinker.

Quantitative analysis and the "quants" who work it have created a fantasy land, where they do not want you to know how easily their quaint ornate universe reduces to a simple function. If they admitted to the charade, the mystery would all disappear and they would no longer have jobs.

Economics and finance does not constitute a science. In science you may need to use partial differential equations. For example, the Fokker-Planck equation shows up quite often -- which incidentally, the Black-Scholes equation shows some similarity to and the quant proponents of B-S certainly like to play up -- but it typically applies to real, physical systems where you use it to try to understand nature, not trying to model some artificial game-like behavior.

I can edit my solution into the Wikipedia page for Black-Scholes and I will bet that someone will immediately remove it. I harbor no illusions. The financial industry depends on the absence of real knowledge to achieve their objectives.

That explains why economics and finance do not classify as sciences; absolute truth does not matter to economists and financiers, only the art of deconstructing profit and the craft of phantom wealth creation does.

Please address editorial comments to: Postings, Main Incinerator, Department of Sanitation, North River Piers, New York, N.Y. 10019.