Archive for category Basic stats & math

Models responsible for whacky weather

Watching Brazilian supermodel Gisele Bundchen sashay across the Olympic stadium in Rio reminded me that, while these fashion plates are really dishy to view, they can be very dippy when it comes to forecasting.  Every time one of our local weather gurus says that their models are disagreeing, I wonder why they would ask someone like Gisele.  What does she and her like know about meteorology?

There really is a connection of fashion and statistical models—the random walk.  However, this movement would be more like that of a drunken man than a fashionably-calculated stroll down the catwalk.  For example, see this video by an MIT professor showing 7 willy-nilly paths from a single point.

Anyways, I am wandering all over the place with this blog.  Mainly I wanted to draw your attention to the Monte Carlo method for forecasting.  I used this for my MBA thesis in 1980, burning up many minutes of very expensive main-frame computer time in the late ‘70s.  What got me going on this whole Monte Carlo meander is this article from yesterday’s Wall Street Journal.  Check out how the European models did better than the Americans on predicting the path of Hurricane Sandy.  Evidently the Euros are on to something as detailed in this Scientific American report at the end of last year’s hurricane season.

I have a random thought for improving the American models—ask Cindy Crawford.  She graduated as valedictorian of her high school in Illinois and earned a scholarship for chemical engineering at Northwestern University.  Cindy has all the talents to create a convergence of fashion and statistical models.  That would be really sweet.

No Comments

Big data puts an end to the reign of statistics

Michael S. Malone of the Wall Street Journal proclaimed last month* that

One of the most extraordinary features of big data is that it signals the end of the reign of statistics.  For 400 years, we’ve been forced to sample complex systems and extrapolate.  Now, with big data, it is possible to measure everything…

Based on what I’ve gathered (admittedly only a small and probably unrepresentative sample), I think this is very unlikely.  Nonetheless, if I were a statistician, I would reposition myself as a “Big Data Scientist”.

*”The Big-Data Future Has Arrived”, 2/22/16.

1 Comment

Fisher-Yates shuffle for music streaming is perfectly random—too much so for some

The headline “When random is too random” caught my eye when the April issue of Significance, published by The Royal Statistical Society, circulated by me the other day.  It really makes no statistical sense, but the music-streaming service Spotify abandoned the truly random Fisher-Yates shuffle.  The problem with randomization is that it naturally produces repeats in tracks two or even three days in a row and occasionally back-to-back.  Although this happened purely by chance, Spotify consumers complained.

Along similar lines, I have been aggravated by screen savers that randomly show family photos.  It really seems that some get repeated too often even though it’s only by chance.  For a detailing of how Spotify’s software engineer Lukáš Poláček tweaked the Fisher-Yates shuffle to stretch songs out more evenly see this blog post.

“I think Fisher-Yates shuffle is one of the most beautiful random algorithms and it’s amazing that such a complicated problem can be solved in 3 lines of code in some programming languages.  And this is accomplished using the optimal number of operations and optimal amount of randomness.”

– Lukáš Poláček (who nevertheless, due to fickleness of music listeners, tweaked the algorithm to introduce a degree of unrandomization so it would reduce natural clustering)

No Comments

“naked statistics” not very revealing

One of my daughters gave me a very readable book by economist Charles Wheelan titled “naked statistics, Stripping the Dread from the Data”.  She knew this would be too simple for me, but figured I might pick up some ways to explain statistics better, which I really appreciate.  However, although I very much liked the way Wheelan keeps things simple and makes it fun, his book never did deliver any nuggets that could be mined for my teachings.  Nevertheless, I do recommend “naked statistics” for anyone who is challenged by this subject.  It helps that author is not a statistician. ; )

By the way, there is very little said in this book about experiment design.  Wheelan mentions in his chapter on “Program Evaluation” the idea of a ‘natural experiment’, that is, a situation where “random circumstances somehow create something approximating a randomized, controlled experiment.”  So far as I am concerned “natural” data (happenstance) and results from an experiment cannot be mixed, thus natural experiment is an oxymoron, but I get the point of exploiting an unusually clean contrast ripe for the picking.  I only advise continued skepticism on any results that come from uncontrolled variables.*

*Wheelan cites this study in which the author, economist Adriana Lleras-Muney, made use of a ‘quasi-natural experiment’ (her term) to conclude that “life expectancy of those adults who reached age thirty-five was extended by an extra one and a half years just by their attending one additional year of school”  (quote from Whelan).  Really!?

No Comments

Educational fun with Galton’s Bean Machine

This blog on Central limit theorem animation by Nathan Yau brought back fond memories of a quincunx (better known as a bean machine) that I built to show operators how results can vary simply by chance.  It was comprised of push-pins laid out in the form of Pascal’s triangle on to a board overlaid with clear acrylic.  I’d pour in several hundred copper-coated BB’s through a funnel and they would fall into the bins at the bottom in the form of a nearly normal curve.

Follow the link above to a virtual quincunx that you can experiment on by changing the number of bins.  To see how varying ball diameters affect the results, check out this surprising video posted by David Bulger, Senior Lecturer, Department of Statistics, Macquarie University, Sydney, Australia.

No Comments

Random thoughts

The latest issue of Wired magazine provides a great heads-up on random numbers by Jonathan Keats.  Scrambling the order of runs is a key to good design of experiments (DOE)—this counteracts the influence of lurking variables, such as changing ambient conditions.

Designing an experiment is like gambling with the devil: only a random strategy can defeat all his betting systems.

— R.A. Fisher

Along those lines, I watched with interest when weather forecasts put Tampa at the bulls-eye of the projected track for Hurricane Isaac.  My perverse thought was this might the best place to be, at least early on when the cone of uncertainty is widest.

In any case, one does best by expecting the unexpected.  That gets me back to the topic of randomization, which turns out to be surprisingly hard to do considering the natural capriciousness of weather and life in general.  When I first got going on DOE, I pulled numbered slips of paper out of my hard hat.  Then a statistician suggested I go to a phone book and cull numbers from the last 4 digits from whatever page opened up haphazardly.  Later I graduated to a table of random numbers (an oxymoron?).  Nowadays I let my DOE software lay out the run order.

Check out how Conjuring Truly Random Numbers Just Got Easier, including the background by Keats on pioneering work in this field by British (1927) and American (1947) statisticians.  Now the Australians have leap-frogged (kangarooed?) everyone, evidently, with a method that produces 5.7 billion “truly random” (how do they know?) values per second.  Rad mon!

,

No Comments

Statisticians no more—now “data scientists”

I spent a week earlier this month at the Joint Statistical Meetings (JSM)—an annual convocation of “data scientists”, as some of these number crunchers now deem themselves.  But most statisticians remain ‘old school’ as evidenced by this quote:

“Some time during the past couple of years, statistics became data sciences older, more boring sibling that always played by the rules.”

— Nathan Yau*

I tend to agree—being suspicious of changes in titles as a cover for shenanigans.  It seems to me that “data science” provides a smoke screen to take unwarranted leaps from shaky numbers.  As the shirt sold at JSM by American Statistical Association (ASA) says, “friends don’t let friends extrapolate.”

*Incorrectly attributed initially (my mistake) to Carnegie Mellon statistics professor Cosma Shalizi, who was credited by Yau for speaking up on this subject.

2 Comments

Where the radix point becomes a comma

Prompted by an ever-growing flow of statistical questions from overseas, Stat-Ease Consultant Wayne Adams, recently circulated this Wikipedia link that provides a breakdown on countries using a decimal point versus a comma for the radix point—the separator of the integer part from the fractional side of a number.

For more background on decimal styles over time and place see this Science Editor article by Amelia Williamson.  It credits Scottish mathematician John Napier* for being the first to use a period.  However, it seems that he wavered later by using a comma, thus setting the stage for this being an alternative.  Given the use of commas to separate thousands from millions and millions from billions and so on, numbers can be misinterpreted by several orders of magnitude very easily if you do not keep a sharp eye on the source.

So, all you math & stats boffins—watch it!

*As detailed in this 2009 blog I first learned of this fellow from seeing his bones on display at IBM’s Watson Research Center in New York.

No Comments

Obscurity does not equal profundity

“GOOD with numbers? Fascinated by data? The sound you hear is opportunity knocking.” This is how Steve Lohr of the New York Times leads off his article in today’s Sunday paper on The Age of Big Data. Certainly the abundance of data has created a big demand for people who can crunch numbers. However, I am not sure the end result will be nearly as profitable as employers may hope.

“Many bits of straw look like needles.”

– Trevor Hastie, Professor of Statistics, Stanford University, co-author of The Elements of Statistical Learning (2nd edition).

I take issue with extremely tortuous paths to complicated models based on happenstance data.  This can be every bit as bad as oversimplifications such as relying on linear trend lines (re Why you should be very leery of forecasts). As I once heard DOE guru George Box say (in regard to overly complex Taguchi methodologies): Obscurity does not equal profundity.

For example, Lohr touts the replacement of earned run average (ERA) with the “Siera”—Skill-Interactive Earned Run Average. Get all the deadly details here from the inventors of this new pitching performance metric. In my opinion, baseball itself is already complicated enough (try explaining it to someone who only follows soccer) without going to such statistical extremes for assessing players.

The movie “Moneyball” being up for Academy Awards is stoking the fever for “big data.” I am afraid that in the end the call may be for “money back” after all is said and done.

3 Comments

Extracting Sunbeams from Cucumbers

With this intriguing title Richard Feinberg and Howard Wainer draw readers of Volume 20, Number 4 into what might have been a dry discourse: How contributors to The Journal of Computational and Graphical Statistics rely mainly on tables to display data.  Given that “Graphical” is in the title of this publication, it begs the question on whether this method of for presenting statistics really works.

When working on the committee that developed the ASTM 1169-07 Standard Practice for Conducting Ruggedness Tests, I introduced the half-normal plot for selecting effects from two-level factorial experiments.  Most of the committee favored this, but one individual – a professor emeritus from a top school of statistics – resisted the introduction of this graphical tool.  He believed that only numerical methods, specifically analysis of variance (ANOVA) tables, could support objective decisions for model selection.  My comeback was to dodge the issue by simply using graphs and tables – this need not be an either/or choice.  Why not do both, or merge them by putting number on to graphs – the best of both worlds?

“A heavy bank of figures is grievously wearisome to the eye, and the popular mind is as incapable of drawing any useful lessons from it as of extracting sunbeams from cucumbers.”

— Economists (brothers) Farquhar and Farquhar (1891)

In their article which can be seen here Feinberg and Wainer take a different tack (path of least resistance?): Make tables look more like graphs.  Here are some of their suggestions for doing so:

  • Round data to 3 digits or less.
  • Line up comparable numbers by column, not row.
  • Provide summary statistics, in particular medians.
  • Don’t default to alphabetical or some other arbitrary order: Stratify by size or some other meaningful attribute.
  • Call out data that demands attention by making it bold and/or bigger and/or boxing it.
  • Insert extra space between rows or columns of data where they change greatly (gap).

Check out the remodeled table on arms transfers which makes it clear that, unlike the uptight USA, the laissez faire French will sell to anyone.  It would be hard to dig that nugget out of the original data compilation.

No Comments