October 16, 2009

Dense-Subset Break-the-Bank Challenge

I’m preparing for my first global travel for global health, but the net is paying attention to a paper that I think I’ll like, and I want to mention it briefly before I fly.

Computational Complexity and Information Asymmetry in Financial Products is 27 pages of serious TCS, but it is so obviously applicable that people outside of our particular ivory tower, and even outside of academia entirely are blogging and twittering about it, and even reading it!

Freedom to Tinker has a nice summary of this paper, if you want to know what it’s about in a hurry.

Mike Trick makes the salient observation that NP-hard doesn’t mean computers can’t do it. But the assumption that this paper is based on is not about worst-case complexity; it is, as it should be, based on an assumption about the average-case complexity of a particular optimization problem over a particular distribution.

As it turns out, this is an average-case combinatorial optimization problem that I know and love, the densest subgraph problem. My plan is to repeat the problem here, and share some Python code for generating instances of it. Then, you, me, and everyone, can have a handy instance to try optimizing. I think that this problem is pretty hard, on average, but there is a lot more chance of making progress on an algorithm for it than for cracking the P versus NP nut. Keep reading →

October 10, 2009

The Two Rules of Program Optimization

Wow, where does the day go? I spent all my non-meeting time debugging something. At least I fixed it before 5 PM.

The details of the problem are boring, but the whole ordeal could have been avoided if I had just followed the two rules of optimizing software in my Generic Disease Modeling System. What are they?

      First Rule of Program Optimization: Don’t do it
      Second Rule of Program Optimization (for experts only!): Don’t do it yet

Maybe next week I’ll get a second to write about the good kind of optimization; my statistical physics friends have posted an article on the arxiv which I am a co-author on, about an application of bounded-depth minimum spanning trees, Clustering with Shallow Trees.

October 5, 2009

Conference you should know about

This weekend marks the submission of my first “Global Health” paper. Congratulations to me! And many, many thanks to all the people who have worked with me to make it happen. I’ll go into details sometime in the future, first let me see how things go in the refereeing process.

While I was over-working on that business, I got an interesting Call-for-Papers forwarded from global health/AI researcher Emma Brunskill. The AAAI Spring Symposium on Artificial Intelligence for Development (AI-D) is an effort to build a community of people applying computer science and artificial intelligence in less-developed settings.

TCS people, don’t let the “AI” in their title turn you off. Eric Horvitz says that this is for all of us. Keep reading →

September 17, 2009

Top

I don’t feel like having that post about how big things are brewing in US health care reform on the top of my blog anymore, so here is a quick replacement: a ranking paper that caught my eye recently on arxiv, where computer scientists is applied to politics: On Ranking Senators By Their Votes, by my fellow CMU alum, Mugizi Rwebangira (@rweba on twitter).

September 5, 2009

Holiday reading

Whoops, I got busy again and didn’t have time to make new pictures of TFR vs HDI for Rif and Tanja, let alone fix the Bayes factor estimation code or implement the nested sampling version (which I think will be the cool way to estimate evidence). But coming soon: How MCMC is tying my new work in Health Metrics to my education in Operations Research. That will be in two weeks, at best.

Until then, here is some light reading to get ready for a big week of US healthcare reform debate: Get Sick, Get Out, a survey conducted by lawyers interested in catastrophic medical payments and their connection to housing forclosures. It’s 40 pages long, but it’s in legal-journal format, where they have like 10 words per page if you skip the footnotes. From the abstract:

Half of all respondents (49%) indicated that their foreclosure was caused in part by a medical problem, including illness or injuries (32%), unmanageable medical bills (23%), lost work due to a medical problem (27%), or caring for sick family members (14%).

I’m excited for the next week of healthcare reform debates. When my most jaded friends are forwarding me Moveon.org videos (and I’m listening to 4 minutes of recent REM), I know something unusual is going on.

Happy labor day weekend!

August 25, 2009

MCMC in Python: PyMC for Bayesian Model Selection

(Updated 9/2/2009)

I never took a statistics class, so I only know the kind of statistics you learn on the street. But now that I’m in global health research, I’ve been doing a lot of on-the-job learning. This post is about something I’ve been reading about recently, how to decide if a simple statistical model is sufficient or if the data demands a more complicated one. To keep the matter concrete (and controversial) I’ll focus on a claim from a recent paper in Nature that my colleague, Haidong Wang, choose for our IHME journal club last week: Advances in development reverse fertility declines. The title of this short letter boldly claims a causal link between total fertility rate (an instantaneous measure of how many babies a population is making) and the human development index (a composite measure of how “developed” a country is, on a scale of 0 to 1). Exhibit A in their case is the following figure:

An astute observer of this chart might ask, “what’s up with the scales on those axes?” But this post is not about the visual display of quantitative information. It is about deciding if the data has a piecewise linear relationship that Myrskyla et al claim, and doing it in a Bayesian framework with Python and PyMC. But let’s start with a figure where the axes have a familiar linear scale! Keep reading →

August 14, 2009

August is Too-Many-Projects Month

(Tap… tap… tap… is this thing on? Good.)

July was vacation month, where I went on a glorious bike tour of the Oregon/California coast, and learned definitively that I don’t like biking on the side of a highway all day. Don’t worry, I escaped in Coos Bay and took trains and buses between Eugene, Santa Cruz, Berkeley, and SF for a vacation more my speed.

But now that I’m back, August is turning out to be project month. I have 3 great TCS applications to global health in the pipeline, and I have big plans to tell you about them soon. But one mixed blessing about these applications is that people actually want to see the results, like, yesterday! So first I have to deal with the results, and then I can write papers and blogs about the techniques.

Since Project Month is a little over-booked with projects, I’m going to have to triage one today. You’ve heard of the NetFlix Challenge, right? Well, github.com is running a smaller scale recommendation contest, and I was messing around with personal page rank, which seems like a fine approach for recommending code repositories to hackers. I haven’t got it working very well (best results, 15% of holdout set recovered), but I was having fun with it. Maybe someone else will take it up, let me know if you get it to work; networkx + data = good times.

    f = open('download/data.txt')
    for l in f:
        u_id, r_id = l.strip().split(':')
        G.add_edge(user(u_id), repo(r_id))

[get the code]

June 25, 2009

ID Modeling Summer School

I’ve been spending the week at the Infectious Disease Modeling Summer School here at UW. It’s very interesting, and good for me to learn more about how people in my new field think (especially people in my new field, outside of my little institute…)

I’ve discovered a pet peeve during this week of presentations, though. I’ve seen a lot of numerical examples where the numbers work out perfectly… a little too perfectly. If you split 1000 people into an experimental and control group by choosing a random subset of 500, fine. But if you look within that group to see how many have a trait that occurs independently with probability 0.2, you do not often find exactly 100 in group A and 100 in group B. I think a little more complexity in the numbers makes the example easier to understand.

I’m sure that you, my loyal reader, can generate random numbers from a multitude of distributions, if you wanted to spend the time. But if you’re busy, busy, busy, then you can have wolfram alpha do all the work. It actually comes through for that one: “sample Binomial(500, .2)“.

June 24, 2009

US Health Care Costs, cont.

I wrote two months ago about the mysterious differences in health care costs that I found so intriguing in a talk by Jonathan Skinner. (That was two months ago? Really?) Since then, the surgeon/author Atul Gawande has brought the mystery to the national stage. In a long story for the New Yorker, he gave the non-technical version of Skinner’s talk, and today he addressed some of the feedback that this article has received over the last month.

His short answer to the mystery is this:

Analysis of Medicare data by the Dartmouth Atlas project shows the difference is due to marked differences in the amount of care ordered for patients—patients in McAllen receive vastly more diagnostic tests, hospital admissions, operations, specialist visits, and home nursing care than in El Paso.

But that is not the end of the story. It only takes a sentence to explain the “proximal” cause of these cost differences, but it takes the whole article for Gawande to do justice to his theory on the underlying cause, and his is certainly not the only theory.

Since his theory of the root cause of this inequality is centered on physicians putting profit over patients, it has made some doctors uneasy. Greg Roth, a physician that I work with hadn’t had time to read the article when we last chatted, but he did attend Skinner’s talk with me two months ago. Greg told me about a detail that has emerged as doctors put Gawande’s article under their microscopes: we might be making a mountain out of molehill-sized mystery.

Look at this plot, which shows the complementary cumulative distribution function for the primary quantity in Gawande’s article, Total Medicare reimbursements per enrollee for 2006.

Investigative reporter have to get the story, and raking the muck way out in the tail of this distribution turned out to be a good bet this time. But McAllen is 6 standard deviations above the mean (not to imply that this distribution is normal… should it be?) How much impact would it have, for the whole population, if the outliers were greatly improved?

If through anti-fraud policing, better culture, and general hard work, the top 10% of hospitals reduced their cost per patient to the national average, that would reduce the average cost by 3.6%. Outliers show what is possible, but making a big change involves more than outliers.

June 18, 2009

Population Health in Iran

The political situation in Iran has been in the news and on the nets a lot this week. I hope that the friends and families of all my Iranian colleagues are safe. I’m thinking of you.

Keep reading →