Blowing that Whistle: My Kafkaesque Journey in Causal Machine Learning
When you witness fraud so flagrant you begin to doubt your sanity
This whistle-blow describes my history with a European project and interaction with the local PI of the project which was thoroughly Kafkaesque in the sense of being a mix of farce and horror. The decisions made as far as the methodology to pursue while designating my role within this process were maddeningly bizarre, while accompanied by a constant stream of animosity. The important point for this blog is indicating what can happen in a large research project, and how the system of incentives regarding organization and interaction are likely to persist across many such research efforts, explaining some of the contemporary failures in the scientific establishment that has eroded its public trust.
Background – the Study and the Data
The project concerns the impact of nutrition on noncommunicable disease (NCD). It is notable for 1) a wide variety of data measured, of genomics, metabolomics, lipidomics, etc. etc. 2) a camera that records what a person is actually eating, rather than relying on (often faulty) self reports. It’s a 12 million euro project involving more than fifteen different institutes across greater Europe. Each institute has its own local Principal Investigator (PI). There is also a main PI for the entire project, who will not appear in the rest of the article, rather she is an expert on food science and logically entrusted the statistical modeling to the machine learning team, and her coordination of the large and diverse consortium has been, from the best of my impressions, excellent.
There is an initial clinical trial conducted in the first year to be used for modeling and a second two years later for validation. For each clinical trial there were a few hundred subjects who are monitored for a couple of weeks. Both the time-independent background data (demographic and biochemical) is noted, together with biomarkers of various modalities, measured at the beginning and end of the trial, meant to measure the overall state of the person’s health.
During the time of the trial the data is the nutrition of the individual. Naturally people eat in rhythms but not regular ones, and so this can be considered an irregularly sampled time series.
Thus we have an outcome measurement at the initial time indicating a person’s health, then a couple of weeks of input measurements, and subsequently a final output measurement.
Many of the measurements involve signal processing to perform feature extraction. Beyond that, structured data can also be clustered in a manner so as to obtain lower dimensional features. That said, it is obvious that both as far as decision variables, that is the nutrition (however codified), the background covariates, and the outcome variables are altogether of very high dimension. A very generous estimate would be in the tens of thousands of features.
The objective of the project is to perform causal learning of nutrition on NCD from the clinical trial data, then use this to develop a personalized nutrition recommendation engine. If you’re a Statistician who just spilled your coffee, I promise it gets much much worse.
The Obvious Things to Do and Not Do
Of course, this is an awfully ambitious project to begin with, if feasible at all. Still, with the project funded one must perform a good faith attempt at applying the most appropriate statistical methods to attempt to do something resembling causal learning. To this end, there are a number of considerations that I would say are obvious, in the sense that it’s clear this has to be done for any hope of modeling the data with any accuracy and statistical meaning, and especially with enough to develop a credible personalized nutrition recommendation engine. That this is obvious should be clear from anyone with basic knowledge and experience in statistical modeling and inference.
First, with the dietary time series, it is clear that someone on the team should be or become an expert on irregular sampled time series and develop methods to extract features (such as total and frequency of certain macronutrient consumption, etc.). Then a model would fit the transition of the health markers from before to after the trial using the covariates of these features. For extra robustness you can add latent variables defining a low dimensional structure of the state of health of the individual, with the input variables feeding into the latent variables, which feed to the biomarkers.
With this data regime, of magnitudes more features than samples, data augmentation is a must. While external data is available, from biobank, etc. it will be typically a smaller subset of all the features extracted in our project. Thus some systematic means of combining data without excessive missing-value heuristics should be done.
Even with, e.g., ten times the data samples from using external data, the data regime is still highly unfavorable to causal learning, with magnitudes more features than samples. In this case, aiming for statistically significant causal graphs is impossible, and the only thing you can really do is Bayesian modeling. In this case, you can model the entire uncertainty, in the hopes that you can identify some high probability interventions under certain conditions. With Bayesian Reinforcement Learning, something one of the other institutes in the consortium is an expert in, this would feed into a recommendation engine feeding the most high probability suggestions.
One can observe that this requires a complex integration of multiple difficult methodological developments. For one thing, it’s impossible for one person to do all of this, and for any hope of success everyone involved in the statistical modeling would need to work on some component. As said, however, this is all very obvious, so I figured this is simply what we will all do.
If you are familiar with causal learning, there are two standard textbook approaches, but it is clear that doing these naively out of the box would be colossally inappropriate:
1) finding a ground truth causal network in the form of one central directed acyclic graph for causal discovery. The sample complexity of causal discovery is exponential in the size of the variables, and recall the data regime is such that even linear sample complexity would be a roadblock for this study. If you managed to (the managing to is another issue, discussed later) solve some maximum likelihood system to obtain a loss-minimizing graph, it would be statistically meaningless – none of the edges would be statistically significant, and no “causal discovery” can be claimed in good faith
2) structural equation time series models (SEMs), treating the entire two weeks as the time series combining the nutrition and biomarkers together as one time series. Ignoring the irregularly sampled aspect of the nutrition, the fact is that the outcome variables are only measured at the beginning and end. So there is no time series to fit. You can add latent variables for the outcomes at all intermediate times, but now you’re just adding extra computation for no additional information in the model structure. Moreover it’s just the incorrect model representation ontology, since the state of health of an individual changes far slower than nutrition in the blood stream.
This is, or should be, entirely obvious. So obvious that I figured everyone in the project knew this, and we would proceed with performing the research along the lines described, and most certainly not along the lines of what not to do.
When something is obvious, we can recall Quine’s web of knowledge – unlike something in the peripheral space of knowledge that one is marginal confident in, obvious knowledge is in the dense center. For evidence to appear that would introduce doubt in your belief in the obvious, very troublesome and disturbing additional questioning of one’s entire reality would have to take place. Consider understanding what I wrote above to be like knowing that 1+1=2 (in the infinite Abelian group of integers, for any smart-ass mathematicians who know the exceptions). If there is someone or something that suggests doubt as to 1+1=2, your whole world will start to tremble – was all of primary school a PsyOp? What are numbers? Am I reading the right numerical script?
Kafka Begins
One more element to this story: the local PI has a favored class of methods, call them M. Now, M are his personal brand of methods, that is, those he is a specialist in and is keen on pushing as far it is beneficial to his career. However, this class of methods, while mathematically elegant, does not scale well. Considering more than two digit number of variables becomes, essentially, computationally impossible and even more than a dozen begins to break most machines. Maybe sparsity can push it a bit higher, but hardly to a significant enough magnitude to be important here. Recall that we have tens of thousands of features, and the number of edges in a graph, appearing the maximum likelihood and other score optimization problems, scaling exponentially beyond that. The PI knows both that his methods don’t scale as well as the scale of the data in the project.
In the beginning of the project, a new PhD student appeared to work on it that myself and the PI would supervise. Great, as is the obvious thing to do, I recommended her to study irregular time series, in order to help with the necessary feature extraction for the nutrition variables.
Some bit of time later, the PI is notably irate by my guidance, and has her study…. M. Moreover, later I find out that she is working on developing methods of applying M to SEM time series, that is, item two above of what is obviously not to do.
My heart sank when I found this out, and my enthusiasm for the project dipped. I concluded that the PI is not interested in actually trying to solve the difficult problem of fitting the data in the project, but just use it as an opportunity to just write papers and publish on what he is interested in that can be vaguely related to the general themes of the topic.
Looking into the mathematical statistics of causal inference in the biological sciences, I find some interesting problems in survival analysis and start looking into developing the mathematics there.
Eventually, upon a meeting, the PI finds out and is irate about this, stating that survival analysis is not relevant for the project as we are not doing long term follow ups with potential morbidity or mortality censoring.
And so here Kafka begins. What is going on??
1) Is he flagrantly pursuing his interests while ignoring the needs of the project, while demanding I not do the same?
2) Or is it not the case that 1+1=2?
3) Or is it textbook totalitarian power tactics of hamfisting absurdities?
As it was, I just decided not to think about it too much and just broadly deprioritized it. Out of sight out of mind meant nothing to deal with.
Kafka Turns into a Climax
Unfortunately, there came a point wherein there was something to deal with. In particular a Deliverable for a Software Package I had promised, which was intended, in the writing of the proposal, to be the foundation for modeling the project data.
The PI harshly scolded me for minimal updates and progress towards this Deliverable. In considering it, and the discussion around it, I became very deeply concerned – from my understanding this software should be capable of at least some basic modeling of the clinical trial data, which would be available around the time of the software Deliverable release. And if it wasn’t capable of doing so, the whole consortium would become furious with me for failing to develop what was necessary for the project.
Recall the large list of difficult things that are absolutely necessary to have any hope of a meaningful statistical model with the data regime. Now I started eight different research papers addressing the vast set of technical requirements needed, from the initial basic graphical modeling, to creating synthetic data by merging external datasets, to additional methods with sparsity considerations, to a couple Bayesian approaches, to scientific computing to scale the aforementioned Bayesian approaches.
This was, of course, insane, with just months before the delivery due date of the proposal. My communication with the PI becomes increasingly more hostile and absurd (remember, Kafka). About a month before the deadline I share the ongoing drafts of 7 papers. He says to submit three, and says that #s 3, 6, and 7 are useless for the project (and they are, like every paper, among the obvious mandatory things to do, remember). Then two weeks out from the deadline, I share ongoing drafts of the 7 papers, all with progress, but insufficient for the deadline. He again says to submit three, and says that #6 should definitely be among the three.
I agree to continue #6 after the deadline, but note three others that have a chance of making it for the deadline. Meanwhile I am going insane with my round-the-clock workload and pressure, and begin discussing the issue with a friend who has been a spectacular therapist for me over many years. She says that childhood trauma triggered my mania of extreme effort to match impossible expectations, and that he clearly is aware that 1+1=2 and there’s no way he could expect this software to actually model the data and that he just wanted me to make something related to tick the box of the Deliverable. I write the PI my relief and...no response.
Eventually the delivery occurred, and I was just being neurotically paranoid in my concern that everyone in the project would be furious with me, nothing happened on that front.
Still bewildered by his lack of response, I am starting to doubt my sanity: is it the case that 1+1=2?
I write to a couple of colleagues with my statistical concerns regarding the project, but get no response…
Finally, I talk to an individual who has worked as a Biostatistician on clinical trials for Harvard Public School of Medicine for twenty years. After I explain to him what the PI is trying to do for the project and what happened, he calls it “colossally insane stupidity” and proceeded to rant for a good twenty minutes on how ridiculous this was.
I talk to a friend of mine who is the Head of Data Science at a Fortune 500. After telling him the story, he proceeds to knee-smacking belly laugh for at least a minute straight. Seriously I was wondering “he...keeps….going….”
I’m not insane, thank fuck.
Escape
Meanwhile: An obvious thing to do for such a problem would be to take subsets of the features small enough to solve a graph learning and estimation problem and then sample a large mixture of these. This one was #7 in the list above, which he had remarked was useless for the project. I had asked him to fund an intern from IIT-Delhi to come to work on it for me. No reply. But at the time he said it was useless for the project, so...
After the deliverable, I just asked him directly “so do you want me to just work on methods for when there is many more data samples than features, or should I continue with this Bayesian method for more credible high dimensional statistics?” Finally he said “yes we have the second situation.” A few IIT-Delhi students work on it remotely, but they were just finishing Bachelor students, and the methods were doing Empirical Bayes using Generalized Variational Inference, which is fairly advanced.
So months go by and the progress is slow. The PI continues to work on methods in the themes of M along the lines of #1 of what is obviously the thing not to do. Eventually after finishing some final polishing of the papers associated to the deliverable, I submit them to journals, decide to jump off this sinking ship, and leave the project.
Months later I find out the consortium has completely abandoned trying to model the influence of the nutrition on biomarkers and are just focusing on understanding adherence to diet plans. And so millions of euros of taxpayer money spent on state of the art technology applied to measure all sorts of biochemical details completely discarded. “I told you so” is a hollow victory – I’m not some visionary genius, I was just under the expectation that we’d proceed under the understanding that 1+1=2. But this was apparently not only not the case, but was even considered offensive to the PI.
Retrospective
So there you have it.
Those of you who may be wondering (or unaware, in which case you should check these links)
What’s up with the replication crisis being so bad that it can be said that most published research is wrong?
Why is there now essentially zero marginal gain in innovation and economic growth from research spending?
Why is public trust in Science eroding?
Well here you go, ladies and gentleman. The smoking gun.
What this man did, deliberately using research funding not for the project but for his personal research brand, is not only horribly unethical, but, to my understanding, may even constitute the serious criminal offense of subsidy fraud.
But I’m not going to be a martyr, and I know how this guy works, and how the system works. So I kept quiet until I got tenure this year.
One could claim in his defense that instead of nefarious intent, he was just very stupid, that is, unaware that 1+1=2. But:
1) that would constitute gross negligence for a PI in charge of the statistical modeling
2) he knows that his methods M do not scale, and he knows the dimensionality of the features, so stupidity is no defense for his entire focus on his preferred class of methods
Now, as much as I abhorred what I went through personally, I want the message to be general:
1) The guy got away with all this with no consequences to him
2) Observe that academic culture actively discourages speaking out. Notice that fellow academics didn’t even respond, but people dealing with statistics in the real world had the appropriate natural reactions to the situation.
If this happened here, this happens across places for many different projects, oftentimes with nobody even saying anything after the fact. Consider that countless millions in taxpayer money is being burned because of such systemic failures.
So if you’re in the European academic research community, unless you want the Afd and National Front to come to power leading the rising populist Right in Europe and do to European research funding what Trump is doing to the NSF and NIH, I suggest that you look in the mirror, and I suggest that you look around. And you’re going to think about what systemic changes need to made, and the kind of culture you want to see academia and academic research to have.
In order to make this article appeal to as general audience as possible and make it not gossippy, I did not include any names and avoided possibly identifying technical details except what was necessary to explain the story. However, I am happy to share any such information in private.
There will be a more technical blog article Part 2 as I, meanwhile, did a deep dive of studying causal machine learning itself and saw serious fundamental problems with the entire research paradigm. I will discuss the details of my findings, which are summarized in my first Philosophy journal submission, in that next article.
Yours is not my field, but to me it appears the issue here is that you're working at a level of the project that is below your competency, and don't have the capacity to influence senior team members on the project's base deficiencies. You wouldn't be the first person in a highly technical position to face this issue. Generally one is left with two options here - first is to go over the bosses heads and report the issue and fight for a resolution, the other is to moan about it at a lower level/privately and learn from it for your own experience. I've never seen the first work out well.
It's not just academia which faces problems of senior people chasing fulfilment of high level criteria to the detriment of actually meeting the low level criteria for success - this issue is present large organisations of all kinds, because the management /lead role is essentially not operating in the same environment as the people doing the stuff.
Never attribute to malice what can be attributed to incompetence.
My advice - figure out how to accept that and how to meet your own needs, or aim for roles where you'll have greater power to avoid those kinds of issues (smaller projects, projects with people you have social power with, or becoming a project leader).