Hacker News

PaulHoule•4m

Low responsiveness of ML models to critical or deteriorating health conditions nature.com

56 comments

wswope•3m
I work in the ICU monitoring field, on the R&D team of a company with live systems at dozens of hospitals and multiple FDA approvals. We use extended Kalman filters (i.e. non-blackbox "ML") to estimate certain lab values of patients that are highly indicative of them crashing, based on live data from whatever set of monitors they're hooked up to - and it's highly robust.
What the authors of this paper are doing is throwing stuff at the wall to see if it works, and publishing results. That's not necessarily a bad thing at all, but I say this to underline that their results are not at all reflective of SOTA capabilities, and they're not doing much exploration of prior art.
- AlotOfReading•3m
  Calling EKFs "ML" is certainly a choice.
  - getnormality•3m
    It is a reasonable choice, and especially with the quotes around it, completely understandable.
    The distinction between statistical inference and machine learning is too blurry to police Kalman filters onto one side.
  - idiotsecant•3m
    It's machine learning until you understand how it works, then it's just control theory and filters again.
    - whatshisface•3m
      Diffusion models are a happy middle ground. :-)
  - Epa095•3m
    Is it less ML than linear regression?
    - klodolph•3m
      If you want to draw the line between ML and not ML, I think you’ll have to put Kalman filters and linear regression on the non-ML side. You can put support vector machines and neural networks on the ML side.
      In some sense the exact place you draw the distinction is arbitrary. You could try to characterize where the distinction is by saying that models with fewer parameters and lower complexity tend to be called “not ML”, and models with more parameters and higher complexity tend to be called “ML”.
      - wrsh07•3m
        Linear regression is literally the second lecture of the Stanford ML class. https://cs229.stanford.edu/
        If you want to say "not neural networks" or not dnn or not llm, sure. But it's obviously machine learning
        klodolph•3m
        When you say it’s “obviously machine learning”, how could that statement possibly be correct? There’s not even broad consensus here… so you don’t get to say that your definition is obviously correct.
        There are pedagogical reasons why you’d include linear regression in a machine learning course. This is pretty clear to me—they have properties which are extremely important to the field of machine learning field, such as differentiability.
        wrsh07•3m
        It is obviously machine learning. It is a machine learning algorithm. It is taught in machine learning classes. It is described as a machine learning algorithm literally everywhere. Here's wiki:
        > Linear regression is also a type of machine learning algorithm, more specifically a supervised algorithm, that learns from the labelled datasets and maps the data points to the most optimized linear functions that can be used for prediction on new datasets
        You can pretend it's not because it's not a sophisticated machine learning algorithm, but you are wrong.
        windsignaling•3m
        After spending over a decade in both statistics and machine learning I'd say the only reason there isn't a "broad consensus" is because statisticians like to gate-keep, whether that's linear regression, Monte Carlo methods, or Kalman Filters.
        Linear regression appears in pretty much every ML textbook. Can you confidently say, "this model that appears in every ML textbook is the only model in the ML textbook that isn't an ML model"?
        Kalman Filters are like a continuous-state HMM. So why are HMMs considered ML and Kalman Filters not considered ML?
        IMO it's an ego thing. They spent decades rigorously analyzing everything about linear models and here come these CS cowboys producing amazing results without any of the careful rigor that statisticians normally apply. It's difficult to argue against real results so the inflexible, hard-nosed statisticians just hang on to whatever they can.
        klodolph•3m
        That’s unfairly harsh to statisticians, IMO. You have two fields of study, statistics and ML. There’s a massive overlap. Gatekeeping? Practitioners from these two fields have different jargon and view things differently from each other.
        “X is taught in books about subject Y” is a pretty weak argument. I could use it to argue that group theory is quantum mechanics. Scientists and mathematics aren’t fighting over who gets to own group theory—the scientists get to put group theory in their toolboxes and the mathematicians get to study it for itself. Same with ML and statistics. When you do ML, you need certain statistical techniques in your toolbox, so they get taught in your ML books.
        Retric•3m
        If you want to draw a line you could say anything that’s not full AGI isn’t machine learning because of philosophical reasons.
        But, other than that there’s the only clear line is when the programmer isn’t hard coding results which puts Linear regression over the ML line. I guess you could argue about supervised vs unsupervised algorithms, but that’s going to exclude a lot of what is generally described as ML.
        computerex•3m
        Linear regression is ML. You are off base.
        genewitch•3m
        That's cool. Can you explain what machines they were using in the 1800s to do learning on?
        wrsh07•3m
        Does this mean if I simulate a neural network on pen and paper that stops being machine learning?
        All of these are machine learning techniques. Doing it by hand doesn't change anything. Today we use machines so it's machine learning
        zeroCalories•3m
        I've worked out a 3x3 neural network with two pieces of labeled data by hand. Give me my statistics degree!
        windsignaling•3m
        You're using the word to define the concept rather than the concept to define the word. Wrong order.
        See: "Program".
        •3m
        [deleted]
      - jll29•3m
        AI professor here.
        Anything that can separate data apoints can rightly been seen as a "supervised machine learning classifier".
        Todemystify the area, I literally introduce my intro to ML lecture by drawing a line on the board, give its equation y = 0.5 x on the backboard, reminding students that they already know this, and then explain how to use it as a spam filter by interpreting the points on either side of the line as good emails versus spam ones.
      - bigmadshoe•3m
        Linear regression is machine learning. At their core neural networks are just repeated linear regression + a non-linearity arranged in interesting ways. The key is that they can be trained to fit data using some optimization protocol (e.g. gradient descent). Just because linear regression has a closed form solution and is conceptually simple doesn't mean anything here.
  - CamperBob2•3m
    EKFs work by 'learning' the covariance matrix on the fly, so I don't see why not?
  - wswope•3m
    Hence the quotes ;).
- nyrikki•3m
  As an intuition on why many people see this as different.
  PAC Learning is about compression, KF/EKF is more like Taylor expansion.
  The specific types of PAC Learning that this paper covers has problems with a simplicity bias, and fairly low sensitivity.
  While based on UHATs, this paper may provide some insights.
  https://arxiv.org/abs/2502.02393
  Obviously LLM and LRMs are the most studied, but even the recent posts on here from anthropic show that without a few high probability entries in the k-top results, confabulations are difficult for transformers.
  Obviously there are PAC Learning methods that target anomaly detection, but they are very different than even EKF + Mc
  You will note in this paper that even highly weighted features exhibited low sensitivity.
  While the industry may find some pathological cases that make the approach usable, autograd and the need for parallelism make the application of this papers methods to tiny variations to multivariate problem ambitious.
  They also only trained on medical data. Part of the reason the foundation models do so well is that they encode verifiers from a huge corpus that invalidates the traditional bias variance tradeoffs from the early 90's papers.
  But they are still selecting from the needles and don't have access to the hay in the haystack.
  The following paper is really not related except it shows how compression exacerbates that problem.
  https://arxiv.org/abs/2205.06977
  Chaitin's constant encoding the Halting problem, and that it is normal and uncomputable is the extreme top end of computability, but relates to the compression idea.
  EKFs have access to the computable reals, and while non-linear, KF and EKFs can be thought of linearization of the approximations as a lens.
  If the diagnostic indicators were both ergodic and Markovian, this paper's approach would probably be fairly reliable.
  But these efforts are really about finding a many to one reduction that works.
  I am skeptical about it in this case for PAC ML, but perhaps they will find a pathological case.
  But the tradeoffs between statistical learning and expansive methods are quite different.
  Obviously hype cycles drive efforts, I encourage you to look at this years AAAI conference report and see that you are not alone with the frustration on the single minded approach.
  IMHO this paper is a net positive, showing that we are moving from a broad exploration to targeted applications.
  But that is just my opinion.
- jvanderbot•3m
  Parameter estimation is ML now?
  - klodolph•3m
    I think ML is in quotes for a reason—the reason is because the usage is not typical.
  - thenobsta•3m
    Why not? LLMs, vision models, and kalman filters all learn parameters based on data.
    - PaulHoule•3m
      A linear regression model can be written and trained as a neural net, has a loss function, all of that. Most if not all ML problems can be formulated as modelling a probability distribution
      - klodolph•3m
        That’s too reductive—ML models are statistical models. Statistical models have parameters, and in general cases, you choose the parameters with some kind of optimization algorithm.
        If you play fast and loose with your definition of “ML”, you’ll end up defining it so that any statistical model is an ML model… in which case, why even bother using two different terms?
        ML models are, broadly speaking, the more complicated ones with more parameters, where the behavior of the models is not really known without training. I’m sure you could nitpick to death any definition I give, but that’s fine.
        PaulHoule•3m
        I am sure there are people teach data science classes who look at it in that "reductive" way.
        From the viewpoint of engineering, scikit-learn provides the same interface to linear regression that it supplies to many other models. Huggingface provides an interface to models that is similar in a lot of ways but I think a 'regression' in that it doesn't provide the bare minimum of model selection facilities needed to reliably make calibrated models. There are many problems where you could use either linear regression or a much more complex model. When it comes to "not known without training" I'm not sure how much of that is the limit of what we know right now and how much is fundamental as in the problem of "we can't really know what a computer program with free loops with do" (Halting problem) or "we can't predict what side of the side Pluto is going to be on in 30 million years" (Deterministic chaos)
        (The first industrial model trainer I built was a simultaneously over and under engineered mess like most things in this industry... I didn't appreciate scikit-learn's model selection facilities and even though the data sci's I worked with had a book understanding of them, they didn't really put them to work.)
        klodolph•3m
        There’s a pedagogical reason to teach things with a kind of reductive definition. It makes a lot of sense.
        I remember getting cornered by somebody in a statistics class and interrogated about whether I thought neural networks were statistical techniques. In that situation I’ll only answer yes, they are statistical techniques. As far as I can tell, a big chunk of what we do with machine learning is create complicated models with a large number of parameters. We’re not creating something other than statistical models. We just draw a kind of cultural line between traditional statistics and machine learning techniques.
        Now that I think about it, maybe if you asked me about the line between traditional statistics and machine learning, I would say that in traditional statistics, you can understand the parameters.
        blackbear_•3m
        > in traditional statistics, you can understand the parameters.
        I also think that this is the key differentiator between ML and stats.
        Statistical models can be understood formally, which means that not only we know how each parameter affects the predictions, we also know what their estimation uncertainties are, under which assumptions, and how to check that these assumptions are satisfied. Usually, we value these models not only because they're predictive but also because they're interpretable.
        In ML there is neither the luxury nor the interest in doing this, all we want is something that predicts as well as possible.
        So the difference is not the model itself but what you want to get out of it.
        zeroCalories•3m
        I've heard professors call ML a form of applied statistics, and I think it's fair to call ML a subfield of statistics that deals with automatically generating statistical models with computers.
      - jvanderbot•3m
        You've equated neural networks with ML.
        I'm questioning the association of a single maximum likelihood parameter estimation via analytical optimization as ML, because it does not invole any methods beyond calculus and no models other than the system itself, whose parameters we are estimating.
        Perhaps I'm wrong, but the power of NN is in an unknown intermediate representation between the data (measurements in estimation) and the prediction. EKF has no such black box.
        Which now that I've written it agrees with top level comment so I rescind my question.
  - windsignaling•3m
    Neural networks are not ML now?
    - jvanderbot•3m
      EKF is a neural network!?
      - windsignaling•3m
        I think you missed the point of that comment. I was responding to the comment saying "Parameter estimation is ML now?"
        Neural networks are trained commonly using maximum likelihood estimation, a common parameter estimation technique.
bbstats•3m
Am I missing something or is this just "We built models that are bad"?
- ohgr•3m
  Bad model, bad method or bad paper?
magicalhippo•3m
For IHM prediction, LSTM models and transformer models were trained for 100 epochs using the MIMIC-III and eICU datasets separately.
I might be blind, but I don't see any mention of loss. Did they stop at 100 because it was a nice round number or because it was a good place to stop?
The LSTM model they used had 7k trainable parameters, the CW-LSTM model 153k while the transformer model had 800k parameters (300k trainable parameters and 600k optimizer parameters as they say).
I don't follow the field close enough, but is it reasonable these models all converged at the same time, given the large difference in size?
They mention the transformer model outperforming the LSTMs, but I wonder if it could have done a lot better.
- rakejake•3m
  A 7k param LSTM is very tiny. Not sure if LSTMs would even work at that scale although someone with more theoretical knowledge can correct me on this.
  As an aside, I'm trying to train transformers for some classification tasks on audio data. The models are "small" (like 1M-15M params at most) and I find they are very finicky to train. Below 1M parameters I find them hard to train at all. I have thrown all sorts of learning rate schedules at them and the best I can get is the network learns for a bit and then plateaus, after which I can't do anything to get them out of that minima. Training an LSTM/GRU on the same data gives me a much better loss value.
  I couldn't find many papers on training transformers at that scale. The only one I was able to find was MS's TinyStories [0], but that paper didn't delve much into how they trained the models and whether they trained from scratch or distilled from a larger model.
  At those scales, I find LSTMs and CNNs are a lot more stable. The few online threads I've found comparing LSTMs and Transformers had the same thing to say - Transformers need a lot more data and model size to achieve parity and exceed LSTMs/GRUs/CNNs, maybe because the inductive bias provided is hard to beat at those scales. Others can comment on what they've seen.
  [0] - https://arxiv.org/abs/2305.07759
  - Al-Khwarizmi•3m
    I don't have much help to offer, but just to echo your experience... at my group we have tried to train Transformers from scratch for various NLP tasks and we always have been hit with them being extremely brittle, and BiLSTMs working better. We only succeeded by following a pre-established recipe (e.g. training a BERT model from scratch for a new language, where the architecture, parameters and tasks are as in BERT), or of course by fine-tuning existing models, but just throwing some layers at a problem and training them from scratch... nope, won't work without arcane knowledge that doesn't seem to be written anywhere accessible. This is one of the reasons why I dislike Transformers and I root for the likes of RWKV to take the throne.
    - rakejake•3m
      I think the "arcane knowledge" is true for LLMs (billions). But there are lots of people who train models in the open in the hundreds of millions realm, but never below. Maybe transformers simply don't work as well below a size and data threshold.
- PaulHoule•3m
  What ever happened to early stopping?
  I see so many papers where people train neural networks with half-baked recipes. I think I saw early stopping first around 1990 but it is so often for people to pick some arbitrary number of epochs to run. I have to admit I never liked the term "early stopping", I think people should have called it just "stopping", because it makes it seem optional.
  Back when I was training LSTM networks it was straightforward to train nets reliably with early stopping...
  - Al-Khwarizmi•3m
    I'm also annoyed about this. I suppose the main reason is because 20 years ago, if you didn't use early stopping, typically your accuracy would plummet. In earlier, smaller neural networks, overfitting was a huge issue; and lack of dropout, batch normalization, etc. made learning much more brittle.
    Now the young'uns don't bother because you can just set 100 epochs, or whatever, and the result might not be optimal but it will generally be fine. Still, it's a pity because you're often wasting computational resources that could be spent in trying alternative architectures, exploring hyperparameters or whatever.
    BTW, I also think "early stopping" is a terrible name. If you don't know the term, it suggests that you're going to undertrain the network, sacrificing some accuracy for efficiency. No one wants undertrained networks. I think it's not an overstatement to say that if it were called "adaptive stopping", "validation guided stopping", or even something more catchy like "smart stopping", probably more people would use it.
    - PaulHoule•3m
      I have a smart RSS reader YOShInOn which uses BERT + a probability calibrated SVM as its main model, I want to make a general purpose model trainer for text classification that is able to do harder problems.
      People who hold court on ML forums will tell you fine-tuned BERT is the way to go but BERT fine-tuning doesn't seem to be compatible with early stopping with anything like the training recipes I see in the literature. Compared to old days these networks soak up knowledge like a sponge, my hunch is that with N=10,000 samples or so you don't benefit from running more than one epoch because the the network doesn't have the capacity to learn from that many samples.
      I find it depressing to find arXiv papers where people copy a training recipe from other papers for BERT and compare it 5-15 different text classification problems with maybe N=500 samples. My BERT experiments take about 30 minutes so it's no small thing to do parametric scans on them, particularly when the epoch count is one of the parameters. With "smart stopping" I'm not afraid of undertraining models so I could run trainings all night and believe I'm seeing representative performance as I vary parameters.
      My plan is to couple ModernBERT to a LSTM or Bi-LSTM model as the literature seems to show that this frequently ties or beats fine-tuned BERT and my experience so far as I can build reliable trainers for LSTM whereas team fined tuned BERT is indifferent to the very idea of "reliable".
      Another pet peeve is all the papers with N=500 samples where I regularly get N=10,000+ in systems that I use everyday and on a rainy weekend I can lay in bed with my iPad and switch to an Android tablet when the battery runs out and get N=5000 samples. [1] When I wrote my first text classification paper we found we needed N=10,000 to get really good models, sure the world knowledge in BERT helps models learn fast and that's great (a problem I worried about in 2005 and still worry about because I think the average person wants good results at N<10!) but I need calibrated usable accuracy and look at AUC-ROC as my metric, not "accuracy", F1 or anything like that.
      Then there's the effort people waste with things that can't possibly work like Word2Vec, seems like people can read a lot of papers and not see it in front of them that Word2Vec is useless! I want to write a meta-analysis but instead I'm writing a diatribe and I'm not going to be happy until I repeat the paradigm with methods that are... repeatable, not for the science but for the engineering.
      [1] with hallucinations as a side effect if it is a visual task but so what
  - rakejake•3m
    Nowadays, even the definition of an "epoch" is not well defined. Traditionally it meant a pass over the entire training set, but datasets are so massive today that many now define an epoch as X steps - where a step is a minibatch (of whatever size) from the training set. So 1 epoch is a random sample of X minibatches from the training set. I'd guess the logic is that datasets are so massive that you pick as much data as you can fit in VRAM.
    Karpathy's Zero To Hero series also uses this.
  - •3m
    [deleted]
timewizard•3m
All of this seems designed to make the hospital more labor efficient. None of this seems designed to improve long term outcomes for patients.
My continuing suspicion that this technology gets the hype it does as part of an effort to reduce wages for all workers grows.
MeteorMarc•3m
Maybe the training set had too many zeroshot patients.
amelius•3m
Maybe start a Kaggle competition?
chiph•3m
[flagged]
- azan_•3m
  Is this an ad?
  - chiph•3m
    Not an ad.
    Even if you aren't yet in your 30's, get your PSA levels checked at your annual physical. The sooner you know, the easier it is to cure.
  - airstrike•3m
    Yes
anthk•3m
LLM models are organic? They somehow obbey the laws of Thermodinamycs by some parallel on algorythms and underlying Math? It would be amazing if it were some parallel between Biologhycs (specially Funghi with emergent properties) and neural networks...