Best practices for reproducibility
Working with a half a dozen Turtlebots and 14 students we have a real problem with reproducibility. That is to say things work and then don't work without us knowing what might have changed. Now I can think of many many culprits:
- some package that was inadvertently or automatically updated (changed)
- slight hardware differences we might not even know about
- timing ordering of launching different nodes
- draining batteries
- and so on and so forth
We have a problem with reproducibility. I am sure we're not the only one, by far. It's a question of software and hardware "hygiene" I suspect. We've thought of some techniques to solve this but haven't implemented them yet::
- Have one linux image that is authorized and install it on bare metal (ok)
- Followed by a shell script that installs very specific versions of everything (hard, but may be doable)
- Prohibit (via passwords?) anyone from installing or deinstalling anything (not sure how to do this)
- Turn off all automatic update mechanisms (not sure how to do this)
My question is, how do you avoid this problem? What are your best practices? What are your tools?
It would be good if you could give some examples of what you feel are "problems with reproducability".
Right now you only list (what you have identified as) potential causes with a list of potential solutions, but you don't really describe what the problems are you are running into.
Working/not working is too vague, and rather binary.
Turtlebots are real systems, closed-loop-ish controlled. If for instance you'd like each and every one of them to reach exactly the same spot in a map, that is, without some serious tweaking and calibration, not going to happen.
Hoi! When I say "not working" I mean something pretty fundamental. As an example the student tells me they got something to work but then they run it again to show it to me and it doesn't work at all. The lidar stopped spinning for no apparent reason; the new navigation destination in rviz doesn't do anything; some weird error that I've never seen before shows up on the log. Sometimes rebooting the robot and re-running roscore etc does it, sometimes power cycling the whole robot makes the problem go away.
that may be, and it may be perfectly clear to you, but if you don't write these things down, we can't know, so can't help you.
I know what you’re saying. The very nature of irreproducibility is that it’s different every time, and that it’s hard for me pin in down what exactly went wrong. Let me refer back to my original question, which was not to solve a particular problem, but asking for experts like you for their best practices. And the previous response I received actually had some specific and actionable practices. So, what is your advice for “good hygiene” in a scenario like mine? (e.g. always brush your teeth at least once a day, don’t drink coffee after 9pm, always look twice before crossing the road)