Time
S
Nick
Message
07:45
juancorr joined #dataverse
08:01
jri joined #dataverse
08:56
jri_ joined #dataverse
13:24
pdurbin joined #dataverse
14:48
donsizemore joined #dataverse
14:57
poikilotherm joined #dataverse
15:00
poikilotherm
Morning pdurbin :-)
15:00
pdurbin
mornin
15:02
poikilotherm
Getting unit tests in place is hard work...
15:02
poikilotherm
Hacking on some for DatasetServiceBean
15:02
poikilotherm
(while working on the scheduled stuff)
15:02
poikilotherm
Are these test missing due to lack of time?
15:05
pdurbin
Not everyone has the testing religion.
15:06
poikilotherm
Yeah...
15:07
poikilotherm
BTW - http://shop.oreilly.com/product/0636920078777.do arrived today. Will read into this during the holidays ;-)
15:10
pdurbin
nice, I have https://www.oreilly.com/library/view/containerizing-continuous-delivery/9781491986851/ in my hands. ~50 pages. Free book from JavaOne.
15:11
poikilotherm
Yeah, seems like a chapter of the book ;-) The author is the same :-)
15:13
pdurbin
Did you make a decision about running a poll or not? For your meeting.
15:13
poikilotherm
Oh, forgot about that. As you wrote that you run the show, lets stick with that for now.
15:14
pdurbin
Uh, I'm not sure what I wrote. I thought I was asking a question.
15:16
poikilotherm
"This morning @djbrooke and I spoke about next week's community call on December 18th and we do plan to have it. "
15:17
pdurbin
Oh. Yes, the normal call tomorrow is on. I'm asking if you've made a decision about your call.
15:18
poikilotherm
???
15:18
poikilotherm
"We like your idea of having those interested in discussing this issue to stay a little longer on the call."
15:19
pdurbin
Ok, so for you it's clear what the plan is? I don't think it's so clear. How do you feel about leaving a new comment on the issue?
15:19
pameyer joined #dataverse
15:19
poikilotherm
Just did that :-D
15:20
pdurbin
perfect, thanks!
15:35
pameyer
dockerhub decided they needed a redesign :(
15:40
pdurbin
I can't tell a huge difference but I'm not on Docker Hub much.
15:41
pameyer
that suggests that you normally browse with javascript enabled
15:41
pdurbin
like most people, yes :)
15:41
pameyer
I only noticed in passing when dockerhub links are now empty pages
16:21
jonas42 joined #dataverse
16:23
jonas42
hey everyone! anybody interested in chatting about "Harvesting from non-OAI-PMH sources" #5402 ? I think just a small input on how others do this would help a lot... No pressure though, I'll go engage the community mailinglist otherwise
16:24
donsizemore
@jonas42 i'd volunteer thu-mai and mandy if you want to join us in slack
16:27
jonas42
i was just a visitor in the dataverse slack
16:27
donsizemore
i'm not cool enough. i meant odum's slack
16:28
jonas42
sure! i'm jonas.kahle wzb.eu
16:30
pameyer
@jonas42 I'm interested; but there may be a lot of lag on my end
16:34
jonas42
@pameyer don't worry!
16:37
donsizemore
@jonas42 mandy leaves for japan tomorrow, so her stock response is "2019 mandy will worry about this" (but we're definitely interested)
16:39
jonas42
it's not the time of year to bring up new questions :D i totally get that
16:40
jonas42
i was just wondering if i'm totally off-track when thinking about using dataverse as a place for external references (outside of "default" harvesting)
16:41
donsizemore
@jonas2 we found in converting IPUMS XML to JSON that a) each dataset claims the same DOI, which b) directs to the IPUMS homepage.
16:42
donsizemore
plunking out the python to generate the JSON was kind of fun, but we're going to want to reference external data more and more (particularly with the Trusted Resource Storage Agent that Akio is developing)
16:43
donsizemore
heading for lunch but defo interested
16:45
pdurbin
jonas42: hi! In 5 minutes we're walking over to our team holiday lunch.
16:45
pdurbin
Do jonas42 and poikilotherm know each other?
16:47
pdurbin
pameyer: good discussion about https://github.com/IQSS/dataverse/issues/5406 and friends after standup. Thanks for looking into this.
16:47
pameyer
pdurbin: no problem - glad to hear it
16:47
pameyer
and glad that wiser heads than me are looking at how to fix it
16:48
pdurbin
and wiser than me
16:48
pdurbin
Friday was very confusing for me.
16:50
pameyer
some days have higher proportions of unexpected things breaking than others
17:10
jonas42
who is poikilotherm? (according to wikipedia, (s)he is no human)
17:11
pameyer
jonas42: from some of the various comments, it seemed to me like you were trying to do something like create a dataset where the data was links to other datasets, and metadata was additional annotations about them. is that anywhere close?
17:12
pameyer
https://github.com/poikilotherm
17:12
pameyer
... but I haven't looked at what wikipedia said
17:12
jonas42
poikilotherm /ˈpɔɪkɪlə(ʊ)ˌθəːm/ nounZoology noun: poikilotherm; plural noun: poikilotherms an organism that cannot regulate its body temperature except by behavioural means such as basking or burrowing.
17:13
jonas42
@pameyer i try to describe it in #5402
17:14
pameyer
I evaluate technical issues without consideration to the body temperature mechanism of the source :)
17:14
pameyer
will re-read 5402
17:22
pameyer
jonas42: it seems to me like 5402 might be better off as at least 2 issues
17:24
pameyer
the way I understand things, harvested datasets are only editable at the source - so "enriching" / adding metadata to a harvested dataset would be one
17:25
pameyer
additional import format(s) would be another (datacite, crossref, etc)
17:26
pameyer
harvesting set membership is also defined at the source. this could potentially be worked-around by defining the set in a static file somewhere; but I'm not sure how that would interact w\ source metadata updates
17:27
jonas42
well i was trying to avoid opening a new issue at all....
17:27
pameyer
the granularity one there might be another
17:27
jonas42
but okay, so i'm not totally off-track with this
17:28
pameyer
issues are free :)
17:28
pameyer
it's definately good to have the high-level goal in an issue (at least in my opinion)
17:29
pameyer
over time, I've picked up the habit of trying to break stuff down into the smallest component chunks though
17:31
pameyer
one question that's not clear from the issue - will the users be getting files from your installation, or the original source?
17:34
jonas42
the files would be at the original source
17:44
jonas42
(going back to the discussion if a non-harvested dataset can exist without data)
17:45
pameyer
you can have a dataset without files easily enough
17:46
pameyer
whether or not that counts as without data is more a question of definations
17:47
jonas42
ok, good. i added that info "This is only about metadata! The data would reside at its original source." to the ticket anyways.
17:48
pameyer
you probably checked this before me, but datacite's content negotiation doesn't list oai as a supported output format
17:49
pameyer
5402 mentions that using oai-pmh might limit the amount of metadata available; do you know if there's a format that would fit?
17:53
jonas42
i think it doesn't exist for data.datacite.org but aoi_dc and oai_datacite are defined at https://oai.datacite.org/oai?verb=ListMetadataFormats
17:53
pameyer
right - I'd been thinking about the "query datacite / eventually crossref" bit
17:54
pameyer
could be on the wrong track though
18:21
donsizemore joined #dataverse
18:28
jri joined #dataverse
18:33
jonas42
thank you for your input - i'm gonna callit a day now (7:30p here)
18:35
jonas42
@pameyer the querying solution is just a workaround (which i would be happy/satisfied to use for now)
18:35
jonas42
have a nice day everyone! :D
18:49
pdurbin
jonas42: back. Thanks!
18:49
pdurbin
Enjoy your evening. Get out of here. :)
19:09
poikilotherm joined #dataverse
19:51
poikilotherm joined #dataverse
19:52
poikilotherm
Hey pdurbin, I am about to refactor some exporting stuff to make it unit testable. Is this a chunk to big for the current issue?
19:57
pdurbin
poikilotherm: sigh. Probably. Hey, I just booked flights for http://osd.mpdl.mpg.de . Are you coming? I hear jonas42 might come. :)
19:58
poikilotherm
Oh cool!
19:58
poikilotherm
I will try to make it, really depends on construction work in the house and the institutes purse. Hopefully I can negotiate tomorrow about the later...
19:59
pdurbin
cool
19:59
poikilotherm
:-)
19:59
pdurbin
jonas42: I booked a hotel in Mitte.
20:00
poikilotherm
Which one? Could try to get a room nearby
20:00
poikilotherm
Or was that a cite from jonas42?
20:01
pdurbin
well, he drew me a circle suggesting generally where to stay
20:01
poikilotherm
pdurbin: do you think I should add unit tests for stuff that I do or should I just skip that?
20:01
poikilotherm
Adding those needs refactoring to be testable
20:02
poikilotherm
(Need mocks for the logging in the export functions)
20:02
pdurbin
poikilotherm: maybe for now you could add /* TODO: Refactor this code to make it testable.*/ . Then we could talk about it and send it back to you if we want the refactoring now. Does that make sense?
20:02
poikilotherm
Hmm ok
20:02
poikilotherm
Sounds fair
20:10
pdurbin
poikilotherm: Jacoco is reported that those lines aren't covered at all?
20:10
poikilotherm
Yeah
20:11
poikilotherm
There is not a single unit test for DatasetServiceBean
20:11
pdurbin
bleh
20:11
poikilotherm
Yeah...
20:12
pdurbin
18% coverage according to https://coveralls.io/github/IQSS/dataverse
20:12
pdurbin
better than DVN 3.x which was 0%
20:12
pameyer
yeah, but we know that 18% number is wrong
20:13
pdurbin
well, it's measuring something but yeah
20:14
pameyer
I didn't have enought momentum to get the integration test coverage reports into anything sane
20:15
poikilotherm
Actually, coverage reports most of the time produce these numbers and they are mostly just that: a number. As long as you don't do proper test engineering, setup some real business logic testing and do proper configuration what to count and what not, this number is useless. And if you do all of this, it is most likely that you cannot reach 100% because you will never ever test every single line, but only
20:15
poikilotherm
those where it makes sense.
20:16
pdurbin
yeah, you're both right
20:17
pameyer
poikilotherm: it's like you saw sqlite in the news over the weekend :)
20:19
pdurbin
poikilotherm: at lunch I invited someone to your Kubernetes meeting. He think it's cool you're interested in running Dataverse on Kuburnetes but asked why. I didn't have a good answer for him beyond "his devops guy wants to run Dataverse on Kubernetes." Is that it? Is there more of a reason?
20:19
pdurbin
What should I have told him?
20:20
poikilotherm
For my work, the Kubernetes part of this is just embellishment
20:20
pameyer
and for the curious, last I checked the numbers were 29% instruction, 15% branch for DatasetServiceBean
20:21
poikilotherm
It's cool to have Kubernetes for devs, too, but this is more for the UI/UX and stakeholder people
20:22
poikilotherm
My dev stuff is going for other things like the PID things etc
20:22
poikilotherm
So my primary goal is the Docker stuff
20:22
poikilotherm
Make things testable
20:23
pdurbin
Testing storage drivers, for example.
20:23
poikilotherm
Yeah
20:23
pdurbin
That didn't come to mind at lunch. Thanks.
20:23
poikilotherm
If all this leads to running this on Kubernetes easily, too I am very happy someone else has a benefit ;-)
20:24
pdurbin
Sure. Me too.
20:24
poikilotherm
As Kubernetes is "just" an automation framework around containers (not necessarily Docker), this is kind of "a level above"
20:24
pdurbin
Yeah. Orchestration.
20:26
poikilotherm
Of course Kubernetes makes things easier. Just like AWS does. Or other cloud tools.
20:26
pameyer
I'm suprised to hear that k8s was a dev thing. in my hands, getting semi-functional dev setup was significantly non-trivial
20:27
poikilotherm
Yeah. Minikube is ok, but Docker only is waaaaay easier.
20:27
pdurbin
I've only ever used Minishift. Not Minikube.
20:27
poikilotherm
Kubernetes hasn't been around as long as Docker has. It's a maturity thing, I suppose.
20:27
pameyer
even with minikube, there's DNS /routing snarls that need to be unsnarled :(
20:28
poikilotherm
docker-compose for the win ;-)
20:28
pdurbin
poikilotherm: so in dev would you use Minicube? Or just vanilla Docker?
20:28
poikilotherm
Just vanilla Docker ;-)
20:28
poikilotherm
And maybe docker-compse
20:29
pdurbin
ok
20:29
poikilotherm
That's more or less trivial as an addendum to Docker
20:29
poikilotherm
In contrast to Kubernetes ;-)
20:29
poikilotherm
Oh I have a good reason why to go for Kubernetes and Dataverse
20:30
poikilotherm
One of our supercomputer guys and one from the bioinformatics people askes about running their code in GitLab CI some days ago.
20:30
poikilotherm
They need test data for this, as they need to verify the code
20:31
poikilotherm
It would be really cool to have the code running next to Dataverse within the same Kubernetes cluster, so the data transfer is quick
20:31
poikilotherm
Gitlab CI has a runner for Kubernetes and orchestrates tests on such a cluster with Docker images.
20:32
poikilotherm
And I tought about using things like the R integrations, WholeTale etc all in the same Kubernetes cluster next to Dataverse
20:32
poikilotherm
We have quite a bunch of people that need about 300-400 megs transfered for tests
20:33
poikilotherm
If you do a test every few minutes, having those next to Dataverse should speed up things ;-)
20:34
pdurbin
Sure, reminds me of NDS Labs Workbench which runs on Kubernetes: http://www.nationaldataservice.org/projects/labs.html
20:35
poikilotherm
Sounds fancy
20:36
poikilotherm
Oh pdurbin: are you coming alone to Berlin or is somebody else from IQSS with you?
20:36
pameyer
I'm in the vast minority, but I tend to think in-place computing on data is better approach than trying to have fast transfer
20:37
pdurbin
poikilotherm: just me
20:37
poikilotherm
There are a lot of different definitions of "in place computing"... Could you elaborate a little?
20:38
poikilotherm
pdurbin: ok. :-)
20:38
poikilotherm
pdurbin: About the TODO comments: like that: https://github.com/IQSS/dataverse/pull/5371/commits/81fbffe736a0c3070ca24fec5e444b583a109385
20:38
poikilotherm
?
20:38
pameyer
poikilotherm: repository software and compute pipelines using same storage
20:38
pdurbin
my wife has 100% German ancestry and would love to come some day :)
20:39
pdurbin
poikilotherm: perfect TODO comments. Thanks!
20:39
poikilotherm
pameyer: "same storage" => S3 same enough? Or real posix share?
20:41
pameyer
poikilotherm: posix. that's what the researchers in question write their software to read from
20:41
pameyer
and dataverse on s3 doesn't support what I consider to be direct compute access
20:41
poikilotherm
Yeah.
20:42
poikilotherm
Posix has its own downsides. No loose coupling. Locking issues. And the like
20:42
pameyer
computing on publish data means you get to make the source read only :)
20:43
pameyer
and let's you de-couple repository, storage and compute infrastructure
20:43
poikilotherm
What about a hybrid? Use local caches?
20:44
pameyer
it's a possibility - but if you're thinking "big data", you don't want more copies than you need
20:44
poikilotherm
Yeah, that'S true
20:44
poikilotherm
For "real big data", the POSIX is more or less inevitable
20:44
pameyer
and the work on having the repo orchestrate local caching got pushed back for otherstuff
20:44
poikilotherm
S3 is not fast enough
20:44
pameyer
there's always trade offs
20:52
poikilotherm
pdurbin: /me crossing fingers @landreev will look into this: https://github.com/IQSS/dataverse/issues/5345#issuecomment-447994649
20:54
pdurbin
I dragged it to code review.
20:54
pdurbin
still WIP?
20:55
poikilotherm
YES!!!!
20:55
poikilotherm
This is not ready, as stated in https://github.com/IQSS/dataverse/issues/5345#issuecomment-447994649
20:55
pdurbin
ok :)
20:55
poikilotherm
The harvester timers are still present
20:55
poikilotherm
Need to refactor that first
20:56
pdurbin
ok, but you're basically blocked, right?
20:56
poikilotherm
A bit. I can continue with those, but before the PR is merged, we really should talk about this.
20:56
pdurbin
sure
20:57
pdurbin
Are you going to change the installer?
20:57
poikilotherm
IMHO every time that a refactoring takes place, testing should be added. Otherwise you will never get over 20% ;-)
20:57
poikilotherm
Altough that often leads to bigger refactoring
20:58
poikilotherm
(Need to make it testable)
20:59
poikilotherm
pdurbin: do you know by heart where to look for the schedule time settings an admin has for the harvesters?
20:59
pdurbin
Does the "Set up the data source for the timers" stuff need to change at https://github.com/IQSS/dataverse/blob/v4.9.4/scripts/installer/glassfish-setup.sh#L118 ?
21:00
poikilotherm
Yes.
21:00
poikilotherm
That line can be pruned.
21:00
poikilotherm
(deleted)
21:00
poikilotherm
err.. sry. not L118, but L120
21:02
pdurbin
poikilotherm: ok, can you please add a TODO there too?
21:02
poikilotherm
Sure.
21:02
pdurbin
Thanks. Otherwise it's hard to keep track of all the places.
21:03
poikilotherm
Err. I just add a commit removing the line. That's easier and keeps us moving forward
21:05
poikilotherm
Or would you prefer to comment it out and add a comment for future reference?
21:07
pdurbin
If we don't need that line any more, removing it is fine.
21:09
poikilotherm
I just made up my mind. For code review and QA it's easier to comment it. One day the installer will be refactored and then it's alright to remove anything commented. Until that day, you can see in git blame/bisect what happened and why.
21:09
pdurbin
Ok, sounds fine.
21:11
poikilotherm
https://github.com/IQSS/dataverse/pull/5371/commits/db73d5b88f9742f9e515bec323479023c3de5068
21:14
pdurbin
looks good. also in that script is some timer=true stuff. Can that be removed as well?
21:14
poikilotherm
Nope, that is used for indicating who is the "master of puppets".
21:14
poikilotherm
See timer docs :-D
21:15
poikilotherm
https://github.com/IQSS/dataverse/blob/db73d5b88f9742f9e515bec323479023c3de5068/doc/sphinx-guides/source/admin/timers.rst
21:15
pdurbin
ok, want to remove that from the script too?
21:16
poikilotherm
Yeah, kcondon is industrious :-)
21:16
poikilotherm
Seems like my PR for the AWS stuff has a chance to be merged before holidays :-)
21:17
poikilotherm
pdurbin: nope. You need it in new installs
21:17
pdurbin
he's on a roll
21:17
poikilotherm
It should stay there, as it would break stuff otherwise
21:17
pdurbin
ok
21:18
poikilotherm
It *should* be replaced by some automatic handling
21:18
poikilotherm
See my " (Might get addressed for automation in a later Dataverse version using cluster support from the application server.)" in the timer docs... ;-)
21:23
poikilotherm
Payara has Hazelcast backed in... That could solve stuff like this.
21:24
pdurbin
ok
21:25
pdurbin
"It is much easier for container approaches to use application scoped JDBC connections, but those seem not to be reusable for EJB timers."
21:26
poikilotherm
Yeah
21:27
pdurbin
I think you and Leonid have got this. :)
21:27
pdurbin
Wake me up when it's over. :)
21:30
poikilotherm
LOL
21:33
pameyer
I still don't have a great idea why EJB timers were used instead of cron jobs.
21:34
pameyer
I'll defer to the folks that actually understand EJB though
21:38
poikilotherm
Most certainly this was just easier.
21:38
pdurbin
Yeah, I've always gone for the cron approach.
21:38
poikilotherm
Cronjobs needs manual setup
21:39
pdurbin
yeah
21:39
poikilotherm
EJB timers are for free
21:39
pdurbin
free as in puppy
21:40
poikilotherm
Why "as in puppy"? Ok, the current code is a bit bloated, but the @Schedule annotation is fairly easy
21:40
pdurbin
I do hope I eventually understand the cron equivalent for Java EE, which I guess are these EJB timers. But you can pull cron from my cold, dead hands.
21:41
pdurbin
I'll be happy to have an easy @Schedule example to look at once this gets merged.
21:42
poikilotherm
Here you go
21:42
poikilotherm
https://github.com/IQSS/dataverse/blob/db73d5b88f9742f9e515bec323479023c3de5068/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java#L581
21:42
poikilotherm
Executed everyday at 2am local time
21:44
pdurbin
doesn't look too bad
21:44
poikilotherm
You were asking about the mster timer stuff above... Show me how to get that in place within a multinode cluster setup with vixie-cron and the like...
21:44
poikilotherm
With stuff like hazelcast or others (zookeeper, ...) you can use this stuff coordinated in clusters.
21:45
poikilotherm
But of course: you could create a API call that is triggered from curl
21:45
poikilotherm
Which is inside a cronjob or webcron
21:45
poikilotherm
I think using EJB is easier ;-)
21:46
pdurbin
I'm fine with whatever works. I'm still learning EJB.
21:46
poikilotherm
:-)
21:47
poikilotherm
I really like the concept of CDI
21:47
poikilotherm
Which is IMHO the greatest benefit and usecase of EJB
21:48
poikilotherm
http://www.adam-bien.com/roller/abien/entry/cdi_with_or_without_ejb
21:48
pameyer
poikilotherm: is "complexity budget" something that the RCE-ish folks think about?
21:48
poikilotherm
RCE = RSE?
21:49
pameyer
yup - I typo things :(
21:49
poikilotherm
;-)
21:49
poikilotherm
No worries
21:49
poikilotherm
And: yeah. Anything that makes stuff easier for a research is good. Less complexity is good.
21:50
poikilotherm
research = researcher
21:51
pameyer
yup
21:51
pameyer
applies to infrastructure too
21:54
pameyer
but pdurbin and I will learn to be happy w\ EJB timers instead of cron ;)
21:54
poikilotherm
It is less complex ;-)
21:55
poikilotherm
At least if you don't ask me to reschedule the OAI exports.
21:55
poikilotherm
Those are hard wired to 2am as befire
21:55
poikilotherm
But now you cannot change those via database hacking
21:56
pameyer
I'm pretty sure database hacking isn't a documented way to change them anyway
21:59
poikilotherm
It's in the docs...
21:59
poikilotherm
http://guides.dataverse.org/en/latest/admin/timers.html#id3
22:00
poikilotherm
"This job is automatically scheduled to run at 2AM local time every night. If really necessary, it is possible (for an advanced user) to change that time by directly editing the EJB timer application table in the database."
22:03
pameyer
good to know - obviously not something I've worried about reconfiguring
22:04
poikilotherm
This is no good thing to consider doing. And it will not be possible anymore once the non-persistent EJB timers are in place.
22:04
poikilotherm
If this really needs to be configurable, a clean implementation should be done.
22:07
pameyer
how do you monitor non-persistent EJB timers?
22:10
poikilotherm
I did not yet look into this. Current approach for monitoring is not very sophisticated http://guides.dataverse.org/en/latest/admin/monitoring.html#id9
22:10
poikilotherm
I haven't tried to use this with non-persistent timers yet
22:11
poikilotherm
pameyer: being the API guy, is there a health API present?
22:11
pameyer
not in dataverse that I know of
22:11
pameyer
I think there's a glassfish one that's off by default
22:11
poikilotherm
What *should* be done is creating something like this
22:11
poikilotherm
Not only for timers
22:12
poikilotherm
And here we go: https://github.com/eclipse/microprofile-health
22:12
poikilotherm
REST API for health
22:12
poikilotherm
(you could use JMX of course)
22:12
poikilotherm
But I think REST could be easier for common monitoring systems
22:13
pameyer
asadmin set server.monitoring-service.module-monitoring-levels.jvm=LOW
22:13
poikilotherm
Oh yeah there is more where that came from
22:13
poikilotherm
This is the JMX stuff
22:13
poikilotherm
Pretty bloated
22:13
pameyer
I used glassfish "health" api to see if glassfish was alive enough to try and deploy dataverse too
22:14
poikilotherm
;-)
22:14
pameyer
re: timer monitoring; timer oddness has been reported enough that it was worth adding a section about how to make sure the timers were happy
22:15
pameyer
I don't know if that approach is one that's in wide use, but it might be worth checking at some point
22:17
pameyer
health api would be helpful, but might be more implementation complexity than it's worth
22:17
pameyer
might not be too ;)
22:18
poikilotherm
Health API just for timers is overkill, but it might be a first step.
22:18
poikilotherm
There is a lots of stuff that could be added to this
22:18
pameyer
there might be other stuff under `api/info`; but `api/info/version` is the only one I've used
22:28
poikilotherm
Alright guys, it's now 23:28 over here. I will get some sleep now. Read and hear you tomorrow. :-)
22:28
pameyer
have a good night
22:28
poikilotherm
Have a nice evening :-)