IQSS logo

IRC log for #dataverse, 2018-12-17

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

All times shown according to UTC.

Time S Nick Message
07:45 juancorr joined #dataverse
08:01 jri joined #dataverse
08:56 jri_ joined #dataverse
13:24 pdurbin joined #dataverse
14:48 donsizemore joined #dataverse
14:57 poikilotherm joined #dataverse
15:00 poikilotherm Morning pdurbin :-)
15:00 pdurbin mornin
15:02 poikilotherm Getting unit tests in place is hard work...
15:02 poikilotherm Hacking on some for DatasetServiceBean
15:02 poikilotherm (while working on the scheduled stuff)
15:02 poikilotherm Are these test missing due to lack of time?
15:05 pdurbin Not everyone has the testing religion.
15:06 poikilotherm Yeah...
15:07 poikilotherm BTW - http://shop.oreilly.com/product/0636920078777.do arrived today. Will read into this during the holidays ;-)
15:10 pdurbin nice, I have https://www.oreilly.com/library/view/containerizing-continuous-delivery/9781491986851/ in my hands. ~50 pages. Free book from JavaOne.
15:11 poikilotherm Yeah, seems like a chapter of the book ;-) The author is the same :-)
15:13 pdurbin Did you make a decision about running a poll or not? For your meeting.
15:13 poikilotherm Oh, forgot about that. As you wrote that you run the show, lets stick with that for now.
15:14 pdurbin Uh, I'm not sure what I wrote. I thought I was asking a question.
15:16 poikilotherm "This morning @djbrooke and I spoke about next week's community call on December 18th and we do plan to have it. "
15:17 pdurbin Oh. Yes, the normal call tomorrow is on. I'm asking if you've made a decision about your call.
15:18 poikilotherm ???
15:18 poikilotherm "We like your idea of having those interested in discussing this issue to stay a little longer on the call."
15:19 pdurbin Ok, so for you it's clear what the plan is? I don't think it's so clear. How do you feel about leaving a new comment on the issue?
15:19 pameyer joined #dataverse
15:19 poikilotherm Just did that :-D
15:20 pdurbin perfect, thanks!
15:35 pameyer dockerhub decided they needed a redesign :(
15:40 pdurbin I can't tell a huge difference but I'm not on Docker Hub much.
15:41 pameyer that suggests that you normally browse with javascript enabled
15:41 pdurbin like most people, yes :)
15:41 pameyer I only noticed in passing when dockerhub links are now empty pages
16:21 jonas42 joined #dataverse
16:23 jonas42 hey everyone! anybody interested in chatting about "Harvesting from non-OAI-PMH sources" #5402 ? I think just a small input on how others do this would help a lot... No pressure though, I'll go engage the community mailinglist otherwise
16:24 donsizemore @jonas42 i'd volunteer thu-mai and mandy if you want to join us in slack
16:27 jonas42 i was just a visitor in the dataverse slack
16:27 donsizemore i'm not cool enough. i meant odum's slack
16:28 jonas42 sure! i'm jonas.kahle@wzb.eu
16:30 pameyer @jonas42 I'm interested; but there may be a lot of lag on my end
16:34 jonas42 @pameyer don't worry!
16:37 donsizemore @jonas42 mandy leaves for japan tomorrow, so her stock response is "2019 mandy will worry about this" (but we're definitely interested)
16:39 jonas42 it's not the time of year to bring up new questions :D i totally get that
16:40 jonas42 i was just wondering if i'm totally off-track when thinking about using dataverse as a place for external references (outside of "default" harvesting)
16:41 donsizemore @jonas2 we found in converting IPUMS XML to JSON that a) each dataset claims the same DOI, which b) directs to the IPUMS homepage.
16:42 donsizemore plunking out the python to generate the JSON was kind of fun, but we're going to want to reference external data more and more (particularly with the Trusted Resource Storage Agent that Akio is developing)
16:43 donsizemore heading for lunch but defo interested
16:45 pdurbin jonas42: hi! In 5 minutes we're walking over to our team holiday lunch.
16:45 pdurbin Do jonas42 and poikilotherm know each other?
16:47 pdurbin pameyer: good discussion about https://github.com/IQSS/dataverse/issues/5406 and friends after standup. Thanks for looking into this.
16:47 pameyer pdurbin: no problem - glad to hear it
16:47 pameyer and glad that wiser heads than me are looking at how to fix it
16:48 pdurbin and wiser than me
16:48 pdurbin Friday was very confusing for me.
16:50 pameyer some days have higher proportions of unexpected things breaking than others
17:10 jonas42 who is poikilotherm? (according to wikipedia, (s)he is no human)
17:11 pameyer jonas42: from some of the various comments, it seemed to me like you were trying to do something like create a dataset where the data was links to other datasets, and metadata was additional annotations about them.  is that anywhere close?
17:12 pameyer https://github.com/poikilotherm
17:12 pameyer ... but I haven't looked at what wikipedia said
17:12 jonas42 poikilotherm /ˈpɔɪkɪlə(ʊ)ˌθəːm/ nounZoology noun: poikilotherm; plural noun: poikilotherms      an organism that cannot regulate its body temperature except by behavioural means such as basking or burrowing.
17:13 jonas42 @pameyer i try to describe it in #5402
17:14 pameyer I evaluate technical issues without consideration to the body temperature mechanism of the source :)
17:14 pameyer will re-read 5402
17:22 pameyer jonas42: it seems to me like 5402 might be better off as at least 2 issues
17:24 pameyer the way I understand things, harvested datasets are only editable at the source - so "enriching" / adding metadata to a harvested dataset would be one
17:25 pameyer additional import format(s) would be another (datacite, crossref, etc)
17:26 pameyer harvesting set membership is also defined at the source.  this could potentially be worked-around by defining the set in a static file somewhere; but I'm not sure how that would interact w\ source metadata updates
17:27 jonas42 well i was trying to avoid opening a new issue at all....
17:27 pameyer the granularity one there might be another
17:27 jonas42 but okay, so i'm not totally off-track with this
17:28 pameyer issues are free :)
17:28 pameyer it's definately good to have the high-level goal in an issue (at least in my opinion)
17:29 pameyer over time, I've picked up the habit of trying to break stuff down into the smallest component chunks though
17:31 pameyer one question that's not clear from the issue - will the users be getting files from your installation, or the original source?
17:34 jonas42 the files would be at the original source
17:44 jonas42 (going back to the discussion if a non-harvested dataset can exist without data)
17:45 pameyer you can have a dataset without files easily enough
17:46 pameyer whether or not that counts as without data is more a question of definations
17:47 jonas42 ok, good. i added that info "This is only about metadata! The data would reside at its original source." to the ticket anyways.
17:48 pameyer you probably checked this before me, but datacite's content negotiation doesn't list oai as a supported output format
17:49 pameyer 5402 mentions that using oai-pmh might limit the amount of metadata available; do you know if there's a format that would fit?
17:53 jonas42 i think it doesn't exist for data.datacite.org but aoi_dc and oai_datacite are defined at https://oai.datacite.org/oai?verb=ListMetadataFormats
17:53 pameyer right - I'd been thinking about the "query datacite / eventually crossref" bit
17:54 pameyer could be on the wrong track though
18:21 donsizemore joined #dataverse
18:28 jri joined #dataverse
18:33 jonas42 thank you for your input - i'm gonna callit a day now (7:30p here)
18:35 jonas42 @pameyer the querying solution is just a workaround (which i would be happy/satisfied to use for now)
18:35 jonas42 have a nice day everyone! :D
18:49 pdurbin jonas42: back. Thanks!
18:49 pdurbin Enjoy your evening. Get out of here. :)
19:09 poikilotherm joined #dataverse
19:51 poikilotherm joined #dataverse
19:52 poikilotherm Hey pdurbin, I am about to refactor some exporting stuff to make it unit testable. Is this a chunk to big for the current issue?
19:57 pdurbin poikilotherm: sigh. Probably. Hey, I just booked flights for http://osd.mpdl.mpg.de . Are you coming? I hear jonas42 might come. :)
19:58 poikilotherm Oh cool!
19:58 poikilotherm I will try to make it, really depends on construction work in the house and the institutes purse. Hopefully I can negotiate tomorrow about the later...
19:59 pdurbin cool
19:59 poikilotherm :-)
19:59 pdurbin jonas42: I booked a hotel in Mitte.
20:00 poikilotherm Which one? Could try to get a room nearby
20:00 poikilotherm Or was that a cite from jonas42?
20:01 pdurbin well, he drew me a circle suggesting generally where to stay
20:01 poikilotherm pdurbin: do you think I should add unit tests for stuff that I do or should I just skip that?
20:01 poikilotherm Adding those needs refactoring to be testable
20:02 poikilotherm (Need mocks for the logging in the export functions)
20:02 pdurbin poikilotherm: maybe for now you could add /* TODO: Refactor this code to make it testable.*/ . Then we could talk about it and send it back to you if we want the refactoring now. Does that make sense?
20:02 poikilotherm Hmm ok
20:02 poikilotherm Sounds fair
20:10 pdurbin poikilotherm: Jacoco is reported that those lines aren't covered at all?
20:10 poikilotherm Yeah
20:11 poikilotherm There is not a single unit test for DatasetServiceBean
20:11 pdurbin bleh
20:11 poikilotherm Yeah...
20:12 pdurbin 18% coverage according to https://coveralls.io/github/IQSS/dataverse
20:12 pdurbin better than DVN 3.x which was 0%
20:12 pameyer yeah, but we know that 18% number is wrong
20:13 pdurbin well, it's measuring something but yeah
20:14 pameyer I didn't have enought momentum to get the integration test coverage reports into anything sane
20:15 poikilotherm Actually, coverage reports most of the time produce these numbers and they are mostly just that: a number. As long as you don't do proper test engineering, setup some real business logic testing and do proper configuration what to count and what not, this number is useless. And if you do all of this, it is most likely that you cannot reach 100% because you will never ever test every single line, but only
20:15 poikilotherm those where it makes sense.
20:16 pdurbin yeah, you're both right
20:17 pameyer poikilotherm: it's like you saw sqlite in the news over the weekend :)
20:19 pdurbin poikilotherm: at lunch I invited someone to your Kubernetes meeting. He think it's cool you're interested in running Dataverse on Kuburnetes but asked why. I didn't have a good answer for him beyond "his devops guy wants to run Dataverse on Kubernetes." Is that it? Is there more of a reason?
20:19 pdurbin What should I have told him?
20:20 poikilotherm For my work, the Kubernetes part of this is just embellishment
20:20 pameyer and for the curious, last I checked the numbers were 29% instruction, 15% branch for DatasetServiceBean
20:21 poikilotherm It's cool to have Kubernetes for devs, too, but this is more for the UI/UX and stakeholder people
20:22 poikilotherm My dev stuff is going for other things like the PID things etc
20:22 poikilotherm So my primary goal is the Docker stuff
20:22 poikilotherm Make things testable
20:23 pdurbin Testing storage drivers, for example.
20:23 poikilotherm Yeah
20:23 pdurbin That didn't come to mind at lunch. Thanks.
20:23 poikilotherm If all this leads to running this on Kubernetes easily, too I am very happy someone else has a benefit ;-)
20:24 pdurbin Sure. Me too.
20:24 poikilotherm As Kubernetes is "just" an automation framework around containers (not necessarily Docker), this is kind of "a level above"
20:24 pdurbin Yeah. Orchestration.
20:26 poikilotherm Of course Kubernetes makes things easier. Just like AWS does. Or other cloud tools.
20:26 pameyer I'm suprised to hear that k8s was a dev thing.  in my hands, getting semi-functional dev setup was significantly non-trivial
20:27 poikilotherm Yeah. Minikube is ok, but Docker only is waaaaay easier.
20:27 pdurbin I've only ever used Minishift. Not Minikube.
20:27 poikilotherm Kubernetes hasn't been around as long as Docker has. It's a maturity thing, I suppose.
20:27 pameyer even with minikube, there's DNS/routing snarls that need to be unsnarled :(
20:28 poikilotherm docker-compose for the win ;-)
20:28 pdurbin poikilotherm: so in dev would you use Minicube? Or just vanilla Docker?
20:28 poikilotherm Just vanilla Docker ;-)
20:28 poikilotherm And maybe docker-compse
20:29 pdurbin ok
20:29 poikilotherm That's more or less trivial as an addendum to Docker
20:29 poikilotherm In contrast to Kubernetes ;-)
20:29 poikilotherm Oh I have a good reason why to go for Kubernetes and Dataverse
20:30 poikilotherm One of our supercomputer guys and one from the bioinformatics people askes about running their code in GitLab CI some days ago.
20:30 poikilotherm They need test data for this, as they need to verify the code
20:31 poikilotherm It would be really cool to have the code running next to Dataverse within the same Kubernetes cluster, so the data transfer is quick
20:31 poikilotherm Gitlab CI has a runner for Kubernetes and orchestrates tests on such a cluster with Docker images.
20:32 poikilotherm And I tought about using things like the R integrations, WholeTale etc all in the same Kubernetes cluster next to Dataverse
20:32 poikilotherm We have quite a bunch of people that need about 300-400 megs transfered for tests
20:33 poikilotherm If you do a test every few minutes, having those next to Dataverse should speed up things ;-)
20:34 pdurbin Sure, reminds me of NDS Labs Workbench which runs on Kubernetes: http://www.nationaldataservice.org/projects/labs.html
20:35 poikilotherm Sounds fancy
20:36 poikilotherm Oh pdurbin: are you coming alone to Berlin or is somebody else from IQSS with you?
20:36 pameyer I'm in the vast minority, but I tend to think in-place computing on data is better approach than trying to have fast transfer
20:37 pdurbin poikilotherm: just me
20:37 poikilotherm There are a lot of different definitions of "in place computing"... Could you elaborate a little?
20:38 poikilotherm pdurbin: ok. :-)
20:38 poikilotherm pdurbin: About the TODO comments: like that: https://github.com/IQSS/dataverse/pull/5371/commits/81fbffe736a0c3070ca24fec5e444b583a109385
20:38 poikilotherm ?
20:38 pameyer poikilotherm: repository software and compute pipelines using same storage
20:38 pdurbin my wife has 100% German ancestry and would love to come some day :)
20:39 pdurbin poikilotherm: perfect TODO comments. Thanks!
20:39 poikilotherm pameyer: "same storage" => S3 same enough? Or real posix share?
20:41 pameyer poikilotherm: posix.  that's what the researchers in question write their software to read from
20:41 pameyer and dataverse on s3 doesn't support what I consider to be direct compute access
20:41 poikilotherm Yeah.
20:42 poikilotherm Posix has its own downsides. No loose coupling. Locking issues. And the like
20:42 pameyer computing on publish data means you get to make the source read only :)
20:43 pameyer and let's you de-couple repository, storage and compute infrastructure
20:43 poikilotherm What about a hybrid? Use local caches?
20:44 pameyer it's a possibility - but if you're thinking "big data", you don't want more copies than you need
20:44 poikilotherm Yeah, that'S true
20:44 poikilotherm For "real big data", the POSIX is more or less inevitable
20:44 pameyer and the work on having the repo orchestrate local caching got pushed back for otherstuff
20:44 poikilotherm S3 is not fast enough
20:44 pameyer there's always trade offs
20:52 poikilotherm pdurbin: /me crossing fingers @landreev will look into this: https://github.com/IQSS/dataverse/issues/5345#issuecomment-447994649
20:54 pdurbin I dragged it to code review.
20:54 pdurbin still WIP?
20:55 poikilotherm YES!!!!
20:55 poikilotherm This is not ready, as stated in https://github.com/IQSS/dataverse/issues/5345#issuecomment-447994649
20:55 pdurbin ok :)
20:55 poikilotherm The harvester timers are still present
20:55 poikilotherm Need to refactor that first
20:56 pdurbin ok, but you're basically blocked, right?
20:56 poikilotherm A bit. I can continue with those, but before the PR is merged, we really should talk about this.
20:56 pdurbin sure
20:57 pdurbin Are you going to change the installer?
20:57 poikilotherm IMHO every time that a refactoring takes place, testing should be added. Otherwise you will never get over 20% ;-)
20:57 poikilotherm Altough that often leads to bigger refactoring
20:58 poikilotherm (Need to make it testable)
20:59 poikilotherm pdurbin: do you know by heart where to look for the schedule time settings an admin has for the harvesters?
20:59 pdurbin Does the "Set up the data source for the timers" stuff need to change at https://github.com/IQSS/dataverse/blob/v4.9.4/scripts/installer/glassfish-setup.sh#L118 ?
21:00 poikilotherm Yes.
21:00 poikilotherm That line can be pruned.
21:00 poikilotherm (deleted)
21:00 poikilotherm err.. sry. not L118, but L120
21:02 pdurbin poikilotherm: ok, can you please add a TODO there too?
21:02 poikilotherm Sure.
21:02 pdurbin Thanks. Otherwise it's hard to keep track of all the places.
21:03 poikilotherm Err. I just add a commit removing the line. That's easier and keeps us moving forward
21:05 poikilotherm Or would you prefer to comment it out and add a comment for future reference?
21:07 pdurbin If we don't need that line any more, removing it is fine.
21:09 poikilotherm I just made up my mind. For code review and QA it's easier to comment it. One day the installer will be refactored and then it's alright to remove anything commented. Until that day, you can see in git blame/bisect what happened and why.
21:09 pdurbin Ok, sounds fine.
21:11 poikilotherm https://github.com/IQSS/dataverse/pull/5371/commits/db73d5b88f9742f9e515bec323479023c3de5068
21:14 pdurbin looks good. also in that script is some timer=true stuff. Can that be removed as well?
21:14 poikilotherm Nope, that is used for indicating who is the "master of puppets".
21:14 poikilotherm See timer docs :-D
21:15 poikilotherm https://github.com/IQSS/dataverse/blob/db73d5b88f9742f9e515bec323479023c3de5068/doc/sphinx-guides/source/admin/timers.rst
21:15 pdurbin ok, want to remove that from the script too?
21:16 poikilotherm Yeah, kcondon is industrious :-)
21:16 poikilotherm Seems like my PR for the AWS stuff has a chance to be merged before holidays :-)
21:17 poikilotherm pdurbin: nope. You need it in new installs
21:17 pdurbin he's on a roll
21:17 poikilotherm It should stay there, as it would break stuff otherwise
21:17 pdurbin ok
21:18 poikilotherm It *should* be replaced by some automatic handling
21:18 poikilotherm See my " (Might get addressed for automation in a later Dataverse version using cluster support from the application server.)" in the timer docs... ;-)
21:23 poikilotherm Payara has Hazelcast backed in... That could solve stuff like this.
21:24 pdurbin ok
21:25 pdurbin "It is much easier for container approaches to use application scoped JDBC connections, but those seem not to be reusable for EJB timers."
21:26 poikilotherm Yeah
21:27 pdurbin I think you and Leonid have got this. :)
21:27 pdurbin Wake me up when it's over. :)
21:30 poikilotherm LOL
21:33 pameyer I still don't have a great idea why EJB timers were used instead of cron jobs.
21:34 pameyer I'll defer to the folks that actually understand EJB though
21:38 poikilotherm Most certainly this was just easier.
21:38 pdurbin Yeah, I've always gone for the cron approach.
21:38 poikilotherm Cronjobs needs manual setup
21:39 pdurbin yeah
21:39 poikilotherm EJB timers are for free
21:39 pdurbin free as in puppy
21:40 poikilotherm Why "as in puppy"? Ok, the current code is a bit bloated, but the @Schedule annotation is fairly easy
21:40 pdurbin I do hope I eventually understand the cron equivalent for Java EE, which I guess are these EJB timers. But you can pull cron from my cold, dead hands.
21:41 pdurbin I'll be happy to have an easy @Schedule example to look at once this gets merged.
21:42 poikilotherm Here you go
21:42 poikilotherm https://github.com/IQSS/dataverse/blob/db73d5b88f9742f9e515bec323479023c3de5068/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java#L581
21:42 poikilotherm Executed everyday at 2am local time
21:44 pdurbin doesn't look too bad
21:44 poikilotherm You were asking about the mster timer stuff above... Show me how to get that in place within a multinode cluster setup with vixie-cron and the like...
21:44 poikilotherm With stuff like hazelcast or others (zookeeper, ...) you can use this stuff coordinated in clusters.
21:45 poikilotherm But of course: you could create a API call that is triggered from curl
21:45 poikilotherm Which is inside a cronjob or webcron
21:45 poikilotherm I think using EJB is easier ;-)
21:46 pdurbin I'm fine with whatever works. I'm still learning EJB.
21:46 poikilotherm :-)
21:47 poikilotherm I really like the concept of CDI
21:47 poikilotherm Which is IMHO the greatest benefit and usecase of EJB
21:48 poikilotherm http://www.adam-bien.com/roller/abien/entry/cdi_with_or_without_ejb
21:48 pameyer poikilotherm: is "complexity budget" something that the RCE-ish folks think about?
21:48 poikilotherm RCE = RSE?
21:49 pameyer yup - I typo things :(
21:49 poikilotherm ;-)
21:49 poikilotherm No worries
21:49 poikilotherm And: yeah. Anything that makes stuff easier for a research is good. Less complexity is good.
21:50 poikilotherm research = researcher
21:51 pameyer yup
21:51 pameyer applies to infrastructure too
21:54 pameyer but pdurbin and I will learn to be happy w\ EJB timers instead of cron ;)
21:54 poikilotherm It is less complex ;-)
21:55 poikilotherm At least if you don't ask me to reschedule the OAI exports.
21:55 poikilotherm Those are hard wired to 2am as befire
21:55 poikilotherm But now you cannot change those via database hacking
21:56 pameyer I'm pretty sure database hacking isn't a documented way to change them anyway
21:59 poikilotherm It's in the docs...
21:59 poikilotherm http://guides.dataverse.org/en/latest/admin/timers.html#id3
22:00 poikilotherm "This job is automatically scheduled to run at 2AM local time every night. If really necessary, it is possible (for an advanced user) to change that time by directly editing the EJB timer application table in the database."
22:03 pameyer good to know - obviously not something I've worried about reconfiguring
22:04 poikilotherm This is no good thing to consider doing. And it will not be possible anymore once the non-persistent EJB timers are in place.
22:04 poikilotherm If this really needs to be configurable, a clean implementation should be done.
22:07 pameyer how do you monitor non-persistent EJB timers?
22:10 poikilotherm I did not yet look into this. Current approach for monitoring is not very sophisticated http://guides.dataverse.org/en/latest/admin/monitoring.html#id9
22:10 poikilotherm I haven't tried to use this with non-persistent timers yet
22:11 poikilotherm pameyer: being the API guy, is there a health API present?
22:11 pameyer not in dataverse that I know of
22:11 pameyer I think there's a glassfish one that's off by default
22:11 poikilotherm What *should* be done is creating something like this
22:11 poikilotherm Not only for timers
22:12 poikilotherm And here we go: https://github.com/eclipse/microprofile-health
22:12 poikilotherm REST API for health
22:12 poikilotherm (you could use JMX of course)
22:12 poikilotherm But I think REST could be easier for common monitoring systems
22:13 pameyer asadmin set server.monitoring-service.mod​ule-monitoring-levels.jvm=LOW
22:13 poikilotherm Oh yeah there is more where that came from
22:13 poikilotherm This is the JMX stuff
22:13 poikilotherm Pretty bloated
22:13 pameyer I used glassfish "health" api to see if glassfish was alive enough to try and deploy dataverse too
22:14 poikilotherm ;-)
22:14 pameyer re: timer monitoring; timer oddness has been reported enough that it was worth adding a section about how to make sure the timers were happy
22:15 pameyer I don't know if that approach is one that's in wide use, but it might be worth checking at some point
22:17 pameyer health api would be helpful, but might be more implementation complexity than it's worth
22:17 pameyer might not be too ;)
22:18 poikilotherm Health API just for timers is overkill, but it might be a first step.
22:18 poikilotherm There is a lots of stuff that could be added to this
22:18 pameyer there might be other stuff under `api/info`; but `api/info/version` is the only one I've used
22:28 poikilotherm Alright guys, it's now 23:28 over here. I will get some sleep now. Read and hear you tomorrow. :-)
22:28 pameyer have a good night
22:28 poikilotherm Have a nice evening :-)

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.