IQSS logo

IRC log for #dataverse, 2018-10-09

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

All times shown according to UTC.

Time S Nick Message
02:47 jri joined #dataverse
05:47 jri joined #dataverse
06:49 poikilotherm joined #dataverse
07:11 jri joined #dataverse
09:21 tcoupin joined #dataverse
10:36 pdurbin joined #dataverse
11:25 donsizemore joined #dataverse
12:10 poikilotherm Morning guys... :-)
12:19 pdurbin poikilotherm: mornin! I'm about to bike to work but talk to you all soon.
12:51 donsizemore joined #dataverse
13:40 pdurbin andrewSC bjonnh bricas candy` cdsp-rmo donsizemore dzho jri poikilotherm: the community call will start in a couple hours and if you have any ideas for topics, please reply to https://groups.google.com/d/msg/dataverse-community/71kuJ6TdUIg/tEszGls2AgAJ . 2 hours and 20 minutes from now (noon Boston time).
13:50 poikilotherm I'm sorry - can't make it today... Kids and stuff to do...
13:51 poikilotherm pdurbin: any news about the merging of my S3 code? Seems to be stuck in QA?
13:52 pdurbin Well, how many cards are in QA?
13:52 * pdurbin looks
13:53 pdurbin 8 cards in QA. I'm looking at https://waffle.io/IQSS/dataverse . So I wouldn't say it's stuck. I'd say there are a lot of cars on the highway.
13:54 poikilotherm You're right... Didn't think about the Waffle Board. Is kcondon a lone tiger in QA? All cards are assigned to him.
13:54 pdurbin Yes. Lone tiger. Lone wolf. :)
13:55 poikilotherm Oh dear... Poor guy.
13:55 poikilotherm I would not change with him even if you offered me a ton of gold...
13:56 poikilotherm Ok, I will just keep quiet and keep fingers crossed kcondon has the time to look into it soon...
13:56 pdurbin I was going to say... honking your horn may not get you off the highway faster. :)
13:58 pdurbin poikilotherm: oh! One thing you should do is merge the latest from develop into your branch. The pom still says 4.9.2 so it won't deploy to the test servers.
13:58 poikilotherm <ironic>Oh, I thought that would be a good idea :-D</ironic>
13:58 poikilotherm YEs sir!
13:58 poikilotherm Will do so
13:58 pdurbin Awesome. Thanks!
13:58 poikilotherm Didn't touch the code for a week as it was in QA... ;-)
13:59 pdurbin Sure. Makes sense.
13:59 pdurbin It would be nice if the people who opened this issue would comment.
13:59 pdurbin I assume the solution will work for them.
14:02 poikilotherm Maybe they switched from dataverse to something else?
14:03 poikilotherm While talking about the "responsiveness" of issues...
14:03 pdurbin could be, maybe they were only evaluating Dataverse and other solutions
14:03 poikilotherm I finally made contact with the DANS and GESIS people about the PID stuff
14:04 pdurbin Oh! Great! Are they going to comment on the issue?
14:04 poikilotherm Don't thinks so. Will ask them to do so, but I dunno if they will do so.
14:05 pdurbin thanks
14:05 poikilotherm Anyway, the approach by @fbgesis will be dropped in favor of a proxy approach.
14:05 pdurbin proxy approach?
14:05 pdurbin microservice? rest api?
14:06 poikilotherm If I got them correctly, they want a lightweight service that talks "DataCite API" language, but will make requests to da|ra instead.
14:06 pdurbin ok, sounds fine
14:06 poikilotherm So they configure Dataverse to talk to DataCite but actually will talk via their proxy with da|ra.
14:07 pdurbin interesting
14:07 poikilotherm I'm currently thinking about using this approach in a similar wa
14:07 poikilotherm +y
14:07 poikilotherm Actually, their approach might be not sufficient for us, as we want more than one provider at the same time and at different points in time.
14:08 pdurbin Would I be able to configure a PID provider that doesn't actually reach out to anything? Just for testing? That always returns success. That I could use when I'm off the network?
14:08 poikilotherm But the idea is not too bad. I like the approach to let a lightweight proxy make the heavy lifting of provider integrations. This would take load from you guys not being responsible for the software sustainability of these provider integrations.
14:09 poikilotherm Yeah, that is the way I am currently heading for. Exactly these scenarios came to my mind... ,-)
14:10 poikilotherm With this Dataverse would need a provider class that is actually offloading the work to the "external" service.
14:10 poikilotherm But this seems to be lightweight and can be kept stable with some well defined protocol  / API, e.g. based on DataCite
14:10 pdurbin poikilotherm: this is related. A proxy to DataCite used in Australia: https://github.com/IQSS/dataverse/pull/3843/files
14:12 poikilotherm Thx!
14:13 poikilotherm I'm not sure yet how to properly depict the aspect that we want PIDs early in the game (when a dataset is created) and not just when someone hits "publish now". The current providers don't seem ready for such a scenario.
14:14 poikilotherm IMHO and IIRC and AFAIK
14:14 poikilotherm :-D
14:17 pdurbin It sounds like you want the concept of a "reserved" PID. EZID supports this. DataCite is working on support for this, from what I hear.
14:17 pdurbin Here. "Reserved": https://github.com/IQSS/dataverse/issues/5093
14:20 poikilotherm Nope, actually we want to use ePICs on creation and DOIs on publishing
14:20 poikilotherm Basically because DataCite DOIs are quite expensive compared to ePIC
14:21 pdurbin oh, interesting
14:22 pdurbin Won't your researchers want to know the DOI of the dataset before the dataset is published so they can put the DOI for the dataset in their journal article?
14:22 poikilotherm Most certainly yes :-D
14:23 poikilotherm The thing is: we want to use Dataverse for "the long tail"...
14:23 poikilotherm Automatic uploads right from the experiment etc
14:23 poikilotherm And most of this data will mostly never be published
14:24 poikilotherm It would be a huge waste of money to use Datacite DOIs on those.
14:24 poikilotherm ePICs are 0,00129€/PID
14:24 poikilotherm (For up to 1e6 PIDs per year)
14:24 pdurbin Ok, makes sense.
14:25 pdurbin I wonder if other institutions have a similar use case, a similar story.
14:27 poikilotherm And for comparison: a Datacite DOI is about 0,1€/PID, not including membership fees etc
14:27 pameyer joined #dataverse
14:28 poikilotherm I don't know if other actually think about using the repositories for other things than "just" the published data.
14:29 poikilotherm From our point of view it would be a huge benefit if also other data is inserted into the repositories.
14:29 poikilotherm Actually we went with Dataverse because it has the hierarchy of verses and sets together with the metadata schemas.
14:30 pdurbin poikilotherm: well, andrewSC opened an issue about making PIDs optional: https://github.com/IQSS/dataverse/issues/3652
14:30 poikilotherm Aye. But we need these PIDs... :-D
14:30 poikilotherm We want people to use persistant links
14:30 poikilotherm And we have some plans to build apps upon Dataverse
14:31 pameyer it does seem like there's some interest in more flexibility w\ identifiers than the current "single public PID"
14:31 poikilotherm And these also will need to have proper PIDs on every dataset
14:31 poikilotherm Yes :-D
14:31 pdurbin So you're not anti-PID. You just trying to control costs. Support for Handles rather than DOIs was contributed by the community because the cost of DOIs is high compared to Handles. That's my understanding anyway.
14:31 pdurbin mornin pameyer
14:32 poikilotherm Totally pro-PID here :-D
14:32 poikilotherm As you said: it is a matter of cost control.
14:32 pdurbin poikilotherm: globus/gridftp discussion is springing up in #dv-design on Slack
14:32 pameyer pdurbin: morning
14:33 poikilotherm Oh interesting
14:35 poikilotherm I can't talk about our plans in public, but we have some stuff in the pipeline about distributed systems.
14:35 poikilotherm GridFTP seems also interesting :-)
14:38 pdurbin Ian Foster from Globus gave a keynote at the Dataverse Community Meeting back in June.
14:41 pameyer pdurbin: it seems like the dv-design discussion is focusing on ux, so I won't do another re-hash of the technical issues that would need to get sorted
14:42 pdurbin lots to get sorted :)
14:42 pdurbin pameyer: oh there was a recent comment by Martin about datasets being available for download from various places.
14:43 pameyer pdurbin: very interesting.  any pointers to where?
14:43 pdurbin "My goal is a contentURL referencing a single file, ideally a bagit archive that also includes metadata. I have run into a use case where I need to support multiple contentURLs - the same content in multiple cloud locations (AWS, Google Cloud), but that is an edge case."
14:43 pdurbin https://github.com/datacite/freya/issues/2#issuecomment-427433681
14:44 pameyer thanks
14:45 pameyer ... it sometimes seems like most of the things I think you need for good system design for a data repository are "edge cases"
14:45 pameyer :(
14:46 pdurbin well, LOCKSS is or was a thing
14:46 pdurbin I'd say you're on the right track.
14:47 pameyer and PDB's been doing distributed access for long enough that I'd have to check the literature to see when they started
14:47 pameyer also not assuming that the world runs on http
14:49 poikilotherm Ok, that's it for me for today... Maybe you can talk to Slava (@4thikonov) during the community meeting? He is the one actually behind the dara stuff
14:50 pdurbin We'll try. Thanks!
15:11 pameyer pdurbin: edff192275df861739bc56f002adc9bf8cd77c51 looks unhappy to me.  would you mind pushing the jenkins button when you've got a chance?
15:12 pdurbin Sure. Just did. Thanks for the heads up.
15:13 pameyer no problem.  I'm leaning towards a glitch on my end, from the commit messages between 0c89260a482428e07f0b206dde2bf73ea8ff5487 and edff192275df861739bc56f002adc9bf8cd77c51
15:18 donsizemore @pameyer can i get a second pair of eyes, before i put my foot through my thunderbolt display?
15:29 pdurbin pameyer: I'm seeing "Regression" "expected:<[Darwin's Finches - dva6e0453b]> but was:<[500 Internal Server Error]>" on DatasetsIT.testPrivateUrl https://build.hmdc.harvard.edu:8443/job/phoenix.dataverse.org-apitest-develop/edu.harvard.iq$dataverse/259/testReport/junit/edu.harvard.iq.dataverse.api/DatasetsIT/testPrivateUrl/
15:29 pdurbin I just kicked off another build to see if I get the same result.
15:30 pdurbin donsizemore: ansible stuff?
15:30 donsizemore @pdurbin yis. and i'm too ashamed to post it publicly.
15:33 pdurbin heh
15:33 * pdurbin has no shame
15:33 pameyer @donsizemore: where do you want a 2nd pair of eyes
15:34 pameyer @pdurbin: yeah, that's what I'm seeing
15:34 pameyer showed up 2x for me
15:34 pdurbin pameyer: are you willing to create an issue?
15:34 donsizemore actually, gimme a minute to try method 4 of coaxing ansible to edit one stupid file.
15:34 pameyer ah - pdurbin, I was wrong.  nothing on private url, but still search API
15:34 pameyer sure - issue incoming
15:49 pameyer pdurbin: also, it turned out that glassfish is more robust to abusing the file access API than I was expecting. lost quite a bit of responsiveness, but didn't crash
15:51 pdurbin a pleasant surprise for both of us :)
15:55 pdurbin pameyer: and in other pleasant surprise news... on the second run (job 260) all the integration tests passed: https://build.hmdc.harvard.edu:8443/job/phoenix.dataverse.org-apitest-develop/
15:59 pameyer pdurbin: huh
15:59 pameyer thanks for checking it; does suggest a problem on my end
16:05 poiki-at-home joined #dataverse
17:25 donsizemore joined #dataverse
17:31 Jim__ joined #dataverse
17:33 Jim__ Hey pdurbin - PR maintenance question - with your 5030 branch replacing my PR, what's the best way for me to maintain it? (There's a new tika version, I've found a dependency issue with commons-io, and its behind dev again...).
17:33 Jim__ Should I make a PR against that branch?
17:34 pdurbin Jim__: yep, a PR against the new branch would be great. Thanks.
17:34 pdurbin Also, can I ask you about highlighting?
17:38 Jim__ highlighting? OK...
17:39 pdurbin That's what we call the snippets of text that match the search term. Not sure if you've seen this in Dataverse or not.
17:40 pdurbin The matching text and a little context on either side. Before and after.
17:41 Jim__ Ah - ok. In the search results...
17:41 pdurbin Yeah, like how "Wright" is in bold in the screenshot at https://github.com/IQSS/dataverse/issues/1589
17:41 pdurbin The question is if you've thought about this for full text search. Of PDFs or whatever.
17:42 pdurbin If you haven't, that's fine. I just thought I'd see if it's on your radar.
17:45 Jim__ I haven't. I guess there isn't a highlight from the full-text search now. I'll look when I'm going back through that code. So far I just add the full-text from tika to the index and have done nothing to the search itself or the results
17:46 pdurbin That's totally fine. I'm super excited to have this feature in any form. I only mention it because when in hits QA highlighting might get asked about.
17:46 pdurbin it*
18:19 pdurbin Jim__: merged. Thanks.
18:29 donsizemore @pameyer okay, i'm ready to cave
18:34 pameyer @donsizemore - ansible sadness?
18:39 Jim__ Phil - thanks & sorry - one more small pr to update poi...
18:39 pdurbin no problem. merged that one too
18:43 pameyer that seems like a worthwhile update
18:46 pdurbin pameyer: no "boolean" in this list: https://github.com/IQSS/dataverse/blob/v4.9.4/src/main/java/edu/harvard/iq/dataverse/DatasetFieldType.java#L35
18:46 pdurbin nor http://guides.dataverse.org/en/4.9.3/admin/metadatacustomization.html#fieldtype-definitions
18:46 donsizemore @pameyer ansible uses python regular expressions (fine) and it prints all output in JSON-encoded form (also fine) but i need to pass it a shell command containing a variable which contains quotes and a dash, and i'm toying with becoming angriful towards ansible
18:46 pameyer pdurbin: thanks, that's what I'd *thought*
18:47 pameyer donsizemore: the variable has quotes and a dash, or the command does?
18:48 donsizemore @pameyer the command (asadmin, to add jvm-options). the XML module is under heavy development, and the lineinfile module reports that it does what i ask it do to (except it doesn't) so i punted and thought i'd pass shell commands. i'm setting them as facts because otherwise ansible will double-escape special characters
18:49 pameyer asadmin / jvm-options commands :<
18:50 pameyer @donsizemore do you have an opinion on using ansible to deploy and call utility scripts?
18:51 donsizemore @pameyer yeah, i think that was coming next
18:51 pameyer it's somewhat of a hack, but it may be the less frustration-inducing approach
18:51 donsizemore @pameyer but i'd still face the same regexp funsies in asking it to write out the script
18:51 donsizemore p.s. did my DMs make it through?
18:51 pdurbin donsizemore: is this work that I tried to pawn off on you or some itch you're scratching? :)
18:52 pameyer haven't seen any DMs - but that might be because my irc-fu is weak
18:52 pameyer utility script would let you reduce it to a solved problem though
18:52 donsizemore @pdurbin the former =) but it's turned into a mild obsession-of-irritation
18:52 pameyer https://github.com/IQSS/dataverse/blob/develop/conf/docker-aio/configure_doi.bash
18:53 pdurbin Yikes. Did I create an issue at least?
18:53 pameyer or at least semi-solved; it appears that it's interacting with integration test setup in a way that it wasn't when that PR was merged
18:53 donsizemore yeah, i can make a template of the script and try my luck there. i just need to set three jvm-options.
18:54 pameyer if I hadn't known, I could make a pretty strong inference that glassfish predated most of the provisioning tools I know of
18:55 donsizemore @pameyer also, our backups decided to exhibit unexpected behavior overnight, which isn't brightening my mood ring
18:56 pameyer @donsizemore backups deciding to exhibit their creativity is not a great thing
18:57 donsizemore @pameyer yeah, a 24-hour incremental consolidation job seems to be blocking further runs. but it's at 88% and i'm confident my going to the gym will fix things
18:58 pameyer I can't rule out the possibility that I fixed some intermittant network issues by getting off a bus this morning ;)
19:03 pdurbin Regarding PMs on IRC, freenode recently added +R (block unidentified) to all user accounts because of all the spam. "Ignores private messages from users who are not identified with services." https://freenode.net/kb/answer/usermodes
19:05 donsizemore @pameyer also, your script uses some of the exact regexps i was trying earlier today!
19:07 pameyer @donsizemore you could distinguish them from line noise? ;)
21:15 jri joined #dataverse
21:26 pameyer @donsizemore - just occured to me that we might've been talking past each other on ansible utility scripts.
21:26 pameyer I'd been thinking write a script that reads from environmental variables, use one task to copy it to the system and another task to execute it
21:42 donsizemore @pdurbin i think i've got the datacite stuff in dataverse-ansible. i'll let you test it ;0
23:24 pdurbin thanks!

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.