IQSS logo

IRC log for #dataverse, 2020-08-25

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

All times shown according to UTC.

Time S Nick Message
07:36 jri joined #dataverse
12:11 donsizemore joined #dataverse
13:20 jri joined #dataverse
13:49 jri joined #dataverse
14:02 jri joined #dataverse
14:07 pdurbin joined #dataverse
14:08 jri_ joined #dataverse
14:08 jri joined #dataverse
14:21 pameyer joined #dataverse
14:25 jri joined #dataverse
14:52 pdurbin bjonnh bricas donsizemore pameyer poikilotherm the community call starts in 10 minutes: https://dataverse.org/community-calls
14:53 poikilotherm Sry not for me today... Outta here in 5
14:54 pdurbin These calls used to be an hour earlier. Then UTC daylight savings happened or something.
14:54 jri joined #dataverse
16:39 pameyer I got interrupted by other stuff; had been thinking to jump on
17:09 pdurbin It was a pretty quick call. You didn't miss anything.
17:13 pameyer I could've rambled about impedence mismatches between deaccessioning datasets and obscoleting software to fill time :)
17:45 pdurbin And while you rambled, maybe I could have played some blues chords.
17:46 pameyer :)
18:39 pdurbin pameyer: somehow I think you'll appreciate this: "I get the sense that there are built-in tools for generating metadata for social science data that perhaps do not apply to, say, atmospheric chemistry data."
18:53 pameyer that does sound a little farmiliar
18:54 pdurbin I invited my friend who wrote that to join us here. We'll see what happens. :)
19:30 ntallen joined #dataverse
19:31 ntallen Good afternoon!
19:32 ntallen I saw a quote go by in the recent log
19:34 pdurbin ntallen: hi! Remember IRC? :)
19:34 ntallen barely!
19:34 pdurbin heh
19:35 ntallen So I'm trying to bring my group up to speed on better data management -- in my spare time
19:35 pdurbin 😁 now with emojis
19:35 ntallen Cool!
19:36 pdurbin glad to hear it
19:36 ntallen pdurbin: How far did you travel?
19:37 pdurbin Not too far. Please tell us your data management woes.
19:38 ntallen I'm sure you get this all the time with new folks: Dataverse is asesome for data you are ready to publish and keep available in perpetuity...
19:39 pdurbin yeah
19:39 ntallen Is there anything like it to facilitate management of new data that isn't ready for prime time
19:39 ntallen Data that might reasonably expire, does not need a DOI, but needs to be stored safely, backed up, etc.
19:39 pdurbin Well, sometimes we point people to integrations. Put your data in OSF, for example, and when you're ready, click the "publish in Dataverse" button.
19:40 donsizemore @ntallen a group here at UNC hosts I-forget-how-many TB of data in Google Drive, then publishes the metadata in UNC Dataverse
19:40 pdurbin OSF is listed here: http://guides.dataverse.org/en/5.0/admin/integrations.html#getting-data-in
19:41 ntallen OK, interesting.
19:41 pdurbin donsizemore: do you have an example dataset you can link to?
19:48 pdurbin ntallen: another thought I have is that Dataverse has APIs for deposit. So you could keep your not-sure-I-want-to-publish-this data wherever and then selectively call into APIs to publish some of it. These APIs are what make integrations possible, of course.
19:50 ntallen Right, that still leaves open the question of how to securely manage the data you just collected last night, and presumably every lab has to work out their own solutions independently.
19:50 pameyer ntallen: one factor is how much data you're talking about
19:52 pameyer it seems like the various electronic lab notebooks try to solve that (sometimes with institutional storage)
19:52 ntallen pameyer: It varies from a few KB per run up to a couple GB per run, with runs on the order of hours.
19:53 ntallen Yeah, I should take a closer look at those.
19:53 pdurbin Where do they store the data after a run now? Dropbox? S3? Somebody's laptop?
19:54 ntallen Lab computers for starters. When we're in the field, we backup to multiple external drives and transfer data back to Harvard
19:54 ntallen We have used DropBox for transfer, but have not really embraced it for storage
19:54 pameyer we've got a setup where a data collection facility puts metadata (including storage info) into a dataverse instance automatically.  doesn't solve the storage/backup issue directly
19:58 ntallen We currently have a home-grown method of validating a data set so we can verify copies separately. I would like to have a tool for tracking multiple copies and facilitating periodic verification, for example.
20:01 pdurbin Dataverse verifies checksums (MD5, SHA, etc.) but that's about it.
20:02 pdurbin Speaking of electronic lab notebooks, RSpace also integrates with Dataverse, but I'm not very familiar with its features or if it's a good fit.
20:04 ntallen Our checks are not much more than that, but our datasets generally consist of lots of files. Probably repackaging the data into something like netcdf would be a good start.
20:06 pdurbin You can upload any type of file to Dataverse so that should be fine.
20:06 pameyer ntallen: if you've got checksums and datasets on >1 storage media, you're probably ahead of a lot of researchers (and it makes me feel like I wasn't the only paranoid one)
20:08 ntallen Well we take data on NASA aircraft along with a large group of other researchers, and each flight is a one-off, so we cannot afford to lose it!
20:09 pameyer :) I completely agree!
20:10 pameyer sometimes you don't get a chance to re-collect a dataset, and it's good when folks treat their data that way
20:14 pdurbin sounds like valuable data
20:16 ntallen Thanks for the info-- I've got some reading to do, and also a call I've got to jump into!
20:16 ntallen left #dataverse
20:16 donsizemore joined #dataverse
20:16 pameyer every so often at meetings, a speaker talks about the replacement cost or total cost of a dataset
20:16 donsizemore @pdurbin dumb question?
20:18 pdurbin donsizemore: hit me 😁
20:19 donsizemore thu-mai wants me to kill that ingest job. stopping dataverse didn't clear the job status, as it resumed once i restarted glassfish
20:19 pdurbin Yeah, there's a queue.
20:19 pdurbin We should probably document it.
20:20 donsizemore in the DB, or admin console, or?
20:21 pdurbin donsizemore: this might help: http://wiki.greptilian.com/java/glassfish/howto/purge-jms-queue/
20:22 donsizemore is the default password assumed to be as it is in your example?
20:23 pdurbin I think so.
20:23 donsizemore Error [A3161]: Failed to read password in passfile: java.io.FileNotFoundException: /dev/fd/63 (No such file or directory)
20:28 donsizemore but if i query it directly it responds, just doesn't return the information in your example
20:28 pdurbin Huh. Is that my bash trick not working? The <() thing?
20:28 donsizemore glassfish4/mq/bin/imqcmd -u admin -passfile <(echo imq.imqcmd.password=admin) query dst -t q -n DataverseIngest
20:29 donsizemore reports the queue and the broker, then hangs
20:29 donsizemore (that's on the query command, i haven't tried to purge just yet)
20:29 pdurbin do it, do it
20:30 donsizemore you're telling me it won't wipe out my entire production dataverse?
20:30 pdurbin !
20:30 pdurbin try it on a test system
20:31 donsizemore idle test system returns the query zippily
20:31 pdurbin that's good
20:31 donsizemore but the commands and output look good
20:33 donsizemore so this will purge the queue, but will it stop running jobs?
20:33 pdurbin I don't think so. You'll probably need to restart Glassfish again.
20:33 donsizemore i had to kill -9 the child processes to get it exit before =(
20:33 pdurbin boo
20:34 donsizemore the purge command hasn't returned
20:34 pdurbin Maybe we should add a feature to Dataverse to kill a running ingest job.
20:34 donsizemore I told them several times Harvard imposed an ingest filesize limit of 150MB
20:34 pdurbin but did they listen?
20:34 donsizemore it returned! and it stopped the jobs!
20:35 donsizemore and we're back!
20:35 pdurbin Nice. Take the rest of the day off.
20:35 donsizemore I will gratefully submit a doc issue and PR based off your commands (tomorrow)
20:35 pdurbin you da man
20:36 donsizemore hee hee Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
20:36 donsizemore (the VM has 64GB of RAM and a heap of 36GB)
20:37 pdurbin makes sense, sadly
20:39 pameyer now the fact that I generated 30GB of error logs seems less significant
20:41 pdurbin I'll read them when I can't sleep.
21:05 pdurbin left #dataverse
22:20 Bala joined #dataverse
22:21 Bala Hi all, I have a quick question. Subscribing for DOI or having a handle.net service, which is cost effective ?

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.