IRC log for #dataverse, 2021-02-10

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

All times shown according to UTC.

Time	Nick	Message
05:18		poikilotherm2 joined #dataverse
07:16		Virgile joined #dataverse
09:14		Virgile joined #dataverse
11:23		Virgile joined #dataverse
13:41	poikilotherm2	Rejoice, fellow Dataversians
13:41	poikilotherm2	Payara released their new version https://github.com/payara/Payara/releases/tag/payara-server-5.2021.1
13:41	poikilotherm2	Including my enhanced DirConfigSource :-)
13:42	poikilotherm2	donsizemore pdurbin we should talk about the production domain removal since 5.2020.7
13:43	poikilotherm2	See https://github.com/gdcc/dataverse-kubernetes/issues/218
13:45	poikilotherm2	And pdurbin donsizemore if you have a minute later, I'd like to know your opinion on using the Payara Docker images or going for custom made.
14:21		donsizemore joined #dataverse
15:04		pdurbin joined #dataverse
15:15	pdurbin	poikilotherm2: which do you want to talk about first?
15:29	poikilotherm2	All :-D
15:29	pdurbin	first in, all out
15:31	poikilotherm2	What sounds most interesting to you?
15:34	pdurbin	Well, a quick one should be the "production" domain. We never used it. We stuck with "domain1" one. So I don't think "production" being removed affects us.
15:34	poikilotherm2	Certainly.
15:35	poikilotherm2	Although the production ready domain has some nice stuff out of the box
15:35	pdurbin	ah
15:35	poikilotherm2	So I was wondering if we should create sth reusable in all types of installs
15:36	poikilotherm2	I copied the differences from the Payara docs in https://github.com/gdcc/dataverse-kubernetes/issues/218#issuecomment-776635621
15:37	poikilotherm2	Pool sizes and others sound pretty useful for any non-dev installation
15:48	pdurbin	Well, the installer already has a concept of a "dev install". Maybe we could build on that.
15:48	poikilotherm2	I created a diff of the two domain.xml
15:48	poikilotherm2	I'll add it to the issue
15:49	pdurbin	sounds good
15:50	poikilotherm2	https://github.com/gdcc/dataverse-kubernetes/issues/218#issuecomment-776807042
15:51	poikilotherm2	So maybe this is connected to my other question
15:51	poikilotherm2	Currently, my idea was to rely on the Payara Image builds
15:51	poikilotherm2	(For container images)
15:52	poikilotherm2	As we are on JDK 11, we might choose to do otherwise
15:52	poikilotherm2	We can tweak a bit more to our use cases if we don't rely on them...
15:53	poikilotherm2	And we could either go with the openjdk11 image (used by solr, so reducing image pulls)
15:53	poikilotherm2	Or we could go with the Redhat cloud images
15:54	pdurbin	Sorry, I missed something. The Payara builds use JDK 8?
15:54	poikilotherm2	Both.
15:55	poikilotherm2	Switched to JDK11 recently
15:55	poikilotherm2	Just a matter of switching tags
15:55	poikilotherm2	I also have to admit that Payara does not do daily builds
15:55	poikilotherm2	So their images might contain more security issues
15:56	poikilotherm2	OpenJDK pushes daily
15:56	poikilotherm2	And has ARM based images
15:57	pdurbin	Well, it's smart to not rely on images you don't trust (for security reasons or whatever).
16:05	pdurbin	We're not really in the Docker world enough for this to be a concern.
16:44	poikilotherm2	So you think we should go for building our own images with daily security updates?
16:47	pdurbin	Do I have to do anything? :)
16:47	poikilotherm2	Tell me if I should go with Redhat Containers or with Debian (openjdk images use it)
16:48	pdurbin	!
16:48	pdurbin	That sounds like a question for donsizemore
16:48	pdurbin	poikilotherm2: meanwhile, please check out this German I can't read: https://github.com/IQSS/dataverse/issues/7598#issuecomment-776840786
16:48	pdurbin	this: https://www.izus.uni-stuttgart.de/fokus/fdm-projekte/xsample/
16:51	poikilotherm2	What would you need?
16:52	poikilotherm2	Need translation?
16:52	pdurbin	Meh, that's ok. I just thought you might be interested.
16:55	pdurbin	I don't have any opinion for the images question. I've heard of Debian. I haven't heard of Redhat Containers. :)
16:55	poikilotherm2	The XSamples website is pretty much saying nothing...
16:55	poikilotherm2	Bla bla bla
16:56	pdurbin	heh, like my talks!
16:56	poikilotherm2	All "we want to do this" but no "that's what we do"
16:56	pdurbin	dreams
16:57	poikilotherm2	Aye
17:04		dataverse-user joined #dataverse
18:34		nightowl313 joined #dataverse
18:39	nightowl313	hi all ... wonder if I could ask a quick question (maybe quick?) ... I am analyzing exactly what happens to files in s3 when they are deleted in dataverse as we are trying to complete our replication/DR workflow; we have a bucket in prod and are replicating that to a bucket in our DR account and changeing to glacier storage; I've noticed that if a file is uploaded and deleted in draft mode in DV, the file is completely deleted in DV, bu
18:39	nightowl313	does this sound correct?
18:41	nightowl313	and, when the file is uploaded, it also creates "cached" copies of all of the various download versions available in S3?
18:43	pdurbin	nightowl313: sorry, your first line got a little cut off. What's after "but"? http://irclog.iq.harvard.edu/dataverse/2021-02-10#i_134701
18:44	nightowl313	haha i need to stop typing long posts!
18:44	nightowl313	but kept in AWS s3 with a delete marker; if the file is uploaded and the dataset published, and then deleted, the file is still saved in DV and accessible in the version history, and is still an active file in aws s3 (no delete markers) 11:39 does this sound correct?
18:45	nightowl313	file uploaded - not published - deleted - completely deleted from dataverse version history, still shows in versions in aws with delete marker
18:45	nightowl313	file uploaded - published - deleted - file still appears and is accessed in dv by selecting previous version - version shows in aws as active file - not deleted
18:45	pdurbin	Unfortunately, I'm not very familiar with the S3 code. Let me see if I can summon someone who is.
18:46	nightowl313	that's what it appears to be doing .. just wanted to verify ... we are just trying to figure out what really needs to go to glacier
18:46	nightowl313	if cached copies of all of the download formats are created, we probably want to exclude those?
18:47	nightowl313	just wondering what other folks are doing here with backing up and data retention
18:48	pdurbin	I think you have the right idea that the originals are the most important to backup.
18:48		Jim46 joined #dataverse
18:49	nightowl313	does it just create those copies for quicker access?
18:49	nightowl313	sorry I always have weird questions!
18:49	Jim46	The Dataverse code just deletes the file in S3. I think s3 can be configured with versioning, in which case deleted files are just marked as deleted.
18:49	nightowl313	yea, we have to turn on versioning in order to replicate the bucket to another account ... so I think that is what is happening
18:50	nightowl313	we are probably making this much more complicated than it has to be!
18:51	Jim46	I haven't fully thought things through, but I suspect s3 versioning isn't needed as Dataverse manages it's own versions and never changes a stored file.
18:52	Jim46	For backup - it's a general open question as to whether you could/should store things that, if the data were re-entered into Dataverse, would be recreated - ingested tab files, DDI metadata extracted in ingest, metadata exports, thumbnails, etc.
18:53	nightowl313	yea, we are overcomplicating it because of audit requirements (3 copies in 3 different storage locations, etc.. ) ... okay ... so in general, when a file is uploaded, does dataverse create cached versions of the file formats?
18:53	nightowl313	oh yes, that makes sense .. that we may actually need those cached versions?
18:53	nightowl313	that helps a lot .. something we need to decide
18:54	nightowl313	cached versions of the metadata exports I think is what those are
18:56	pdurbin	donsizemore: Mandy's up! https://reusableresearch.com
18:58	pdurbin	nightowl313: Dataverse only creates archival versions of tabular files. The idea is to take a proprietary Stata file and create a plain text TSV from it.
18:58	Jim46	the cached metadata exports duplicate what's in the database, so if you are backing that up, you don't need them/they'll be reproduced on demand (there's an api call to recreate them too).
19:00	nightowl313	oh perfect! Those two comments sum it up (I was testing with a tabular file) ... that helps a lot! Thank you so much!
19:02	pdurbin	donsizemore: she claims to have coined "co ray ray"!
19:03	donsizemore	she also said Matthew _had_ to use Django ;)
19:03	pdurbin	damn
19:11	pdurbin	"Data Curation Result: Major Issues"
19:52	donsizemore	Mandy always holds back
19:54	pdurbin	Heh. She's a live wire!
20:02	pdurbin	"Software has continued to weasel its way into the very fabric of society" -- Titus Winters
21:45		nightowl313 joined #dataverse
22:01		pdurbin left #dataverse

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.