IRC log for #dataverse, 2020-04-23

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

All times shown according to UTC.

Time	Nick	Message
07:00		jri joined #dataverse
07:41		ppeter joined #dataverse
07:43	ppeter	We are currently considering to install dataverse at our research institute, however some questions came up for which I could not find the answer to in the documentation.
07:43	ppeter	If I use multiple filestores, what is the placement strategy?
07:44	ppeter	Is it possible to move files among filestores, either automatically or manually (through API)?
07:45	ppeter	Can a dataset or dataverse place some files on one filestore, and some other files on another?
07:47	ppeter	The two main focuses are performance tiering and removing filestores. Moving files is necessary for the former, moving dataverses is sufficient for the latter.
07:48		ppeter18 joined #dataverse
07:48		ppeter56 joined #dataverse
07:51		ppeter56 left #dataverse
07:52		ppeter joined #dataverse
09:33		jri_ joined #dataverse
09:52		Benjamin_Peuch joined #dataverse
11:42		donsizemore joined #dataverse
11:45	donsizemore	@ppeter hello! Dataverse support for multiple stores is currently under development. the default is a local filesystem
11:46	donsizemore	@ppeter it is possible to direct where new datasets and files are created via the files.type jvm-option, and I believe legacy file reads (when location isn't updated in the dvobject table) should continue to work for either store
11:47	donsizemore	@ppeter but to my understanding at present it's still an "either/or" choice. I know of at least two institutions who are actively working on support for multiple datastores, though.
11:48	donsizemore	@ppeter moving files, again to my knowledge, would be handled "under the hood" both with the filesystem itself and the location entry in the dvobject table
11:52	ppeter	In the dolumentation, it says "File Storage: Using a Local Filesystem and/or Swift and/or S3 object stores", and proceeds to list examples on how to add multiple file storages.
11:53	ppeter	Do I understand it right that this is experimental currently?
11:54	ppeter	And what does under-the-hood mean? Does it mean that the admin has no control over what files are stored where?
12:01	donsizemore	@ppeter object storage support is beyond experimental, but there's no baked-in support for choosing a datastore at dataset creation just yet (to my knowledge)
12:01	donsizemore	@ppeter an admin of the server could copy/move the files and update the database storage location, but an admin of the service (say, a lead archivist) wouldn't be able to do that via API or web interface
12:32	ppeter	Thank you for the information!
12:40		donsizemore joined #dataverse
12:41		ppeter44 joined #dataverse
12:42	donsizemore	@ppeter44 I'm delighted you're considering Dataverse, but don't quote me on the above. I haven't done a ton of work with S3 configuration so that's all to my understanding. Jim Myers with GDCC/QDR would be the person to ask.
12:42	donsizemore	@ppeter44 or Leonid with IQSS
12:43	ppeter44	Thank you, I may try to contact them later.
12:45	ppeter44	There is definitely nothing documented about moving datasets/files among file stores; there seems to be some way to choose a file store at upload time.
12:48	donsizemore	@ppeter44 correct. this PR hasn't made it to the public guide just yet: https://github.com/IQSS/dataverse/pull/6789/files
13:08	poikilotherm	donsizemore this might be sth valuable to add to dvcli
13:08	poikilotherm	Instead of fiddling with manual moving and updating make it usable from cmdline with a simple call
13:14	donsizemore	i like this
13:51		pdurbin joined #dataverse
13:53		dataverse-user joined #dataverse
13:53		Jim12 joined #dataverse
13:55	pdurbin	ppeter44: hi! Did you get answers to all your questions?
13:55	Jim12	@ppeter - Stores can be chosen per Dataverse by an admin, so you can set up different Dataverses to send files to different places.
13:57	Jim12	If you change either the global default store or the store for a Dataverse, new files go to the newly selected place but old ones aren't moved. It would be nice to have an api to move existing files around, but that doesn't yet exist.
13:58	Jim12	You can achieve a move by changing the database entry and moving the file physically (right now, all stores use the same path structure, so bulk moving is fairly striaght forward - move the whole file tree, and just change the store id as recorded for that file.
14:03	Jim12	The two main uses cases I'm aware of are for people trying to separate large files or to allow independent accounting - one Dataverse with storage resources for different groups separated into different stores that can allocate and/or charge as they wish.
14:04	Jim12	Harvard is also looking at another case - two stores that point at the same S3 storage, but one configured for direct uploads (useful for larger files).
14:06	Jim12	The choice of managing which store is used by Dataverse was seen as a flexible/powerful first option, but how a store is assigned is fairly independent from the store mechanism itself, so it wouldn't be hard for someone to implement a way to choose a store based on file size for example (slightly tricky if one considers that Dataverse will unzip zip files).
14:09	ppeter44	@pdurbin I think so, thank you. I think I now see how it works, and what may be available in the future.
14:18	poikilotherm	Morning pdurbin :-) Have you seen the latest addition to https://github.com/poikilotherm/dvcli ? ;-)
14:48	pdurbin	ppeter44: I think multistore is already available. Yeah, in Dataverse 4.20, the latest: https://github.com/IQSS/dataverse/releases/tag/v4.20
14:50	pdurbin	poikilotherm: I see k8s stuff. And a logo! Nice!
15:35		uclouvain joined #dataverse
15:35	uclouvain	Hi. I have a problem of storage on my dataverse instance
15:36	uclouvain	It is configured to store file on S3. Ant it does it. Anyway, /usr/local/dvn/data/temp fills my hdd.
15:36	uclouvain	Why ? What's the solution ?
15:40	pdurbin	uclouvain: hi, I believe files are staged locally in some cases. Jim12 might have a better idea.
15:40	pdurbin	uclouvain: I'm aware of this issue: Documentation: Document various temp directories used by Dataverse #2848 - https://github.com/IQSS/dataverse/issues/2848
15:41	pdurbin	donsizemore: to quote you (from that issue), "Odum has ~7,300 files piled up in $files.dir/temp"
15:45	uclouvain	it seems that those files are not yet pushed to the S3 storage.
15:46	poikilotherm	uclouvain please see also https://github.com/IQSS/dataverse/issues/6656
15:46	pdurbin	uclouvain: that's what I'm hearing in Slack too.
15:46	uclouvain	Any solution ?
15:47	poikilotherm	I opened that recently and it analyses how things work internally and where things can start to pile up
15:51	pdurbin	poikilotherm: nice issue. Thorough,
15:52	pdurbin	uclouvain: I think we have a script that cleans up old file in that "temp" directory.
15:53	Jim12	When Dataverse is done processing and stores the file, it deletes the temporary copy. So any temp files that exist represent current uploads or failures in the past. Recent versions of Dataverse do a better job of cleaning up after cancellations and failures but, if a user uploads files and then leaves the browser without clicking save or cancel, the temp files remain. Deleting temp files on some periodic basics (i.e. ones older than 1 day+) should be f
15:53	pdurbin	Ghosts of failed uploads past.
15:54	Jim12	If there are other cases where v4.20 is not removing temp files, it would be good to report them.
15:55	pdurbin	uclouvain: is this helping? :)
15:55	Jim12	@poikilotherm's issue points out there's another directory, managed by primefaces where you may also find temp files/may need to assure sufficient space to handle active uploads.
15:56	uclouvain	There's no risk to delete files older than 1day ?
15:57	pdurbin	uclouvain: 1 day old is probably fine. If you want to be even more careful you could delete files that are 3 days old.
15:59	Jim12	Dataverse only tracks those temp files as part of active uploads, so there's no way Dataverse can do anything with them after the upload session is gone. That said - if a user says 'I started uploading my only copy of a file a week ago and it didn't work - can I get my file back?' then the temp file is your only copy on the server.
16:01	uclouvain	server.log is full of lines like these ones:
16:01	uclouvain	[#\|2020-04-22T13:56:21.472+0000\|WARNING\|glassfish 4.1\|edu.harvard.iq.dataverse.ingest.IngestServiceBean\|_ThreadID=51;_ThreadName=jk-connector(5);_TimeMillis=1587563781472;_LevelValue=900;\| Failed to save the file, storage id s3://dataverse:171a21dba64-8634e7583d95 (Unable to calculate MD5 hash: /usr/local/dvn/data/temp/s3:/dataverse:171a21dba64-8634e7583d95 (No such file or directory))\|#] [#\|2020-04-22T13:56:21.474+0000\|WARNING\|glassfi
16:02	pdurbin	uclouvain: which version of Dataverse are you running?
16:03	uclouvain	4.14 (I know... I know...)
16:03	pdurbin	:)
16:07	uclouvain	BTW what's the default maximum number of files per dataset ? My user has already over 900 files !
16:07	pdurbin	There are some "Datafile Integrity" checks you could do: http://guides.dataverse.org/en/4.20/api/native-api.html#id17 . But it sounds like uploads might be failing.
16:08	pdurbin	You can set limits for how many files can be uploaded at a time but I'm not sure if there's a way to set a limit on the total number of files in a dataset.
16:09	uclouvain	OK So he will not be blocked
16:10	pdurbin	I don't believe so.
16:11	pdurbin	If you want to limit the number of files per dataset, please open a GitHub issue.
16:11	Jim12	re: number of files - QDR has 1800+, 1000 per single upload is the only limit I know of. Beyond that, things just get slow to display as the number of files goes up.
16:12	pdurbin	Jim12: is that still true, even after the file listing on the dataset page was switched to Solr? It definitely was true in the past. :)
16:14	Jim12	I'm not sure what the error could be, especially back in 4.14. I'd suspect something like S3 failure or a misconfiguration, but I'm not sure. FWIW - one thing we've seen lately in S3 from AWS - the us-east-1 region doesn't guarantee that a call to get a new file or its metadata will succeed immediately after it's creation. We have some evidence that that has gone from <1 second to more than a minute at times lately (eventual consistency).
16:14	uclouvain	So I deleted files older than one day. The user can go on with uploads. Thanks for your help !
16:14	Jim12	Our testing was around the new direct upload feature, but it could affect the normal uploads as well.
16:14	pdurbin	uclouvain: you're welcome. It's probably getting late for you. :)
16:15	uclouvain	We use S3 with CEPH at UCLouvain.
16:15	pdurbin	fancy!
16:15	Jim12	Things getting slower? Yes - I think I started around 4.8.6 when it was already on solr. There have been improvements since, and maybe not so much in display as in making any changes (e.g. change metadata).
16:16	Jim12	Hopefully eventual consistency isn't your issue then.
16:16	pdurbin	Huh, I thought the switch to Solr was more recent.
16:17	pdurbin	This is the change I meant, from Dataverse 4.15: https://github.com/IQSS/dataverse/pull/5820
16:18	pdurbin	Anyway, I have something like 115 files in my dataset and I can page through them fairly quickly.
16:44		jri joined #dataverse
18:03	pdurbin	donsizemore: https://github.com/IQSS/dataverse-ansible/pull/167 looks good to me. Thanks!
18:43	donsizemore	@pdurbin i pestered leonid about whether we should log the output of as-setup.sh — he's not worried about it for now but if he does more work on the installer he may add it
18:45	pdurbin	Sounds good. So are you happy to switch to the installer? I thought you said you'd be able to delete some Ansible code but the pull request is a net gain in code.
18:58	donsizemore	i already switched it. the nice thing is ansible isn't tracking and wedging in bits that would have been provided by the perl installer
18:58	donsizemore	net gain in code... hey, it's python!
18:58	pdurbin	:)
18:59	pdurbin	Well, it's nice that the new installer will be tested regularly.
19:00	donsizemore	i'm re-running in ec2 now with the jenkins group_vars
19:01	donsizemore	hmm. sun must've moved to github? "Request failed: <urlopen error timed out>", "url": "http://dlc-cdn.sun.com/glassfish/4.1/release/glassfish-4.1.zip"
19:01	donsizemore	maybe we should go ahead and switch that to payara
19:06	pdurbin	That link works for me. I was wondering when we're going to switch Jenkins to Payara. What are we waiting for?
19:08	donsizemore	gustavo to say so (and, if we get CentOS 8 AMIs any time soon, Glassfish to patch this morning's OpenJDK breakage)
19:10	pdurbin	Yeah that was weird. Thanks for opening https://github.com/IQSS/dataverse/issues/6853 . I haven't grokked it completely.
19:11	donsizemore	a change in openjdk-1.8.0.252+ broke payara-5.201's packaged grizzly
19:11	donsizemore	you tell payara to use a different grizzly jar for openjdk _252 and up, and things are fine
19:12	pdurbin	Gotcha. And you're always getting new JDK releases, I suppose. I don't update my JDK on my laptop very often.
19:12	donsizemore	i like security patches
19:13	pdurbin	I remember having a vague sense of horror and unrest when I had to install Java on my laptop for the first time. I probably should keep it more up to date than I do.
19:17	donsizemore	if i want to set an ansible group_var to skip only that sitemap test, do i call it testUpdateSiteMap or SiteMapUtilTest.testUpdateSiteMap
19:18	donsizemore	ooh ooh -Dit.test=\!SiteMapUtilTest.*
19:18	pdurbin	buh, I'm not sure
19:19	donsizemore	i'm trying ^^
19:19	donsizemore	so Jenkins can carry on
19:33	pdurbin	like a wayward son

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.