Time
S
Nick
Message
07:00
jri joined #dataverse
07:41
ppeter joined #dataverse
07:43
ppeter
We are currently considering to install dataverse at our research institute, however some questions came up for which I could not find the answer to in the documentation.
07:43
ppeter
If I use multiple filestores, what is the placement strategy?
07:44
ppeter
Is it possible to move files among filestores, either automatically or manually (through API )?
07:45
ppeter
Can a dataset or dataverse place some files on one filestore, and some other files on another?
07:47
ppeter
The two main focuses are performance tiering and removing filestores. Moving files is necessary for the former, moving dataverses is sufficient for the latter.
07:48
ppeter18 joined #dataverse
07:48
ppeter56 joined #dataverse
07:51
ppeter56 left #dataverse
07:52
ppeter joined #dataverse
09:33
jri_ joined #dataverse
09:52
Benjamin_Peuch joined #dataverse
11:42
donsizemore joined #dataverse
11:45
donsizemore
@ppeter hello! Dataverse support for multiple stores is currently under development. the default is a local filesystem
11:46
donsizemore
@ppeter it is possible to direct where new datasets and files are created via the files.type jvm-option, and I believe legacy file reads (when location isn't updated in the dvobject table) should continue to work for either store
11:47
donsizemore
@ppeter but to my understanding at present it's still an "either/or" choice. I know of at least two institutions who are actively working on support for multiple datastores, though.
11:48
donsizemore
@ppeter moving files, again to my knowledge, would be handled "under the hood" both with the filesystem itself and the location entry in the dvobject table
11:52
ppeter
In the dolumentation, it says "File Storage: Using a Local Filesystem and/or Swift and/or S3 object stores", and proceeds to list examples on how to add multiple file storages.
11:53
ppeter
Do I understand it right that this is experimental currently?
11:54
ppeter
And what does under-the-hood mean? Does it mean that the admin has no control over what files are stored where?
12:01
donsizemore
@ppeter object storage support is beyond experimental, but there's no baked-in support for choosing a datastore at dataset creation just yet (to my knowledge)
12:01
donsizemore
@ppeter an admin of the server could copy/move the files and update the database storage location, but an admin of the service (say, a lead archivist) wouldn't be able to do that via API or web interface
12:32
ppeter
Thank you for the information!
12:40
donsizemore joined #dataverse
12:41
ppeter44 joined #dataverse
12:42
donsizemore
@ppeter44 I'm delighted you're considering Dataverse, but don't quote me on the above. I haven't done a ton of work with S3 configuration so that's all to my understanding. Jim Myers with GDCC/QDR would be the person to ask.
12:42
donsizemore
@ppeter44 or Leonid with IQSS
12:43
ppeter44
Thank you, I may try to contact them later.
12:45
ppeter44
There is definitely nothing documented about moving datasets/files among file stores; there seems to be some way to choose a file store at upload time.
12:48
donsizemore
@ppeter44 correct. this PR hasn't made it to the public guide just yet: https://github.com/IQSS/dataverse/pull/6789/files
13:08
poikilotherm
donsizemore this might be sth valuable to add to dvcli
13:08
poikilotherm
Instead of fiddling with manual moving and updating make it usable from cmdline with a simple call
13:14
donsizemore
i like this
13:51
pdurbin joined #dataverse
13:53
dataverse-user joined #dataverse
13:53
Jim12 joined #dataverse
13:55
pdurbin
ppeter44: hi! Did you get answers to all your questions?
13:55
Jim12
@ppeter - Stores can be chosen per Dataverse by an admin, so you can set up different Dataverses to send files to different places.
13:57
Jim12
If you change either the global default store or the store for a Dataverse, new files go to the newly selected place but old ones aren't moved. It would be nice to have an api to move existing files around, but that doesn't yet exist.
13:58
Jim12
You can achieve a move by changing the database entry and moving the file physically (right now, all stores use the same path structure, so bulk moving is fairly striaght forward - move the whole file tree, and just change the store id as recorded for that file.
14:03
Jim12
The two main uses cases I'm aware of are for people trying to separate large files or to allow independent accounting - one Dataverse with storage resources for different groups separated into different stores that can allocate and/or charge as they wish.
14:04
Jim12
Harvard is also looking at another case - two stores that point at the same S3 storage, but one configured for direct uploads (useful for larger files).
14:06
Jim12
The choice of managing which store is used by Dataverse was seen as a flexible/powerful first option, but how a store is assigned is fairly independent from the store mechanism itself, so it wouldn't be hard for someone to implement a way to choose a store based on file size for example (slightly tricky if one considers that Dataverse will unzip zip files).
14:09
ppeter44
@pdurbin I think so, thank you. I think I now see how it works, and what may be available in the future.
14:18
poikilotherm
Morning pdurbin :-) Have you seen the latest addition to https://github.com/poikilotherm/dvcli ? ;-)
14:48
pdurbin
ppeter44: I think multistore is already available. Yeah, in Dataverse 4.20, the latest: https://github.com/IQSS/dataverse/releases/tag/v4.20
14:50
pdurbin
poikilotherm: I see k8s stuff. And a logo! Nice!
15:35
uclouvain joined #dataverse
15:35
uclouvain
Hi. I have a problem of storage on my dataverse instance
15:36
uclouvain
It is configured to store file on S3. Ant it does it. Anyway, /usr/local/dvn/data/temp fills my hdd.
15:36
uclouvain
Why ? What's the solution ?
15:40
pdurbin
uclouvain: hi, I believe files are staged locally in some cases. Jim12 might have a better idea.
15:40
pdurbin
uclouvain: I'm aware of this issue: Documentation: Document various temp directories used by Dataverse #2848 - https://github.com/IQSS/dataverse/issues/2848
15:41
pdurbin
donsizemore: to quote you (from that issue), "Odum has ~7,300 files piled up in $files.dir/temp"
15:45
uclouvain
it seems that those files are not yet pushed to the S3 storage.
15:46
poikilotherm
uclouvain please see also https://github.com/IQSS/dataverse/issues/6656
15:46
pdurbin
uclouvain: that's what I'm hearing in Slack too.
15:46
uclouvain
Any solution ?
15:47
poikilotherm
I opened that recently and it analyses how things work internally and where things can start to pile up
15:51
pdurbin
poikilotherm: nice issue. Thorough,
15:52
pdurbin
uclouvain: I think we have a script that cleans up old file in that "temp" directory.
15:53
Jim12
When Dataverse is done processing and stores the file, it deletes the temporary copy. So any temp files that exist represent current uploads or failures in the past. Recent versions of Dataverse do a better job of cleaning up after cancellations and failures but, if a user uploads files and then leaves the browser without clicking save or cancel, the temp files remain. Deleting temp files on some periodic basics (i.e. ones older than 1 day+) should be f
15:53
pdurbin
Ghosts of failed uploads past.
15:54
Jim12
If there are other cases where v4.20 is not removing temp files, it would be good to report them.
15:55
pdurbin
uclouvain: is this helping? :)
15:55
Jim12
@poikilotherm's issue points out there's another directory, managed by primefaces where you may also find temp files/may need to assure sufficient space to handle active uploads.
15:56
uclouvain
There's no risk to delete files older than 1day ?
15:57
pdurbin
uclouvain: 1 day old is probably fine. If you want to be even more careful you could delete files that are 3 days old.
15:59
Jim12
Dataverse only tracks those temp files as part of active uploads, so there's no way Dataverse can do anything with them after the upload session is gone. That said - if a user says 'I started uploading my only copy of a file a week ago and it didn't work - can I get my file back?' then the temp file is your only copy on the server.
16:01
uclouvain
server.log is full of lines like these ones:
16:01
uclouvain
[#|2020-04-22T13:56:21.472+0000|WARNING|glassfish 4.1|edu.harvard.iq.dataverse.ingest.IngestServiceBean|_ThreadID=51;_ThreadName=jk-connector(5);_TimeMillis=1587563781472;_LevelValue=900;| Failed to save the file, storage id s3://dataverse:171a21dba64-8634e7583d95 (Unable to calculate MD5 hash: /usr/local/dvn/data/temp/s3:/dataverse:171a21dba64-8634e7583d95 (No such file or directory))|#] [#|2020-04-22T13:56:21.474+0000|WARNING|glassfi
16:02
pdurbin
uclouvain: which version of Dataverse are you running?
16:03
uclouvain
4.14 (I know... I know...)
16:03
pdurbin
:)
16:07
uclouvain
BTW what's the default maximum number of files per dataset ? My user has already over 900 files !
16:07
pdurbin
There are some "Datafile Integrity" checks you could do: http://guides.dataverse.org/en/4.20/api/native-api.html#id17 . But it sounds like uploads might be failing.
16:08
pdurbin
You can set limits for how many files can be uploaded at a time but I'm not sure if there's a way to set a limit on the total number of files in a dataset.
16:09
uclouvain
OK So he will not be blocked
16:10
pdurbin
I don't believe so.
16:11
pdurbin
If you want to limit the number of files per dataset, please open a GitHub issue.
16:11
Jim12
re: number of files - QDR has 1800+, 1000 per single upload is the only limit I know of. Beyond that, things just get slow to display as the number of files goes up.
16:12
pdurbin
Jim12: is that still true, even after the file listing on the dataset page was switched to Solr? It definitely was true in the past. :)
16:14
Jim12
I'm not sure what the error could be, especially back in 4.14. I'd suspect something like S3 failure or a misconfiguration, but I'm not sure. FWIW - one thing we've seen lately in S3 from AWS - the us-east-1 region doesn't guarantee that a call to get a new file or its metadata will succeed immediately after it's creation. We have some evidence that that has gone from <1 second to more than a minute at times lately (eventual consistency).
16:14
uclouvain
So I deleted files older than one day. The user can go on with uploads. Thanks for your help !
16:14
Jim12
Our testing was around the new direct upload feature, but it could affect the normal uploads as well.
16:14
pdurbin
uclouvain: you're welcome. It's probably getting late for you. :)
16:15
uclouvain
We use S3 with CEPH at UCLouvain.
16:15
pdurbin
fancy!
16:15
Jim12
Things getting slower? Yes - I think I started around 4.8.6 when it was already on solr. There have been improvements since, and maybe not so much in display as in making any changes (e.g. change metadata).
16:16
Jim12
Hopefully eventual consistency isn't your issue then.
16:16
pdurbin
Huh, I thought the switch to Solr was more recent.
16:17
pdurbin
This is the change I meant, from Dataverse 4.15: https://github.com/IQSS/dataverse/pull/5820
16:18
pdurbin
Anyway, I have something like 115 files in my dataset and I can page through them fairly quickly.
16:44
jri joined #dataverse
18:03
pdurbin
donsizemore: https://github.com/IQSS/dataverse-ansible/pull/167 looks good to me. Thanks!
18:43
donsizemore
@pdurbin i pestered leonid about whether we should log the output of as-setup.sh — he's not worried about it for now but if he does more work on the installer he may add it
18:45
pdurbin
Sounds good. So are you happy to switch to the installer? I thought you said you'd be able to delete some Ansible code but the pull request is a net gain in code.
18:58
donsizemore
i already switched it. the nice thing is ansible isn't tracking and wedging in bits that would have been provided by the perl installer
18:58
donsizemore
net gain in code... hey, it's python!
18:58
pdurbin
:)
18:59
pdurbin
Well, it's nice that the new installer will be tested regularly.
19:00
donsizemore
i'm re-running in ec2 now with the jenkins group_vars
19:01
donsizemore
hmm. sun must've moved to github? "Request failed: <urlopen error timed out>", "url": "http://dlc-cdn.sun.com/glassfish/4.1/release/glassfish-4.1.zip "
19:01
donsizemore
maybe we should go ahead and switch that to payara
19:06
pdurbin
That link works for me. I was wondering when we're going to switch Jenkins to Payara. What are we waiting for?
19:08
donsizemore
gustavo to say so (and, if we get CentOS 8 AMIs any time soon, Glassfish to patch this morning's OpenJDK breakage)
19:10
pdurbin
Yeah that was weird. Thanks for opening https://github.com/IQSS/dataverse/issues/6853 . I haven't grokked it completely.
19:11
donsizemore
a change in openjdk-1.8.0.252+ broke payara-5.201's packaged grizzly
19:11
donsizemore
you tell payara to use a different grizzly jar for openjdk _252 and up, and things are fine
19:12
pdurbin
Gotcha. And you're always getting new JDK releases, I suppose. I don't update my JDK on my laptop very often.
19:12
donsizemore
i like security patches
19:13
pdurbin
I remember having a vague sense of horror and unrest when I had to install Java on my laptop for the first time. I probably should keep it more up to date than I do.
19:17
donsizemore
if i want to set an ansible group_var to skip only that sitemap test, do i call it testUpdateSiteMap or SiteMapUtilTest.testUpdateSiteMap
19:18
donsizemore
ooh ooh -Dit.test=\!SiteMapUtilTest.*
19:18
pdurbin
buh, I'm not sure
19:19
donsizemore
i'm trying ^^
19:19
donsizemore
so Jenkins can carry on
19:33
pdurbin
like a wayward son