IQSS logo

IRC log for #dataverse, 2020-04-23

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

All times shown according to UTC.

Time S Nick Message
07:00 jri joined #dataverse
07:41 ppeter joined #dataverse
07:43 ppeter We are currently considering to install dataverse at our research institute, however some questions came up for which I could not find the answer to in the documentation.
07:43 ppeter If I use multiple filestores, what is the placement strategy?
07:44 ppeter Is it possible to move files among filestores, either automatically or manually (through API)?
07:45 ppeter Can a dataset or dataverse place some files on one filestore, and some other files on another?
07:47 ppeter The two main focuses are performance tiering and removing filestores. Moving files is necessary for the former, moving dataverses is sufficient for the latter.
07:48 ppeter18 joined #dataverse
07:48 ppeter56 joined #dataverse
07:51 ppeter56 left #dataverse
07:52 ppeter joined #dataverse
09:33 jri_ joined #dataverse
09:52 Benjamin_Peuch joined #dataverse
11:42 donsizemore joined #dataverse
11:45 donsizemore @ppeter hello! Dataverse support for multiple stores is currently under development. the default is a local filesystem
11:46 donsizemore @ppeter it is possible to direct where new datasets and files are created via the files.type jvm-option, and I believe legacy file reads (when location isn't updated in the dvobject table) should continue to work for either store
11:47 donsizemore @ppeter but to my understanding at present it's still an "either/or" choice. I know of at least two institutions who are actively working on support for multiple datastores, though.
11:48 donsizemore @ppeter moving files, again to my knowledge, would be handled "under the hood" both with the filesystem itself and the location entry in the dvobject table
11:52 ppeter In the dolumentation, it says "File Storage: Using a Local Filesystem and/or Swift and/or S3 object stores", and proceeds to list examples on how to add multiple file storages.
11:53 ppeter Do I understand it right that this is experimental currently?
11:54 ppeter And what does under-the-hood mean? Does it mean that the admin has no control over what files are stored where?
12:01 donsizemore @ppeter object storage support is beyond experimental, but there's no baked-in support for choosing a datastore at dataset creation just yet (to my knowledge)
12:01 donsizemore @ppeter an admin of the server could copy/move the files and update the database storage location, but an admin of the service (say, a lead archivist) wouldn't be able to do that via API or web interface
12:32 ppeter Thank you for the information!
12:40 donsizemore joined #dataverse
12:41 ppeter44 joined #dataverse
12:42 donsizemore @ppeter44 I'm delighted you're considering Dataverse, but don't quote me on the above. I haven't done a ton of work with S3 configuration so that's all to my understanding. Jim Myers with GDCC/QDR would be the person to ask.
12:42 donsizemore @ppeter44 or Leonid with IQSS
12:43 ppeter44 Thank you, I may try to contact them later.
12:45 ppeter44 There is definitely nothing documented about moving datasets/files among file stores; there seems to be some way to choose a file store at upload time.
12:48 donsizemore @ppeter44 correct. this PR hasn't made it to the public guide just yet: https://github.com/IQSS/dataverse/pull/6789/files
13:08 poikilotherm donsizemore this might be sth valuable to add to dvcli
13:08 poikilotherm Instead of fiddling with manual moving and updating make it usable from cmdline with a simple call
13:14 donsizemore i like this
13:51 pdurbin joined #dataverse
13:53 dataverse-user joined #dataverse
13:53 Jim12 joined #dataverse
13:55 pdurbin ppeter44: hi! Did you get answers to all your questions?
13:55 Jim12 @ppeter - Stores can be chosen per Dataverse by an admin, so you can set up different Dataverses to send files to different places.
13:57 Jim12 If you change either the global default store or the store for a Dataverse, new files go to the newly selected place but old ones aren't moved. It would be nice to have an api to move existing files around, but that doesn't yet exist.
13:58 Jim12 You can achieve a move by changing the database entry and moving the file physically (right now, all stores use the same path structure, so bulk moving is fairly striaght forward - move the whole file tree, and just change the store id as recorded for that file.
14:03 Jim12 The two main uses cases I'm aware of are for people trying to separate large files or to allow independent accounting - one Dataverse with storage resources for different groups separated into different stores that can allocate and/or charge as they wish.
14:04 Jim12 Harvard is also looking at another case - two stores that point at the same S3 storage, but one configured for direct uploads (useful for larger files).
14:06 Jim12 The choice of managing which store is used by Dataverse was seen as a flexible/powerful first option, but how a store is assigned is fairly independent from the store mechanism itself, so it wouldn't be hard for someone to implement a way to choose a store based on file size for example (slightly tricky if one considers that Dataverse will unzip zip files).
14:09 ppeter44 @pdurbin I think so, thank you. I think I now see how it works, and what may be available in the future.
14:18 poikilotherm Morning pdurbin :-) Have you seen the latest addition to https://github.com/poikilotherm/dvcli ? ;-)
14:48 pdurbin ppeter44: I think multistore is already available. Yeah, in Dataverse 4.20, the latest: https://github.com/IQSS/dataverse/releases/tag/v4.20
14:50 pdurbin poikilotherm: I see k8s stuff. And a logo! Nice!
15:35 uclouvain joined #dataverse
15:35 uclouvain Hi. I have a problem of storage on my dataverse instance
15:36 uclouvain It is configured to store file on S3. Ant it does it. Anyway, /usr/local/dvn/data/temp fills my hdd.
15:36 uclouvain Why ? What's the solution ?
15:40 pdurbin uclouvain: hi, I believe files are staged locally in some cases. Jim12 might have a better idea.
15:40 pdurbin uclouvain: I'm aware of this issue: Documentation: Document various temp directories used by Dataverse #2848 - https://github.com/IQSS/dataverse/issues/2848
15:41 pdurbin donsizemore: to quote you (from that issue), "Odum has ~7,300 files piled up in $files.dir/temp"
15:45 uclouvain it seems that those files are not yet pushed to the S3 storage.
15:46 poikilotherm uclouvain please see also  https://github.com/IQSS/dataverse/issues/6656
15:46 pdurbin uclouvain: that's what I'm hearing in Slack too.
15:46 uclouvain Any solution ?
15:47 poikilotherm I opened that recently and it analyses how things work internally and where things can start to pile up
15:51 pdurbin poikilotherm: nice issue. Thorough,
15:52 pdurbin uclouvain: I think we have a script that cleans up old file in that "temp" directory.
15:53 Jim12 When Dataverse is done processing and stores the file, it deletes the temporary copy.  So any temp files that exist represent current uploads or failures in the past. Recent versions of Dataverse do a better job of cleaning up after cancellations and failures but, if a user uploads files and then leaves the browser without clicking save or cancel, the temp files remain. Deleting temp files on some periodic basics (i.e. ones older than 1 day+) should be f
15:53 pdurbin Ghosts of failed uploads past.
15:54 Jim12 If there are other cases where v4.20 is not removing temp files, it would be good to report them.
15:55 pdurbin uclouvain: is this helping? :)
15:55 Jim12 @poikilotherm's issue points out there's another directory, managed by primefaces where you may also find temp files/may need to assure sufficient space to handle active uploads.
15:56 uclouvain There's no risk to delete files older than 1day ?
15:57 pdurbin uclouvain: 1 day old is probably fine. If you want to be even more careful you could delete files that are 3 days old.
15:59 Jim12 Dataverse only tracks those temp files as part of active uploads, so there's no way Dataverse can do anything with them after the upload session is gone. That said - if a user says 'I started uploading my only copy of a file a week ago and it didn't work - can I get my file back?' then the temp file is your only copy on the server.
16:01 uclouvain server.log is full of lines like these ones:
16:01 uclouvain [#|2020-04-22T13:56:21.472+0000|WARNING|glassfish 4.1|edu.harvard.iq.dataverse.ingest.IngestServ​iceBean|_ThreadID=51;_ThreadName=jk-connector(​5);_TimeMillis=1587563781472;_LevelValue=900;|   Failed to save the file, storage id s3://dataverse:171a21dba64-8634e7583d95 (Unable to calculate MD5 hash: /usr/local/dvn/data/temp/s3:/dat​averse:171a21dba64-8634e7583d95 (No such file or directory))|#]  [#|2020-04-22T13:56:21.474+0000|WARNING|glassfi
16:02 pdurbin uclouvain: which version of Dataverse are you running?
16:03 uclouvain 4.14 (I know... I know...)
16:03 pdurbin :)
16:07 uclouvain BTW what's the default maximum number of files per dataset ?  My user has already over 900 files !
16:07 pdurbin There are some "Datafile Integrity" checks you could do: http://guides.dataverse.org/en/4.20/api/native-api.html#id17 . But it sounds like uploads might be failing.
16:08 pdurbin You can set limits for how many files can be uploaded at a time but I'm not sure if there's a way to set a limit on the total number of files in a dataset.
16:09 uclouvain OK So he will not be blocked
16:10 pdurbin I don't believe so.
16:11 pdurbin If you want to limit the number of files per dataset, please open a GitHub issue.
16:11 Jim12 re: number of files - QDR has 1800+, 1000 per single upload is the only limit I know of. Beyond that, things just get slow to display as the number of files goes up.
16:12 pdurbin Jim12: is that still true, even after the file listing on the dataset page was switched to Solr? It definitely was true in the past. :)
16:14 Jim12 I'm not sure what the error could be, especially back in 4.14. I'd suspect something like S3 failure or a misconfiguration, but I'm not sure. FWIW - one thing we've seen lately in S3 from AWS - the us-east-1 region doesn't guarantee that a call to get a new file or its metadata will succeed immediately after it's creation. We have some evidence that that has gone from <1 second to more than a minute at times lately (eventual consistency).
16:14 uclouvain So I deleted files older than one day. The user can go on with uploads. Thanks for your help !
16:14 Jim12 Our testing was around the new direct upload feature, but it could affect the normal uploads as well.
16:14 pdurbin uclouvain: you're welcome. It's probably getting late for you. :)
16:15 uclouvain We use S3 with CEPH at UCLouvain.
16:15 pdurbin fancy!
16:15 Jim12 Things getting slower? Yes - I think I started around 4.8.6 when it was already on solr. There have been improvements since, and maybe not so much in display as in making any changes (e.g. change metadata).
16:16 Jim12 Hopefully eventual consistency isn't your issue then.
16:16 pdurbin Huh, I thought the switch to Solr was more recent.
16:17 pdurbin This is the change I meant, from Dataverse 4.15: https://github.com/IQSS/dataverse/pull/5820
16:18 pdurbin Anyway, I have something like 115 files in my dataset and I can page through them fairly quickly.
16:44 jri joined #dataverse
18:03 pdurbin donsizemore: https://github.com/IQSS/dataverse-ansible/pull/167 looks good to me. Thanks!
18:43 donsizemore @pdurbin i pestered leonid about whether we should log the output of as-setup.sh — he's not worried about it for now but if he does more work on the installer he may add it
18:45 pdurbin Sounds good. So are you happy to switch to the installer? I thought you said you'd be able to delete some Ansible code but the pull request is a net gain in code.
18:58 donsizemore i already switched it. the nice thing is ansible isn't tracking and wedging in bits that would have been provided by the perl installer
18:58 donsizemore net gain in code... hey, it's python!
18:58 pdurbin :)
18:59 pdurbin Well, it's nice that the new installer will be tested regularly.
19:00 donsizemore i'm re-running in ec2 now with the jenkins group_vars
19:01 donsizemore hmm. sun must've moved to github? "Request failed: <urlopen error timed out>", "url": "http://dlc-cdn.sun.com/glassfish/4.1/release/glassfish-4.1.zip"
19:01 donsizemore maybe we should go ahead and switch that to payara
19:06 pdurbin That link works for me. I was wondering when we're going to switch Jenkins to Payara. What are we waiting for?
19:08 donsizemore gustavo to say so (and, if we get CentOS 8 AMIs any time soon, Glassfish to patch this morning's OpenJDK breakage)
19:10 pdurbin Yeah that was weird. Thanks for opening https://github.com/IQSS/dataverse/issues/6853 . I haven't grokked it completely.
19:11 donsizemore a change in openjdk-1.8.0.252+ broke payara-5.201's packaged grizzly
19:11 donsizemore you tell payara to use a different grizzly jar for openjdk _252 and up, and things are fine
19:12 pdurbin Gotcha. And you're always getting new JDK releases, I suppose. I don't update my JDK on my laptop very often.
19:12 donsizemore i like security patches
19:13 pdurbin I remember having a vague sense of horror and unrest when I had to install Java on my laptop for the first time. I probably should keep it more up to date than I do.
19:17 donsizemore if i want to set an ansible group_var to skip only that sitemap test, do i call it testUpdateSiteMap or SiteMapUtilTest.testUpdateSiteMap
19:18 donsizemore ooh ooh -Dit.test=\!SiteMapUtilTest.*
19:18 pdurbin buh, I'm not sure
19:19 donsizemore i'm trying ^^
19:19 donsizemore so Jenkins can carry on
19:33 pdurbin like a wayward son

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.