IQSS logo

IRC log for #dataverse, 2020-10-16

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

All times shown according to UTC.

Time S Nick Message
06:23 nightowl313 joined #dataverse
07:07 Virgile joined #dataverse
07:58 Virgile joined #dataverse
09:13 Virgile joined #dataverse
11:51 Virgile joined #dataverse
11:52 donsizemore joined #dataverse
11:57 Virgile joined #dataverse
12:06 Virgile joined #dataverse
12:29 yoh joined #dataverse
13:04 donsizemore "This release includes tech preview functionality to run Jakarta EE 9 applications on Payara Server and Payara Micro." https://github.com/payara/Payara/releases/tag/payara-server-5.2020.5
14:04 pdurbin joined #dataverse
14:04 pdurbin donsizemore: nice
14:05 pdurbin poikilotherm: ^^
14:05 donsizemore @pdurbin i'm trying 5.2020.5 in the trsa-ansible role i'm writing for akio
14:05 pdurbin cool
14:05 pdurbin I'm curious if it "just works" or if we'll need to change javax stuff to jakarta in imports. The namespace change.
14:54 pdurbin donsizemore: man, everybody is asking about moving datasets from one Dataverse installation to another: https://groups.google.com/g/dataverse-community/c/PfKIZFxFZhE/m/_itkuvz8BAAJ
14:58 donsizemore yes
15:29 pameyer joined #dataverse
15:33 pameyer think the "move" things are coming up often enough that it's worth an API for "this is the round-trip'able version"?
15:34 pameyer ... and whenever I look at google groups, I'm reminded that I'm turning into that user who dislikes every UI redesign
16:09 pdurbin who moved my cheese
16:09 pameyer and why is everything animated
16:10 pdurbin heh
16:40 donsizemore I absolutely think it's worth it.
16:40 donsizemore and be careful with that cheese. i'm vegetarian and my doctor is on my about my cholesterol
16:41 pdurbin Yes to an API but it needs some thought/design.
16:42 pdurbin What about using BagIt as an interchange format?
16:42 donsizemore that's definitely on Jon and Jim's radar(S)
16:42 pdurbin Yeah, Jim brought it up on the call.
16:44 pameyer I'd wonder if BagIt might be too standardized for an interchange format - custom metadata, that kind of thing
16:44 * donsizemore ducks
16:44 pdurbin Good point. Dunno.
16:44 pameyer ... did I miss an argument about it? ;)
16:45 donsizemore I try to stay out of metadata-land
16:45 pdurbin It's too perilous.
16:46 pameyer APIs having thought/design is always good :) I'd mainly been thinking something along the lines of GET dataverse $x (modified dataset api), POST dataverse $y (existing dataset api)
16:46 donsizemore speaking of, in my crappy lil' trsa-ansible role i'm writing, i've added a "drop_db_if_exists" group_var
16:46 donsizemore do you think a drop_db_if_exists would be helpful for testing and safe enough for dataverse-ansible?
16:47 donsizemore to my knowledge, that super-user toggle called by the install script is the last non-idempotent part of the role
16:48 pdurbin Nice. It wouldn't be hard to fix that (make it enable/disable). And i drop my database all the time on my laptop but I don't use ansible there.
16:49 pdurbin pameyer: well the GET is already in place to get a BagIt representation of a dataset. But I don't know much about it. Jim coded it up.
16:49 pameyer I've gotten the impression folks use dataverse-ansible in prod - any ideas if that's an install thing, or ongoing thing?
16:49 pameyer pdurbin: my knowledge of BagIt is pretty minimal too
18:34 yoh joined #dataverse
19:01 yoh joined #dataverse
19:11 donsizemore joined #dataverse
19:18 nightowl313 joined #dataverse
19:20 donsizemore @nightowl313 any luck with the uploads?
19:22 nightowl313 still working on it .. wondering if there is anything else that needs to be done with the s3 bucket ... we have all s3 permissions set for the user that dataverse uses, and the cors ... I literally copied the ones in the guide (so it is wide open .. test system) .. but no other permissions on it
19:22 nightowl313 i suspect it is the cors, like you said, but i'm pretty unfamiliar with cors (as with everything else!)
19:23 nightowl313 I'm going to look for errors in the log ... I did look and didn't see anything , but will run it and try again
19:23 nightowl313 doing that now
19:23 Jim58 joined #dataverse
19:24 nightowl313 i got on to ask about a comment on here yesterday ... I think someone said that harvard's dataverse stores about 43T worth of data ... my team thought that seemed very small ... is that just the harvard data (ie: not the other orgs in there)
19:24 pdurbin Other orgs too. Free hosting for the world, within limits. :)
19:25 nightowl313 wow ... i think we assumed with all of those orgs that you would have 100's of TB of data ... how do you limit it?
19:25 nightowl313 our very first "client" is wanting to host 10-100TB of data!
19:26 pdurbin Well, I think we've always said people can't upload more than 1 TB total. And the file size limit was 2.5 GB for a long time. Not sure what it is now.
19:26 Jim58 Hi all - re: s3 - any clues in the browser console? It should have info about CORS if that's the problem.
19:27 nightowl313 testing it now ... had an appointment this morning and just getting back to it ... thanks for the responses ... i didn't see anything in the logs but testing that and the console again
19:29 donsizemore @nightowl313 historically Social Science data was quite small in size
19:30 pdurbin before Twitter came along
19:36 nightowl313 anyone know if there are any orgs using dataverse that host more than, say 50TB of data? 100's of TBs? I'm concerned that our expectations of what we can provide (eventually) may be somewhat inflated :-)
19:37 nightowl313 but I guess that would really be up to our ability to host that much data ... which we are working on with the provost and funding sources
19:37 nightowl313 but, can the dataverse application support that if we have the storage capacity?
19:38 pdurbin I think the key thing to having Dataverse support that much data is to NOT push it all through Glassfish/Payara. That is to say, using the direct upload/download to S3 is a win.
19:38 Jim58 if you're doing s3 with direct up and down, the scaling issue for dataverse is # of files rather than size.
19:39 pdurbin True. Dataverse doesn't do so well with thousands of files in a dataset. It works but it's kinda meh in my opinion.
19:40 donsizemore @nightowl313 can you upload a file as the glassfish user with 'aws s3 cp file s3://bucket/path' ?
19:40 nightowl313 direct upload for initial upload right? so there are issues with having that many files per dataset even after it is all there?
19:40 donsizemore @nightowl313 sorry, the payara user.
19:41 nightowl313 so, I'm getting a 403 firbidden error in the console when trying to upload ... will try to copy a file with aws cli per DS suggestion
19:41 nightowl313 upload error: undefined upid=0, Error 403: Forbidden
19:41 Jim58 is Dataverse running as the user with the aws credentials? (A test with normal upload would show whether that's ok)
19:43 nightowl313 it should be ... the .aws config file is in that acccount, and file uploads work fine when I don't have upload/download enabled
19:43 nightowl313 it just gets this error when I enable diret upload
19:44 nightowl313 but will try direct copy while logged in as the dataverse user
19:44 Jim58 @pameyer - FWIW: The Bag mechanism handles custom metadatablocks. The import part, which I just got ~working, only handles going to a new instance with the same metadatablocks at this point.
19:45 nightowl313 i mean payara user =)
19:45 Jim58 In the console - does the network tab show which call is getting the 403 - presumably the call to s3? And is there any info in the response tab there?
19:45 pdurbin Jim58: that's awesome about the bagit stuff
19:46 nightowl313 the put is getting the 403 ... there is a post, an options, and then the put (error) and then post
19:47 nightowl313 it says "no-referrer-when-downgrade"
19:48 Jim58 are Dataverse and s3 both https / both http? Or mixed?
19:49 nightowl313 dataverse is https ... is there something specific that needs to be done to the s3 bucket to make it https? if so I probably didn't do it =)
19:50 Jim58 not if its aws - and the network tab should show what the full URL was
19:51 nightowl313 the put command on the network tab is using https://<my bucket name> + file location bunch of characters
19:51 nightowl313 does my aws user need any other permissions other than s3 all?
19:52 Jim58 I don't think so - and it should be the same for normal and direct.
19:53 nightowl313 actually it has PutObject, GetObjectAcl, GetObject, ListBucket, GetBucketAcl, DeleteObject, HeadBucket, GetBucketLocation, GetBucketPolicy
19:53 pameyer @Jim58 - cool, thanks
19:53 nightowl313 i forgot we limited it
19:55 Jim58 any aws:Referer policies set up? (Default should be OK but if you limited those...)
19:56 nightowl313 i don't think so .. didn't specifically do anything that I know of
19:57 pameyer is this a standard/standard-ish apache/ajp/payara setup?
19:57 Jim58 I think the one PUT call that fails should be OK if you just have PutObject, so I think the permissions are OK. (Dataverse needs more than PutObject internally)
19:57 pameyer I'm wondering about a possible external https -> web server -> internal http -> app server
19:58 nightowl313 i used dataverse-ansible! which is magical and we owe our entire dataverse to it
19:59 nightowl313 i configured the s3 part manually though .. and some other things, but the core was set up with that
20:02 Jim58 is it possible to access your test Dataverse from out here?
20:02 pameyer dataverse-ansible is apache ajp, so one thing that's not the problem
20:04 nightowl313 verified that I could copy a file directly from the payara user account to the s3 bucket
20:04 nightowl313 our test dv is currently public ... https://dataverse-test.lib.asu.edu
20:05 nightowl313 but we have it configured for sso (shib) ... I can create a local account
20:06 nightowl313 there was discussion about having it available for some of our research teams to have a "sandbox"
20:06 Jim58 So direct download works, which would suggest the basic creds/bucket are all OK - I guess that points more towards permission issues.
20:06 dataverse-user joined #dataverse
20:07 nightowl313 i can try changing the permissions to "*"  .. that is how we had it before
20:08 dataverse-user hi
20:09 Jim58 might be worthwhile - perhaps putObjectAcl or some other permission is also needed for a PUT to work.
20:10 Jim58 hi dataverse-user
20:10 nightowl313 doing that now ... sorry to take over the whole chat! I'll try that and check back ... thank you all!!!!!
20:11 Jim58 if that doesn't work, if I can get an account, I can see if I can spot anything else in the browser - good luck!
20:15 pameyer would awselb be a possible problem?
20:15 nightowl313 that was it!!!!
20:15 pdurbin * fixed it?
20:16 pameyer given the timestamps, I'm guessing that was Jim58's permission suggestion :)
20:16 nightowl313 yes giving * permissions to the user ... it works! (this is directly from the dataverse file upload interface)
20:17 nightowl313 the aws user in iam tha tis
20:18 nightowl313 you all are the greatest! thanks so much for working through another thing with me ... i'm going to lurk on here every day and see if I can help with anything
20:18 Jim58 Yay! - I'm not sure what else might be needed besides PutObjectAcl (and not sure it's that), but I'd think you should be able to cut it down from * . The only other things I can think might relate to that PUT would be handling signatures or adding tags/metadata. If there are perms for those you may need them.
20:19 pdurbin nightowl313: maybe you could open an issue about how we should document which permissions are needed.
20:23 nightowl313 i will do that! I think aws may have an analyzer tool to help identify permissions needed as well .. I may see if I can find that too
20:25 pameyer I'm not sure where, but there should be logs somewhere for what calls were made to the bucket
20:26 nightowl313 oh right! I will look at the cloudwatch logs .. still learning all of those aws services, too!
20:26 nightowl313 And now, to tackle uploads outside of dataverse! Thanks so much all!
20:28 nightowl313 so much to learn ... so little time =)
20:28 pameyer very true :)
20:29 pdurbin nightowl313: you have a lot of energy. You're probably learning faster than the rest of us. :)
20:29 pameyer doesn't seem to stop - just today I learned something new about solr tokenizers and custom metadata blocks
20:32 nightowl313 lol, well I have so much more to learn! and, seems like everthing is needed now! what are solr tokenizers?
20:35 pameyer things that cause me weirdness with search API and metadata values with "-" in them :)
20:36 pameyer I'd been copy/pasting old solr schema blocks, and that didn't work too well when I wanted exact match searches with things that solr split up
20:40 nightowl313 that sounds interesting
20:42 pdurbin pameyer: exact match searches work better with "string" than with "text" in Solr.
20:42 pameyer pdurbin: exactly :)
20:42 pdurbin We use "string" for facets, for example.
20:45 pameyer they mostly work ok with string, as long as you don't have a solr token boundary character.  was contemplating adding to the guides, but it seemed like another pameyer's doing it wrong again thing ;)
20:45 pameyer mostly work ok with "text_en", I'd mean - yet another typo
20:47 nightowl313 =)
20:48 pdurbin If you want to add it to the guides, go for it. :)
20:49 pdurbin It's been a pretty lively Friday afternoon in here but I'm stepping away from the screen soon. I hope everyone has a lovely weekend.
20:53 pameyer stepping away from screens is good :)
20:54 nightowl313 have a great weekend! thanks for the help!
20:57 pdurbin left #dataverse
21:14 nightowl313 oops have another question ... is it better to separate out normal file upload traffic from large file uploads to separate buckets/stores? I know the guide mentions the possibility but just wondering if there is a best practice or recommendation for that?
21:16 nightowl313 i suppose it might be difficult to anticipate which projects might have large files and enforce changing stores if file sizes are big vs small...
22:01 pameyer that's a good question - I don't known enough s3 (or dataverse+s3) to have useful ideas about it though
22:09 nightowl313 yea, now I'm thinking that will just be a battle we will fight later! Thanks!

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.