Time
S
Nick
Message
06:23
nightowl313 joined #dataverse
07:07
Virgile joined #dataverse
07:58
Virgile joined #dataverse
09:13
Virgile joined #dataverse
11:51
Virgile joined #dataverse
11:52
donsizemore joined #dataverse
11:57
Virgile joined #dataverse
12:06
Virgile joined #dataverse
12:29
yoh joined #dataverse
13:04
donsizemore
"This release includes tech preview functionality to run Jakarta EE 9 applications on Payara Server and Payara Micro." https://github.com/payara/Payara/releases/tag/payara-server-5.2020.5
14:04
pdurbin joined #dataverse
14:04
pdurbin
donsizemore: nice
14:05
pdurbin
poikilotherm: ^^
14:05
donsizemore
@pdurbin i'm trying 5.2020.5 in the trsa-ansible role i'm writing for akio
14:05
pdurbin
cool
14:05
pdurbin
I'm curious if it "just works" or if we'll need to change javax stuff to jakarta in imports. The namespace change.
14:54
pdurbin
donsizemore: man, everybody is asking about moving datasets from one Dataverse installation to another: https://groups.google.com/g/dataverse-community/c/PfKIZFxFZhE/m/_itkuvz8BAAJ
14:58
donsizemore
yes
15:29
pameyer joined #dataverse
15:33
pameyer
think the "move" things are coming up often enough that it's worth an API for "this is the round-trip'able version"?
15:34
pameyer
... and whenever I look at google groups, I'm reminded that I'm turning into that user who dislikes every UI redesign
16:09
pdurbin
who moved my cheese
16:09
pameyer
and why is everything animated
16:10
pdurbin
heh
16:40
donsizemore
I absolutely think it's worth it.
16:40
donsizemore
and be careful with that cheese. i'm vegetarian and my doctor is on my about my cholesterol
16:41
pdurbin
Yes to an API but it needs some thought/design.
16:42
pdurbin
What about using BagIt as an interchange format?
16:42
donsizemore
that's definitely on Jon and Jim's radar(S)
16:42
pdurbin
Yeah, Jim brought it up on the call.
16:44
pameyer
I'd wonder if BagIt might be too standardized for an interchange format - custom metadata, that kind of thing
16:44
* donsizemore
ducks
16:44
pdurbin
Good point. Dunno.
16:44
pameyer
... did I miss an argument about it? ;)
16:45
donsizemore
I try to stay out of metadata-land
16:45
pdurbin
It's too perilous.
16:46
pameyer
APIs having thought/design is always good :) I'd mainly been thinking something along the lines of GET dataverse $x (modified dataset api), POST dataverse $y (existing dataset api)
16:46
donsizemore
speaking of, in my crappy lil' trsa-ansible role i'm writing, i've added a "drop_db_if_exists" group_var
16:46
donsizemore
do you think a drop_db_if_exists would be helpful for testing and safe enough for dataverse-ansible?
16:47
donsizemore
to my knowledge, that super-user toggle called by the install script is the last non-idempotent part of the role
16:48
pdurbin
Nice. It wouldn't be hard to fix that (make it enable/disable). And i drop my database all the time on my laptop but I don't use ansible there.
16:49
pdurbin
pameyer: well the GET is already in place to get a BagIt representation of a dataset. But I don't know much about it. Jim coded it up.
16:49
pameyer
I've gotten the impression folks use dataverse-ansible in prod - any ideas if that's an install thing, or ongoing thing?
16:49
pameyer
pdurbin: my knowledge of BagIt is pretty minimal too
18:34
yoh joined #dataverse
19:01
yoh joined #dataverse
19:11
donsizemore joined #dataverse
19:18
nightowl313 joined #dataverse
19:20
donsizemore
@nightowl313 any luck with the uploads?
19:22
nightowl313
still working on it .. wondering if there is anything else that needs to be done with the s3 bucket ... we have all s3 permissions set for the user that dataverse uses, and the cors ... I literally copied the ones in the guide (so it is wide open .. test system) .. but no other permissions on it
19:22
nightowl313
i suspect it is the cors, like you said, but i'm pretty unfamiliar with cors (as with everything else!)
19:23
nightowl313
I'm going to look for errors in the log ... I did look and didn't see anything , but will run it and try again
19:23
nightowl313
doing that now
19:23
Jim58 joined #dataverse
19:24
nightowl313
i got on to ask about a comment on here yesterday ... I think someone said that harvard's dataverse stores about 43T worth of data ... my team thought that seemed very small ... is that just the harvard data (ie: not the other orgs in there)
19:24
pdurbin
Other orgs too. Free hosting for the world, within limits. :)
19:25
nightowl313
wow ... i think we assumed with all of those orgs that you would have 100's of TB of data ... how do you limit it?
19:25
nightowl313
our very first "client" is wanting to host 10-100TB of data!
19:26
pdurbin
Well, I think we've always said people can't upload more than 1 TB total. And the file size limit was 2.5 GB for a long time. Not sure what it is now.
19:26
Jim58
Hi all - re: s3 - any clues in the browser console? It should have info about CORS if that's the problem.
19:27
nightowl313
testing it now ... had an appointment this morning and just getting back to it ... thanks for the responses ... i didn't see anything in the logs but testing that and the console again
19:29
donsizemore
@nightowl313 historically Social Science data was quite small in size
19:30
pdurbin
before Twitter came along
19:36
nightowl313
anyone know if there are any orgs using dataverse that host more than, say 50TB of data? 100's of TBs? I'm concerned that our expectations of what we can provide (eventually) may be somewhat inflated :-)
19:37
nightowl313
but I guess that would really be up to our ability to host that much data ... which we are working on with the provost and funding sources
19:37
nightowl313
but, can the dataverse application support that if we have the storage capacity?
19:38
pdurbin
I think the key thing to having Dataverse support that much data is to NOT push it all through Glassfish/Payara. That is to say, using the direct upload/download to S3 is a win.
19:38
Jim58
if you're doing s3 with direct up and down, the scaling issue for dataverse is # of files rather than size.
19:39
pdurbin
True. Dataverse doesn't do so well with thousands of files in a dataset. It works but it's kinda meh in my opinion.
19:40
donsizemore
@nightowl313 can you upload a file as the glassfish user with 'aws s3 cp file s3://bucket/path' ?
19:40
nightowl313
direct upload for initial upload right? so there are issues with having that many files per dataset even after it is all there?
19:40
donsizemore
@nightowl313 sorry, the payara user.
19:41
nightowl313
so, I'm getting a 403 firbidden error in the console when trying to upload ... will try to copy a file with aws cli per DS suggestion
19:41
nightowl313
upload error: undefined upid=0, Error 403: Forbidden
19:41
Jim58
is Dataverse running as the user with the aws credentials? (A test with normal upload would show whether that's ok)
19:43
nightowl313
it should be ... the .aws config file is in that acccount, and file uploads work fine when I don't have upload/download enabled
19:43
nightowl313
it just gets this error when I enable diret upload
19:44
nightowl313
but will try direct copy while logged in as the dataverse user
19:44
Jim58
@pameyer - FWIW: The Bag mechanism handles custom metadatablocks. The import part, which I just got ~working, only handles going to a new instance with the same metadatablocks at this point.
19:45
nightowl313
i mean payara user =)
19:45
Jim58
In the console - does the network tab show which call is getting the 403 - presumably the call to s3? And is there any info in the response tab there?
19:45
pdurbin
Jim58: that's awesome about the bagit stuff
19:46
nightowl313
the put is getting the 403 ... there is a post, an options, and then the put (error) and then post
19:47
nightowl313
it says "no-referrer-when-downgrade"
19:48
Jim58
are Dataverse and s3 both https / both http? Or mixed?
19:49
nightowl313
dataverse is https ... is there something specific that needs to be done to the s3 bucket to make it https? if so I probably didn't do it =)
19:50
Jim58
not if its aws - and the network tab should show what the full URL was
19:51
nightowl313
the put command on the network tab is using https://<my bucket name> + file location bunch of characters
19:51
nightowl313
does my aws user need any other permissions other than s3 all?
19:52
Jim58
I don't think so - and it should be the same for normal and direct.
19:53
nightowl313
actually it has PutObject, GetObjectAcl, GetObject, ListBucket, GetBucketAcl, DeleteObject, HeadBucket, GetBucketLocation, GetBucketPolicy
19:53
pameyer
@Jim58 - cool, thanks
19:53
nightowl313
i forgot we limited it
19:55
Jim58
any aws:Referer policies set up? (Default should be OK but if you limited those...)
19:56
nightowl313
i don't think so .. didn't specifically do anything that I know of
19:57
pameyer
is this a standard/standard-ish apache/ajp/payara setup?
19:57
Jim58
I think the one PUT call that fails should be OK if you just have PutObject, so I think the permissions are OK. (Dataverse needs more than PutObject internally)
19:57
pameyer
I'm wondering about a possible external https -> web server -> internal http -> app server
19:58
nightowl313
i used dataverse-ansible! which is magical and we owe our entire dataverse to it
19:59
nightowl313
i configured the s3 part manually though .. and some other things, but the core was set up with that
20:02
Jim58
is it possible to access your test Dataverse from out here?
20:02
pameyer
dataverse-ansible is apache ajp, so one thing that's not the problem
20:04
nightowl313
verified that I could copy a file directly from the payara user account to the s3 bucket
20:04
nightowl313
our test dv is currently public ... https://dataverse-test.lib.asu.edu
20:05
nightowl313
but we have it configured for sso (shib) ... I can create a local account
20:06
nightowl313
there was discussion about having it available for some of our research teams to have a "sandbox"
20:06
Jim58
So direct download works, which would suggest the basic creds/bucket are all OK - I guess that points more towards permission issues.
20:06
dataverse-user joined #dataverse
20:07
nightowl313
i can try changing the permissions to "*" .. that is how we had it before
20:08
dataverse-user
hi
20:09
Jim58
might be worthwhile - perhaps putObjectAcl or some other permission is also needed for a PUT to work.
20:10
Jim58
hi dataverse-user
20:10
nightowl313
doing that now ... sorry to take over the whole chat! I'll try that and check back ... thank you all!!!!!
20:11
Jim58
if that doesn't work, if I can get an account, I can see if I can spot anything else in the browser - good luck!
20:15
pameyer
would awselb be a possible problem?
20:15
nightowl313
that was it!!!!
20:15
pdurbin
* fixed it?
20:16
pameyer
given the timestamps, I'm guessing that was Jim58's permission suggestion :)
20:16
nightowl313
yes giving * permissions to the user ... it works! (this is directly from the dataverse file upload interface)
20:17
nightowl313
the aws user in iam tha tis
20:18
nightowl313
you all are the greatest! thanks so much for working through another thing with me ... i'm going to lurk on here every day and see if I can help with anything
20:18
Jim58
Yay! - I'm not sure what else might be needed besides PutObjectAcl (and not sure it's that), but I'd think you should be able to cut it down from * . The only other things I can think might relate to that PUT would be handling signatures or adding tags/metadata. If there are perms for those you may need them.
20:19
pdurbin
nightowl313: maybe you could open an issue about how we should document which permissions are needed.
20:23
nightowl313
i will do that! I think aws may have an analyzer tool to help identify permissions needed as well .. I may see if I can find that too
20:25
pameyer
I'm not sure where, but there should be logs somewhere for what calls were made to the bucket
20:26
nightowl313
oh right! I will look at the cloudwatch logs .. still learning all of those aws services, too!
20:26
nightowl313
And now, to tackle uploads outside of dataverse! Thanks so much all!
20:28
nightowl313
so much to learn ... so little time =)
20:28
pameyer
very true :)
20:29
pdurbin
nightowl313: you have a lot of energy. You're probably learning faster than the rest of us. :)
20:29
pameyer
doesn't seem to stop - just today I learned something new about solr tokenizers and custom metadata blocks
20:32
nightowl313
lol, well I have so much more to learn! and, seems like everthing is needed now! what are solr tokenizers?
20:35
pameyer
things that cause me weirdness with search API and metadata values with "-" in them :)
20:36
pameyer
I'd been copy/pasting old solr schema blocks, and that didn't work too well when I wanted exact match searches with things that solr split up
20:40
nightowl313
that sounds interesting
20:42
pdurbin
pameyer: exact match searches work better with "string" than with "text" in Solr.
20:42
pameyer
pdurbin: exactly :)
20:42
pdurbin
We use "string" for facets, for example.
20:45
pameyer
they mostly work ok with string, as long as you don't have a solr token boundary character. was contemplating adding to the guides, but it seemed like another pameyer's doing it wrong again thing ;)
20:45
pameyer
mostly work ok with "text_en", I'd mean - yet another typo
20:47
nightowl313
=)
20:48
pdurbin
If you want to add it to the guides, go for it. :)
20:49
pdurbin
It's been a pretty lively Friday afternoon in here but I'm stepping away from the screen soon. I hope everyone has a lovely weekend.
20:53
pameyer
stepping away from screens is good :)
20:54
nightowl313
have a great weekend! thanks for the help!
20:57
pdurbin left #dataverse
21:14
nightowl313
oops have another question ... is it better to separate out normal file upload traffic from large file uploads to separate buckets/stores? I know the guide mentions the possibility but just wondering if there is a best practice or recommendation for that?
21:16
nightowl313
i suppose it might be difficult to anticipate which projects might have large files and enforce changing stores if file sizes are big vs small...
22:01
pameyer
that's a good question - I don't known enough s3 (or dataverse+s3) to have useful ideas about it though
22:09
nightowl313
yea, now I'm thinking that will just be a battle we will fight later! Thanks!