IRC log for #dataverse, 2020-10-16

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

All times shown according to UTC.

Time	Nick	Message
06:23		nightowl313 joined #dataverse
07:07		Virgile joined #dataverse
07:58		Virgile joined #dataverse
09:13		Virgile joined #dataverse
11:51		Virgile joined #dataverse
11:52		donsizemore joined #dataverse
11:57		Virgile joined #dataverse
12:06		Virgile joined #dataverse
12:29		yoh joined #dataverse
13:04	donsizemore	"This release includes tech preview functionality to run Jakarta EE 9 applications on Payara Server and Payara Micro." https://github.com/payara/Payara/releases/tag/payara-server-5.2020.5
14:04		pdurbin joined #dataverse
14:04	pdurbin	donsizemore: nice
14:05	pdurbin	poikilotherm: ^^
14:05	donsizemore	@pdurbin i'm trying 5.2020.5 in the trsa-ansible role i'm writing for akio
14:05	pdurbin	cool
14:05	pdurbin	I'm curious if it "just works" or if we'll need to change javax stuff to jakarta in imports. The namespace change.
14:54	pdurbin	donsizemore: man, everybody is asking about moving datasets from one Dataverse installation to another: https://groups.google.com/g/dataverse-community/c/PfKIZFxFZhE/m/_itkuvz8BAAJ
14:58	donsizemore	yes
15:29		pameyer joined #dataverse
15:33	pameyer	think the "move" things are coming up often enough that it's worth an API for "this is the round-trip'able version"?
15:34	pameyer	... and whenever I look at google groups, I'm reminded that I'm turning into that user who dislikes every UI redesign
16:09	pdurbin	who moved my cheese
16:09	pameyer	and why is everything animated
16:10	pdurbin	heh
16:40	donsizemore	I absolutely think it's worth it.
16:40	donsizemore	and be careful with that cheese. i'm vegetarian and my doctor is on my about my cholesterol
16:41	pdurbin	Yes to an API but it needs some thought/design.
16:42	pdurbin	What about using BagIt as an interchange format?
16:42	donsizemore	that's definitely on Jon and Jim's radar(S)
16:42	pdurbin	Yeah, Jim brought it up on the call.
16:44	pameyer	I'd wonder if BagIt might be too standardized for an interchange format - custom metadata, that kind of thing
16:44	* donsizemore	ducks
16:44	pdurbin	Good point. Dunno.
16:44	pameyer	... did I miss an argument about it? ;)
16:45	donsizemore	I try to stay out of metadata-land
16:45	pdurbin	It's too perilous.
16:46	pameyer	APIs having thought/design is always good :) I'd mainly been thinking something along the lines of GET dataverse $x (modified dataset api), POST dataverse $y (existing dataset api)
16:46	donsizemore	speaking of, in my crappy lil' trsa-ansible role i'm writing, i've added a "drop_db_if_exists" group_var
16:46	donsizemore	do you think a drop_db_if_exists would be helpful for testing and safe enough for dataverse-ansible?
16:47	donsizemore	to my knowledge, that super-user toggle called by the install script is the last non-idempotent part of the role
16:48	pdurbin	Nice. It wouldn't be hard to fix that (make it enable/disable). And i drop my database all the time on my laptop but I don't use ansible there.
16:49	pdurbin	pameyer: well the GET is already in place to get a BagIt representation of a dataset. But I don't know much about it. Jim coded it up.
16:49	pameyer	I've gotten the impression folks use dataverse-ansible in prod - any ideas if that's an install thing, or ongoing thing?
16:49	pameyer	pdurbin: my knowledge of BagIt is pretty minimal too
18:34		yoh joined #dataverse
19:01		yoh joined #dataverse
19:11		donsizemore joined #dataverse
19:18		nightowl313 joined #dataverse
19:20	donsizemore	@nightowl313 any luck with the uploads?
19:22	nightowl313	still working on it .. wondering if there is anything else that needs to be done with the s3 bucket ... we have all s3 permissions set for the user that dataverse uses, and the cors ... I literally copied the ones in the guide (so it is wide open .. test system) .. but no other permissions on it
19:22	nightowl313	i suspect it is the cors, like you said, but i'm pretty unfamiliar with cors (as with everything else!)
19:23	nightowl313	I'm going to look for errors in the log ... I did look and didn't see anything , but will run it and try again
19:23	nightowl313	doing that now
19:23		Jim58 joined #dataverse
19:24	nightowl313	i got on to ask about a comment on here yesterday ... I think someone said that harvard's dataverse stores about 43T worth of data ... my team thought that seemed very small ... is that just the harvard data (ie: not the other orgs in there)
19:24	pdurbin	Other orgs too. Free hosting for the world, within limits. :)
19:25	nightowl313	wow ... i think we assumed with all of those orgs that you would have 100's of TB of data ... how do you limit it?
19:25	nightowl313	our very first "client" is wanting to host 10-100TB of data!
19:26	pdurbin	Well, I think we've always said people can't upload more than 1 TB total. And the file size limit was 2.5 GB for a long time. Not sure what it is now.
19:26	Jim58	Hi all - re: s3 - any clues in the browser console? It should have info about CORS if that's the problem.
19:27	nightowl313	testing it now ... had an appointment this morning and just getting back to it ... thanks for the responses ... i didn't see anything in the logs but testing that and the console again
19:29	donsizemore	@nightowl313 historically Social Science data was quite small in size
19:30	pdurbin	before Twitter came along
19:36	nightowl313	anyone know if there are any orgs using dataverse that host more than, say 50TB of data? 100's of TBs? I'm concerned that our expectations of what we can provide (eventually) may be somewhat inflated :-)
19:37	nightowl313	but I guess that would really be up to our ability to host that much data ... which we are working on with the provost and funding sources
19:37	nightowl313	but, can the dataverse application support that if we have the storage capacity?
19:38	pdurbin	I think the key thing to having Dataverse support that much data is to NOT push it all through Glassfish/Payara. That is to say, using the direct upload/download to S3 is a win.
19:38	Jim58	if you're doing s3 with direct up and down, the scaling issue for dataverse is # of files rather than size.
19:39	pdurbin	True. Dataverse doesn't do so well with thousands of files in a dataset. It works but it's kinda meh in my opinion.
19:40	donsizemore	@nightowl313 can you upload a file as the glassfish user with 'aws s3 cp file s3://bucket/path' ?
19:40	nightowl313	direct upload for initial upload right? so there are issues with having that many files per dataset even after it is all there?
19:40	donsizemore	@nightowl313 sorry, the payara user.
19:41	nightowl313	so, I'm getting a 403 firbidden error in the console when trying to upload ... will try to copy a file with aws cli per DS suggestion
19:41	nightowl313	upload error: undefined upid=0, Error 403: Forbidden
19:41	Jim58	is Dataverse running as the user with the aws credentials? (A test with normal upload would show whether that's ok)
19:43	nightowl313	it should be ... the .aws config file is in that acccount, and file uploads work fine when I don't have upload/download enabled
19:43	nightowl313	it just gets this error when I enable diret upload
19:44	nightowl313	but will try direct copy while logged in as the dataverse user
19:44	Jim58	@pameyer - FWIW: The Bag mechanism handles custom metadatablocks. The import part, which I just got ~working, only handles going to a new instance with the same metadatablocks at this point.
19:45	nightowl313	i mean payara user =)
19:45	Jim58	In the console - does the network tab show which call is getting the 403 - presumably the call to s3? And is there any info in the response tab there?
19:45	pdurbin	Jim58: that's awesome about the bagit stuff
19:46	nightowl313	the put is getting the 403 ... there is a post, an options, and then the put (error) and then post
19:47	nightowl313	it says "no-referrer-when-downgrade"
19:48	Jim58	are Dataverse and s3 both https / both http? Or mixed?
19:49	nightowl313	dataverse is https ... is there something specific that needs to be done to the s3 bucket to make it https? if so I probably didn't do it =)
19:50	Jim58	not if its aws - and the network tab should show what the full URL was
19:51	nightowl313	the put command on the network tab is using https://<my bucket name> + file location bunch of characters
19:51	nightowl313	does my aws user need any other permissions other than s3 all?
19:52	Jim58	I don't think so - and it should be the same for normal and direct.
19:53	nightowl313	actually it has PutObject, GetObjectAcl, GetObject, ListBucket, GetBucketAcl, DeleteObject, HeadBucket, GetBucketLocation, GetBucketPolicy
19:53	pameyer	@Jim58 - cool, thanks
19:53	nightowl313	i forgot we limited it
19:55	Jim58	any aws:Referer policies set up? (Default should be OK but if you limited those...)
19:56	nightowl313	i don't think so .. didn't specifically do anything that I know of
19:57	pameyer	is this a standard/standard-ish apache/ajp/payara setup?
19:57	Jim58	I think the one PUT call that fails should be OK if you just have PutObject, so I think the permissions are OK. (Dataverse needs more than PutObject internally)
19:57	pameyer	I'm wondering about a possible external https -> web server -> internal http -> app server
19:58	nightowl313	i used dataverse-ansible! which is magical and we owe our entire dataverse to it
19:59	nightowl313	i configured the s3 part manually though .. and some other things, but the core was set up with that
20:02	Jim58	is it possible to access your test Dataverse from out here?
20:02	pameyer	dataverse-ansible is apache ajp, so one thing that's not the problem
20:04	nightowl313	verified that I could copy a file directly from the payara user account to the s3 bucket
20:04	nightowl313	our test dv is currently public ... https://dataverse-test.lib.asu.edu
20:05	nightowl313	but we have it configured for sso (shib) ... I can create a local account
20:06	nightowl313	there was discussion about having it available for some of our research teams to have a "sandbox"
20:06	Jim58	So direct download works, which would suggest the basic creds/bucket are all OK - I guess that points more towards permission issues.
20:06		dataverse-user joined #dataverse
20:07	nightowl313	i can try changing the permissions to "*" .. that is how we had it before
20:08	dataverse-user	hi
20:09	Jim58	might be worthwhile - perhaps putObjectAcl or some other permission is also needed for a PUT to work.
20:10	Jim58	hi dataverse-user
20:10	nightowl313	doing that now ... sorry to take over the whole chat! I'll try that and check back ... thank you all!!!!!
20:11	Jim58	if that doesn't work, if I can get an account, I can see if I can spot anything else in the browser - good luck!
20:15	pameyer	would awselb be a possible problem?
20:15	nightowl313	that was it!!!!
20:15	pdurbin	* fixed it?
20:16	pameyer	given the timestamps, I'm guessing that was Jim58's permission suggestion :)
20:16	nightowl313	yes giving * permissions to the user ... it works! (this is directly from the dataverse file upload interface)
20:17	nightowl313	the aws user in iam tha tis
20:18	nightowl313	you all are the greatest! thanks so much for working through another thing with me ... i'm going to lurk on here every day and see if I can help with anything
20:18	Jim58	Yay! - I'm not sure what else might be needed besides PutObjectAcl (and not sure it's that), but I'd think you should be able to cut it down from * . The only other things I can think might relate to that PUT would be handling signatures or adding tags/metadata. If there are perms for those you may need them.
20:19	pdurbin	nightowl313: maybe you could open an issue about how we should document which permissions are needed.
20:23	nightowl313	i will do that! I think aws may have an analyzer tool to help identify permissions needed as well .. I may see if I can find that too
20:25	pameyer	I'm not sure where, but there should be logs somewhere for what calls were made to the bucket
20:26	nightowl313	oh right! I will look at the cloudwatch logs .. still learning all of those aws services, too!
20:26	nightowl313	And now, to tackle uploads outside of dataverse! Thanks so much all!
20:28	nightowl313	so much to learn ... so little time =)
20:28	pameyer	very true :)
20:29	pdurbin	nightowl313: you have a lot of energy. You're probably learning faster than the rest of us. :)
20:29	pameyer	doesn't seem to stop - just today I learned something new about solr tokenizers and custom metadata blocks
20:32	nightowl313	lol, well I have so much more to learn! and, seems like everthing is needed now! what are solr tokenizers?
20:35	pameyer	things that cause me weirdness with search API and metadata values with "-" in them :)
20:36	pameyer	I'd been copy/pasting old solr schema blocks, and that didn't work too well when I wanted exact match searches with things that solr split up
20:40	nightowl313	that sounds interesting
20:42	pdurbin	pameyer: exact match searches work better with "string" than with "text" in Solr.
20:42	pameyer	pdurbin: exactly :)
20:42	pdurbin	We use "string" for facets, for example.
20:45	pameyer	they mostly work ok with string, as long as you don't have a solr token boundary character. was contemplating adding to the guides, but it seemed like another pameyer's doing it wrong again thing ;)
20:45	pameyer	mostly work ok with "text_en", I'd mean - yet another typo
20:47	nightowl313	=)
20:48	pdurbin	If you want to add it to the guides, go for it. :)
20:49	pdurbin	It's been a pretty lively Friday afternoon in here but I'm stepping away from the screen soon. I hope everyone has a lovely weekend.
20:53	pameyer	stepping away from screens is good :)
20:54	nightowl313	have a great weekend! thanks for the help!
20:57		pdurbin left #dataverse
21:14	nightowl313	oops have another question ... is it better to separate out normal file upload traffic from large file uploads to separate buckets/stores? I know the guide mentions the possibility but just wondering if there is a best practice or recommendation for that?
21:16	nightowl313	i suppose it might be difficult to anticipate which projects might have large files and enforce changing stores if file sizes are big vs small...
22:01	pameyer	that's a good question - I don't known enough s3 (or dataverse+s3) to have useful ideas about it though
22:09	nightowl313	yea, now I'm thinking that will just be a battle we will fight later! Thanks!

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.