IRC log for #dataverse, 2020-10-15

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

All times shown according to UTC.

Time	Nick	Message
06:42		Virgile joined #dataverse
07:55		juancorr joined #dataverse
11:01		Virgile joined #dataverse
11:18		Virgile joined #dataverse
12:20		yoh joined #dataverse
12:26		donsizemore joined #dataverse
14:31		pdurbin joined #dataverse
14:50	donsizemore	@pdurbin good morning
14:52	donsizemore	@pdurbin you know how you don't like marathon commits?
14:53		pameyer joined #dataverse
14:59	donsizemore	O ye @pdurbin @pameyer hive mind. To export a dataset I'm calling /api/datasets/export?exporter=dataverse_json&persistentId=doi:10.5072/FK2/AT8YFM if this looks right?
15:00	donsizemore	then I call http://guides.dataverse.org/en/latest/api/native-api.html#import-a-dataset-into-a-dataverse (which, I'm going to submit a PR to fix that example)
15:00	donsizemore	which dies "Error parsing dataset json. Json:"
15:02	pameyer	my first thought is that it might be a upload-file vs post body thing - I get those mixed up pretty frequently
15:02	pameyer	might be a dumb question - why go through the exporter vs the native API GET?
15:02	donsizemore	that's the PR part. the example says upload file but -F data= seems to send it
15:03	donsizemore	to answer your question, because I don't find instructions on how to do this, so I'm stumbling step by step
15:05	donsizemore	API GET produces similar-looking JSON at a glance and import throws the same error: Unexpected char 45 at (line no=1, column no=2, offset=1)
15:13	pdurbin	error parsing dataset... that's no good
15:14	donsizemore	the exported dataverse_json is totally different from the desired example import JSON in the guide
15:14	pdurbin	I think the import docs provide a sample JSON file.
15:14	pdurbin	Is it XML?
15:15	* pdurbin	ducks
15:15	pdurbin	They should at least resemble each other. Cousins if not siblings.
15:17	donsizemore	stripping up to the leading {"datasetVersion seemed like a good first step but import dies on line 1 col 2
15:17	pdurbin	Can you run the JSON through jq to make sure it's valid?
15:18	pdurbin	(or similar)
15:18	pdurbin	cat foo.json \| jq .
15:18	donsizemore	looks right to me
15:18	donsizemore	just the ordering of elements is totally different from what import wants
15:19	pdurbin	well, order doesn't matter in json
15:19	pameyer	I woudln't have thought that order of elements in a json dictionary would make a difference
15:19	pameyer	... messages crossing
15:20	pdurbin	I document the stripping out of datasetVersion in the docs on how to edit a dataset via the native API. It's a pain.
15:20	pdurbin	(Which is partially why we now support partial edits.)
15:21	pdurbin	donsizemore: for fun do you want to try importing the sample file from the docs? Maybe import is broken.
15:21	donsizemore	oh oh if i cat the export-dataverse_json through jq, redirect that into a file and call it with --upload-file i get further (but now a ton of null-pointers)
15:22	* pameyer	looking for my import scripts
15:22	donsizemore	java.lang.IllegalStateException: During synchronization a new object was found through a relationship that was not marked cascade PERSIST: [Dataset id:261 ].
15:23	donsizemore	and max(id) is 260
15:23		donsizemore15 joined #dataverse
15:23	pdurbin	yikes
15:24	donsizemore15	i stole Danny's dataset from https://dataverse5.odum.unc.edu/dataset.xhtml?persistentId=doi:10.5072/FK2/AT8YFM
15:24	pdurbin	If you can get the API to emit a 500 error, that's a paddlin'. Please feel free to create an issue. And please upload a file that produces it.
15:25	donsizemore15	and am trying to import it into an EC2 instance I was using to test those dataverse-ansible merges, just so I can tell Thu-Mai and Jon whether we can give a group of folks their datasets once they stand up Dataverse #64(!)
15:26	pdurbin	Mmmm, Dataverse #64.
15:26	pdurbin	donsizemore15: wait, did you try the sample file from the docs?
15:28	donsizemore15	i will, lemme make a dummy storage identifier to match
15:29	pdurbin	k
15:29	pameyer	definately was --upload-file
15:30	donsizemore15	that succeeds
15:31	donsizemore15	so, leonid thinks I should export with export?exporter=dataverse_json (which was the first thing I did)
15:31	donsizemore15	if sample.json succeeds and dataverse_json doesn't... does that mean an issue?
15:32	pdurbin	Maybe the issue is about how to get the equivalent of sample.json?
15:32	donsizemore15	that was my question.
15:33	donsizemore15	I've tried Native API GET and export?exporter=dataverse_json. If someone could (at some point) steer me toward the proper workflow I'll be happy to write this up in the guides for future generation
15:33	donsizemore15	today I mostly want to know what to tell Jon and Thu-Mai
15:36	pdurbin	Well, there are two ways to export that JSON. The most straightforward way is to click on JSON in the UI under Metadata exports. The other way is documented in the API Guide somewhere. I'm not sure if they produce the same exact output or not.
15:37	donsizemore15	they're off by 9 characters.
15:40	pdurbin	:)
15:41	pdurbin	Off By 9 Characters is the name of my band!
15:42	pameyer	another dumb question - `curl $exporter_apo \| jq .datasetVersion > importer_input.json` doesn't do the trick?
15:43	pameyer	I'm not sure if the top-level "citation" key under datasetVersion would cause problems for the import api or not; but I'd guess it wouldn't be needed
15:43	donsizemore15	i tried stripping down to datasetVersion, will try again with your command after lunch (and thanks)
15:44	pameyer	good luck (and enjoy lunch)
15:45	pameyer	I found my importer stuff; but the input json was a custom endpoint rather than one of the dataverse ones - so may not be as helpful here as it could be
16:26	donsizemore15	it occurred to me over lunch I wasn't setting Content-Type: application/json
16:41	donsizemore15	@pameyer trying to import the json produced by 'jq .datasetVersion' throws a 500 error
16:51	donsizemore15	yah cat sample.json \|jq .datasetVersion \|jq -r 'keys[]' \|wc -l 7
16:51	donsizemore15	yee cat D8BEAL_full.json \|jq .datasetVersion \|jq -r 'keys[]' \|wc -l 16
17:02	donsizemore15	removed the citation contents and still got java.lang.IllegalStateException: During synchronization a new object was found through a relationship that was not marked cascade PERSIST: [Dataset id:272 ].
17:19	pdurbin	woof
17:19	pdurbin	So what are you gonna tell Jon and Thu-Mai?
17:22	pameyer	cat 499-export_f.json \| jq -r 'keys[]' \|wc -l -> 1
17:23	pameyer	so my guess is that my initial `jq .datasetVersion` bit was wrong
17:26	donsizemore15	your .datasetVersion prints the sub-keys. the example and my "full" dataverse_json export (which Leonid says is the best JSON Dataverse can export) were closer, key-wise, though wildly disparate
17:27	pameyer	yeah; it was bad - definately needs the toplevel datasetVersion key
17:30	pameyer	will take a bit for my dev environment to spin up to see if the old json still works - I'd guess it would, but don't know for sure
17:34	donsizemore15	you don't need to go to that trouble
17:35	pameyer	_shouldn't_ be too much trouble
17:37	pdurbin	famous last words
17:43	pameyer	yeah
17:47	donsizemore15	https://www.hashicorp.com/blog/announcing-waypoint
17:49	pdurbin	Developers do just want to deploy.
18:18	pameyer	once I cut the custom block, my old examples still work on 7112-payara5_aio-5a77f49f7
18:18	pameyer	so not the latest code, but not too far from develop in between 5.0 and 5.1
18:19	pdurbin	that's good
18:38		nightowl313 joined #dataverse
18:39	nightowl313	hi all ... gotta a question for anyone that can answer ... we just made our dataverse live and we already have a request for adding a large amount of data (ie: 10T or more) ... is the best method of getting that data in fast to use direct upload?
18:41	pdurbin	10T is a lot but yes, direct upload to S3, I'd think.
18:42	donsizemore15	@nightowl313 have them upgrade to RoadRunner extra-zippy first!
18:42	nightowl313	eeks .. wasn't quire ready to just jump in with that much data but we do have the request ... now I just need to get direct upload working :-\|
18:43	nightowl313	haha don will ask them to!
18:43	donsizemore15	if you're already in S3, you turn on upload-direct as a jvm-option and you'll need to set CORS on the bucket
18:43	nightowl313	has any other org used it for this amount of data? probably won't be all that at first, but probably a lot
18:43	donsizemore15	http://guides.dataverse.org/en/latest/developers/big-data-support.html?highlight=upload%20direct#s3-direct-upload-and-download
18:44	nightowl313	not sure yet who is paying for it but we're working on that
18:44	donsizemore15	I understand Harvard has 43TB+
18:44	nightowl313	i've done all of this except for the cors ...
18:45	nightowl313	i think you were helping me with that before! I gave up just to get dataverse installed .. time to pick it back up very quickly!
18:45	donsizemore15	for some reason cors keeps getting removed from my test bucket... haven't pinned down how that's happening but i don't own the bucket so
18:48	pdurbin	nightowl313: especially since pameyer is here I'd feel remiss if I didn't point out that another option for big data is rsync. Not sure if you've seen the docs on that. They're in the dev guide.
18:49	nightowl313	oh ... of course ... I have used rclone ... but, if I transfer from the current source to the s3 bucket using rclone .. how does it then get associated with a dataset?
18:51	pameyer	pdurbin: I appreciate the hat tip for rsync, but for an s3 dataverse install and 10TB of data it might not be the way to go (would need 10TB temp space for the DCM, and _might_ need public-only - I don't remember if s3/DCM requires that or not)
18:52	pdurbin	I don't remember either. I don't think anyone is using that combo.
18:52	pameyer	and 10TB of client-side checksum calculations will take a while
18:53	pdurbin	yeah
18:53	pdurbin	nightowl313: maybe just stick with direct to s3 upload :)
18:54	pameyer	I'm out of the loop for direct s3 upload, but if that lets you use the aws cli and s3 sync, I'd go with that over a browser
18:55	pdurbin	It's supported by DVUploader, which is a command line app.
18:58	pameyer	I should, but don't, remember if that's got a sync primitive
19:02	pdurbin	Hmm, dunno. I'm sure it's calling S3 APIs under the hood. I could try to summon Jim if you want.
19:03	nightowl313	thinking since this is our first dataverse of all ... we will ask the research team to take it slow with uploading things ... i think they are willing to work with us to figure things out
19:03	nightowl313	but would love direction on the best course to try with it ... i'm such a newbie! this is all exciting and scary
19:04	pdurbin	Published research data! It's a thing! :)
19:06	nightowl313	=)
19:09	pameyer	pdurbin: thanks, but not important enough to try a Jim summoning - just curious
19:11	pdurbin	Ok. He's in Slack these days. Was letting him know earlier that donsizemore15 broke his pull request. :)
19:11	pameyer	nightowl313: having a testing/staging setup is very helpful for figuring out "am I going to break production with this" type things
19:13	pameyer	I'm not sure if it's something you have available or not (sometimes it's more trouble than it's worth); but it was an unexpected thing I learned a while back
19:13	nightowl313	so just reading through all of this ... I should configure S3 direct upload and then use DVUploader (which utilizes command line utilites such as cli and rclone ) to work with the direct upload?
19:14	nightowl313	we do have a staging site... will def try there first!
19:14	nightowl313	I told the research team it will be a few days before we are ready to start uploading anything
19:14	pdurbin	With direct S3 upload you can use either DVUploader or the normal Dataverse web interface.
19:16	nightowl313	ah okay ... looking at how it works now .. thank you all! so thankful for this chat/community
19:17	pdurbin	Sure. And it's been nice to see you on community calls.
19:18	nightowl313	i hope i can help with answering some of them some day! for now I am just a taker! lol
19:20	pdurbin	nah, you're good
19:21	pdurbin	We'd be bored without the questions. :)
19:23	nightowl313	oh i'm sure I'll have a lot more! =)
19:26	pdurbin	You've made it so I don't have to answer my standard question... When do you want your installation on the map? ... Thanks for that.
19:26	* pdurbin	looks at pameyer
19:27	nightowl313	i think it's on there! yay! We added it last week! that was the most exciting part of this for me!
19:28	pdurbin	Please feel free to help us fill in any more details about your installation. There's a spreadsheet linked from https://github.com/IQSS/dataverse-installations
19:28	nightowl313	yes, there's one in AZ ... although if you click our link there is no data! (that's what we're working on now)
19:29	nightowl313	okay, will add more to that ...
19:29	pdurbin	here's a direct link to the spreadsheet: https://docs.google.com/spreadsheets/d/1bfsw7gnHlHerLXuk7YprUT68liHfcaMxs1rFciA-mEo/edit#gid=0
19:30	pdurbin	please feel free to request edit access
19:30	pdurbin	and make a pull request, but that comes later
19:32	nightowl313	oh i mean, we need to get some data into our dataverse! haha ... requesting edit access for the spreadsheet
19:34	pdurbin	Hmm, I don't see the request yet. Maybe it takes a while.
19:56	nightowl313	i did hit "send" a couple of times ... I think this happened before
19:58	pdurbin	🤷‍
20:01		nightowl313 left #dataverse
20:36		nightowl313 joined #dataverse
20:54		nightowl313 left #dataverse
21:02	pdurbin	I hate to end on a shrug so I'll say goodnight all. See ya tomorrow.
21:02		pdurbin left #dataverse

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.