IQSS logo

IRC log for #dataverse, 2020-10-15

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

All times shown according to UTC.

Time S Nick Message
06:42 Virgile joined #dataverse
07:55 juancorr joined #dataverse
11:01 Virgile joined #dataverse
11:18 Virgile joined #dataverse
12:20 yoh joined #dataverse
12:26 donsizemore joined #dataverse
14:31 pdurbin joined #dataverse
14:50 donsizemore @pdurbin good morning
14:52 donsizemore @pdurbin you know how you don't like marathon commits?
14:53 pameyer joined #dataverse
14:59 donsizemore O ye @pdurbin @pameyer hive mind. To export a dataset I'm calling /api/datasets/export?exporter=dataverse_​json&persistentId=doi:10.5072/FK2/AT8YFM if this looks right?
15:00 donsizemore then I call http://guides.dataverse.org/en/latest/api/native-api.html#import-a-dataset-into-a-dataverse (which, I'm going to submit a PR to fix that example)
15:00 donsizemore which dies "Error parsing dataset json. Json:"
15:02 pameyer my first thought is that it might be a upload-file vs post body thing - I get those mixed up pretty frequently
15:02 pameyer might be a dumb question - why go through the exporter vs the native API GET?
15:02 donsizemore that's the PR part. the example says upload file but -F data= seems to send it
15:03 donsizemore to answer your question, because I don't find instructions on how to do this, so I'm stumbling step by step
15:05 donsizemore API GET produces similar-looking JSON at a glance and import throws the same error: Unexpected char 45 at (line no=1, column no=2, offset=1)
15:13 pdurbin error parsing dataset... that's no good
15:14 donsizemore the exported dataverse_json is totally different from the desired example import JSON in the guide
15:14 pdurbin I think the import docs provide a sample JSON file.
15:14 pdurbin Is it XML?
15:15 * pdurbin ducks
15:15 pdurbin They should at least resemble each other. Cousins if not siblings.
15:17 donsizemore stripping up to the leading {"datasetVersion seemed like a good first step but import dies on line 1 col 2
15:17 pdurbin Can you run the JSON through jq to make sure it's valid?
15:18 pdurbin (or similar)
15:18 pdurbin cat foo.json | jq .
15:18 donsizemore looks right to me
15:18 donsizemore just the ordering of elements is totally different from what import wants
15:19 pdurbin well, order doesn't matter in json
15:19 pameyer I woudln't have thought that order of elements in a json dictionary would make a difference
15:19 pameyer ... messages crossing
15:20 pdurbin I document the stripping out of datasetVersion in the docs on how to edit a dataset via the native API. It's a pain.
15:20 pdurbin (Which is partially why we now support partial edits.)
15:21 pdurbin donsizemore: for fun do you want to try importing the sample file from the docs? Maybe import is broken.
15:21 donsizemore oh oh if i cat the export-dataverse_json through jq, redirect that into a file and call it with --upload-file i get further (but now a ton of null-pointers)
15:22 * pameyer looking for my import scripts
15:22 donsizemore java.lang.IllegalStateException: During synchronization a new object was found through a relationship that was not marked cascade PERSIST: [Dataset id:261 ].
15:23 donsizemore and max(id) is 260
15:23 donsizemore15 joined #dataverse
15:23 pdurbin yikes
15:24 donsizemore15 i stole Danny's dataset from https://dataverse5.odum.unc.edu/dataset.xhtml?persistentId=doi:10.5072/FK2/AT8YFM
15:24 pdurbin If you can get the API to emit a 500 error, that's a paddlin'. Please feel free to create an issue. And please upload a file that produces it.
15:25 donsizemore15 and am trying to import it into an EC2 instance I was using to test those dataverse-ansible merges, just so I can tell Thu-Mai and Jon whether we can give a group of folks their datasets once they stand up Dataverse #64(!)
15:26 pdurbin Mmmm, Dataverse #64.
15:26 pdurbin donsizemore15: wait, did you try the sample file from the docs?
15:28 donsizemore15 i will, lemme make a dummy storage identifier to match
15:29 pdurbin k
15:29 pameyer definately was --upload-file
15:30 donsizemore15 that succeeds
15:31 donsizemore15 so, leonid thinks I should export with export?exporter=dataverse_json (which was the first thing I did)
15:31 donsizemore15 if sample.json succeeds and dataverse_json doesn't... does that mean an issue?
15:32 pdurbin Maybe the issue is about how to get the equivalent of sample.json?
15:32 donsizemore15 that was my question.
15:33 donsizemore15 I've tried Native API GET and export?exporter=dataverse_json. If someone could (at some point) steer me toward the proper workflow I'll be happy to write this up in the guides for future generation
15:33 donsizemore15 today I mostly want to know what to tell Jon and Thu-Mai
15:36 pdurbin Well, there are two ways to export that JSON. The most straightforward way is to click on JSON in the UI under Metadata exports. The other way is documented in the API Guide somewhere. I'm not sure if they produce the same exact output or not.
15:37 donsizemore15 they're off by 9 characters.
15:40 pdurbin :)
15:41 pdurbin Off By 9 Characters is the name of my band!
15:42 pameyer another dumb question - `curl $exporter_apo | jq .datasetVersion > importer_input.json` doesn't do the trick?
15:43 pameyer I'm not sure if the top-level "citation" key under datasetVersion would cause problems for the import api or not; but I'd guess it wouldn't be needed
15:43 donsizemore15 i tried stripping down to datasetVersion, will try again with your command after lunch (and thanks)
15:44 pameyer good luck (and enjoy lunch)
15:45 pameyer I found my importer stuff; but the input json was a custom endpoint rather than one of the dataverse ones - so may not be as helpful here as it could be
16:26 donsizemore15 it occurred to me over lunch I wasn't setting Content-Type: application/json
16:41 donsizemore15 @pameyer trying to import the json produced by 'jq .datasetVersion' throws a 500 error
16:51 donsizemore15 yah cat sample.json |jq .datasetVersion |jq -r 'keys[]' |wc -l 7
16:51 donsizemore15 yee cat D8BEAL_full.json |jq .datasetVersion |jq -r 'keys[]' |wc -l 16
17:02 donsizemore15 removed the citation contents and still got java.lang.IllegalStateException: During synchronization a new object was found through a relationship that was not marked cascade PERSIST: [Dataset id:272 ].
17:19 pdurbin woof
17:19 pdurbin So what are you gonna tell Jon and Thu-Mai?
17:22 pameyer cat 499-export_f.json | jq -r 'keys[]' |wc -l -> 1
17:23 pameyer so my guess is that my initial `jq .datasetVersion` bit was wrong
17:26 donsizemore15 your .datasetVersion prints the sub-keys. the example and my "full" dataverse_json export (which Leonid says is the best JSON Dataverse can export) were closer, key-wise, though wildly disparate
17:27 pameyer yeah; it was bad - definately needs the toplevel datasetVersion key
17:30 pameyer will take a bit for my dev environment to spin up to see if the old json still works - I'd guess it would, but don't know for sure
17:34 donsizemore15 you don't need to go to that trouble
17:35 pameyer _shouldn't_ be too much trouble
17:37 pdurbin famous last words
17:43 pameyer yeah
17:47 donsizemore15 https://www.hashicorp.com/blog/announcing-waypoint
17:49 pdurbin Developers *do* just want to deploy.
18:18 pameyer once I cut the custom block, my old examples still work on 7112-payara5_aio-5a77f49f7
18:18 pameyer so not the latest code, but not too far from develop in between 5.0 and 5.1
18:19 pdurbin that's good
18:38 nightowl313 joined #dataverse
18:39 nightowl313 hi all ... gotta a question for anyone that can answer ... we just made our dataverse live and we already have a request for adding a large amount of data (ie: 10T or more) ... is the best method of getting that data in fast to use direct upload?
18:41 pdurbin 10T is a lot but yes, direct upload to S3, I'd think.
18:42 donsizemore15 @nightowl313 have them upgrade to RoadRunner extra-zippy first!
18:42 nightowl313 eeks .. wasn't quire ready to just jump in with that much data but we do have the request ... now I just need to get direct upload working :-|
18:43 nightowl313 haha don will ask them to!
18:43 donsizemore15 if you're already in S3, you turn on upload-direct as a jvm-option and you'll need to set CORS on the bucket
18:43 nightowl313 has any other org used it for this amount of data? probably won't be all that at first, but probably a lot
18:43 donsizemore15 http://guides.dataverse.org/en/latest/developers/big-data-support.html?highlight=upload%20direct#s3-direct-upload-and-download
18:44 nightowl313 not sure yet who is paying for it but we're working on that
18:44 donsizemore15 I understand Harvard has 43TB+
18:44 nightowl313 i've done all of this except for the cors ...
18:45 nightowl313 i think you were helping me with that before! I gave up just to get dataverse installed .. time to pick it back up very quickly!
18:45 donsizemore15 for some reason cors keeps getting removed from my test bucket... haven't pinned down how that's happening but i don't own the bucket so
18:48 pdurbin nightowl313: especially since pameyer is here I'd feel remiss if I didn't point out that another option for big data is rsync. Not sure if you've seen the docs on that. They're in the dev guide.
18:49 nightowl313 oh ... of course ... I have used rclone ... but, if I transfer from the current source to the s3 bucket using rclone .. how does it then get associated with a dataset?
18:51 pameyer pdurbin: I appreciate the hat tip for rsync, but for an s3 dataverse install and 10TB of data it might not be the way to go (would need 10TB temp space for the DCM, and _might_ need public-only - I don't remember if s3/DCM requires that or not)
18:52 pdurbin I don't remember either. I don't think anyone is using that combo.
18:52 pameyer and 10TB of client-side checksum calculations will take a while
18:53 pdurbin yeah
18:53 pdurbin nightowl313: maybe just stick with direct to s3 upload :)
18:54 pameyer I'm out of the loop for direct s3 upload, but if that lets you use the aws cli and s3 sync, I'd go with that over a browser
18:55 pdurbin It's supported by DVUploader, which is a command line app.
18:58 pameyer I should, but don't, remember if that's got a sync primitive
19:02 pdurbin Hmm, dunno. I'm sure it's calling S3 APIs under the hood. I could try to summon Jim if you want.
19:03 nightowl313 thinking since this is our first dataverse of all ... we will ask the research team to take it slow with uploading things ... i think they are willing to work with us to figure things out
19:03 nightowl313 but would love direction on the best course to try with it ... i'm such a newbie! this is all exciting and scary
19:04 pdurbin Published research data! It's a thing! :)
19:06 nightowl313 =)
19:09 pameyer pdurbin: thanks, but not important enough to try a Jim summoning - just curious
19:11 pdurbin Ok. He's in Slack these days. Was letting him know earlier that donsizemore15 broke his pull request. :)
19:11 pameyer nightowl313: having a testing/staging setup is very helpful for figuring out "am I going to break production with this" type things
19:13 pameyer I'm not sure if it's something you have available or not (sometimes it's more trouble than it's worth); but it was an unexpected thing I learned a while back
19:13 nightowl313 so just reading through all of this ... I should configure S3 direct upload and then use DVUploader (which utilizes command line utilites such as cli and rclone ) to work with the direct upload?
19:14 nightowl313 we do have a staging site... will def try there first!
19:14 nightowl313 I told the research team it will be a few days before we are ready to start uploading anything
19:14 pdurbin With direct S3 upload you can use either DVUploader or the normal Dataverse web interface.
19:16 nightowl313 ah okay ... looking at how it works now .. thank you all! so thankful for this chat/community
19:17 pdurbin Sure. And it's been nice to see you on community calls.
19:18 nightowl313 i hope i can help with answering some of them some day! for now I am just a taker! lol
19:20 pdurbin nah, you're good
19:21 pdurbin We'd be bored without the questions. :)
19:23 nightowl313 oh i'm sure I'll have a lot more! =)
19:26 pdurbin You've made it so I don't have to answer my standard question... When do you want your installation on the map? ... Thanks for that.
19:26 * pdurbin looks at pameyer
19:27 nightowl313 i think it's on there! yay! We added it last week! that was the most exciting part of this for me!
19:28 pdurbin Please feel free to help us fill in any more details about your installation. There's a spreadsheet linked from https://github.com/IQSS/dataverse-installations
19:28 nightowl313 yes, there's one in AZ ... although if you click our link there is no data! (that's what we're working on now)
19:29 nightowl313 okay, will add more to that ...
19:29 pdurbin here's a direct link to the spreadsheet: https://docs.google.com/spreadsheets/d/1bfsw7gnHlHerLXuk7YprUT68liHfcaMxs1rFciA-mEo/edit#gid=0
19:30 pdurbin please feel free to request edit access
19:30 pdurbin and make a pull request, but that comes later
19:32 nightowl313 oh i mean, we need to get some data into our dataverse! haha ... requesting edit access for the spreadsheet
19:34 pdurbin Hmm, I don't see the request yet. Maybe it takes a while.
19:56 nightowl313 i did hit "send" a couple of times ... I think this happened before
19:58 pdurbin 🤷‍
20:01 nightowl313 left #dataverse
20:36 nightowl313 joined #dataverse
20:54 nightowl313 left #dataverse
21:02 pdurbin I hate to end on a shrug so I'll say goodnight all. See ya tomorrow.
21:02 pdurbin left #dataverse

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.