Time |
S |
Nick |
Message |
14:44 |
|
|
iqlogbot joined #dvn |
14:44 |
|
|
Topic for #dvn is now http://thedata.org - The Dataverse Network Project | logs at http://irclog.iq.harvard.edu/dvn/today |
14:45 |
|
jwhitney |
pdurbin: yes, although we do want to allow multiple files: https://docs.google.com/file/d/0B8Zfl4GMgyejMlhFOUU5M0p4c3M/edit |
14:51 |
|
jwhitney |
pdurbin: (these are just mockups: the file description form has some fields that should describe the study, instead) |
14:51 |
|
jwhitney |
pdurbin: (they're a bit out of date) |
14:58 |
|
pdurbin |
jwhitney: hmm, ok |
14:58 |
|
pdurbin |
you're back! |
14:59 |
|
pdurbin |
jwhitney: sorry, that was for iqlogbot :) ... logging is back http://irclog.iq.harvard.edu/dvn/2013-07-29 |
14:59 |
|
jwhitney |
pdurbin: :) |
15:00 |
|
pdurbin |
jwhitney: I think it would be great if you played with the SWORD API as it stands right now. I can point you to the curl commands. It's still very rough but it'll give you an idea of its current state |
15:01 |
|
pdurbin |
https://github.com/IQSS/dvn/tree/develop/tools/scripts/data-deposit-api contains all the scripts and I'm happy to walk you through them |
15:03 |
|
jwhitney |
pdurbin: yep, ok. |
15:03 |
|
pdurbin |
jwhitney: the biggest thing that's on my mind is... what will the binary file you send look like? You mentioned simple zip... After I receive the zip, I should unzip it and look for files inside and ingest them one by one? And also look for a metadata file in there? |
15:04 |
|
jwhitney |
pdurbin: that's one approach: study metadata in the atom entry, then include file-level metadata in the zip |
15:04 |
|
pdurbin |
right now my implementation takes whatever binary file is sent and attempts to ingest it. So if you send an Rdata file it will be ingested as Rdata. Same for a Stata file, I assume, but I haven't tried this yet. |
15:05 |
|
pdurbin |
but it sounds like I should always expect a zip instead? |
15:05 |
|
jwhitney |
pdurbin: I think so, yes: even if there is only one file, there may be associated metadata |
15:06 |
|
jwhitney |
pdurbin: something like DSpace's simple archive format, maybe https://wiki.duraspace.org/display/DSDOC3x/Importing+and+Exporting+Items+via+Simple+Archive+Format#ImportingandExportingItemsviaSimpleArchiveFormat-ItemImporterandExporter |
15:08 |
|
pdurbin |
jwhitney: ok. Alex also seemed interested in BagIt: http://en.wikipedia.org/wiki/BagIt |
15:09 |
|
pdurbin |
(I've only barely heard of both of these formats.) |
15:09 |
|
jwhitney |
pdurbin: bagit seems more straightforward |
15:09 |
|
pdurbin |
straightforward is good :) |
15:10 |
|
pdurbin |
jwhitney: do you think we should formally support BagIt? or just use it as a model for now? |
15:10 |
|
jwhitney |
pdurbin: I've worked with the DSpace format, have only read about bagit. |
15:13 |
|
pdurbin |
jwhitney: the way upload works in DVN now is that you can upload a single file and specify the format (Rdata vs. Stata vs. etc.) |
15:13 |
|
pdurbin |
or you can upload a zip file that has a bunch of files in it |
15:14 |
|
pdurbin |
which gets unzipped... and all the files get ingested |
15:14 |
|
pdurbin |
so it's very simple |
15:14 |
|
jwhitney |
pdurbin: Ok. |
15:14 |
|
pdurbin |
that would probably be the easiest thing for me to support out of the gate |
15:15 |
|
|
posixeleni joined #dvn |
15:15 |
|
pdurbin |
which is called "simple zip" in the SWORD spec |
15:15 |
|
pdurbin |
posixeleni: hi! |
15:16 |
|
posixeleni |
hi folks! just wanted clarification on how OJS would handle the supplementary files that you send over to DVN |
15:16 |
|
pdurbin |
jwhitney: are you saying you're familiar with METSDSpaceSIP? That's also in the SWORD spec as an example |
15:16 |
|
posixeleni |
so if I understand it correctly: OJS will allow authors to deposit multiple files |
15:17 |
|
posixeleni |
Then when it comes time to send it to DVN it is packaged into a simple zip file and sent via API? |
15:17 |
|
pdurbin |
posixeleni: right on both accounts |
15:17 |
|
posixeleni |
cool sorry to interrupt! |
15:17 |
|
jwhitney |
posixeleni: not at all! |
15:18 |
|
pdurbin |
posixeleni: sorry, iqlogbot was broken but jwhitney or I will paste the whole chat to a Google Doc when we're done |
15:18 |
|
|
posixeleni joined #dvn |
15:18 |
|
posixeleni |
thanks so much! |
15:19 |
|
pdurbin |
I was saying that right now I'm just ingesting whatever binary file is sent ... but it sounds like I need to switch to expecting a zip file, which I will unzip ... and then ingest the files one by one |
15:19 |
|
pdurbin |
jwhitney: right? |
15:20 |
|
jwhitney |
pdurbin: yes, if OJS needs to send along file-level metadata, which it seems like yes |
15:20 |
|
jwhitney |
pdurbin: data type, at the very least. |
15:21 |
|
pdurbin |
jwhitney: well, even if metadata it not necessary... right now you would have to send files one by one |
15:21 |
|
pdurbin |
which we probably don't want |
15:24 |
|
jwhitney |
what's typical? I think OJS has to allow the possibility of multiple files, but if most articles will only have a single file... |
15:24 |
|
pdurbin |
posixeleni: it's quite common for studies to have multiple files, right? |
15:25 |
|
posixeleni |
more common than not since they will have the dataset and then a different file for documentation explaining the dataset (readme) |
15:25 |
|
jwhitney |
right, ok |
15:26 |
|
pdurbin |
I feel like everything I've read and watched so far suggests that a zip gets sent across during a binary deposit in the SWORD protocol. |
15:28 |
|
pdurbin |
It was easier for me to simply accept any file as-is (not zipped) and attach it to a study but again, I think I should change this... I should advertise via SWORD that I accept "simple zip" and then accept a zip file and unzip it |
15:29 |
|
pdurbin |
jwhitney: ready for a quick walk through of curl commands? |
15:29 |
|
jwhitney |
pdurbin: ok, sure |
15:29 |
|
pdurbin |
great. the starting point is https://github.com/IQSS/dvn/tree/develop/tools/scripts/data-deposit-api |
15:29 |
|
jwhitney |
pdurbin: yep, I have been walking through your scripts |
15:30 |
|
pdurbin |
the create-study-deposit-data script is a wrapper around a bunch of shell scripts that call curl: https://github.com/IQSS/dvn/blob/develop/tools/scripts/data-deposit-api/create-study-deposit-data |
15:31 |
|
pdurbin |
to explain each curl command, the wrapper script does the following: |
15:31 |
|
pdurbin |
1. retrieve service document using creditials for the journal dataverse in question |
15:32 |
|
pdurbin |
2. create a study based on an "atom entry" XML file |
15:32 |
|
pdurbin |
3. list studies (should be incremented by one each time) |
15:32 |
|
pdurbin |
4. add a file to the study that was just created |
15:33 |
|
pdurbin |
5. make sure error handling is working (sorry, I threw this in just for myself) |
15:33 |
|
jwhitney |
pdurbin: ok, great. |
15:33 |
|
pdurbin |
6. retrieve the SWORD statement for the study |
15:34 |
|
pdurbin |
I'm definitely not sure I'm implementing all this correctly, but it's a start :) |
15:34 |
|
jwhitney |
:) |
15:34 |
|
pdurbin |
you'll see "fakeIRI" and such in some places |
15:34 |
|
pdurbin |
so it's a bit of a moving target |
15:35 |
|
jwhitney |
pdurbin: just to make sure we're on the same page: you said, "... everything I've read and watched so far suggests that a zip gets sent across during a binary deposit in the SWORD protocol" |
15:35 |
|
pdurbin |
and if you think I'd doing anything wrong spec-wise, please let me know! I want to make sure I'm implementing SWORD correctly |
15:35 |
|
pdurbin |
I'm* |
15:35 |
|
jwhitney |
pdurbin: do you feel that adding content in a zip to an existing resource is not quite in line w/ the spec? |
15:35 |
|
jwhitney |
& same here! |
15:36 |
|
jwhitney |
errr. want to make sure I'm sending content in a way that makes sense... |
15:36 |
|
pdurbin |
jwhitney: it does feel strange... right adding the file to the study is a "replace" from a SWORD perspective |
15:37 |
|
jwhitney |
pdurbin: that's true. |
15:37 |
|
pdurbin |
because a PUT is a replace |
15:38 |
|
pdurbin |
I tried to get clarification on this from the SWORD mailing list: [sword-app-tech] POST atom entry, then PUT media resource - http://www.mail-archive.com/sword-app-tech@lists.sourceforge.net/msg00331.html |
15:38 |
|
pdurbin |
but so far it's just me writing to myself :( |
15:38 |
|
pdurbin |
jwhitney: something for you to chew on is... how do we replace a file that has been added to a study? |
15:39 |
|
pdurbin |
jwhitney: or... how do we replace 2 out of 5 files that have been added to a study? |
15:44 |
|
pdurbin |
jwhitney: do you sent all 5 files over again via PUT and I do a "replace" on the DVN side? |
15:44 |
|
pdurbin |
it gets interesting :) |
15:44 |
|
jwhitney |
pdurbin: that's what I've been thinking through.. if I do that, can I provide information for you to know what's changed? |
15:45 |
|
pdurbin |
jwhitney: you mean so you can send only the 2 changed files instead of all 5? |
15:47 |
|
jwhitney |
pdurbin: no, I meant I'd send all content, but with enough metadata for you to know what's changed.. |
15:48 |
|
jwhitney |
pdurbin: but that's awkwad |
15:48 |
|
jwhitney |
'awkward' |
15:48 |
|
pdurbin |
ah. sure. well... and md5sum for each of the 5 files would help in this case, right? |
15:48 |
|
jwhitney |
yes |
15:48 |
|
pdurbin |
jwhitney: in Boston we say "awkwad" ... and "hazadous" ;) |
15:48 |
|
jwhitney |
jwhitney: literal lol. |
15:49 |
|
jwhitney |
pdurbin: oops, self-referential comment. |
15:50 |
|
pdurbin |
jwhitney: on thing I'm wondering is if you plan to persist into your database a unique identifier for each study and files the corresponds to the dataverse for a journal |
15:51 |
|
jwhitney |
pdurbin: yes, we have to know if a study's been created for an article |
15:52 |
|
pdurbin |
right. and studies have persistent, unique identifiers such as hdl:1902.1/12345 |
15:52 |
|
pdurbin |
jwhitney: but what about files? ... the best I could do is expose the database id of each file... and then we have to think about how studies can have multiple versions... |
15:54 |
|
pdurbin |
jwhitney: anyway, for now we can focus on creating new stuff... but obviously I have a lot of questions about how stuff gets updated :) |
15:55 |
|
posixeleni |
jwhitney: as do i! |
15:56 |
|
pdurbin |
posixeleni: :) ... well I think this is on your agenda for the meeting in only a few hours :) |
15:56 |
|
jwhitney |
sorry, minor interruption... |
15:57 |
|
jwhitney |
& I'm concerned about storing metadata OJS-side that is updated DV-side. |
15:58 |
|
jwhitney |
so after OJS has created studies, files in DV, I think it makes sense to store IDs and refresh as requested OJS-side. |
15:59 |
|
pdurbin |
jwhitney: so... with the example of updating 2 of 5 files, what would happen? |
16:01 |
|
pdurbin |
jwhitney: we can talk about it during the meeting if it's too much to type :) |
16:03 |
|
jwhitney |
pdurbin: a fast think-through: author requests a view of files. OJS fetches study / file metadata from DV. Author edits study / file metadata. OJS PUTS new study metadata to EDIT-IRI. OJS PUTs new package of content to EM-IRI. |
16:04 |
|
jwhitney |
pdurbin: glaring holes? |
16:04 |
|
jwhitney |
pdurbin: assuming package includes enough information for the server to identify changed files. |
16:08 |
|
pdurbin |
jwhitney: so would I give you a view of the files via a SWORD statement? ... I need to look at the spec some more |
16:08 |
|
jwhitney |
pdurbin: like here? http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#protocoloperations_retrievingcontent_feed |
16:09 |
|
pdurbin |
jwhitney: ah. so not a statement. yes, that looks right |
16:12 |
|
pdurbin |
jwhitney: thanks :) |
16:18 |
|
jwhitney |
pdurbin: you're welcome, although I meant the '?' -- a double-check on my understanding of the spec. |
16:20 |
|
pdurbin |
jwhitney: sure. but I think you're right |
17:39 |
|
pdurbin |
here's the entire conversation (including the earlier part not logged by iqlogbot) in a Google Doc: https://docs.google.com/document/d/1XbaVsDTML0RohCtY5WNJMpFcC6xYhCjyFKyKA-cQBVI/edit?usp=sharing |
19:21 |
|
|
jwhitney joined #dvn |
19:30 |
|
pdurbin |
jwhitney: good meeting. :) ... not sure if you saw but I did create that Google Doc with the rest of the chat from this morning |
19:30 |
|
jwhitney |
pdurbin: no, but thanks |