[12:09] Nemo_bis: im working on the uploader [12:09] http://archive.org/details/wiki-androidjimsh [12:11] check the fields [12:11] what to do with those wikis with no license metadata in api query? [12:13] also, appending date to item name is ok, right? [12:13] the curl command you sent me doesnt include it, example http://archive.org/details/wiki-androidjimsh [12:15] emijrp: looking [12:16] emijrp: no, item identifier shouldn't contain date [12:16] but that may produce collisions [12:16] distinct users uploading dump for a wiki in different dates [12:17] emijrp: how so? I think it's better if we put different dates in the same item [12:17] you only need error handling, but I think IA doesn't let you disrupt anything [12:17] such items will need to be uploaded by a wikiteam collection admin (like you, me and underscor IIRC) [12:18] SketchCow: opinion? [12:19] Nemo_bis: also, description is almost empty [12:19] it's early morning there [12:19] emijrp: where are you fetching the description from [12:19] emijrp: the item id pattern is what Alexis told us to use [12:19] no date in it [12:20] emijrp: as for the license info, what API info are using? I think there are (or have been) multiple places across the MW releases [12:22] http://android.jim.sh/api.php?action=query&meta=siteinfo&siprop=general|rightsinfo&format=xml [12:22] emijrp: I think that when you can't find a proper license URL you should just take whatever you find (non-URLs, names in the rightsinfo fiels, even main page or Project:About or Project:COpyright and dump it in the "rights" field, and in any case tag the item so that we can check it later [12:23] emijrp: according to docs that didn't exist before MW 1.15+ [12:25] or just a link to the mainpage [12:25] emijrp: also note https://bugzilla.wikimedia.org/show_bug.cgi?id=29918#c1 [12:25] http://wikipapers.referata.com/wiki/Main_Page#footer [12:26] link is bad, might disappear any time [12:27] hmm is wikistats.wmflabs.org down? [12:27] you said link Project:About [12:28] no, I'd just include its content [12:28] whatever it is, can't harm [12:28] (if there's no license URL) [12:30] btw I'm still without a proper PC right now but I have dumps on an external HDD and connection now works (actually it's freer than usual), so I can upload lots of stuff [12:46] if you upload a file with the same name to an item, are they overwrited¡ [12:46] ? [12:48] emijrp: this I don't know [12:49] the description is not [12:49] emijrp: description and other metadata is never overwritten unless you add ignore bucket something [12:50] emijrp: https://www.mediawiki.org/w/index.php?title=Special:Contributions/Nemo_bis&offset=&limit=5&target=Nemo+bis [12:51] So, I think licenseurl should contain whatever "url" is, additionally if it's not a link to creativecommons.org or fsf.org you should fetch it and add it to "rights" field [12:51] additionally, it would probably be best to add to "rights" whatever MediaWiki:Copyright contains even if the wiki uses a license, to get cases like the referata wiki footer you linked [12:52] some people add the copyright info to main page, project:about, project:copyright, etc, it is a mess [12:55] yuo can run a 100 batch, and check the log, how many wikis havent coypright metadata [12:57] yes I can do such a batch when the script is ready [12:57] is it? [13:00] emijrp: which of those pages are you teching right now? [13:01] none, just api metadata [13:02] emijrp: can you add at least mediawiki:copyright? [13:02] shouldn't be hard with API or ?action=raw [13:02] no, im going to fetch from the mainpage, i think is better [13:03] emijrp: what's that? [13:04] http://amsterdam.campuswiki.nl/nl/Hoofdpagina look html code [13:04] at the bottom [13:04] ah the footer [13:04] but this doesn't let you discover if it's customised or not [13:04] you should include it only if it's custom, otherwise it's worthless crap which doesn't let us wikis which actually need info [13:05] that
  • is from old mediawiki [13:05] prior 1.12 [13:06] 1.15* [13:06] emijrp: ah, so you're going to use it only in those cases, because there's no API? [13:06] yes [13:06] ok [13:07] when there's API, it's still worth including MediaWIki:copyright iff existing [13:21] in origialurl field, api or mainpage link? [13:22] Nemo_bis: [13:23] emijrp: hm? [13:24] are there any wikiteam tools for DokuWiki? [13:24] dump it in the "rights" IA field [13:24] balrog_: not yet [13:24] :/ ok [13:24] AFAIK [13:25] Nemo_bis: i mean originalurl field [13:25] http://archive.org/details/wiki-amsterdamcampuswikinl_wiki [13:29] emijrp: API [13:30] that's what we decided, I don't remember all the reasons but one of them is that script path is non trivial [13:31] emijrp: so you manage to extract the licenseurl even if there's no API now? [13:32] i copy copyright info from html footer, and link to #footer [13:32] anyway, there are wikis without copyright info at all [13:33] first api, if fails then html, if fails then skip dump [13:33] emijrp: skip?? [13:34] if there is no license data, uplaod anyway ? [13:34] yes [13:34] but tag it in some way so that we can review it later [13:34] ok [13:35] add unknowncopyright keyword [13:35] for instance NO_COPYRIGHT_INFO in rights field [13:35] yeah whatever [13:37] emijrp: are you adding the footer content to 'rights' field even if you tech license from API? [13:37] no [13:38] emijrp: it would be better, if MediaWiki:copyright is custom [13:38] (or just raw text of the message) [13:39] emijrp: copyright message in the footer was introduced with Domas Mituzas 2006-01-22 00:49:58 +0000 670) 'copyright' => 'Content is available under $1.', [13:44] ah no it just was in another file [13:46] yes, but $1 is defined in lcoalsettings.php [13:46] and showed in the footer [13:46] you cant read mediawiki:copyrgiht licence [13:46] shown* license* [13:50] Nemo_bis: tell me a field for dump date [13:51] emijrp: I know you can't, but the text can still contain info [13:51] like in your referata wiki [13:51] while the link is already provided by the API [13:51] now looking for a proper field [13:54] emijrp: 'addeddate' would seem ok but I still have to dig [13:55] 'date' [13:57] emijrp: no, that's for te date of creation [13:57] but the content is created before the dump [13:57] creation of what? [13:57] creation of dump [13:57] it would be like replacing the date of publication of a book with the date of scanning [13:57] no, creation of the work [13:58] downloaddate, backupdate... [13:58] dumpdate [13:59] ? [14:00] emijrp: better to use existing ones [14:01] addeddate? [14:01] no addeddate is wrong [14:02] it's already used by the system on upload [14:02] soo let's see [14:02] emijrp: how about a last-updated-date [14:03] it could contain the date of the last dump [14:03] anyway if more than 1 dump are allowed per item, date is nonsense [14:04] date is on filenames [14:04] emijrp: it's necessarily nonsense for wikis [14:04] no field for date [14:04] it should be replaced with something like 2001-2012 [14:05] emijrp: however, a way to see when the dump has been updates last would be useful [14:06] at some point we could use it to select wikis to be redownloaded [14:06] ok [14:14] i think you can use it Nemo_bis http://code.google.com/p/wikiteam/source/browse/trunk/uploader.py [14:14] you need keys.txt file [14:14] download it in the same directory you have the dumps [14:14] and listxxx.txt is required [14:15] im going to add instructions here http://code.google.com/p/wikiteam/w/edit/TaskForce [14:19] emijrp: so it uploads only what one has in the list? [14:19] (just to check) [14:19] yes [14:19] oki [14:19] * Nemo_bis eager to try [14:22] instructions http://code.google.com/p/wikiteam/wiki/TaskForce#footer [14:23] try [14:28] Nemo_bis: works? [14:31] balrog_: dokuwiki is not supported, really only mediawiki is supported [14:31] I've used the mediawiki tool and it works well [14:37] do you know any dokuwiki wikifarm? [14:38] well, I was thinking of the MESS wiki, but that's a mess [14:38] because it was lost with no backups [14:38] so it might have to be scraped from IA :( [14:44] sure [14:51] Nemo_bis is fapping watching the script to upload a fuckton of wikis [15:08] emijrp: been busy, now I should tri [15:10] emijrp: first batch includes a 100 GB wiki, it won't overwritten right? [15:11] how overwritten? uploader script just read from harddisk [15:12] emijrp: I mean on archive.org [15:12] I don't want to reupload that huge dump [15:12] i dont know [15:12] hmm [15:12] then just remove it from listxxx.txt [15:12] let's try another batch first [15:12] right [15:13] remove from list those dumps you uploaded in the past [15:13] I'm so smart, I had already removed it [15:13] but do not make svn commit [15:14] you should use different directories for svn and the taskforce downlaods [15:14] IndexError: list index out of range [15:14] sure [15:14] hmm [15:14] wait [15:15] paste the entire error [15:15] no, I was just stupid [15:16] emijrp: http://archive.org/details/wiki-citwikioberlinedu [15:17] we used to include the dots though [15:17] no -wikidump.7z ? [15:17] dunno if it was a good idea [15:17] emijrp: http://www.us.archive.org/log_show.php?task_id=117329967 [15:17] you should see it now [15:19] afk for some mins [15:31] hmpf IOError: [Errno socket error] [Errno 110] Connection timed out [15:31] now let's see what happens if I just rerun [15:32] emijrp: ^ [15:33] file gets overwitten [15:34] no -wikidump.7z [15:34] there isn't any probably [15:34] i have used the script to upload 3 small wikis and all is ok [15:35] perhaps i may move the urllib2 query inside the try catch [15:35] to avoid time out errors [15:35] probably some wikis are dead now [15:36] hm there's an images directory but no archive [15:37] later I'll have to check all the archives, recompress if needed and reupload [15:37] so I seriously need the script not to reupload stuff which is already on archive.org [15:37] it timed out again [15:38] http://p.defau.lt/?as62zqO_kO6K8Duh_jCG1Q [15:40] yep cloverfield.despoiler.org/api.php [15:40] no response [15:40] it didn't fail before [15:41] http://archive.org/details/wiki-cloverfielddespoilerorg [15:42] erratic server [15:42] update the uplaoder script, i added try except [15:43] emijrp: how about reupload? [15:44] ? [15:45] emijrp: does it still reupload stuff already uploaded? [15:45] yes [15:45] hmpf [15:45] there is any command to skip? in curl? [15:46] i thought you didnt upload dumps [15:46] thats why you needed the script [15:46] sure but for instance now I have to manually edit the list of files every time I have to rerun it [15:47] and for this wiki I must reupload the 250 MB archive to also upload the small history 7z [15:47] ok [15:48] perhaps it would be easier to use the script to put the metadata on a csv file for their bulkuploader? [15:48] there must be some command though :/ [15:49] the uploader creates a log with the dump status [15:49] i will read it at first, to skip those uploaded before [15:58] Nemo_bis: try now, it skips uploaded dumps [15:58] but you must to conservate the .log [15:59] emijrp: ok [15:59] emijrp: are you putting the "lang" from siteinfo in the "language" field? [16:04] nope [16:04] i will add it later [16:47] Quick question. [16:48] Who is James Michael Dupont. One of you guys? [16:59] SketchCow: somehow [17:00] he's a Wikipedian who's archiving some speedy deleted articles to a Wikia wiki [17:00] (from en.wikipedia) [17:02] SketchCow: is he flooding IA? [17:02] he wanted to add XML exports of articles to wikiteam collection but it should be in a subcollection at least (if any) [17:20] He is not. [17:20] He just asked for access to stuff. [17:20] I can give him a subcollection [17:20] just making sure we're in the relatively correct location. [17:24] SketchCow: we don't know him very well, of course you own the house but IMvHO better if he's not admin of wikiteam [17:24] a subcollection would be good [17:28] I've traded a lot of emails, I trust him. [17:28] but a subcollection would organize it better [17:29] yep [17:30] Give him a subcollection, then [17:30] I'll link it into archiveteam's wikiteam as a whole [18:33] Nemo_bis: language must be 'en' or 'English'? [18:33] api returns 'en' [18:33] and make a conversor is a pain [18:36] i can convert the basic ones [18:39] ok, new version of uploader, detecting languages [18:39] update [18:47] Hey, emijrp [18:48] We're giving wikideletion guy a subcollection pointing to the wikiteam collection [18:48] So he can add the stuff but it's not in the main thing and he's not an admin [18:48] Also, I closed new accounts on archiveteam.org while we clean [18:49] ok SketchCow, about wikideletion guy [18:50] about at.org, i wouldnt care about deleting user account but deleting spam pages created by them [18:51] also, an antispam extension is needed, main issue are bot registration, and extensions may be triggered in that cases, or when adding an external link, etc [18:52] http://www.mediawiki.org/wiki/Extension:ConfirmEdit + http://www.mediawiki.org/wiki/Extension:QuestyCaptcha [18:53] Well, I want deleted accounts AND deleting spam pages. [18:54] $wgCaptchaQuestions[] = array( 'question' => "...", 'answer' => "..." ); [18:54] $wgCaptchaTriggers['createaccount'] = true; [19:15] SketchCow: also we are grabbing Wikimedia Commons entirely month by month http://archive.org/details/wikimediacommons-200507 [19:15] 12*7 years = 84 items [19:16] ! [19:16] not sure if subcollection is desired [19:16] Yeah, it will be. [19:16] But for now, just keep going. [19:17] http://archive.org/search.php?query=%22wikimedia%20commons%20grab%22 [19:18] http://archive.org/details/wikimediacommons exists already, apparently. [19:20] but that is an item, not a subcollection, right? [19:20] it was created by Hydriz probably, but he saw it is many TB for a single item, and separated into months [19:22] It's a subcollection now. [19:22] I'll move the crap over [19:24] Done. [19:24] It'll filter into place across the next 5-40 minutes. [19:27] http://archive.org/search.php?query=collection%3Awikimediacommons&sort=-publicdate [19:30] it may be a subcollection inside wikiteam? [19:31] It is. [19:31] http://archive.org/details/wikimediacommons [19:32] if you click on the all items, they're there; archive.org has these walkers that go through and shore up collections later. [19:32] http://ia601202.us.archive.org/zipview.php?zip=/22/items/wikimediacommons-200607/2006-07-01.zip&file=2006%2F07%2F01%2FHooters_Girls_calendar_signing_at_Kandahar.jpg [19:32] SAVED FOREVER [19:32] ah ok, it is now shown here http://archive.org/details/wikiteam may be lag [19:32] not* [19:33] Yeah, not lag. [19:33] See, a thing goes through over time and shores up collection sets, cleans messes, etc. [19:33] ok [19:33] That's one of archive.org's weirdness that's culturally difficult for folks - how stuff kind of fades into view [19:33] It's not 1/0 [19:35] the zip explorer is very useful and cool in this case [19:35] and the .xml files are opened in firefox too, so you can read the image description and license [19:35] pure win [19:41] anyway, some files were lost in wikimedia servers http://upload.wikimedia.org/wikipedia/commons/archive/b/b8/20050415212201!SMS_Bluecher.jpg and they are saved as empty files in the grab [20:06] Frau Bluecher!!! (whinny) [20:49] emijrp: AFAIK language codes work [20:58] emijrp: launched script for 3557 more wikis [20:58] not all of them actually downloaded of course [20:58] cool [20:59] emijrp: it would be nice if you could redirect curl output so that I can see the progress info [20:59] I can monitor it with nethogs but it's not quite the same [21:00] do it yourself and commit [21:00] emijrp: don't know exactly what output you're expecting in the logs [21:00] anyway it's not urgent [21:01] SketchCow: what the command to prevent curl/s3 from replacing an existing file with same filename? [21:02] -nc --no-clobber [21:03] so easy? :p [21:05] chronomex: are you? I mean when one UPloads with s3 [21:05] *you sure [21:05] oh [21:05] I was thinking wget [21:05] sorry [21:06] chronomex: I want to avoid overwriting existing files/items in IA with my local ones [21:09] good, saturating bandwidth to IA for once [21:11] emijrp: it's not the uploader's fault but I wonder if we should do something to catch the duplicates I overlooked: http://archive.org/details/wiki-docwikigumstixcom http://archive.org/details/wiki-docwikigumstixorg [21:12] emijrp: a pretty feature would be downloading the wiki logo and upload it to the item as well [21:12] (surely not prioritary :p) [21:13] k [21:19] Nemo_bis: add it as a TODO comment in top of uploader.py [21:19] emijrp: ok [21:31] Nemo_bis: the speedydeletion guy does not upload the entire history, just the last revision [21:32] http://speedydeletion.wikia.com/wiki/Salvador_De_Jesus_%28deleted_05_Sep_2008_at_20:47%29 [21:32] http://deletionpedia.dbatley.com/w/index.php?title=Salvador_De_Jesus_%28deleted_05_Sep_2008_at_20:47%29&action=history [21:33] http://speedydeletion.wikia.com/wiki/Salvador_De_Jesus_%28deleted_05_Sep_2008_at_20:47%29?action=history [21:33] i think he really paste the text instead of import [21:34] hmmmmmmmm [21:45] who is alphacorp again? [21:45] underscor: Hydriz [21:45] thx [21:46] hm? http://p.defau.lt/?puN_G_zKXbv1lz9TfSliPg [21:47] Nemo_bis: is that a space in the identifier? [21:48] underscor: it shouldn't, let me check [21:49] underscor: I don't think so http://archive.org/details/wiki-editionorg_w [21:49] woah, weird [21:50] it started and then failed [21:50] delete and retry? [21:50] it's two very small files, maybe it didn't manage to finish the first before the second [21:50] ah [21:50] yeah [21:50] need a longer pause [21:52] underscor: it didn't even manage to set the collection and other metadata http://archive.org/details/wiki-editionorg_w [22:06] underscor: weird, just 300 KB more and it works http://archive.org/details/wiki-emiswikioecnk12ohus [22:16] underscor: we also have some escaping problem, fancy fixing it while emijrp is offline? :) http://archive.org/details/wiki-encitizendiumorg [22:16] shouldn't be too hard [22:19] sigh http://archive.org/details/wiki-enecgpediaorg http://archive.org/details/wiki-en.ecgpedia.org [22:19] Nemo_bis: Working on it [22:19] the fix is easy, but trying to figure out where we munge the data [22:22] underscor: from siteinfo API query IIRC [22:22] otherwise meta HTML tags? [22:22] no, no [22:22] not your fault [22:23] somewhere in the s3 pipline, it's double entity-ized [22:23] pipeline*