#wikiteam 2012-08-06,Mon

↑back Search

Time	Nickname	Message
12:09 ^🔗	emijrp	Nemo_bis: im working on the uploader
12:09 ^🔗	emijrp	http://archive.org/details/wiki-androidjimsh
12:11 ^🔗	emijrp	check the fields
12:11 ^🔗	emijrp	what to do with those wikis with no license metadata in api query?
12:13 ^🔗	emijrp	also, appending date to item name is ok, right?
12:13 ^🔗	emijrp	the curl command you sent me doesnt include it, example http://archive.org/details/wiki-androidjimsh
12:15 ^🔗	Nemo_bis	emijrp: looking
12:16 ^🔗	Nemo_bis	emijrp: no, item identifier shouldn't contain date
12:16 ^🔗	emijrp	but that may produce collisions
12:16 ^🔗	emijrp	distinct users uploading dump for a wiki in different dates
12:17 ^🔗	Nemo_bis	emijrp: how so? I think it's better if we put different dates in the same item
12:17 ^🔗	Nemo_bis	you only need error handling, but I think IA doesn't let you disrupt anything
12:17 ^🔗	Nemo_bis	such items will need to be uploaded by a wikiteam collection admin (like you, me and underscor IIRC)
12:18 ^🔗	emijrp	SketchCow: opinion?
12:19 ^🔗	emijrp	Nemo_bis: also, description is almost empty
12:19 ^🔗	Nemo_bis	it's early morning there
12:19 ^🔗	Nemo_bis	emijrp: where are you fetching the description from
12:19 ^🔗	Nemo_bis	emijrp: the item id pattern is what Alexis told us to use
12:19 ^🔗	Nemo_bis	no date in it
12:20 ^🔗	Nemo_bis	emijrp: as for the license info, what API info are using? I think there are (or have been) multiple places across the MW releases
12:22 ^🔗	emijrp	http://android.jim.sh/api.php?action=query&meta=siteinfo&siprop=general\|rightsinfo&format=xml
12:22 ^🔗	Nemo_bis	emijrp: I think that when you can't find a proper license URL you should just take whatever you find (non-URLs, names in the rightsinfo fiels, even main page or Project:About or Project:COpyright and dump it in the "rights" field, and in any case tag the item so that we can check it later
12:23 ^🔗	Nemo_bis	emijrp: according to docs that didn't exist before MW 1.15+
12:25 ^🔗	emijrp	or just a link to the mainpage
12:25 ^🔗	Nemo_bis	emijrp: also note https://bugzilla.wikimedia.org/show_bug.cgi?id=29918#c1
12:25 ^🔗	emijrp	http://wikipapers.referata.com/wiki/Main_Page#footer
12:26 ^🔗	Nemo_bis	link is bad, might disappear any time
12:27 ^🔗	Nemo_bis	hmm is wikistats.wmflabs.org down?
12:27 ^🔗	emijrp	you said link Project:About
12:28 ^🔗	Nemo_bis	no, I'd just include its content
12:28 ^🔗	Nemo_bis	whatever it is, can't harm
12:28 ^🔗	Nemo_bis	(if there's no license URL)
12:30 ^🔗	Nemo_bis	btw I'm still without a proper PC right now but I have dumps on an external HDD and connection now works (actually it's freer than usual), so I can upload lots of stuff
12:46 ^🔗	emijrp	if you upload a file with the same name to an item, are they overwritedÂ¡
12:46 ^🔗	emijrp	?
12:48 ^🔗	Nemo_bis	emijrp: this I don't know
12:49 ^🔗	emijrp	the description is not
12:49 ^🔗	Nemo_bis	emijrp: description and other metadata is never overwritten unless you add ignore bucket something
12:50 ^🔗	Nemo_bis	emijrp: https://www.mediawiki.org/w/index.php?title=Special:Contributions/Nemo_bis&offset=&limit=5&target=Nemo+bis
12:51 ^🔗	Nemo_bis	So, I think licenseurl should contain whatever "url" is, additionally if it's not a link to creativecommons.org or fsf.org you should fetch it and add it to "rights" field
12:51 ^🔗	Nemo_bis	additionally, it would probably be best to add to "rights" whatever MediaWiki:Copyright contains even if the wiki uses a license, to get cases like the referata wiki footer you linked
12:52 ^🔗	emijrp	some people add the copyright info to main page, project:about, project:copyright, etc, it is a mess
12:55 ^🔗	emijrp	yuo can run a 100 batch, and check the log, how many wikis havent coypright metadata
12:57 ^🔗	Nemo_bis	yes I can do such a batch when the script is ready
12:57 ^🔗	Nemo_bis	is it?
13:00 ^🔗	Nemo_bis	emijrp: which of those pages are you teching right now?
13:01 ^🔗	emijrp	none, just api metadata
13:02 ^🔗	Nemo_bis	emijrp: can you add at least mediawiki:copyright?
13:02 ^🔗	Nemo_bis	shouldn't be hard with API or ?action=raw
13:02 ^🔗	emijrp	no, im going to fetch <li id="copyright"></li> from the mainpage, i think is better
13:03 ^🔗	Nemo_bis	emijrp: what's that?
13:04 ^🔗	emijrp	http://amsterdam.campuswiki.nl/nl/Hoofdpagina look html code
13:04 ^🔗	emijrp	at the bottom
13:04 ^🔗	Nemo_bis	ah the footer
13:04 ^🔗	Nemo_bis	but this doesn't let you discover if it's customised or not
13:04 ^🔗	Nemo_bis	you should include it only if it's custom, otherwise it's worthless crap which doesn't let us wikis which actually need info
13:05 ^🔗	emijrp	that <li> is from old mediawiki
13:05 ^🔗	emijrp	prior 1.12
13:06 ^🔗	emijrp	1.15*
13:06 ^🔗	Nemo_bis	emijrp: ah, so you're going to use it only in those cases, because there's no API?
13:06 ^🔗	emijrp	yes
13:06 ^🔗	Nemo_bis	ok
13:07 ^🔗	Nemo_bis	when there's API, it's still worth including MediaWIki:copyright iff existing
13:21 ^🔗	emijrp	in origialurl field, api or mainpage link?
13:22 ^🔗	emijrp	Nemo_bis:
13:23 ^🔗	Nemo_bis	emijrp: hm?
13:24 ^🔗	balrog_	are there any wikiteam tools for DokuWiki?
13:24 ^🔗	Nemo_bis	dump it in the "rights" IA field
13:24 ^🔗	Nemo_bis	balrog_: not yet
13:24 ^🔗	balrog_	:/ ok
13:24 ^🔗	Nemo_bis	AFAIK
13:25 ^🔗	emijrp	Nemo_bis: i mean originalurl field
13:25 ^🔗	emijrp	http://archive.org/details/wiki-amsterdamcampuswikinl_wiki
13:29 ^🔗	Nemo_bis	emijrp: API
13:30 ^🔗	Nemo_bis	that's what we decided, I don't remember all the reasons but one of them is that script path is non trivial
13:31 ^🔗	Nemo_bis	emijrp: so you manage to extract the licenseurl even if there's no API now?
13:32 ^🔗	emijrp	i copy copyright info from html footer, and link to #footer
13:32 ^🔗	emijrp	anyway, there are wikis without copyright info at all
13:33 ^🔗	emijrp	first api, if fails then html, if fails then skip dump
13:33 ^🔗	Nemo_bis	emijrp: skip??
13:34 ^🔗	emijrp	if there is no license data, uplaod anyway ?
13:34 ^🔗	Nemo_bis	yes
13:34 ^🔗	Nemo_bis	but tag it in some way so that we can review it later
13:34 ^🔗	emijrp	ok
13:35 ^🔗	emijrp	add unknowncopyright keyword
13:35 ^🔗	Nemo_bis	for instance NO_COPYRIGHT_INFO in rights field
13:35 ^🔗	Nemo_bis	yeah whatever
13:37 ^🔗	Nemo_bis	emijrp: are you adding the footer content to 'rights' field even if you tech license from API?
13:37 ^🔗	emijrp	no
13:38 ^🔗	Nemo_bis	emijrp: it would be better, if MediaWiki:copyright is custom
13:38 ^🔗	Nemo_bis	(or just raw text of the message)
13:39 ^🔗	Nemo_bis	emijrp: copyright message in the footer was introduced with Domas Mituzas 2006-01-22 00:49:58 +0000 670) 'copyright' => 'Content is available under $1.',
13:44 ^🔗	Nemo_bis	ah no it just was in another file
13:46 ^🔗	emijrp	yes, but $1 is defined in lcoalsettings.php
13:46 ^🔗	emijrp	and showed in the footer
13:46 ^🔗	emijrp	you cant read mediawiki:copyrgiht licence
13:46 ^🔗	emijrp	shown* license*
13:50 ^🔗	emijrp	Nemo_bis: tell me a field for dump date
13:51 ^🔗	Nemo_bis	emijrp: I know you can't, but the text can still contain info
13:51 ^🔗	Nemo_bis	like in your referata wiki
13:51 ^🔗	Nemo_bis	while the link is already provided by the API
13:51 ^🔗	Nemo_bis	now looking for a proper field
13:54 ^🔗	Nemo_bis	emijrp: 'addeddate' would seem ok but I still have to dig
13:55 ^🔗	emijrp	'date'
13:57 ^🔗	Nemo_bis	emijrp: no, that's for te date of creation
13:57 ^🔗	Nemo_bis	but the content is created before the dump
13:57 ^🔗	emijrp	creation of what?
13:57 ^🔗	emijrp	creation of dump
13:57 ^🔗	Nemo_bis	it would be like replacing the date of publication of a book with the date of scanning
13:57 ^🔗	Nemo_bis	no, creation of the work
13:58 ^🔗	emijrp	downloaddate, backupdate...
13:58 ^🔗	emijrp	dumpdate
13:59 ^🔗	emijrp	?
14:00 ^🔗	Nemo_bis	emijrp: better to use existing ones
14:01 ^🔗	emijrp	addeddate?
14:01 ^🔗	Nemo_bis	no addeddate is wrong
14:02 ^🔗	Nemo_bis	it's already used by the system on upload
14:02 ^🔗	Nemo_bis	soo let's see
14:02 ^🔗	Nemo_bis	emijrp: how about a last-updated-date
14:03 ^🔗	Nemo_bis	it could contain the date of the last dump
14:03 ^🔗	emijrp	anyway if more than 1 dump are allowed per item, date is nonsense
14:04 ^🔗	emijrp	date is on filenames
14:04 ^🔗	Nemo_bis	emijrp: it's necessarily nonsense for wikis
14:04 ^🔗	emijrp	no field for date
14:04 ^🔗	Nemo_bis	it should be replaced with something like 2001-2012
14:05 ^🔗	Nemo_bis	emijrp: however, a way to see when the dump has been updates last would be useful
14:06 ^🔗	Nemo_bis	at some point we could use it to select wikis to be redownloaded
14:06 ^🔗	emijrp	ok
14:14 ^🔗	emijrp	i think you can use it Nemo_bis http://code.google.com/p/wikiteam/source/browse/trunk/uploader.py
14:14 ^🔗	emijrp	you need keys.txt file
14:14 ^🔗	emijrp	download it in the same directory you have the dumps
14:14 ^🔗	emijrp	and listxxx.txt is required
14:15 ^🔗	emijrp	im going to add instructions here http://code.google.com/p/wikiteam/w/edit/TaskForce
14:19 ^🔗	Nemo_bis	emijrp: so it uploads only what one has in the list?
14:19 ^🔗	Nemo_bis	(just to check)
14:19 ^🔗	emijrp	yes
14:19 ^🔗	Nemo_bis	oki
14:19 ^🔗	*	Nemo_bis eager to try
14:22 ^🔗	emijrp	instructions http://code.google.com/p/wikiteam/wiki/TaskForce#footer
14:23 ^🔗	emijrp	try
14:28 ^🔗	emijrp	Nemo_bis: works?
14:31 ^🔗	emijrp	balrog_: dokuwiki is not supported, really only mediawiki is supported
14:31 ^🔗	balrog_	I've used the mediawiki tool and it works well
14:37 ^🔗	emijrp	do you know any dokuwiki wikifarm?
14:38 ^🔗	balrog_	well, I was thinking of the MESS wiki, but that's a mess
14:38 ^🔗	balrog_	because it was lost with no backups
14:38 ^🔗	balrog_	so it might have to be scraped from IA :(
14:44 ^🔗	emijrp	sure
14:51 ^🔗	emijrp	Nemo_bis is fapping watching the script to upload a fuckton of wikis
15:08 ^🔗	Nemo_bis	emijrp: been busy, now I should tri
15:10 ^🔗	Nemo_bis	emijrp: first batch includes a 100 GB wiki, it won't overwritten right?
15:11 ^🔗	emijrp	how overwritten? uploader script just read from harddisk
15:12 ^🔗	Nemo_bis	emijrp: I mean on archive.org
15:12 ^🔗	Nemo_bis	I don't want to reupload that huge dump
15:12 ^🔗	emijrp	i dont know
15:12 ^🔗	Nemo_bis	hmm
15:12 ^🔗	emijrp	then just remove it from listxxx.txt
15:12 ^🔗	Nemo_bis	let's try another batch first
15:12 ^🔗	Nemo_bis	right
15:13 ^🔗	emijrp	remove from list those dumps you uploaded in the past
15:13 ^🔗	Nemo_bis	I'm so smart, I had already removed it
15:13 ^🔗	emijrp	but do not make svn commit
15:14 ^🔗	emijrp	you should use different directories for svn and the taskforce downlaods
15:14 ^🔗	Nemo_bis	IndexError: list index out of range
15:14 ^🔗	Nemo_bis	sure
15:14 ^🔗	Nemo_bis	hmm
15:14 ^🔗	Nemo_bis	wait
15:15 ^🔗	emijrp	paste the entire error
15:15 ^🔗	Nemo_bis	no, I was just stupid
15:16 ^🔗	Nemo_bis	emijrp: http://archive.org/details/wiki-citwikioberlinedu
15:17 ^🔗	Nemo_bis	we used to include the dots though
15:17 ^🔗	emijrp	no -wikidump.7z ?
15:17 ^🔗	Nemo_bis	dunno if it was a good idea
15:17 ^🔗	Nemo_bis	emijrp: http://www.us.archive.org/log_show.php?task_id=117329967
15:17 ^🔗	Nemo_bis	you should see it now
15:19 ^🔗	Nemo_bis	afk for some mins
15:31 ^🔗	Nemo_bis	hmpf IOError: [Errno socket error] [Errno 110] Connection timed out
15:31 ^🔗	Nemo_bis	now let's see what happens if I just rerun
15:32 ^🔗	Nemo_bis	emijrp: ^
15:33 ^🔗	Nemo_bis	file gets overwitten
15:34 ^🔗	emijrp	no -wikidump.7z
15:34 ^🔗	Nemo_bis	there isn't any probably
15:34 ^🔗	emijrp	i have used the script to upload 3 small wikis and all is ok
15:35 ^🔗	emijrp	perhaps i may move the urllib2 query inside the try catch
15:35 ^🔗	emijrp	to avoid time out errors
15:35 ^🔗	emijrp	probably some wikis are dead now
15:36 ^🔗	Nemo_bis	hm there's an images directory but no archive
15:37 ^🔗	Nemo_bis	later I'll have to check all the archives, recompress if needed and reupload
15:37 ^🔗	Nemo_bis	so I seriously need the script not to reupload stuff which is already on archive.org
15:37 ^🔗	Nemo_bis	it timed out again
15:38 ^🔗	Nemo_bis	http://p.defau.lt/?as62zqO_kO6K8Duh_jCG1Q
15:40 ^🔗	emijrp	yep cloverfield.despoiler.org/api.php
15:40 ^🔗	emijrp	no response
15:40 ^🔗	Nemo_bis	it didn't fail before
15:41 ^🔗	Nemo_bis	http://archive.org/details/wiki-cloverfielddespoilerorg
15:42 ^🔗	emijrp	erratic server
15:42 ^🔗	emijrp	update the uplaoder script, i added try except
15:43 ^🔗	Nemo_bis	emijrp: how about reupload?
15:44 ^🔗	emijrp	?
15:45 ^🔗	Nemo_bis	emijrp: does it still reupload stuff already uploaded?
15:45 ^🔗	emijrp	yes
15:45 ^🔗	Nemo_bis	hmpf
15:45 ^🔗	emijrp	there is any command to skip? in curl?
15:46 ^🔗	emijrp	i thought you didnt upload dumps
15:46 ^🔗	emijrp	thats why you needed the script
15:46 ^🔗	Nemo_bis	sure but for instance now I have to manually edit the list of files every time I have to rerun it
15:47 ^🔗	Nemo_bis	and for this wiki I must reupload the 250 MB archive to also upload the small history 7z
15:47 ^🔗	emijrp	ok
15:48 ^🔗	Nemo_bis	perhaps it would be easier to use the script to put the metadata on a csv file for their bulkuploader?
15:48 ^🔗	Nemo_bis	there must be some command though :/
15:49 ^🔗	emijrp	the uploader creates a log with the dump status
15:49 ^🔗	emijrp	i will read it at first, to skip those uploaded before
15:58 ^🔗	emijrp	Nemo_bis: try now, it skips uploaded dumps
15:58 ^🔗	emijrp	but you must to conservate the .log
15:59 ^🔗	Nemo_bis	emijrp: ok
15:59 ^🔗	Nemo_bis	emijrp: are you putting the "lang" from siteinfo in the "language" field?
16:04 ^🔗	emijrp	nope
16:04 ^🔗	emijrp	i will add it later
16:47 ^🔗	SketchCow	Quick question.
16:48 ^🔗	SketchCow	Who is James Michael Dupont. One of you guys?
16:59 ^🔗	Nemo_bis	SketchCow: somehow
17:00 ^🔗	Nemo_bis	he's a Wikipedian who's archiving some speedy deleted articles to a Wikia wiki
17:00 ^🔗	Nemo_bis	(from en.wikipedia)
17:02 ^🔗	Nemo_bis	SketchCow: is he flooding IA?
17:02 ^🔗	Nemo_bis	he wanted to add XML exports of articles to wikiteam collection but it should be in a subcollection at least (if any)
17:20 ^🔗	SketchCow	He is not.
17:20 ^🔗	SketchCow	He just asked for access to stuff.
17:20 ^🔗	SketchCow	I can give him a subcollection
17:20 ^🔗	SketchCow	just making sure we're in the relatively correct location.
17:24 ^🔗	Nemo_bis	SketchCow: we don't know him very well, of course you own the house but IMvHO better if he's not admin of wikiteam
17:24 ^🔗	Nemo_bis	a subcollection would be good
17:28 ^🔗	underscor	I've traded a lot of emails, I trust him.
17:28 ^🔗	underscor	but a subcollection would organize it better
17:29 ^🔗	Nemo_bis	yep
17:30 ^🔗	SketchCow	Give him a subcollection, then
17:30 ^🔗	SketchCow	I'll link it into archiveteam's wikiteam as a whole
18:33 ^🔗	emijrp	Nemo_bis: language must be 'en' or 'English'?
18:33 ^🔗	emijrp	api returns 'en'
18:33 ^🔗	emijrp	and make a conversor is a pain
18:36 ^🔗	emijrp	i can convert the basic ones
18:39 ^🔗	emijrp	ok, new version of uploader, detecting languages
18:39 ^🔗	emijrp	update
18:47 ^🔗	SketchCow	Hey, emijrp
18:48 ^🔗	SketchCow	We're giving wikideletion guy a subcollection pointing to the wikiteam collection
18:48 ^🔗	SketchCow	So he can add the stuff but it's not in the main thing and he's not an admin
18:48 ^🔗	SketchCow	Also, I closed new accounts on archiveteam.org while we clean
18:49 ^🔗	emijrp	ok SketchCow, about wikideletion guy
18:50 ^🔗	emijrp	about at.org, i wouldnt care about deleting user account but deleting spam pages created by them
18:51 ^🔗	emijrp	also, an antispam extension is needed, main issue are bot registration, and extensions may be triggered in that cases, or when adding an external link, etc
18:52 ^🔗	emijrp	http://www.mediawiki.org/wiki/Extension:ConfirmEdit + http://www.mediawiki.org/wiki/Extension:QuestyCaptcha
18:53 ^🔗	SketchCow	Well, I want deleted accounts AND deleting spam pages.
18:54 ^🔗	emijrp	$wgCaptchaQuestions[] = array( 'question' => "...", 'answer' => "..." );
18:54 ^🔗	emijrp	$wgCaptchaTriggers['createaccount'] = true;
19:15 ^🔗	emijrp	SketchCow: also we are grabbing Wikimedia Commons entirely month by month http://archive.org/details/wikimediacommons-200507
19:15 ^🔗	emijrp	12*7 years = 84 items
19:16 ^🔗	SketchCow	!
19:16 ^🔗	emijrp	not sure if subcollection is desired
19:16 ^🔗	SketchCow	Yeah, it will be.
19:16 ^🔗	SketchCow	But for now, just keep going.
19:17 ^🔗	emijrp	http://archive.org/search.php?query=%22wikimedia%20commons%20grab%22
19:18 ^🔗	SketchCow	http://archive.org/details/wikimediacommons exists already, apparently.
19:20 ^🔗	emijrp	but that is an item, not a subcollection, right?
19:20 ^🔗	emijrp	it was created by Hydriz probably, but he saw it is many TB for a single item, and separated into months
19:22 ^🔗	SketchCow	It's a subcollection now.
19:22 ^🔗	SketchCow	I'll move the crap over
19:24 ^🔗	SketchCow	Done.
19:24 ^🔗	SketchCow	It'll filter into place across the next 5-40 minutes.
19:27 ^🔗	SketchCow	http://archive.org/search.php?query=collection%3Awikimediacommons&sort=-publicdate
19:30 ^🔗	emijrp	it may be a subcollection inside wikiteam?
19:31 ^🔗	SketchCow	It is.
19:31 ^🔗	SketchCow	http://archive.org/details/wikimediacommons
19:32 ^🔗	SketchCow	if you click on the all items, they're there; archive.org has these walkers that go through and shore up collections later.
19:32 ^🔗	SketchCow	http://ia601202.us.archive.org/zipview.php?zip=/22/items/wikimediacommons-200607/2006-07-01.zip&file=2006%2F07%2F01%2FHooters_Girls_calendar_signing_at_Kandahar.jpg
19:32 ^🔗	SketchCow	SAVED FOREVER
19:32 ^🔗	emijrp	ah ok, it is now shown here http://archive.org/details/wikiteam may be lag
19:32 ^🔗	emijrp	not*
19:33 ^🔗	SketchCow	Yeah, not lag.
19:33 ^🔗	SketchCow	See, a thing goes through over time and shores up collection sets, cleans messes, etc.
19:33 ^🔗	emijrp	ok
19:33 ^🔗	SketchCow	That's one of archive.org's weirdness that's culturally difficult for folks - how stuff kind of fades into view
19:33 ^🔗	SketchCow	It's not 1/0
19:35 ^🔗	emijrp	the zip explorer is very useful and cool in this case
19:35 ^🔗	emijrp	and the .xml files are opened in firefox too, so you can read the image description and license
19:35 ^🔗	emijrp	pure win
19:41 ^🔗	emijrp	anyway, some files were lost in wikimedia servers http://upload.wikimedia.org/wikipedia/commons/archive/b/b8/20050415212201!SMS_Bluecher.jpg and they are saved as empty files in the grab
20:06 ^🔗	SketchCow	Frau Bluecher!!! (whinny)
20:49 ^🔗	Nemo_bis	emijrp: AFAIK language codes work
20:58 ^🔗	Nemo_bis	emijrp: launched script for 3557 more wikis
20:58 ^🔗	Nemo_bis	not all of them actually downloaded of course
20:58 ^🔗	emijrp	cool
20:59 ^🔗	Nemo_bis	emijrp: it would be nice if you could redirect curl output so that I can see the progress info
20:59 ^🔗	Nemo_bis	I can monitor it with nethogs but it's not quite the same
21:00 ^🔗	emijrp	do it yourself and commit
21:00 ^🔗	Nemo_bis	emijrp: don't know exactly what output you're expecting in the logs
21:00 ^🔗	Nemo_bis	anyway it's not urgent
21:01 ^🔗	Nemo_bis	SketchCow: what the command to prevent curl/s3 from replacing an existing file with same filename?
21:02 ^🔗	chronomex	-nc --no-clobber
21:03 ^🔗	Nemo_bis	so easy? :p
21:05 ^🔗	Nemo_bis	chronomex: are you? I mean when one UPloads with s3
21:05 ^🔗	Nemo_bis	*you sure
21:05 ^🔗	chronomex	oh
21:05 ^🔗	chronomex	I was thinking wget
21:05 ^🔗	chronomex	sorry
21:06 ^🔗	Nemo_bis	chronomex: I want to avoid overwriting existing files/items in IA with my local ones
21:09 ^🔗	Nemo_bis	good, saturating bandwidth to IA for once
21:11 ^🔗	Nemo_bis	emijrp: it's not the uploader's fault but I wonder if we should do something to catch the duplicates I overlooked: http://archive.org/details/wiki-docwikigumstixcom http://archive.org/details/wiki-docwikigumstixorg
21:12 ^🔗	Nemo_bis	emijrp: a pretty feature would be downloading the wiki logo and upload it to the item as well
21:12 ^🔗	Nemo_bis	(surely not prioritary :p)
21:13 ^🔗	emijrp	k
21:19 ^🔗	emijrp	Nemo_bis: add it as a TODO comment in top of uploader.py
21:19 ^🔗	Nemo_bis	emijrp: ok
21:31 ^🔗	emijrp	Nemo_bis: the speedydeletion guy does not upload the entire history, just the last revision
21:32 ^🔗	emijrp	http://speedydeletion.wikia.com/wiki/Salvador_De_Jesus_%28deleted_05_Sep_2008_at_20:47%29
21:32 ^🔗	emijrp	http://deletionpedia.dbatley.com/w/index.php?title=Salvador_De_Jesus_%28deleted_05_Sep_2008_at_20:47%29&action=history
21:33 ^🔗	emijrp	http://speedydeletion.wikia.com/wiki/Salvador_De_Jesus_%28deleted_05_Sep_2008_at_20:47%29?action=history
21:33 ^🔗	emijrp	i think he really paste the text instead of import
21:34 ^🔗	Nemo_bis	hmmmmmmmm
21:45 ^🔗	underscor	who is alphacorp again?
21:45 ^🔗	Nemo_bis	underscor: Hydriz
21:45 ^🔗	underscor	thx
21:46 ^🔗	Nemo_bis	hm? http://p.defau.lt/?puN_G_zKXbv1lz9TfSliPg
21:47 ^🔗	underscor	Nemo_bis: is that a space in the identifier?
21:48 ^🔗	Nemo_bis	underscor: it shouldn't, let me check
21:49 ^🔗	Nemo_bis	underscor: I don't think so http://archive.org/details/wiki-editionorg_w
21:49 ^🔗	underscor	woah, weird
21:50 ^🔗	underscor	it started and then failed
21:50 ^🔗	underscor	delete and retry?
21:50 ^🔗	Nemo_bis	it's two very small files, maybe it didn't manage to finish the first before the second
21:50 ^🔗	underscor	ah
21:50 ^🔗	underscor	yeah
21:50 ^🔗	underscor	need a longer pause
21:52 ^🔗	Nemo_bis	underscor: it didn't even manage to set the collection and other metadata http://archive.org/details/wiki-editionorg_w
22:06 ^🔗	Nemo_bis	underscor: weird, just 300 KB more and it works http://archive.org/details/wiki-emiswikioecnk12ohus
22:16 ^🔗	Nemo_bis	underscor: we also have some escaping problem, fancy fixing it while emijrp is offline? :) http://archive.org/details/wiki-encitizendiumorg
22:16 ^🔗	Nemo_bis	shouldn't be too hard
22:19 ^🔗	Nemo_bis	sigh http://archive.org/details/wiki-enecgpediaorg http://archive.org/details/wiki-en.ecgpedia.org
22:19 ^🔗	underscor	Nemo_bis: Working on it
22:19 ^🔗	underscor	the fix is easy, but trying to figure out where we munge the data
22:22 ^🔗	Nemo_bis	underscor: from siteinfo API query IIRC
22:22 ^🔗	Nemo_bis	otherwise meta HTML tags?
22:22 ^🔗	underscor	no, no
22:22 ^🔗	underscor	not your fault
22:23 ^🔗	underscor	somewhere in the s3 pipline, it's double entity-ized
22:23 ^🔗	underscor	pipeline*

irclogger-viewer