Time |
Nickname |
Message |
12:09
π
|
emijrp |
Nemo_bis: im working on the uploader |
12:09
π
|
emijrp |
http://archive.org/details/wiki-androidjimsh |
12:11
π
|
emijrp |
check the fields |
12:11
π
|
emijrp |
what to do with those wikis with no license metadata in api query? |
12:13
π
|
emijrp |
also, appending date to item name is ok, right? |
12:13
π
|
emijrp |
the curl command you sent me doesnt include it, example http://archive.org/details/wiki-androidjimsh |
12:15
π
|
Nemo_bis |
emijrp: looking |
12:16
π
|
Nemo_bis |
emijrp: no, item identifier shouldn't contain date |
12:16
π
|
emijrp |
but that may produce collisions |
12:16
π
|
emijrp |
distinct users uploading dump for a wiki in different dates |
12:17
π
|
Nemo_bis |
emijrp: how so? I think it's better if we put different dates in the same item |
12:17
π
|
Nemo_bis |
you only need error handling, but I think IA doesn't let you disrupt anything |
12:17
π
|
Nemo_bis |
such items will need to be uploaded by a wikiteam collection admin (like you, me and underscor IIRC) |
12:18
π
|
emijrp |
SketchCow: opinion? |
12:19
π
|
emijrp |
Nemo_bis: also, description is almost empty |
12:19
π
|
Nemo_bis |
it's early morning there |
12:19
π
|
Nemo_bis |
emijrp: where are you fetching the description from |
12:19
π
|
Nemo_bis |
emijrp: the item id pattern is what Alexis told us to use |
12:19
π
|
Nemo_bis |
no date in it |
12:20
π
|
Nemo_bis |
emijrp: as for the license info, what API info are using? I think there are (or have been) multiple places across the MW releases |
12:22
π
|
emijrp |
http://android.jim.sh/api.php?action=query&meta=siteinfo&siprop=general|rightsinfo&format=xml |
12:22
π
|
Nemo_bis |
emijrp: I think that when you can't find a proper license URL you should just take whatever you find (non-URLs, names in the rightsinfo fiels, even main page or Project:About or Project:COpyright and dump it in the "rights" field, and in any case tag the item so that we can check it later |
12:23
π
|
Nemo_bis |
emijrp: according to docs that didn't exist before MW 1.15+ |
12:25
π
|
emijrp |
or just a link to the mainpage |
12:25
π
|
Nemo_bis |
emijrp: also note https://bugzilla.wikimedia.org/show_bug.cgi?id=29918#c1 |
12:25
π
|
emijrp |
http://wikipapers.referata.com/wiki/Main_Page#footer |
12:26
π
|
Nemo_bis |
link is bad, might disappear any time |
12:27
π
|
Nemo_bis |
hmm is wikistats.wmflabs.org down? |
12:27
π
|
emijrp |
you said link Project:About |
12:28
π
|
Nemo_bis |
no, I'd just include its content |
12:28
π
|
Nemo_bis |
whatever it is, can't harm |
12:28
π
|
Nemo_bis |
(if there's no license URL) |
12:30
π
|
Nemo_bis |
btw I'm still without a proper PC right now but I have dumps on an external HDD and connection now works (actually it's freer than usual), so I can upload lots of stuff |
12:46
π
|
emijrp |
if you upload a file with the same name to an item, are they overwritedΓΒ‘ |
12:46
π
|
emijrp |
? |
12:48
π
|
Nemo_bis |
emijrp: this I don't know |
12:49
π
|
emijrp |
the description is not |
12:49
π
|
Nemo_bis |
emijrp: description and other metadata is never overwritten unless you add ignore bucket something |
12:50
π
|
Nemo_bis |
emijrp: https://www.mediawiki.org/w/index.php?title=Special:Contributions/Nemo_bis&offset=&limit=5&target=Nemo+bis |
12:51
π
|
Nemo_bis |
So, I think licenseurl should contain whatever "url" is, additionally if it's not a link to creativecommons.org or fsf.org you should fetch it and add it to "rights" field |
12:51
π
|
Nemo_bis |
additionally, it would probably be best to add to "rights" whatever MediaWiki:Copyright contains even if the wiki uses a license, to get cases like the referata wiki footer you linked |
12:52
π
|
emijrp |
some people add the copyright info to main page, project:about, project:copyright, etc, it is a mess |
12:55
π
|
emijrp |
yuo can run a 100 batch, and check the log, how many wikis havent coypright metadata |
12:57
π
|
Nemo_bis |
yes I can do such a batch when the script is ready |
12:57
π
|
Nemo_bis |
is it? |
13:00
π
|
Nemo_bis |
emijrp: which of those pages are you teching right now? |
13:01
π
|
emijrp |
none, just api metadata |
13:02
π
|
Nemo_bis |
emijrp: can you add at least mediawiki:copyright? |
13:02
π
|
Nemo_bis |
shouldn't be hard with API or ?action=raw |
13:02
π
|
emijrp |
no, im going to fetch <li id="copyright"></li> from the mainpage, i think is better |
13:03
π
|
Nemo_bis |
emijrp: what's that? |
13:04
π
|
emijrp |
http://amsterdam.campuswiki.nl/nl/Hoofdpagina look html code |
13:04
π
|
emijrp |
at the bottom |
13:04
π
|
Nemo_bis |
ah the footer |
13:04
π
|
Nemo_bis |
but this doesn't let you discover if it's customised or not |
13:04
π
|
Nemo_bis |
you should include it only if it's custom, otherwise it's worthless crap which doesn't let us wikis which actually need info |
13:05
π
|
emijrp |
that <li> is from old mediawiki |
13:05
π
|
emijrp |
prior 1.12 |
13:06
π
|
emijrp |
1.15* |
13:06
π
|
Nemo_bis |
emijrp: ah, so you're going to use it only in those cases, because there's no API? |
13:06
π
|
emijrp |
yes |
13:06
π
|
Nemo_bis |
ok |
13:07
π
|
Nemo_bis |
when there's API, it's still worth including MediaWIki:copyright iff existing |
13:21
π
|
emijrp |
in origialurl field, api or mainpage link? |
13:22
π
|
emijrp |
Nemo_bis: |
13:23
π
|
Nemo_bis |
emijrp: hm? |
13:24
π
|
balrog_ |
are there any wikiteam tools for DokuWiki? |
13:24
π
|
Nemo_bis |
dump it in the "rights" IA field |
13:24
π
|
Nemo_bis |
balrog_: not yet |
13:24
π
|
balrog_ |
:/ ok |
13:24
π
|
Nemo_bis |
AFAIK |
13:25
π
|
emijrp |
Nemo_bis: i mean originalurl field |
13:25
π
|
emijrp |
http://archive.org/details/wiki-amsterdamcampuswikinl_wiki |
13:29
π
|
Nemo_bis |
emijrp: API |
13:30
π
|
Nemo_bis |
that's what we decided, I don't remember all the reasons but one of them is that script path is non trivial |
13:31
π
|
Nemo_bis |
emijrp: so you manage to extract the licenseurl even if there's no API now? |
13:32
π
|
emijrp |
i copy copyright info from html footer, and link to #footer |
13:32
π
|
emijrp |
anyway, there are wikis without copyright info at all |
13:33
π
|
emijrp |
first api, if fails then html, if fails then skip dump |
13:33
π
|
Nemo_bis |
emijrp: skip?? |
13:34
π
|
emijrp |
if there is no license data, uplaod anyway ? |
13:34
π
|
Nemo_bis |
yes |
13:34
π
|
Nemo_bis |
but tag it in some way so that we can review it later |
13:34
π
|
emijrp |
ok |
13:35
π
|
emijrp |
add unknowncopyright keyword |
13:35
π
|
Nemo_bis |
for instance NO_COPYRIGHT_INFO in rights field |
13:35
π
|
Nemo_bis |
yeah whatever |
13:37
π
|
Nemo_bis |
emijrp: are you adding the footer content to 'rights' field even if you tech license from API? |
13:37
π
|
emijrp |
no |
13:38
π
|
Nemo_bis |
emijrp: it would be better, if MediaWiki:copyright is custom |
13:38
π
|
Nemo_bis |
(or just raw text of the message) |
13:39
π
|
Nemo_bis |
emijrp: copyright message in the footer was introduced with Domas Mituzas 2006-01-22 00:49:58 +0000 670) 'copyright' => 'Content is available under $1.', |
13:44
π
|
Nemo_bis |
ah no it just was in another file |
13:46
π
|
emijrp |
yes, but $1 is defined in lcoalsettings.php |
13:46
π
|
emijrp |
and showed in the footer |
13:46
π
|
emijrp |
you cant read mediawiki:copyrgiht licence |
13:46
π
|
emijrp |
shown* license* |
13:50
π
|
emijrp |
Nemo_bis: tell me a field for dump date |
13:51
π
|
Nemo_bis |
emijrp: I know you can't, but the text can still contain info |
13:51
π
|
Nemo_bis |
like in your referata wiki |
13:51
π
|
Nemo_bis |
while the link is already provided by the API |
13:51
π
|
Nemo_bis |
now looking for a proper field |
13:54
π
|
Nemo_bis |
emijrp: 'addeddate' would seem ok but I still have to dig |
13:55
π
|
emijrp |
'date' |
13:57
π
|
Nemo_bis |
emijrp: no, that's for te date of creation |
13:57
π
|
Nemo_bis |
but the content is created before the dump |
13:57
π
|
emijrp |
creation of what? |
13:57
π
|
emijrp |
creation of dump |
13:57
π
|
Nemo_bis |
it would be like replacing the date of publication of a book with the date of scanning |
13:57
π
|
Nemo_bis |
no, creation of the work |
13:58
π
|
emijrp |
downloaddate, backupdate... |
13:58
π
|
emijrp |
dumpdate |
13:59
π
|
emijrp |
? |
14:00
π
|
Nemo_bis |
emijrp: better to use existing ones |
14:01
π
|
emijrp |
addeddate? |
14:01
π
|
Nemo_bis |
no addeddate is wrong |
14:02
π
|
Nemo_bis |
it's already used by the system on upload |
14:02
π
|
Nemo_bis |
soo let's see |
14:02
π
|
Nemo_bis |
emijrp: how about a last-updated-date |
14:03
π
|
Nemo_bis |
it could contain the date of the last dump |
14:03
π
|
emijrp |
anyway if more than 1 dump are allowed per item, date is nonsense |
14:04
π
|
emijrp |
date is on filenames |
14:04
π
|
Nemo_bis |
emijrp: it's necessarily nonsense for wikis |
14:04
π
|
emijrp |
no field for date |
14:04
π
|
Nemo_bis |
it should be replaced with something like 2001-2012 |
14:05
π
|
Nemo_bis |
emijrp: however, a way to see when the dump has been updates last would be useful |
14:06
π
|
Nemo_bis |
at some point we could use it to select wikis to be redownloaded |
14:06
π
|
emijrp |
ok |
14:14
π
|
emijrp |
i think you can use it Nemo_bis http://code.google.com/p/wikiteam/source/browse/trunk/uploader.py |
14:14
π
|
emijrp |
you need keys.txt file |
14:14
π
|
emijrp |
download it in the same directory you have the dumps |
14:14
π
|
emijrp |
and listxxx.txt is required |
14:15
π
|
emijrp |
im going to add instructions here http://code.google.com/p/wikiteam/w/edit/TaskForce |
14:19
π
|
Nemo_bis |
emijrp: so it uploads only what one has in the list? |
14:19
π
|
Nemo_bis |
(just to check) |
14:19
π
|
emijrp |
yes |
14:19
π
|
Nemo_bis |
oki |
14:19
π
|
* |
Nemo_bis eager to try |
14:22
π
|
emijrp |
instructions http://code.google.com/p/wikiteam/wiki/TaskForce#footer |
14:23
π
|
emijrp |
try |
14:28
π
|
emijrp |
Nemo_bis: works? |
14:31
π
|
emijrp |
balrog_: dokuwiki is not supported, really only mediawiki is supported |
14:31
π
|
balrog_ |
I've used the mediawiki tool and it works well |
14:37
π
|
emijrp |
do you know any dokuwiki wikifarm? |
14:38
π
|
balrog_ |
well, I was thinking of the MESS wiki, but that's a mess |
14:38
π
|
balrog_ |
because it was lost with no backups |
14:38
π
|
balrog_ |
so it might have to be scraped from IA :( |
14:44
π
|
emijrp |
sure |
14:51
π
|
emijrp |
Nemo_bis is fapping watching the script to upload a fuckton of wikis |
15:08
π
|
Nemo_bis |
emijrp: been busy, now I should tri |
15:10
π
|
Nemo_bis |
emijrp: first batch includes a 100 GB wiki, it won't overwritten right? |
15:11
π
|
emijrp |
how overwritten? uploader script just read from harddisk |
15:12
π
|
Nemo_bis |
emijrp: I mean on archive.org |
15:12
π
|
Nemo_bis |
I don't want to reupload that huge dump |
15:12
π
|
emijrp |
i dont know |
15:12
π
|
Nemo_bis |
hmm |
15:12
π
|
emijrp |
then just remove it from listxxx.txt |
15:12
π
|
Nemo_bis |
let's try another batch first |
15:12
π
|
Nemo_bis |
right |
15:13
π
|
emijrp |
remove from list those dumps you uploaded in the past |
15:13
π
|
Nemo_bis |
I'm so smart, I had already removed it |
15:13
π
|
emijrp |
but do not make svn commit |
15:14
π
|
emijrp |
you should use different directories for svn and the taskforce downlaods |
15:14
π
|
Nemo_bis |
IndexError: list index out of range |
15:14
π
|
Nemo_bis |
sure |
15:14
π
|
Nemo_bis |
hmm |
15:14
π
|
Nemo_bis |
wait |
15:15
π
|
emijrp |
paste the entire error |
15:15
π
|
Nemo_bis |
no, I was just stupid |
15:16
π
|
Nemo_bis |
emijrp: http://archive.org/details/wiki-citwikioberlinedu |
15:17
π
|
Nemo_bis |
we used to include the dots though |
15:17
π
|
emijrp |
no -wikidump.7z ? |
15:17
π
|
Nemo_bis |
dunno if it was a good idea |
15:17
π
|
Nemo_bis |
emijrp: http://www.us.archive.org/log_show.php?task_id=117329967 |
15:17
π
|
Nemo_bis |
you should see it now |
15:19
π
|
Nemo_bis |
afk for some mins |
15:31
π
|
Nemo_bis |
hmpf IOError: [Errno socket error] [Errno 110] Connection timed out |
15:31
π
|
Nemo_bis |
now let's see what happens if I just rerun |
15:32
π
|
Nemo_bis |
emijrp: ^ |
15:33
π
|
Nemo_bis |
file gets overwitten |
15:34
π
|
emijrp |
no -wikidump.7z |
15:34
π
|
Nemo_bis |
there isn't any probably |
15:34
π
|
emijrp |
i have used the script to upload 3 small wikis and all is ok |
15:35
π
|
emijrp |
perhaps i may move the urllib2 query inside the try catch |
15:35
π
|
emijrp |
to avoid time out errors |
15:35
π
|
emijrp |
probably some wikis are dead now |
15:36
π
|
Nemo_bis |
hm there's an images directory but no archive |
15:37
π
|
Nemo_bis |
later I'll have to check all the archives, recompress if needed and reupload |
15:37
π
|
Nemo_bis |
so I seriously need the script not to reupload stuff which is already on archive.org |
15:37
π
|
Nemo_bis |
it timed out again |
15:38
π
|
Nemo_bis |
http://p.defau.lt/?as62zqO_kO6K8Duh_jCG1Q |
15:40
π
|
emijrp |
yep cloverfield.despoiler.org/api.php |
15:40
π
|
emijrp |
no response |
15:40
π
|
Nemo_bis |
it didn't fail before |
15:41
π
|
Nemo_bis |
http://archive.org/details/wiki-cloverfielddespoilerorg |
15:42
π
|
emijrp |
erratic server |
15:42
π
|
emijrp |
update the uplaoder script, i added try except |
15:43
π
|
Nemo_bis |
emijrp: how about reupload? |
15:44
π
|
emijrp |
? |
15:45
π
|
Nemo_bis |
emijrp: does it still reupload stuff already uploaded? |
15:45
π
|
emijrp |
yes |
15:45
π
|
Nemo_bis |
hmpf |
15:45
π
|
emijrp |
there is any command to skip? in curl? |
15:46
π
|
emijrp |
i thought you didnt upload dumps |
15:46
π
|
emijrp |
thats why you needed the script |
15:46
π
|
Nemo_bis |
sure but for instance now I have to manually edit the list of files every time I have to rerun it |
15:47
π
|
Nemo_bis |
and for this wiki I must reupload the 250 MB archive to also upload the small history 7z |
15:47
π
|
emijrp |
ok |
15:48
π
|
Nemo_bis |
perhaps it would be easier to use the script to put the metadata on a csv file for their bulkuploader? |
15:48
π
|
Nemo_bis |
there must be some command though :/ |
15:49
π
|
emijrp |
the uploader creates a log with the dump status |
15:49
π
|
emijrp |
i will read it at first, to skip those uploaded before |
15:58
π
|
emijrp |
Nemo_bis: try now, it skips uploaded dumps |
15:58
π
|
emijrp |
but you must to conservate the .log |
15:59
π
|
Nemo_bis |
emijrp: ok |
15:59
π
|
Nemo_bis |
emijrp: are you putting the "lang" from siteinfo in the "language" field? |
16:04
π
|
emijrp |
nope |
16:04
π
|
emijrp |
i will add it later |
16:47
π
|
SketchCow |
Quick question. |
16:48
π
|
SketchCow |
Who is James Michael Dupont. One of you guys? |
16:59
π
|
Nemo_bis |
SketchCow: somehow |
17:00
π
|
Nemo_bis |
he's a Wikipedian who's archiving some speedy deleted articles to a Wikia wiki |
17:00
π
|
Nemo_bis |
(from en.wikipedia) |
17:02
π
|
Nemo_bis |
SketchCow: is he flooding IA? |
17:02
π
|
Nemo_bis |
he wanted to add XML exports of articles to wikiteam collection but it should be in a subcollection at least (if any) |
17:20
π
|
SketchCow |
He is not. |
17:20
π
|
SketchCow |
He just asked for access to stuff. |
17:20
π
|
SketchCow |
I can give him a subcollection |
17:20
π
|
SketchCow |
just making sure we're in the relatively correct location. |
17:24
π
|
Nemo_bis |
SketchCow: we don't know him very well, of course you own the house but IMvHO better if he's not admin of wikiteam |
17:24
π
|
Nemo_bis |
a subcollection would be good |
17:28
π
|
underscor |
I've traded a lot of emails, I trust him. |
17:28
π
|
underscor |
but a subcollection would organize it better |
17:29
π
|
Nemo_bis |
yep |
17:30
π
|
SketchCow |
Give him a subcollection, then |
17:30
π
|
SketchCow |
I'll link it into archiveteam's wikiteam as a whole |
18:33
π
|
emijrp |
Nemo_bis: language must be 'en' or 'English'? |
18:33
π
|
emijrp |
api returns 'en' |
18:33
π
|
emijrp |
and make a conversor is a pain |
18:36
π
|
emijrp |
i can convert the basic ones |
18:39
π
|
emijrp |
ok, new version of uploader, detecting languages |
18:39
π
|
emijrp |
update |
18:47
π
|
SketchCow |
Hey, emijrp |
18:48
π
|
SketchCow |
We're giving wikideletion guy a subcollection pointing to the wikiteam collection |
18:48
π
|
SketchCow |
So he can add the stuff but it's not in the main thing and he's not an admin |
18:48
π
|
SketchCow |
Also, I closed new accounts on archiveteam.org while we clean |
18:49
π
|
emijrp |
ok SketchCow, about wikideletion guy |
18:50
π
|
emijrp |
about at.org, i wouldnt care about deleting user account but deleting spam pages created by them |
18:51
π
|
emijrp |
also, an antispam extension is needed, main issue are bot registration, and extensions may be triggered in that cases, or when adding an external link, etc |
18:52
π
|
emijrp |
http://www.mediawiki.org/wiki/Extension:ConfirmEdit + http://www.mediawiki.org/wiki/Extension:QuestyCaptcha |
18:53
π
|
SketchCow |
Well, I want deleted accounts AND deleting spam pages. |
18:54
π
|
emijrp |
$wgCaptchaQuestions[] = array( 'question' => "...", 'answer' => "..." ); |
18:54
π
|
emijrp |
$wgCaptchaTriggers['createaccount'] = true; |
19:15
π
|
emijrp |
SketchCow: also we are grabbing Wikimedia Commons entirely month by month http://archive.org/details/wikimediacommons-200507 |
19:15
π
|
emijrp |
12*7 years = 84 items |
19:16
π
|
SketchCow |
! |
19:16
π
|
emijrp |
not sure if subcollection is desired |
19:16
π
|
SketchCow |
Yeah, it will be. |
19:16
π
|
SketchCow |
But for now, just keep going. |
19:17
π
|
emijrp |
http://archive.org/search.php?query=%22wikimedia%20commons%20grab%22 |
19:18
π
|
SketchCow |
http://archive.org/details/wikimediacommons exists already, apparently. |
19:20
π
|
emijrp |
but that is an item, not a subcollection, right? |
19:20
π
|
emijrp |
it was created by Hydriz probably, but he saw it is many TB for a single item, and separated into months |
19:22
π
|
SketchCow |
It's a subcollection now. |
19:22
π
|
SketchCow |
I'll move the crap over |
19:24
π
|
SketchCow |
Done. |
19:24
π
|
SketchCow |
It'll filter into place across the next 5-40 minutes. |
19:27
π
|
SketchCow |
http://archive.org/search.php?query=collection%3Awikimediacommons&sort=-publicdate |
19:30
π
|
emijrp |
it may be a subcollection inside wikiteam? |
19:31
π
|
SketchCow |
It is. |
19:31
π
|
SketchCow |
http://archive.org/details/wikimediacommons |
19:32
π
|
SketchCow |
if you click on the all items, they're there; archive.org has these walkers that go through and shore up collections later. |
19:32
π
|
SketchCow |
http://ia601202.us.archive.org/zipview.php?zip=/22/items/wikimediacommons-200607/2006-07-01.zip&file=2006%2F07%2F01%2FHooters_Girls_calendar_signing_at_Kandahar.jpg |
19:32
π
|
SketchCow |
SAVED FOREVER |
19:32
π
|
emijrp |
ah ok, it is now shown here http://archive.org/details/wikiteam may be lag |
19:32
π
|
emijrp |
not* |
19:33
π
|
SketchCow |
Yeah, not lag. |
19:33
π
|
SketchCow |
See, a thing goes through over time and shores up collection sets, cleans messes, etc. |
19:33
π
|
emijrp |
ok |
19:33
π
|
SketchCow |
That's one of archive.org's weirdness that's culturally difficult for folks - how stuff kind of fades into view |
19:33
π
|
SketchCow |
It's not 1/0 |
19:35
π
|
emijrp |
the zip explorer is very useful and cool in this case |
19:35
π
|
emijrp |
and the .xml files are opened in firefox too, so you can read the image description and license |
19:35
π
|
emijrp |
pure win |
19:41
π
|
emijrp |
anyway, some files were lost in wikimedia servers http://upload.wikimedia.org/wikipedia/commons/archive/b/b8/20050415212201!SMS_Bluecher.jpg and they are saved as empty files in the grab |
20:06
π
|
SketchCow |
Frau Bluecher!!! (whinny) |
20:49
π
|
Nemo_bis |
emijrp: AFAIK language codes work |
20:58
π
|
Nemo_bis |
emijrp: launched script for 3557 more wikis |
20:58
π
|
Nemo_bis |
not all of them actually downloaded of course |
20:58
π
|
emijrp |
cool |
20:59
π
|
Nemo_bis |
emijrp: it would be nice if you could redirect curl output so that I can see the progress info |
20:59
π
|
Nemo_bis |
I can monitor it with nethogs but it's not quite the same |
21:00
π
|
emijrp |
do it yourself and commit |
21:00
π
|
Nemo_bis |
emijrp: don't know exactly what output you're expecting in the logs |
21:00
π
|
Nemo_bis |
anyway it's not urgent |
21:01
π
|
Nemo_bis |
SketchCow: what the command to prevent curl/s3 from replacing an existing file with same filename? |
21:02
π
|
chronomex |
-nc --no-clobber |
21:03
π
|
Nemo_bis |
so easy? :p |
21:05
π
|
Nemo_bis |
chronomex: are you? I mean when one UPloads with s3 |
21:05
π
|
Nemo_bis |
*you sure |
21:05
π
|
chronomex |
oh |
21:05
π
|
chronomex |
I was thinking wget |
21:05
π
|
chronomex |
sorry |
21:06
π
|
Nemo_bis |
chronomex: I want to avoid overwriting existing files/items in IA with my local ones |
21:09
π
|
Nemo_bis |
good, saturating bandwidth to IA for once |
21:11
π
|
Nemo_bis |
emijrp: it's not the uploader's fault but I wonder if we should do something to catch the duplicates I overlooked: http://archive.org/details/wiki-docwikigumstixcom http://archive.org/details/wiki-docwikigumstixorg |
21:12
π
|
Nemo_bis |
emijrp: a pretty feature would be downloading the wiki logo and upload it to the item as well |
21:12
π
|
Nemo_bis |
(surely not prioritary :p) |
21:13
π
|
emijrp |
k |
21:19
π
|
emijrp |
Nemo_bis: add it as a TODO comment in top of uploader.py |
21:19
π
|
Nemo_bis |
emijrp: ok |
21:31
π
|
emijrp |
Nemo_bis: the speedydeletion guy does not upload the entire history, just the last revision |
21:32
π
|
emijrp |
http://speedydeletion.wikia.com/wiki/Salvador_De_Jesus_%28deleted_05_Sep_2008_at_20:47%29 |
21:32
π
|
emijrp |
http://deletionpedia.dbatley.com/w/index.php?title=Salvador_De_Jesus_%28deleted_05_Sep_2008_at_20:47%29&action=history |
21:33
π
|
emijrp |
http://speedydeletion.wikia.com/wiki/Salvador_De_Jesus_%28deleted_05_Sep_2008_at_20:47%29?action=history |
21:33
π
|
emijrp |
i think he really paste the text instead of import |
21:34
π
|
Nemo_bis |
hmmmmmmmm |
21:45
π
|
underscor |
who is alphacorp again? |
21:45
π
|
Nemo_bis |
underscor: Hydriz |
21:45
π
|
underscor |
thx |
21:46
π
|
Nemo_bis |
hm? http://p.defau.lt/?puN_G_zKXbv1lz9TfSliPg |
21:47
π
|
underscor |
Nemo_bis: is that a space in the identifier? |
21:48
π
|
Nemo_bis |
underscor: it shouldn't, let me check |
21:49
π
|
Nemo_bis |
underscor: I don't think so http://archive.org/details/wiki-editionorg_w |
21:49
π
|
underscor |
woah, weird |
21:50
π
|
underscor |
it started and then failed |
21:50
π
|
underscor |
delete and retry? |
21:50
π
|
Nemo_bis |
it's two very small files, maybe it didn't manage to finish the first before the second |
21:50
π
|
underscor |
ah |
21:50
π
|
underscor |
yeah |
21:50
π
|
underscor |
need a longer pause |
21:52
π
|
Nemo_bis |
underscor: it didn't even manage to set the collection and other metadata http://archive.org/details/wiki-editionorg_w |
22:06
π
|
Nemo_bis |
underscor: weird, just 300 KB more and it works http://archive.org/details/wiki-emiswikioecnk12ohus |
22:16
π
|
Nemo_bis |
underscor: we also have some escaping problem, fancy fixing it while emijrp is offline? :) http://archive.org/details/wiki-encitizendiumorg |
22:16
π
|
Nemo_bis |
shouldn't be too hard |
22:19
π
|
Nemo_bis |
sigh http://archive.org/details/wiki-enecgpediaorg http://archive.org/details/wiki-en.ecgpedia.org |
22:19
π
|
underscor |
Nemo_bis: Working on it |
22:19
π
|
underscor |
the fix is easy, but trying to figure out where we munge the data |
22:22
π
|
Nemo_bis |
underscor: from siteinfo API query IIRC |
22:22
π
|
Nemo_bis |
otherwise meta HTML tags? |
22:22
π
|
underscor |
no, no |
22:22
π
|
underscor |
not your fault |
22:23
π
|
underscor |
somewhere in the s3 pipline, it's double entity-ized |
22:23
π
|
underscor |
pipeline* |