#wikiteam 2018-08-02,Thu

↑back Search

Time	Nickname	Message
00:08 ^🔗		randomdes has quit IRC (Ping timeout: 268 seconds)
00:10 ^🔗		randomdes has joined #wikiteam
01:14 ^🔗		ta9le has quit IRC (Quit: Connection closed for inactivity)
07:23 ^🔗		midas1 has quit IRC (Read error: Operation timed out)
07:23 ^🔗		svchfoo3 has quit IRC (Read error: Operation timed out)
07:24 ^🔗		midas1 has joined #wikiteam
07:25 ^🔗		svchfoo3 has joined #wikiteam
07:25 ^🔗		svchfoo1 sets mode: +o svchfoo3
11:24 ^🔗		svchfoo3 has quit IRC (Read error: Operation timed out)
11:26 ^🔗		ta9le has joined #wikiteam
11:27 ^🔗		svchfoo3 has joined #wikiteam
11:28 ^🔗		svchfoo1 sets mode: +o svchfoo3
12:50 ^🔗		netchup has joined #wikiteam
12:50 ^🔗		netchup has quit IRC (Client Quit)
12:53 ^🔗		netchup has joined #wikiteam
12:53 ^🔗	netchup	hi
12:54 ^🔗		netchupp has joined #wikiteam
12:54 ^🔗		netchupp has quit IRC (Client Quit)
12:55 ^🔗	netchup	i am new here, dumps generated by dumpgenerator.py are visible via WB Machine ?
12:58 ^🔗	netchup	JAA: can we talk? uzerus here :)
12:59 ^🔗	JAA	Oh, hey netchup, long time no see. Your ArchiveBot job for those school websites is still running (almost seven months now).
12:59 ^🔗	JAA	I have no idea regarding this project, I'm just idling here.
13:00 ^🔗	netchup	ah, ok :) can you just give me the id of ArchiveBot job ?
13:01 ^🔗	JAA	netchup: That's a5l5nek576o746i75mvi27j31. It's pretty close to finishing though, should be done within August.
13:04 ^🔗	netchup	Nice to see that :) now i know i did good job, few k domains are saved :) but i cannot find all of them, did what i can
13:07 ^🔗	Nemo_bis	netchup: no, dumps need MediaWiki to be parsed into HTML
13:09 ^🔗	JAA	Nemo_bis: Was WikiTeam ever archiving in WARC format?
13:10 ^🔗	Nemo_bis	No, never
13:10 ^🔗	Nemo_bis	Although at some point someone in ArchiveTeam launched some warrior project for wikis (I have no idea what was in it or what came out of it)
13:11 ^🔗	JAA	Huh
13:11 ^🔗	JAA	I asked mainly because the AT wiki prominently says "WikiTeam (WARC format)" on the homepage.
13:11 ^🔗	Nemo_bis	Yeah, it could be that whoever put that job up never wrapped it up
13:12 ^🔗	Nemo_bis	It's been in the warrior for a long time (is it still?)
13:12 ^🔗	Nemo_bis	To download all the possible HTML representations that MediaWiki offers for the data in our dumps it would probably take billions of HTTP requests
13:14 ^🔗	JAA	I see.
13:15 ^🔗	JAA	Yeah, it would definitely be big.
13:15 ^🔗	JAA	And it would require careful ignores, e.g. Special:Log, Special:RecentChanges, Special:NewImages (I think that one's an extension), etc.
13:16 ^🔗	Nemo_bis	I?d say 5.2 billions at a minimum https://wikistats.wmflabs.org/
13:16 ^🔗	Nemo_bis	Considering one per page (+ most recent history) and one per revision (+diff)
13:17 ^🔗	JAA	Yeah, though I'd say that the edit page is more useful than the diff one.
13:18 ^🔗	Nemo_bis	But the revisions are not usable without the diff
13:18 ^🔗	JAA	How so?
13:18 ^🔗	Nemo_bis	By looking at the final HTML you have no idea what the edit changed
13:18 ^🔗	Nemo_bis	Unless you run some diffing locally
13:18 ^🔗	JAA	Yeah, that's true, you'd have to run a diff afterwards.
13:19 ^🔗	JAA	We ignored diffs in ArchiveBot anyway because it doubles (approximately) the number of URLs that need to be retrieved.
13:19 ^🔗	Nemo_bis	Wikia alone would be 1G pages multiplied by some 2 MB (their HTML is incredibly bloated) uncompressed
13:19 ^🔗	Nemo_bis	While the XML is "only" 850 GiB https://archive.org/details/wikia_dump_20180602
13:21 ^🔗	Nemo_bis	Speaking of which, a good use of a server with some hundreds GiB of disk would be to run again that Wikia archival with the latest WikiTeam code :)
13:21 ^🔗	Nemo_bis	Part of the latest archival was done with some bugs re API requests
13:37 ^🔗	JAA	arkiver: Do you know more about WikiTeam and WARCs?
13:47 ^🔗	JAA	Nemo_bis: That could be done in chunks, right? I.e. it wouldn't be necessary to store all 850 GiB at the same time.
13:47 ^🔗	JAA	As far as I can see, it's one dump file per wiki anyway.
13:48 ^🔗	JAA	So grab [0-9] wikis, compress, upload, delete, grab "a" wikis, compress, upload, delete, etc.
14:11 ^🔗	Nemo_bis	In theory yes, but given a wiki might take a totally random amount of space from 100 KB to 250 GB that's rarely useful.
14:11 ^🔗	Nemo_bis	Then I'm open to other people experimenting with their own methods. :)
14:11 ^🔗	JAA	An individual wiki, sure, but a large number of wikis should be fairly predictable (comparatively, at least).
14:12 ^🔗	Nemo_bis	dunno
14:12 ^🔗	Nemo_bis	I have no evidence of size being randomly distributed
14:13 ^🔗	Nemo_bis	With some effort it can be enough to have 300 GiB of disk or so
14:13 ^🔗	Nemo_bis	Personally I preferred to spend some dozens € more on disk space and save myself hours of pointless work
14:14 ^🔗	JAA	Yeah, makes sense.
14:15 ^🔗	JAA	How were the Wikia wikis discovered, by the way?
14:16 ^🔗	Nemo_bis	There's an API to list them all
14:16 ^🔗	JAA	Ah, nice.
14:16 ^🔗	Nemo_bis	https://github.com/WikiTeam/wikiteam/blob/master/listsofwikis/mediawiki/wikia.py
15:36 ^🔗	netchup	how items are added to wiki warrior?
15:43 ^🔗	arkiver	<Nemo_bis> Although at some point someone in ArchiveTeam launched some warrior project for wikis (I have no idea what was in it or what came out of it)
15:43 ^🔗	arkiver	that was me
15:43 ^🔗	arkiver	we archived external links
15:44 ^🔗	arkiver	https://archive.org/details/archiveteam_wiki
15:45 ^🔗	arkiver	not much happened with the project, but I´m totally for getting it running again
15:45 ^🔗	arkiver	with more wikis to archive external URLs from and maybe archive all wikis themselves?
15:46 ^🔗	arkiver	If diffs only double the number of URLs, then we could get them anyway, since we don´t really have a deadline for this
15:46 ^🔗	netchup	arkiver: can i help you someway ?
15:46 ^🔗	arkiver	not sure
15:47 ^🔗	arkiver	Nemo_bis: JAA: are you fine with this being the ´official´ channel of the warrior project too?
15:47 ^🔗	arkiver	I´ll try to get some stuff restarted today, code should be pretty much there already for wikimedias
15:48 ^🔗	arkiver	err
15:48 ^🔗	arkiver	mediawikis
15:48 ^🔗	JAA	Ah, only external links, I see. I'll add that to our wiki somewhere, since it's currently not clear at all what the "WARC format" is referring to.
15:48 ^🔗	netchup	fine, cos urlteam warrior has stopped, for now only newsgrabber is working
15:49 ^🔗	arkiver	JAA: I´d like to start archiving wikis themselves too
15:49 ^🔗	*	arkiver is afk for ~30 min
15:49 ^🔗	netchup	what about dumps ? can we use them for archiving wikis?
15:50 ^🔗	netchup	i mean the manual job that has been done
15:51 ^🔗	netchup	some of wikis probably cannot be accessible in WB, and are dead. Dumps we grabbed can help with that i think
15:51 ^🔗	netchup	(i am not a programmer, so if you can please clear that)
15:52 ^🔗	JAA	netchup: We won't fabricate data for the Wayback Machine. But someone could take the dumps, set them up again somewhere as a wiki, and then we could archive those. That would be distinct from the original wiki though.
15:58 ^🔗	netchup	for archiving wikis, you probably need a list of them
15:59 ^🔗	netchup	this is where i can help a little
16:02 ^🔗	Nemo_bis	arkiver: sure, there's no problem in using this channel :)
16:03 ^🔗	Nemo_bis	I'm not sure how the WARC collections pageviews are calculated, but 21M is not bad https://archive.org/details/archiveteam_wiki&tab=about
16:04 ^🔗	Nemo_bis	I wonder why they became vanishingly small in 2017 compared to 2016. Wayback machine superseding them with its own archives?
16:04 ^🔗	Nemo_bis	Or just bots. 99,99 % of views come from California (?)
16:06 ^🔗	astrid	strange
16:34 ^🔗	arkiver	I´m not totally sure about this, but all the data you get through the Wayback Machine first undergoes some processing to for example rewrite URLs
16:34 ^🔗	arkiver	That means someone using the Wayback Machine will not directly download something from the WARCs, but indirectly through IA servers.
16:34 ^🔗	astrid	right
16:35 ^🔗	astrid	it's processed to insert the javascript header thingy, among other stuff
16:35 ^🔗	arkiver	Yes
16:35 ^🔗	arkiver	So that would be where the 99.99% California downloads comes from
16:36 ^🔗	astrid	i'm seeing lots of california downloads also for non-warc collections
16:36 ^🔗	astrid	shrug emoji
16:37 ^🔗	arkiver	As for the drop in views, URLs closest to a requested timestamp are send to a user. IA has never stopped archiving, while we have not been working on these URLs anymore
16:37 ^🔗	arkiver	astrid: Possibly search related, I´ll ask at IA.
16:37 ^🔗	astrid	cool :)
16:38 ^🔗	astrid	i'm curious but not enough to bother someone
20:03 ^🔗		balrog_ has joined #wikiteam
20:09 ^🔗		balrog has quit IRC (Ping timeout: 960 seconds)
20:09 ^🔗		balrog_ is now known as balrog
22:05 ^🔗		netchup has quit IRC (Quit: http://www.mibbit.com ajax IRC Client)
22:11 ^🔗	JAA	https://wikiapiary.com/wiki/Websites "Total pages -8,982,570,193,088,097,861"
22:11 ^🔗	JAA	Nice
22:11 ^🔗	JAA	The total number of edits also seems slightly unrealistic: 552,816,861,540,660,473
22:20 ^🔗	JAA	arkiver: Can you change the description of https://archive.org/details/archiveteam_wiki ? Since that's the archive of external links, not the wikis itself...
22:20 ^🔗	JAA	I'm updating our wiki page on WikiTeam right now.
23:23 ^🔗		ta9le has quit IRC (Quit: Connection closed for inactivity)

irclogger-viewer