[00:08] *** randomdes has quit IRC (Ping timeout: 268 seconds) [00:10] *** randomdes has joined #wikiteam [01:14] *** ta9le has quit IRC (Quit: Connection closed for inactivity) [07:23] *** midas1 has quit IRC (Read error: Operation timed out) [07:23] *** svchfoo3 has quit IRC (Read error: Operation timed out) [07:24] *** midas1 has joined #wikiteam [07:25] *** svchfoo3 has joined #wikiteam [07:25] *** svchfoo1 sets mode: +o svchfoo3 [11:24] *** svchfoo3 has quit IRC (Read error: Operation timed out) [11:26] *** ta9le has joined #wikiteam [11:27] *** svchfoo3 has joined #wikiteam [11:28] *** svchfoo1 sets mode: +o svchfoo3 [12:50] *** netchup has joined #wikiteam [12:50] *** netchup has quit IRC (Client Quit) [12:53] *** netchup has joined #wikiteam [12:53] hi [12:54] *** netchupp has joined #wikiteam [12:54] *** netchupp has quit IRC (Client Quit) [12:55] i am new here, dumps generated by dumpgenerator.py are visible via WB Machine ? [12:58] JAA: can we talk? uzerus here :) [12:59] Oh, hey netchup, long time no see. Your ArchiveBot job for those school websites is *still* running (almost seven months now). [12:59] I have no idea regarding this project, I'm just idling here. [13:00] ah, ok :) can you just give me the id of ArchiveBot job ? [13:01] netchup: That's a5l5nek576o746i75mvi27j31. It's pretty close to finishing though, should be done within August. [13:04] Nice to see that :) now i know i did good job, few k domains are saved :) but i cannot find all of them, did what i can [13:07] netchup: no, dumps need MediaWiki to be parsed into HTML [13:09] Nemo_bis: Was WikiTeam ever archiving in WARC format? [13:10] No, never [13:10] Although at some point someone in ArchiveTeam launched some warrior project for wikis (I have no idea what was in it or what came out of it) [13:11] Huh [13:11] I asked mainly because the AT wiki prominently says "WikiTeam (WARC format)" on the homepage. [13:11] Yeah, it could be that whoever put that job up never wrapped it up [13:12] It's been in the warrior for a long time (is it still?) [13:12] To download all the possible HTML representations that MediaWiki offers for the data in our dumps it would probably take billions of HTTP requests [13:14] I see. [13:15] Yeah, it would definitely be big. [13:15] And it would require careful ignores, e.g. Special:Log, Special:RecentChanges, Special:NewImages (I think that one's an extension), etc. [13:16] I?d say 5.2 billions at a minimum https://wikistats.wmflabs.org/ [13:16] Considering one per page (+ most recent history) and one per revision (+diff) [13:17] Yeah, though I'd say that the edit page is more useful than the diff one. [13:18] But the revisions are not usable without the diff [13:18] How so? [13:18] By looking at the final HTML you have no idea what the edit changed [13:18] Unless you run some diffing locally [13:18] Yeah, that's true, you'd have to run a diff afterwards. [13:19] We ignored diffs in ArchiveBot anyway because it doubles (approximately) the number of URLs that need to be retrieved. [13:19] Wikia alone would be 1G pages multiplied by some 2 MB (their HTML is incredibly bloated) uncompressed [13:19] While the XML is "only" 850 GiB https://archive.org/details/wikia_dump_20180602 [13:21] Speaking of which, a good use of a server with some hundreds GiB of disk would be to run again that Wikia archival with the latest WikiTeam code :) [13:21] Part of the latest archival was done with some bugs re API requests [13:37] arkiver: Do you know more about WikiTeam and WARCs? [13:47] Nemo_bis: That could be done in chunks, right? I.e. it wouldn't be necessary to store all 850 GiB at the same time. [13:47] As far as I can see, it's one dump file per wiki anyway. [13:48] So grab [0-9] wikis, compress, upload, delete, grab "a" wikis, compress, upload, delete, etc. [14:11] In theory yes, but given a wiki might take a totally random amount of space from 100 KB to 250 GB that's rarely useful. [14:11] Then I'm open to other people experimenting with their own methods. :) [14:11] An individual wiki, sure, but a large number of wikis should be fairly predictable (comparatively, at least). [14:12] dunno [14:12] I have no evidence of size being randomly distributed [14:13] With some effort it can be enough to have 300 GiB of disk or so [14:13] Personally I preferred to spend some dozens € more on disk space and save myself hours of pointless work [14:14] Yeah, makes sense. [14:15] How were the Wikia wikis discovered, by the way? [14:16] There's an API to list them all [14:16] Ah, nice. [14:16] https://github.com/WikiTeam/wikiteam/blob/master/listsofwikis/mediawiki/wikia.py [15:36] how items are added to wiki warrior? [15:43] Although at some point someone in ArchiveTeam launched some warrior project for wikis (I have no idea what was in it or what came out of it) [15:43] that was me [15:43] we archived external links [15:44] https://archive.org/details/archiveteam_wiki [15:45] not much happened with the project, but I´m totally for getting it running again [15:45] with more wikis to archive external URLs from and maybe archive all wikis themselves? [15:46] If diffs only double the number of URLs, then we could get them anyway, since we don´t really have a deadline for this [15:46] arkiver: can i help you someway ? [15:46] not sure [15:47] Nemo_bis: JAA: are you fine with this being the ´official´ channel of the warrior project too? [15:47] I´ll try to get some stuff restarted today, code should be pretty much there already for wikimedias [15:48] err [15:48] mediawikis [15:48] Ah, only external links, I see. I'll add that to our wiki somewhere, since it's currently not clear at all what the "WARC format" is referring to. [15:48] fine, cos urlteam warrior has stopped, for now only newsgrabber is working [15:49] JAA: I´d like to start archiving wikis themselves too [15:49] * arkiver is afk for ~30 min [15:49] what about dumps ? can we use them for archiving wikis? [15:50] i mean the manual job that has been done [15:51] some of wikis probably cannot be accessible in WB, and are dead. Dumps we grabbed can help with that i think [15:51] (i am not a programmer, so if you can please clear that) [15:52] netchup: We won't fabricate data for the Wayback Machine. But someone could take the dumps, set them up again somewhere as a wiki, and then we could archive those. That would be distinct from the original wiki though. [15:58] for archiving wikis, you probably need a list of them [15:59] this is where i can help a little [16:02] arkiver: sure, there's no problem in using this channel :) [16:03] I'm not sure how the WARC collections pageviews are calculated, but 21M is not bad https://archive.org/details/archiveteam_wiki&tab=about [16:04] I wonder why they became vanishingly small in 2017 compared to 2016. Wayback machine superseding them with its own archives? [16:04] Or just bots. 99,99 % of views come from California (?) [16:06] strange [16:34] I´m not totally sure about this, but all the data you get through the Wayback Machine first undergoes some processing to for example rewrite URLs [16:34] That means someone using the Wayback Machine will not directly download something from the WARCs, but indirectly through IA servers. [16:34] right [16:35] it's processed to insert the javascript header thingy, among other stuff [16:35] Yes [16:35] So that would be where the 99.99% California downloads comes from [16:36] i'm seeing lots of california downloads also for non-warc collections [16:36] shrug emoji [16:37] As for the drop in views, URLs closest to a requested timestamp are send to a user. IA has never stopped archiving, while we have not been working on these URLs anymore [16:37] astrid: Possibly search related, I´ll ask at IA. [16:37] cool :) [16:38] i'm curious but not enough to bother someone [20:03] *** balrog_ has joined #wikiteam [20:09] *** balrog has quit IRC (Ping timeout: 960 seconds) [20:09] *** balrog_ is now known as balrog [22:05] *** netchup has quit IRC (Quit: http://www.mibbit.com ajax IRC Client) [22:11] https://wikiapiary.com/wiki/Websites "Total pages -8,982,570,193,088,097,861" [22:11] Nice [22:11] The total number of edits also seems slightly unrealistic: 552,816,861,540,660,473 [22:20] arkiver: Can you change the description of https://archive.org/details/archiveteam_wiki ? Since that's the archive of external links, not the wikis itself... [22:20] I'm updating our wiki page on WikiTeam right now. [23:23] *** ta9le has quit IRC (Quit: Connection closed for inactivity)