#wikiteam 2018-08-02,Thu

↑back Search

Time Nickname Message
00:08 πŸ”— randomdes has quit IRC (Ping timeout: 268 seconds)
00:10 πŸ”— randomdes has joined #wikiteam
01:14 πŸ”— ta9le has quit IRC (Quit: Connection closed for inactivity)
07:23 πŸ”— midas1 has quit IRC (Read error: Operation timed out)
07:23 πŸ”— svchfoo3 has quit IRC (Read error: Operation timed out)
07:24 πŸ”— midas1 has joined #wikiteam
07:25 πŸ”— svchfoo3 has joined #wikiteam
07:25 πŸ”— svchfoo1 sets mode: +o svchfoo3
11:24 πŸ”— svchfoo3 has quit IRC (Read error: Operation timed out)
11:26 πŸ”— ta9le has joined #wikiteam
11:27 πŸ”— svchfoo3 has joined #wikiteam
11:28 πŸ”— svchfoo1 sets mode: +o svchfoo3
12:50 πŸ”— netchup has joined #wikiteam
12:50 πŸ”— netchup has quit IRC (Client Quit)
12:53 πŸ”— netchup has joined #wikiteam
12:53 πŸ”— netchup hi
12:54 πŸ”— netchupp has joined #wikiteam
12:54 πŸ”— netchupp has quit IRC (Client Quit)
12:55 πŸ”— netchup i am new here, dumps generated by dumpgenerator.py are visible via WB Machine ?
12:58 πŸ”— netchup JAA: can we talk? uzerus here :)
12:59 πŸ”— JAA Oh, hey netchup, long time no see. Your ArchiveBot job for those school websites is *still* running (almost seven months now).
12:59 πŸ”— JAA I have no idea regarding this project, I'm just idling here.
13:00 πŸ”— netchup ah, ok :) can you just give me the id of ArchiveBot job ?
13:01 πŸ”— JAA netchup: That's a5l5nek576o746i75mvi27j31. It's pretty close to finishing though, should be done within August.
13:04 πŸ”— netchup Nice to see that :) now i know i did good job, few k domains are saved :) but i cannot find all of them, did what i can
13:07 πŸ”— Nemo_bis netchup: no, dumps need MediaWiki to be parsed into HTML
13:09 πŸ”— JAA Nemo_bis: Was WikiTeam ever archiving in WARC format?
13:10 πŸ”— Nemo_bis No, never
13:10 πŸ”— Nemo_bis Although at some point someone in ArchiveTeam launched some warrior project for wikis (I have no idea what was in it or what came out of it)
13:11 πŸ”— JAA Huh
13:11 πŸ”— JAA I asked mainly because the AT wiki prominently says "WikiTeam (WARC format)" on the homepage.
13:11 πŸ”— Nemo_bis Yeah, it could be that whoever put that job up never wrapped it up
13:12 πŸ”— Nemo_bis It's been in the warrior for a long time (is it still?)
13:12 πŸ”— Nemo_bis To download all the possible HTML representations that MediaWiki offers for the data in our dumps it would probably take billions of HTTP requests
13:14 πŸ”— JAA I see.
13:15 πŸ”— JAA Yeah, it would definitely be big.
13:15 πŸ”— JAA And it would require careful ignores, e.g. Special:Log, Special:RecentChanges, Special:NewImages (I think that one's an extension), etc.
13:16 πŸ”— Nemo_bis I?d say 5.2 billions at a minimum https://wikistats.wmflabs.org/
13:16 πŸ”— Nemo_bis Considering one per page (+ most recent history) and one per revision (+diff)
13:17 πŸ”— JAA Yeah, though I'd say that the edit page is more useful than the diff one.
13:18 πŸ”— Nemo_bis But the revisions are not usable without the diff
13:18 πŸ”— JAA How so?
13:18 πŸ”— Nemo_bis By looking at the final HTML you have no idea what the edit changed
13:18 πŸ”— Nemo_bis Unless you run some diffing locally
13:18 πŸ”— JAA Yeah, that's true, you'd have to run a diff afterwards.
13:19 πŸ”— JAA We ignored diffs in ArchiveBot anyway because it doubles (approximately) the number of URLs that need to be retrieved.
13:19 πŸ”— Nemo_bis Wikia alone would be 1G pages multiplied by some 2 MB (their HTML is incredibly bloated) uncompressed
13:19 πŸ”— Nemo_bis While the XML is "only" 850 GiB https://archive.org/details/wikia_dump_20180602
13:21 πŸ”— Nemo_bis Speaking of which, a good use of a server with some hundreds GiB of disk would be to run again that Wikia archival with the latest WikiTeam code :)
13:21 πŸ”— Nemo_bis Part of the latest archival was done with some bugs re API requests
13:37 πŸ”— JAA arkiver: Do you know more about WikiTeam and WARCs?
13:47 πŸ”— JAA Nemo_bis: That could be done in chunks, right? I.e. it wouldn't be necessary to store all 850 GiB at the same time.
13:47 πŸ”— JAA As far as I can see, it's one dump file per wiki anyway.
13:48 πŸ”— JAA So grab [0-9] wikis, compress, upload, delete, grab "a" wikis, compress, upload, delete, etc.
14:11 πŸ”— Nemo_bis In theory yes, but given a wiki might take a totally random amount of space from 100 KB to 250 GB that's rarely useful.
14:11 πŸ”— Nemo_bis Then I'm open to other people experimenting with their own methods. :)
14:11 πŸ”— JAA An individual wiki, sure, but a large number of wikis should be fairly predictable (comparatively, at least).
14:12 πŸ”— Nemo_bis dunno
14:12 πŸ”— Nemo_bis I have no evidence of size being randomly distributed
14:13 πŸ”— Nemo_bis With some effort it can be enough to have 300 GiB of disk or so
14:13 πŸ”— Nemo_bis Personally I preferred to spend some dozens € more on disk space and save myself hours of pointless work
14:14 πŸ”— JAA Yeah, makes sense.
14:15 πŸ”— JAA How were the Wikia wikis discovered, by the way?
14:16 πŸ”— Nemo_bis There's an API to list them all
14:16 πŸ”— JAA Ah, nice.
14:16 πŸ”— Nemo_bis https://github.com/WikiTeam/wikiteam/blob/master/listsofwikis/mediawiki/wikia.py
15:36 πŸ”— netchup how items are added to wiki warrior?
15:43 πŸ”— arkiver <Nemo_bis> Although at some point someone in ArchiveTeam launched some warrior project for wikis (I have no idea what was in it or what came out of it)
15:43 πŸ”— arkiver that was me
15:43 πŸ”— arkiver we archived external links
15:44 πŸ”— arkiver https://archive.org/details/archiveteam_wiki
15:45 πŸ”— arkiver not much happened with the project, but IΒ΄m totally for getting it running again
15:45 πŸ”— arkiver with more wikis to archive external URLs from and maybe archive all wikis themselves?
15:46 πŸ”— arkiver If diffs only double the number of URLs, then we could get them anyway, since we donΒ΄t really have a deadline for this
15:46 πŸ”— netchup arkiver: can i help you someway ?
15:46 πŸ”— arkiver not sure
15:47 πŸ”— arkiver Nemo_bis: JAA: are you fine with this being the Β΄officialΒ΄ channel of the warrior project too?
15:47 πŸ”— arkiver IΒ΄ll try to get some stuff restarted today, code should be pretty much there already for wikimedias
15:48 πŸ”— arkiver err
15:48 πŸ”— arkiver mediawikis
15:48 πŸ”— JAA Ah, only external links, I see. I'll add that to our wiki somewhere, since it's currently not clear at all what the "WARC format" is referring to.
15:48 πŸ”— netchup fine, cos urlteam warrior has stopped, for now only newsgrabber is working
15:49 πŸ”— arkiver JAA: IΒ΄d like to start archiving wikis themselves too
15:49 πŸ”— * arkiver is afk for ~30 min
15:49 πŸ”— netchup what about dumps ? can we use them for archiving wikis?
15:50 πŸ”— netchup i mean the manual job that has been done
15:51 πŸ”— netchup some of wikis probably cannot be accessible in WB, and are dead. Dumps we grabbed can help with that i think
15:51 πŸ”— netchup (i am not a programmer, so if you can please clear that)
15:52 πŸ”— JAA netchup: We won't fabricate data for the Wayback Machine. But someone could take the dumps, set them up again somewhere as a wiki, and then we could archive those. That would be distinct from the original wiki though.
15:58 πŸ”— netchup for archiving wikis, you probably need a list of them
15:59 πŸ”— netchup this is where i can help a little
16:02 πŸ”— Nemo_bis arkiver: sure, there's no problem in using this channel :)
16:03 πŸ”— Nemo_bis I'm not sure how the WARC collections pageviews are calculated, but 21M is not bad https://archive.org/details/archiveteam_wiki&tab=about
16:04 πŸ”— Nemo_bis I wonder why they became vanishingly small in 2017 compared to 2016. Wayback machine superseding them with its own archives?
16:04 πŸ”— Nemo_bis Or just bots. 99,99 % of views come from California (?)
16:06 πŸ”— astrid strange
16:34 πŸ”— arkiver IΒ΄m not totally sure about this, but all the data you get through the Wayback Machine first undergoes some processing to for example rewrite URLs
16:34 πŸ”— arkiver That means someone using the Wayback Machine will not directly download something from the WARCs, but indirectly through IA servers.
16:34 πŸ”— astrid right
16:35 πŸ”— astrid it's processed to insert the javascript header thingy, among other stuff
16:35 πŸ”— arkiver Yes
16:35 πŸ”— arkiver So that would be where the 99.99% California downloads comes from
16:36 πŸ”— astrid i'm seeing lots of california downloads also for non-warc collections
16:36 πŸ”— astrid shrug emoji
16:37 πŸ”— arkiver As for the drop in views, URLs closest to a requested timestamp are send to a user. IA has never stopped archiving, while we have not been working on these URLs anymore
16:37 πŸ”— arkiver astrid: Possibly search related, IΒ΄ll ask at IA.
16:37 πŸ”— astrid cool :)
16:38 πŸ”— astrid i'm curious but not enough to bother someone
20:03 πŸ”— balrog_ has joined #wikiteam
20:09 πŸ”— balrog has quit IRC (Ping timeout: 960 seconds)
20:09 πŸ”— balrog_ is now known as balrog
22:05 πŸ”— netchup has quit IRC (Quit: http://www.mibbit.com ajax IRC Client)
22:11 πŸ”— JAA https://wikiapiary.com/wiki/Websites "Total pages -8,982,570,193,088,097,861"
22:11 πŸ”— JAA Nice
22:11 πŸ”— JAA The total number of edits also seems slightly unrealistic: 552,816,861,540,660,473
22:20 πŸ”— JAA arkiver: Can you change the description of https://archive.org/details/archiveteam_wiki ? Since that's the archive of external links, not the wikis itself...
22:20 πŸ”— JAA I'm updating our wiki page on WikiTeam right now.
23:23 πŸ”— ta9le has quit IRC (Quit: Connection closed for inactivity)

irclogger-viewer