[00:05] *** lelo_paul has quit IRC (Read error: Connection reset by peer) [00:05] *** paul_lelo has joined #archiveteam-bs [00:05] *** decay has quit IRC (Read error: Operation timed out) [00:07] *** decay has joined #archiveteam-bs [00:12] *** t2t2 has quit IRC (Read error: Operation timed out) [00:12] *** Fletcher has quit IRC (Read error: Operation timed out) [00:13] *** t2t2 has joined #archiveteam-bs [00:13] *** Jordan has quit IRC (Ping timeout: 250 seconds) [00:16] *** Jordan has joined #archiveteam-bs [00:21] *** Fletcher has joined #archiveteam-bs [00:21] *** SmileyG has joined #archiveteam-bs [00:22] *** Smiley has quit IRC (Read error: Operation timed out) [00:23] *** hawc145 has joined #archiveteam-bs [00:26] *** HCross has quit IRC (Ping timeout: 370 seconds) [00:39] *** BitHippo has quit IRC (Ping timeout: 268 seconds) [00:40] *** dashcloud has quit IRC (Ping timeout: 244 seconds) [00:42] *** dashcloud has joined #archiveteam-bs [00:45] *** vitzli has joined #archiveteam-bs [01:09] *** Specular has joined #archiveteam-bs [01:11] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [01:18] *** BartoCH has joined #archiveteam-bs [01:41] *** Specular has quit IRC (Ping timeout: 370 seconds) [01:43] *** Specular has joined #archiveteam-bs [02:09] *** zenguy has quit IRC (Read error: Operation timed out) [02:18] *** kvieta has quit IRC (Ping timeout: 260 seconds) [02:20] SketchCow: found a bug on gifcities - the single result here links to a broken URL: http://gifcities.org/#/foobar [02:21] *** kvieta has joined #archiveteam-bs [02:32] *** zenguy has joined #archiveteam-bs [02:33] *** kvieta has quit IRC (Read error: Operation timed out) [02:35] *** dx has joined #archiveteam-bs [02:57] that site is amazing [03:08] would have expected more results for some queries though [03:18] *** pikhq has quit IRC (Ping timeout: 255 seconds) [03:19] *** pikhq has joined #archiveteam-bs [03:19] *** Start has quit IRC (Read error: Connection reset by peer) [03:20] *** Start has joined #archiveteam-bs [03:41] *** Stiletto has joined #archiveteam-bs [04:04] *** kvieta has joined #archiveteam-bs [04:12] *** ndiddy has quit IRC (Read error: Connection reset by peer) [04:20] *** vitzli has quit IRC (Quit: Leaving) [04:21] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:27] *** Sk1d has joined #archiveteam-bs [04:45] *** Specular has quit IRC (Ping timeout: 370 seconds) [04:46] *** Specular has joined #archiveteam-bs [05:01] looks like 19631007 issue of Aviation Week doesn't work on there site [05:02] like no images load [05:02] http://archive.aviationweek.com/issue/19631007 [05:06] so i uploaded 79k urls of abc.net.au/news/2004 urls [05:30] *** RichardG has joined #archiveteam-bs [05:43] *** Specular_ has joined #archiveteam-bs [05:46] *** Specular has quit IRC (Ping timeout: 370 seconds) [06:07] *** wp494 has quit IRC (Read error: Operation timed out) [06:11] *** superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) [06:16] *** wp494 has joined #archiveteam-bs [06:25] *** superkuh has joined #archiveteam-bs [06:33] *** superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) [06:35] *** Aranje has quit IRC (Quit: Three sheets to the wind) [06:39] *** superkuh has joined #archiveteam-bs [06:47] *** GE has joined #archiveteam-bs [06:51] do dane your a work horse [06:51] godane [06:52] *** superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) [06:54] *** superkuh has joined #archiveteam-bs [06:55] *** RichardG_ has joined #archiveteam-bs [06:55] *** RichardG has quit IRC (Read error: Connection reset by peer) [07:01] *** RichardG_ has quit IRC (Ping timeout: 244 seconds) [07:01] *** RichardG has joined #archiveteam-bs [07:01] *** closure has quit IRC (Ping timeout: 244 seconds) [07:02] *** fie has quit IRC (Ping timeout: 244 seconds) [07:02] *** Frogging has quit IRC (Ping timeout: 244 seconds) [07:02] *** espes__ has quit IRC (Ping timeout: 244 seconds) [07:03] *** edsu has quit IRC (Ping timeout: 244 seconds) [07:03] *** alfiepate has quit IRC (Ping timeout: 244 seconds) [07:04] *** closure has joined #archiveteam-bs [07:04] *** Frogging has joined #archiveteam-bs [07:05] *** jk[SVP] has quit IRC (Ping timeout: 244 seconds) [07:05] *** alfie has joined #archiveteam-bs [07:06] *** jk[SVP] has joined #archiveteam-bs [07:07] *** Mathias` has quit IRC (Ping timeout: 244 seconds) [07:07] *** Baljem has quit IRC (Ping timeout: 244 seconds) [07:08] *** espes__ has joined #archiveteam-bs [07:08] *** RichardG has quit IRC (Read error: Operation timed out) [07:08] *** Madthias has joined #archiveteam-bs [07:09] *** closure has quit IRC (Ping timeout: 244 seconds) [07:09] *** superkuh has quit IRC (Remote host closed the connection) [07:11] *** edsu has joined #archiveteam-bs [07:11] *** swebb sets mode: +o edsu [07:12] *** chfoo has quit IRC (Read error: Operation timed out) [07:12] *** chfoo has joined #archiveteam-bs [07:13] *** closure has joined #archiveteam-bs [07:13] *** Baljem has joined #archiveteam-bs [07:14] *** SilSte has quit IRC (Read error: Operation timed out) [07:15] *** RichardG has joined #archiveteam-bs [07:15] *** Fletcher has quit IRC (Read error: Operation timed out) [07:16] *** espes___ has joined #archiveteam-bs [07:17] *** Fletcher has joined #archiveteam-bs [07:17] *** Whopper has joined #archiveteam-bs [07:17] *** kristian_ has joined #archiveteam-bs [07:18] *** espes__ has quit IRC (Read error: Connection reset by peer) [07:18] *** JW_work1 has joined #archiveteam-bs [07:18] *** JW_work has quit IRC (Read error: Connection reset by peer) [07:19] *** obskyr has joined #archiveteam-bs [07:22] *** Whopper_ has quit IRC (Ping timeout: 633 seconds) [07:23] *** brayden_ has quit IRC (Ping timeout: 633 seconds) [07:24] *** decay has quit IRC (Read error: Connection reset by peer) [07:24] *** decay has joined #archiveteam-bs [07:25] *** SilSte has joined #archiveteam-bs [07:26] *** GE has quit IRC (Remote host closed the connection) [07:28] *** obskyr has quit IRC (Ping timeout: 506 seconds) [07:34] *** GE has joined #archiveteam-bs [07:40] *** GE has quit IRC (Remote host closed the connection) [07:41] *** superkuh has joined #archiveteam-bs [08:02] *** superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) [08:11] *** superkuh has joined #archiveteam-bs [09:10] so i have upload 57k items this month [09:10] it will most likely pass 60k before the month ends [09:13] *** ravetcofx has quit IRC (Ping timeout: 506 seconds) [09:29] *** BlueMaxim has quit IRC (Quit: Leaving) [09:43] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [09:44] *** SilSte has joined #archiveteam-bs [09:48] *** Cameron_D has quit IRC (Ping timeout: 370 seconds) [09:50] *** icebrain has joined #archiveteam-bs [09:51] hi! I'm running a warrior, but most of my jobs are sitting idle waiting to upload (it seems the server is overloaded). I have the disk space, is there any way to keep it pulling? [09:52] *** xmc has quit IRC (Read error: Operation timed out) [09:52] *** RichardG has quit IRC (Ping timeout: 370 seconds) [09:53] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [09:53] *** SilSte has joined #archiveteam-bs [09:55] *** xmc has joined #archiveteam-bs [09:55] *** swebb sets mode: +o xmc [09:56] *** Cameron_D has joined #archiveteam-bs [09:57] *** GE has joined #archiveteam-bs [09:58] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [09:59] *** SilSte has joined #archiveteam-bs [10:03] *** Specular_ has quit IRC (Quit: Leaving) [10:11] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [10:11] *** SilSte has joined #archiveteam-bs [10:16] *** VADemon has joined #archiveteam-bs [10:18] *** vineguy has joined #archiveteam-bs [10:21] *** vineguy has quit IRC (Client Quit) [10:21] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [10:21] *** SilSte has joined #archiveteam-bs [10:28] icebrain: https://spit.mixtape.moe/view/raw/228c47ed [10:28] Sent just after you left #panoramio :D [10:29] #paranormio * [10:33] *** SilSte has quit IRC (Remote host closed the connection) [10:34] *** SilSte has joined #archiveteam-bs [10:35] Aoede, Medowar0: thanks! [10:37] icebrain: False, there is. [10:38] Given that there is 60 second delay between trying, increasing concurrency for both download and upload helps. Or hacking the code to try every 20 seconds. [10:38] Do use rsync > 1 only if you get good upload, else you would block others. [10:39] I think that using -9 for rsync compression might have some effect, would like to know myself whether it stalls on I/O, CPU, or something else. [10:45] Yoshimura: thanks, but won't that just put more pressure on the target servers? my objective was more to keep pulling and cache locally until it could upload, not necessarily upload sooner. [10:45] Not really, the server does impose limits, so it would merely give you more chance. [10:46] But yeah, the dumbest thing is to increase the number of threads for download. [10:46] If it was not python I would fix it, but given all the stuff that is done, if I did rewrite it in different language it would likely not be accepted anyways. [10:57] Yoshimura: This is a stupid egoistic way, because it does not increase the overall grab speed, just your portion of the grab. [10:58] Medowar0 somewhat, but then exmplain me why some people can get terrabytes and some barely few hundred GB? [11:00] And yes, I did state it is dumb. And it does not have to be egoistic., it has to do with motivation and sense of purpose and being useful. [11:00] Because I am running 160 Instances 24/7, since the start of the project. Yes, currently the targets are overloaded, but when I started everything, there was still capacity. And I already reduced the total concurrent. [11:01] 160 instances, one could say same thing, that's egoistic. [11:01] Stupid egoistic way. Nothing personal. It's just different was to get "preferred". [11:01] yes, as I said, I already reduced it. I startet with ~500. [11:01] But... Please I would like to know what is the bottleneck? [11:02] the rsync Targets. [11:02] What type of resource I meant. [11:02] storage. [11:02] ok. thanks [11:04] Kenshin has 60TB banked somewhere, HCross and Kaz are the current targets and constantly uploading, Fos is full, lysantor has 24TB banked. [11:09] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [11:12] *** SilSte has joined #archiveteam-bs [11:19] *** brayden_ has joined #archiveteam-bs [11:19] *** swebb sets mode: +o brayden_ [11:23] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [11:25] *** SilSte has joined #archiveteam-bs [11:29] Yeah I know, the upload sucks. [11:29] Never got to the core what is the bottleneck, the petastore or the s3 protocol or something else. [11:30] I have ~900GB I can dedicate to it, I'm assuming it's not enough to make it worthwhile to set up a target server? [11:30] It could be if you got enough upload. [11:31] I am not the one to talk though, and in comparison that is small. [11:31] Just do not mind me, I'm nuts. [11:32] speedtest.net says 88mbps down/90mbps up [11:34] Well those beasts are incomparable. I was not here for a while. But last time the problem was the protocol and transcontinenal transfer. [11:35] So having a better solution based on udp than s3, even if it was intermediate server would solve that. Russia -> US East = 3Mbit max. [11:35] And you can gave 1Gbps on each side, TCP sucks for long distance and wide pipe. [11:38] *** kristian_ has quit IRC (Quit: Leaving) [11:39] I am rethinking my life, if someone would accept my work, so that it would have impact I would start rewriting warrior and stuff. The only thing that might have to be left in place is dupe detection, or have to be tested, as it does use specific stuff to process pages, so total compatibility would be hard. But running grab site myself, stuff gets overloaded, eating too much CPU. Warrior does... [11:39] ...wait for I/O after each file fetch, etc. I resorted to using ramdisk it sped up both up and down. Luckily tmpfs can be swapped and does only use what it needs. [11:40] That said, if there would be interest let me know please. [11:50] *** SilSte has quit IRC (Remote host closed the connection) [11:51] *** SilSte has joined #archiveteam-bs [11:55] *** SilSte has quit IRC (Client Quit) [12:01] *** SilSte has joined #archiveteam-bs [12:04] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [12:04] *** SilSte has joined #archiveteam-bs [12:16] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [12:17] *** SilSte has joined #archiveteam-bs [12:25] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [12:26] *** SilSte has joined #archiveteam-bs [12:27] *** VADemon has quit IRC (Quit: left4dead) [12:40] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [12:45] *** SilSte has joined #archiveteam-bs [12:48] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [12:49] *** SilSte has joined #archiveteam-bs [12:53] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [12:54] *** SilSte has joined #archiveteam-bs [12:57] *** SilSte has quit IRC (Client Quit) [13:01] *** SilSte has joined #archiveteam-bs [13:16] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [13:17] *** SilSte has joined #archiveteam-bs [13:19] *** Yoshimura has quit IRC (Remote host closed the connection) [13:22] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [13:23] *** SilSte has joined #archiveteam-bs [13:28] *** SilSte has quit IRC (Client Quit) [13:40] *** SilSte has joined #archiveteam-bs [13:54] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [13:54] *** SilSte has joined #archiveteam-bs [14:03] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [14:04] *** RichardG has joined #archiveteam-bs [14:04] *** SilSte has joined #archiveteam-bs [14:08] *** SilSte has quit IRC (Client Quit) [14:09] *** SilSte has joined #archiveteam-bs [14:23] *** Start has quit IRC (Quit: Disconnected.) [14:28] *** GE has quit IRC (Remote host closed the connection) [14:31] i need to get my fileserver up again... 40TB laying around doing nothing [14:43] *** RichardG has quit IRC (Ping timeout: 370 seconds) [14:55] *** Yoshimura has joined #archiveteam-bs [14:56] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [14:58] *** SilSte has joined #archiveteam-bs [15:00] okay, I'm back [15:00] vine, what's going on [15:02] twitter is shutting them down or something? [15:03] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [15:05] *** SilSte has joined #archiveteam-bs [15:09] yup [15:12] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [15:12] *** SilSte has joined #archiveteam-bs [15:16] *** SilSte has quit IRC (Client Quit) [15:16] *** SilSte has joined #archiveteam-bs [15:28] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [15:29] *** SilSte has joined #archiveteam-bs [15:41] *** kristian_ has joined #archiveteam-bs [15:49] *** SilSte has quit IRC (Remote host closed the connection) [15:50] *** SilSte has joined #archiveteam-bs [15:54] *** SilSte has quit IRC (Remote host closed the connection) [15:55] *** SilSte has joined #archiveteam-bs [16:01] *** SilSte has quit IRC (Quit: No Ping reply in 180 seconds.) [16:08] https://vine.co/v/5u1dDA00uHw [16:09] *** bsmith093 has quit IRC (Ping timeout: 255 seconds) [16:30] *** GE has joined #archiveteam-bs [16:32] *** RichardG has joined #archiveteam-bs [16:49] *** kristian_ has quit IRC (Quit: Leaving) [16:59] *** Shakespea has joined #archiveteam-bs [16:59] Hi all - http://www.bbc.co.uk/news/technology-37788052 [16:59] Another service going away :( [17:04] thats number 10 [17:09] I think I did quite well to be number two then [17:09] :D [17:11] Do WARCs downloaded by archivebot get injected to wayback? [17:12] yes [17:12] Thanks [17:20] *** Shakespea has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 52.0a1/20161028030204]) [17:25] That reminds me I forgot to check my personal archival. How do I gain +v for the bot, again? It feels futile to archive stuff, while you archive knowledge but people cannot find it, cause it does not get to wayback. [17:29] *** Stilett0 has joined #archiveteam-bs [17:29] *** superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) [17:31] *** Stiletto has quit IRC (Read error: Operation timed out) [17:56] *** RichardG has quit IRC (Read error: Operation timed out) [17:56] *** RichardG has joined #archiveteam-bs [18:01] *** Start has joined #archiveteam-bs [18:20] *** Start has quit IRC (Quit: Disconnected.) [18:24] *** ndiddy has joined #archiveteam-bs [18:30] *** ravetcofx has joined #archiveteam-bs [18:36] *** Start has joined #archiveteam-bs [18:53] *** bsmith093 has joined #archiveteam-bs [18:57] *** Start has quit IRC (Quit: Disconnected.) [19:11] *** DragonDav has joined #archiveteam-bs [19:15] I can't even begin to imagine the amount of storage needed for all the Vines [19:16] *** DragonDav is now known as Dragon [19:35] Get some vines, average, multiply, done. [19:36] Most sites have stuff in TBs, although it might from front seem like they are gigantic. [19:43] Oh yeah with Vine you know exactly how long a video is going to be, and the rough filesize so can get a pretty good extimate [19:43] * estimate [19:47] Largest one I've got so far is 3.17MB [19:50] Random selection of 5 videos yields 1.88MB average file size [19:53] ah they have supported 140s long videos since this year though which will skew some stats [19:55] Not sure on videos uploaded [19:55] 100TB or more then. [19:56] 39 million as of February, so 2MB * 50 million = 100TB [19:56] Less than I thought [19:57] Well, in reality it might be more, plus there are pages. So my guesstimate is 100-200TB. Which is not that much when you take how "valuable" resource it is. [19:58] Yeah my estimate there is based on a really small selection of videos [19:58] I'm thinking about rewriting the archivebot and stuff, what you think? [19:58] I can't find any official stats on videos uploaded on their blog [19:59] Not like the YouTube press page anyway... [20:02] *** JW_work has joined #archiveteam-bs [20:03] idk if Vine is worth hundreds of TB personally [20:03] Definitely is. [20:04] Or at least the popular ones are. [20:04] maybe the popular ones for historical purposes, yeah. I doubt there's much value in the vast majority of these 7 second clips [20:05] *** JW_work1 has quit IRC (Read error: Operation timed out) [20:05] but I never actually used Vine and it's not up to any of us anyway :p [20:06] who are you and what have you done with the archiveteam member named Frogging [20:06] the one thing we have going for us on the vine front is that videos are quite short unlike twitch [20:07] that said it wouldn't really matter if there's more volume anyway [20:07] xmc: did I say something out of character? :p [20:08] should I put vine in video hosting or social [20:08] sep332: the gdocs link in the Vine page is dead [20:08] I'm thinking video but I see reasons for social as well [20:08] (for the navbox, that is) [20:11] *** JW_work has quit IRC (Quit: Leaving.) [20:12] screw it I'll toss it to a poll [20:13] *** JW_work has joined #archiveteam-bs [20:16] i'd go with social [20:16] same [20:20] social [20:32] *** BlueMaxim has joined #archiveteam-bs [21:01] Kaz: thanks, had an extra dot somehow. Fixed [21:06] Umm is there a way to get list of pages saved in wayback? [21:08] *** Fletcher has quit IRC (Ping timeout: 244 seconds) [21:08] *** will has quit IRC (Ping timeout: 244 seconds) [21:08] *** Kenshin has quit IRC (Ping timeout: 244 seconds) [21:08] *** closure has quit IRC (Ping timeout: 244 seconds) [21:08] *** useretail has quit IRC (Ping timeout: 244 seconds) [21:08] *** i0npulse has quit IRC (Ping timeout: 244 seconds) [21:08] *** Medowar has quit IRC (Ping timeout: 244 seconds) [21:08] *** purplebot has quit IRC (Ping timeout: 244 seconds) [21:08] *** Rye has quit IRC (Ping timeout: 244 seconds) [21:08] *** will has joined #archiveteam-bs [21:08] *** Medowar has joined #archiveteam-bs [21:08] *** Kenshin has joined #archiveteam-bs [21:09] *** closure has joined #archiveteam-bs [21:09] *** Rye has joined #archiveteam-bs [21:10] *** purplebot has joined #archiveteam-bs [21:11] *** useretail has joined #archiveteam-bs [21:13] *** i0npulse has joined #archiveteam-bs [21:24] *** Fletcher has joined #archiveteam-bs [21:53] *** RichardG has quit IRC (Read error: Operation timed out) [21:53] *** RichardG has joined #archiveteam-bs [22:00] *** GE has quit IRC (Quit: zzz) [22:09] Yoshimura: list of pages in Wayback — see the CDX server interface (see IA page on the wiki for links) [22:10] I mean, can one get the list, not only query? (load on server) [22:11] you mean the list of 253 billion pages? [22:11] 273 billion [22:11] sorry, i lost track counting [22:11] Yes. [22:11] no. [22:11] there isn't a separately downloadable list that I know of, no [22:11] https://archive.org/details/waybackcdx are those? [22:12] but ... why? [22:12] "These shards are not publicly downloadable. " [22:12] Damn. Archival stuff, more intelligent then just fetching stuff and uploading what is already there. [22:13] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [22:13] The checksums and pages are of interest to be specific. And also sizes. [22:13] it's generally felt worth uploading stuff again, for multiple reasons: [22:13] pages = links [22:13] 1) gives evidence the stuff was still present at the new date [22:14] stuff changes over time, and with some exceptions such as large collections of multi-gigabyte files, some duplication isn't fatal [22:14] I am talking large scale, yes. [22:14] don't look at the number of copies of jquery-min*.js in wayback if you have OCD [22:14] 2) while they still have plenty of space, extra duplication of stuff that was interesting enough to grab multiple times is useful [22:14] *** BartoCH has joined #archiveteam-bs [22:15] i mean you could make a bloom filter on the contents of the cdx'es [22:15] ^ this [22:15] but it's a lot of work to save a little bit of bandwidth [22:15] A lot. [22:15] and a lot of bandwidth on the front-side! [22:15] 3) if/when they get a space crunch, being able to unexpectedly get more merely by removing said duplication is also neat [22:16] have fun tuning your false positive rate if you decide to go for a Bloom filter [22:16] That is simple. [22:16] I do agree that it would be neat to have a publically downloadable set of *hashes* of all the content in the Wayback machine, though [22:17] in any case we decided to skip that and it's been doing okay for about three years [22:17] (although, like improved searching, that increases the risk of people objecting to the inclusion of particular bits of content) [22:17] or 5-6 years if you include all AT activity [22:17] Can I just hammer the API about all resources? [22:17] duplication isn't really an endemic Archive Team problem; it is relatively easy to say "hey you really don't need to grab those gigabytes of ISOs again" [22:18] if you do it slowly enough (i.e. say, one query per minute) I doubt anyone would care [22:18] fetch diversity is a bigger one, IMO [22:18] * JW_work agrees about a need for greater fetch diversity [22:18] we get *so* *much* *fucking* *shit* about Western technopolitics [22:18] did I mention that we don't tend to get much from the entire other half of the world [22:19] the amount of non-English-language stuff we are missing knowing about is … really not good [22:19] I meant 10-100 per second with memoization [22:19] Or in case of CDX API wildcards would help of course [22:20] This would lead to the diversity yes. [22:20] 100 queries per second would probably get noticed (although it might not). [22:20] Given my health and my life I am rethinking everything, even my own existence. [22:20] wait, how would extracting a list of the current contents of the Wayback Machine help with diversity? [22:20] *** kristian_ has joined #archiveteam-bs [22:21] when interacting with archive.org: the general rule is, don't make things fall over, don't make it slow for other people, and the archive won't get in your way [22:21] If I could get info about what pages are missing one can more easily fetch those. [22:21] but the problem is knowing what's out there, not what's missing from Wayback [22:22] we just discussed why grabing stuff that has already been grabbed is harmless [22:22] If I know what is out there, I can look if it is missing programmaticaly [22:22] right, but how do you find what is out there? [22:23] well, the APIs exist and if you keep the request rate down you'll probably be ok [22:23] enjoy [22:23] If you talk large scale it means not grabbing other stuff with those resources. [22:23] The API gives me hashes, right? [22:23] JW_work: Crawl data, users, etc. There are multiple facets [22:23] the CDX api does give "digests", which are (IIRC) sha1 with a weird format [22:24] okay, -BS ... any francophone people here? [22:25] Yes, digest. Base32 is not weird, just uncommon. [22:25] I don't see how knowing more about what is *in* the Wayback Machine would help you find material that we (ArchiveTeam) don't already know about. It seems like making contacts with other people/groups would be the only way to do that. [22:26] It helps to discern what is there and what is not when one gets hands on datasets. [22:32] agreed — but the first job is getting one's hands on more datasets. Once that happens (or even, once specific possiblities are known), *then* figuring out how much is duplicated matters — but not until then [22:36] *** kniffy has joined #archiveteam-bs [22:45] So much talking in here. [22:47] zzz [22:55] so archive.org only has 273 billion pages when it was 491 billion pages this past july: https://web.archive.org/web/20160713070126/https://archive.org/ [22:56] i expect it's 491B copies of pages, and 273 distinct urls [22:56] ok [22:56] its just weird to me [22:57] yeah it is [22:57] i'd expect them to be consistent [22:57] anyways i maybe adding close to half a million urls just from abc.net.au [22:58] yow [23:00] 89549 urls are in abc.net.au 2006 news sitemap [23:01] wayback has less then 2000 urls from abc.net.au/news/2006* [23:02] just the 2003 sitemap urls was more then wayback machine had from 2003 to 2009 urls [23:03] so i think wayback machine was drunk when going after abc.net.au [23:06] *** computerf has quit IRC (Bye.) [23:12] *** computerf has joined #archiveteam-bs [23:15] As I understand, the "wide" crawls only go a very limited depth down into any particular website — if abc.net.au didn't get included in a higher-priority crawl until after 2009, that would explain it. [23:15] (assuming my understanding is correct, which it certainly might not be) [23:18] 502 on Aug 14; 505 on Sept 14; 510 on Oct 14, which is the most recent crawl (and oddly, archive.org can't be saved with Wayback Save) [23:18] i think the urls change to the current format in 2011 [23:18] (all in billions) [23:19] and the current page (that lists the 273 billion number) now links to the blog post that explains the apparent drop [23:20] it's 273 billion "webpages" and 510 billion "time-stamped web objects" (aka "web captures") [23:21] wow, so that implies that there are 237 billion captures that *aren't* HTML, plain text or PDF. That's a lot of copies of jQuery. :-) [23:22] images? [23:22] (and 404 errors) [23:22] wait no [23:22] yes, images [23:22] ignore the wait, no then [23:26] (I don't have access to my email right now, so I'll mention this here: http://blog.archive.org/2016/10/24/faqs-for-some-new-features-available-in-the-beta-wayback-machine/#comment-355327 is a spam comment and should probably get removed) Somebody, please forward this on to info@archive [23:28] *** epicfacet has joined #archiveteam-bs [23:28] hi [23:29] so yeah, let us know if it is back [23:29] we'll archive it again [23:29] any idea how large the site is when it's fully online? [23:29] JW_work: I don't see any spammy comments? [23:29] I only see two comments, though the page implies there's 4 [23:29] the site was huge. [23:30] it had a backup of the original AoS forums, about 100k members, ect [23:30] Kaz — the one from onlinepluz is what I was referring to. [23:30] the game just fell out of popularity in the last year or two [23:30] how many posts do you think? [23:30] I have no idea. [23:30] ok [23:30] *** JW_work has quit IRC (Quit: Leaving.) [23:30] well let us know if it's back and we'll have a look at it [23:31] epicfacet: is http://www.aceofspades.com/community/index related? [23:31] or is that a different game? [23:32] what happened is a company bought the rights to the game about a year or two ago, and made that. BnS was a continuation of the original game because so many people didn't like the paid version [23:32] when they bought the game, they actualy shut the original forums down w/o notice. someone was somehow able to grab a copy and put it on there [23:33] so yeah, quite a history with shutdowns for this game [23:46] i just saved this: https://web.archive.org/web/*/http://mpegmedia.abc.net.au/local/sydney/201301/r1058576_12374885.mp3 [23:46] mp3 came from this article: https://web.archive.org/web/20130117142641/http://www.abc.net.au/local/stories/2013/01/14/3669278.htm?site=sydney [23:48] again that sort of feel like a fail since we did that Aaron Swartz collection [23:49] but at least wayback had a copy of the article from that time [23:55] *** epicfacet has quit IRC (Quit: Page closed) [23:55] *** VADemon has joined #archiveteam-bs