#archiveteam-bs 2016-10-28,Fri

↑back Search

Time Nickname Message
00:05 🔗 lelo_paul has quit IRC (Read error: Connection reset by peer)
00:05 🔗 paul_lelo has joined #archiveteam-bs
00:05 🔗 decay has quit IRC (Read error: Operation timed out)
00:07 🔗 decay has joined #archiveteam-bs
00:12 🔗 t2t2 has quit IRC (Read error: Operation timed out)
00:12 🔗 Fletcher has quit IRC (Read error: Operation timed out)
00:13 🔗 t2t2 has joined #archiveteam-bs
00:13 🔗 Jordan has quit IRC (Ping timeout: 250 seconds)
00:16 🔗 Jordan has joined #archiveteam-bs
00:21 🔗 Fletcher has joined #archiveteam-bs
00:21 🔗 SmileyG has joined #archiveteam-bs
00:22 🔗 Smiley has quit IRC (Read error: Operation timed out)
00:23 🔗 hawc145 has joined #archiveteam-bs
00:26 🔗 HCross has quit IRC (Ping timeout: 370 seconds)
00:39 🔗 BitHippo has quit IRC (Ping timeout: 268 seconds)
00:40 🔗 dashcloud has quit IRC (Ping timeout: 244 seconds)
00:42 🔗 dashcloud has joined #archiveteam-bs
00:45 🔗 vitzli has joined #archiveteam-bs
01:09 🔗 Specular has joined #archiveteam-bs
01:11 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
01:18 🔗 BartoCH has joined #archiveteam-bs
01:41 🔗 Specular has quit IRC (Ping timeout: 370 seconds)
01:43 🔗 Specular has joined #archiveteam-bs
02:09 🔗 zenguy has quit IRC (Read error: Operation timed out)
02:18 🔗 kvieta has quit IRC (Ping timeout: 260 seconds)
02:20 🔗 joepie91 SketchCow: found a bug on gifcities - the single result here links to a broken URL: http://gifcities.org/#/foobar
02:21 🔗 kvieta has joined #archiveteam-bs
02:32 🔗 zenguy has joined #archiveteam-bs
02:33 🔗 kvieta has quit IRC (Read error: Operation timed out)
02:35 🔗 dx has joined #archiveteam-bs
02:57 🔗 Specular that site is amazing
03:08 🔗 Specular would have expected more results for some queries though
03:18 🔗 pikhq has quit IRC (Ping timeout: 255 seconds)
03:19 🔗 pikhq has joined #archiveteam-bs
03:19 🔗 Start has quit IRC (Read error: Connection reset by peer)
03:20 🔗 Start has joined #archiveteam-bs
03:41 🔗 Stiletto has joined #archiveteam-bs
04:04 🔗 kvieta has joined #archiveteam-bs
04:12 🔗 ndiddy has quit IRC (Read error: Connection reset by peer)
04:20 🔗 vitzli has quit IRC (Quit: Leaving)
04:21 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:27 🔗 Sk1d has joined #archiveteam-bs
04:45 🔗 Specular has quit IRC (Ping timeout: 370 seconds)
04:46 🔗 Specular has joined #archiveteam-bs
05:01 🔗 godane looks like 19631007 issue of Aviation Week doesn't work on there site
05:02 🔗 godane like no images load
05:02 🔗 godane http://archive.aviationweek.com/issue/19631007
05:06 🔗 godane so i uploaded 79k urls of abc.net.au/news/2004 urls
05:30 🔗 RichardG has joined #archiveteam-bs
05:43 🔗 Specular_ has joined #archiveteam-bs
05:46 🔗 Specular has quit IRC (Ping timeout: 370 seconds)
06:07 🔗 wp494 has quit IRC (Read error: Operation timed out)
06:11 🔗 superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
06:16 🔗 wp494 has joined #archiveteam-bs
06:25 🔗 superkuh has joined #archiveteam-bs
06:33 🔗 superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
06:35 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
06:39 🔗 superkuh has joined #archiveteam-bs
06:47 🔗 GE has joined #archiveteam-bs
06:51 🔗 jsp12345 do dane your a work horse
06:51 🔗 jsp12345 godane
06:52 🔗 superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
06:54 🔗 superkuh has joined #archiveteam-bs
06:55 🔗 RichardG_ has joined #archiveteam-bs
06:55 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
07:01 🔗 RichardG_ has quit IRC (Ping timeout: 244 seconds)
07:01 🔗 RichardG has joined #archiveteam-bs
07:01 🔗 closure has quit IRC (Ping timeout: 244 seconds)
07:02 🔗 fie has quit IRC (Ping timeout: 244 seconds)
07:02 🔗 Frogging has quit IRC (Ping timeout: 244 seconds)
07:02 🔗 espes__ has quit IRC (Ping timeout: 244 seconds)
07:03 🔗 edsu has quit IRC (Ping timeout: 244 seconds)
07:03 🔗 alfiepate has quit IRC (Ping timeout: 244 seconds)
07:04 🔗 closure has joined #archiveteam-bs
07:04 🔗 Frogging has joined #archiveteam-bs
07:05 🔗 jk[SVP] has quit IRC (Ping timeout: 244 seconds)
07:05 🔗 alfie has joined #archiveteam-bs
07:06 🔗 jk[SVP] has joined #archiveteam-bs
07:07 🔗 Mathias` has quit IRC (Ping timeout: 244 seconds)
07:07 🔗 Baljem has quit IRC (Ping timeout: 244 seconds)
07:08 🔗 espes__ has joined #archiveteam-bs
07:08 🔗 RichardG has quit IRC (Read error: Operation timed out)
07:08 🔗 Madthias has joined #archiveteam-bs
07:09 🔗 closure has quit IRC (Ping timeout: 244 seconds)
07:09 🔗 superkuh has quit IRC (Remote host closed the connection)
07:11 🔗 edsu has joined #archiveteam-bs
07:11 🔗 swebb sets mode: +o edsu
07:12 🔗 chfoo has quit IRC (Read error: Operation timed out)
07:12 🔗 chfoo has joined #archiveteam-bs
07:13 🔗 closure has joined #archiveteam-bs
07:13 🔗 Baljem has joined #archiveteam-bs
07:14 🔗 SilSte has quit IRC (Read error: Operation timed out)
07:15 🔗 RichardG has joined #archiveteam-bs
07:15 🔗 Fletcher has quit IRC (Read error: Operation timed out)
07:16 🔗 espes___ has joined #archiveteam-bs
07:17 🔗 Fletcher has joined #archiveteam-bs
07:17 🔗 Whopper has joined #archiveteam-bs
07:17 🔗 kristian_ has joined #archiveteam-bs
07:18 🔗 espes__ has quit IRC (Read error: Connection reset by peer)
07:18 🔗 JW_work1 has joined #archiveteam-bs
07:18 🔗 JW_work has quit IRC (Read error: Connection reset by peer)
07:19 🔗 obskyr has joined #archiveteam-bs
07:22 🔗 Whopper_ has quit IRC (Ping timeout: 633 seconds)
07:23 🔗 brayden_ has quit IRC (Ping timeout: 633 seconds)
07:24 🔗 decay has quit IRC (Read error: Connection reset by peer)
07:24 🔗 decay has joined #archiveteam-bs
07:25 🔗 SilSte has joined #archiveteam-bs
07:26 🔗 GE has quit IRC (Remote host closed the connection)
07:28 🔗 obskyr has quit IRC (Ping timeout: 506 seconds)
07:34 🔗 GE has joined #archiveteam-bs
07:40 🔗 GE has quit IRC (Remote host closed the connection)
07:41 🔗 superkuh has joined #archiveteam-bs
08:02 🔗 superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
08:11 🔗 superkuh has joined #archiveteam-bs
09:10 🔗 godane so i have upload 57k items this month
09:10 🔗 godane it will most likely pass 60k before the month ends
09:13 🔗 ravetcofx has quit IRC (Ping timeout: 506 seconds)
09:29 🔗 BlueMaxim has quit IRC (Quit: Leaving)
09:43 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
09:44 🔗 SilSte has joined #archiveteam-bs
09:48 🔗 Cameron_D has quit IRC (Ping timeout: 370 seconds)
09:50 🔗 icebrain has joined #archiveteam-bs
09:51 🔗 icebrain hi! I'm running a warrior, but most of my jobs are sitting idle waiting to upload (it seems the server is overloaded). I have the disk space, is there any way to keep it pulling?
09:52 🔗 xmc has quit IRC (Read error: Operation timed out)
09:52 🔗 RichardG has quit IRC (Ping timeout: 370 seconds)
09:53 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
09:53 🔗 SilSte has joined #archiveteam-bs
09:55 🔗 xmc has joined #archiveteam-bs
09:55 🔗 swebb sets mode: +o xmc
09:56 🔗 Cameron_D has joined #archiveteam-bs
09:57 🔗 GE has joined #archiveteam-bs
09:58 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
09:59 🔗 SilSte has joined #archiveteam-bs
10:03 🔗 Specular_ has quit IRC (Quit: Leaving)
10:11 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
10:11 🔗 SilSte has joined #archiveteam-bs
10:16 🔗 VADemon has joined #archiveteam-bs
10:18 🔗 vineguy has joined #archiveteam-bs
10:21 🔗 vineguy has quit IRC (Client Quit)
10:21 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
10:21 🔗 SilSte has joined #archiveteam-bs
10:28 🔗 Aoede icebrain: https://spit.mixtape.moe/view/raw/228c47ed
10:28 🔗 Aoede Sent just after you left #panoramio :D
10:29 🔗 Aoede #paranormio *
10:33 🔗 SilSte has quit IRC (Remote host closed the connection)
10:34 🔗 SilSte has joined #archiveteam-bs
10:35 🔗 icebrain Aoede, Medowar0: thanks!
10:37 🔗 Yoshimura icebrain: False, there is.
10:38 🔗 Yoshimura Given that there is 60 second delay between trying, increasing concurrency for both download and upload helps. Or hacking the code to try every 20 seconds.
10:38 🔗 Yoshimura Do use rsync > 1 only if you get good upload, else you would block others.
10:39 🔗 Yoshimura I think that using -9 for rsync compression might have some effect, would like to know myself whether it stalls on I/O, CPU, or something else.
10:45 🔗 icebrain Yoshimura: thanks, but won't that just put more pressure on the target servers? my objective was more to keep pulling and cache locally until it could upload, not necessarily upload sooner.
10:45 🔗 Yoshimura Not really, the server does impose limits, so it would merely give you more chance.
10:46 🔗 Yoshimura But yeah, the dumbest thing is to increase the number of threads for download.
10:46 🔗 Yoshimura If it was not python I would fix it, but given all the stuff that is done, if I did rewrite it in different language it would likely not be accepted anyways.
10:57 🔗 Medowar0 Yoshimura: This is a stupid egoistic way, because it does not increase the overall grab speed, just your portion of the grab.
10:58 🔗 Yoshimura Medowar0 somewhat, but then exmplain me why some people can get terrabytes and some barely few hundred GB?
11:00 🔗 Yoshimura And yes, I did state it is dumb. And it does not have to be egoistic., it has to do with motivation and sense of purpose and being useful.
11:00 🔗 Medowar0 Because I am running 160 Instances 24/7, since the start of the project. Yes, currently the targets are overloaded, but when I started everything, there was still capacity. And I already reduced the total concurrent.
11:01 🔗 Yoshimura 160 instances, one could say same thing, that's egoistic.
11:01 🔗 Yoshimura Stupid egoistic way. Nothing personal. It's just different was to get "preferred".
11:01 🔗 Medowar0 yes, as I said, I already reduced it. I startet with ~500.
11:01 🔗 Yoshimura But... Please I would like to know what is the bottleneck?
11:02 🔗 Medowar0 the rsync Targets.
11:02 🔗 Yoshimura What type of resource I meant.
11:02 🔗 Medowar0 storage.
11:02 🔗 Yoshimura ok. thanks
11:04 🔗 Medowar0 Kenshin has 60TB banked somewhere, HCross and Kaz are the current targets and constantly uploading, Fos is full, lysantor has 24TB banked.
11:09 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
11:12 🔗 SilSte has joined #archiveteam-bs
11:19 🔗 brayden_ has joined #archiveteam-bs
11:19 🔗 swebb sets mode: +o brayden_
11:23 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
11:25 🔗 SilSte has joined #archiveteam-bs
11:29 🔗 Yoshimura Yeah I know, the upload sucks.
11:29 🔗 Yoshimura Never got to the core what is the bottleneck, the petastore or the s3 protocol or something else.
11:30 🔗 icebrain I have ~900GB I can dedicate to it, I'm assuming it's not enough to make it worthwhile to set up a target server?
11:30 🔗 Yoshimura It could be if you got enough upload.
11:31 🔗 Yoshimura I am not the one to talk though, and in comparison that is small.
11:31 🔗 Yoshimura Just do not mind me, I'm nuts.
11:32 🔗 icebrain speedtest.net says 88mbps down/90mbps up
11:34 🔗 Yoshimura Well those beasts are incomparable. I was not here for a while. But last time the problem was the protocol and transcontinenal transfer.
11:35 🔗 Yoshimura So having a better solution based on udp than s3, even if it was intermediate server would solve that. Russia -> US East = 3Mbit max.
11:35 🔗 Yoshimura And you can gave 1Gbps on each side, TCP sucks for long distance and wide pipe.
11:38 🔗 kristian_ has quit IRC (Quit: Leaving)
11:39 🔗 Yoshimura I am rethinking my life, if someone would accept my work, so that it would have impact I would start rewriting warrior and stuff. The only thing that might have to be left in place is dupe detection, or have to be tested, as it does use specific stuff to process pages, so total compatibility would be hard. But running grab site myself, stuff gets overloaded, eating too much CPU. Warrior does...
11:39 🔗 Yoshimura ...wait for I/O after each file fetch, etc. I resorted to using ramdisk it sped up both up and down. Luckily tmpfs can be swapped and does only use what it needs.
11:40 🔗 Yoshimura That said, if there would be interest let me know please.
11:50 🔗 SilSte has quit IRC (Remote host closed the connection)
11:51 🔗 SilSte has joined #archiveteam-bs
11:55 🔗 SilSte has quit IRC (Client Quit)
12:01 🔗 SilSte has joined #archiveteam-bs
12:04 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
12:04 🔗 SilSte has joined #archiveteam-bs
12:16 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
12:17 🔗 SilSte has joined #archiveteam-bs
12:25 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
12:26 🔗 SilSte has joined #archiveteam-bs
12:27 🔗 VADemon has quit IRC (Quit: left4dead)
12:40 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
12:45 🔗 SilSte has joined #archiveteam-bs
12:48 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
12:49 🔗 SilSte has joined #archiveteam-bs
12:53 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
12:54 🔗 SilSte has joined #archiveteam-bs
12:57 🔗 SilSte has quit IRC (Client Quit)
13:01 🔗 SilSte has joined #archiveteam-bs
13:16 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
13:17 🔗 SilSte has joined #archiveteam-bs
13:19 🔗 Yoshimura has quit IRC (Remote host closed the connection)
13:22 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
13:23 🔗 SilSte has joined #archiveteam-bs
13:28 🔗 SilSte has quit IRC (Client Quit)
13:40 🔗 SilSte has joined #archiveteam-bs
13:54 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
13:54 🔗 SilSte has joined #archiveteam-bs
14:03 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
14:04 🔗 RichardG has joined #archiveteam-bs
14:04 🔗 SilSte has joined #archiveteam-bs
14:08 🔗 SilSte has quit IRC (Client Quit)
14:09 🔗 SilSte has joined #archiveteam-bs
14:23 🔗 Start has quit IRC (Quit: Disconnected.)
14:28 🔗 GE has quit IRC (Remote host closed the connection)
14:31 🔗 Midas i need to get my fileserver up again... 40TB laying around doing nothing
14:43 🔗 RichardG has quit IRC (Ping timeout: 370 seconds)
14:55 🔗 Yoshimura has joined #archiveteam-bs
14:56 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
14:58 🔗 SilSte has joined #archiveteam-bs
15:00 🔗 Kaz okay, I'm back
15:00 🔗 Kaz vine, what's going on
15:02 🔗 ranma twitter is shutting them down or something?
15:03 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
15:05 🔗 SilSte has joined #archiveteam-bs
15:09 🔗 Kaz yup
15:12 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
15:12 🔗 SilSte has joined #archiveteam-bs
15:16 🔗 SilSte has quit IRC (Client Quit)
15:16 🔗 SilSte has joined #archiveteam-bs
15:28 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
15:29 🔗 SilSte has joined #archiveteam-bs
15:41 🔗 kristian_ has joined #archiveteam-bs
15:49 🔗 SilSte has quit IRC (Remote host closed the connection)
15:50 🔗 SilSte has joined #archiveteam-bs
15:54 🔗 SilSte has quit IRC (Remote host closed the connection)
15:55 🔗 SilSte has joined #archiveteam-bs
16:01 🔗 SilSte has quit IRC (Quit: No Ping reply in 180 seconds.)
16:08 🔗 zhongfu https://vine.co/v/5u1dDA00uHw
16:09 🔗 bsmith093 has quit IRC (Ping timeout: 255 seconds)
16:30 🔗 GE has joined #archiveteam-bs
16:32 🔗 RichardG has joined #archiveteam-bs
16:49 🔗 kristian_ has quit IRC (Quit: Leaving)
16:59 🔗 Shakespea has joined #archiveteam-bs
16:59 🔗 Shakespea Hi all - http://www.bbc.co.uk/news/technology-37788052
16:59 🔗 Shakespea Another service going away :(
17:04 🔗 Medowar0 thats number 10
17:09 🔗 Kaz I think I did quite well to be number two then
17:09 🔗 Kaz :D
17:11 🔗 Aoede Do WARCs downloaded by archivebot get injected to wayback?
17:12 🔗 Kaz yes
17:12 🔗 Aoede Thanks
17:20 🔗 Shakespea has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 52.0a1/20161028030204])
17:25 🔗 Yoshimura That reminds me I forgot to check my personal archival. How do I gain +v for the bot, again? It feels futile to archive stuff, while you archive knowledge but people cannot find it, cause it does not get to wayback.
17:29 🔗 Stilett0 has joined #archiveteam-bs
17:29 🔗 superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
17:31 🔗 Stiletto has quit IRC (Read error: Operation timed out)
17:56 🔗 RichardG has quit IRC (Read error: Operation timed out)
17:56 🔗 RichardG has joined #archiveteam-bs
18:01 🔗 Start has joined #archiveteam-bs
18:20 🔗 Start has quit IRC (Quit: Disconnected.)
18:24 🔗 ndiddy has joined #archiveteam-bs
18:30 🔗 ravetcofx has joined #archiveteam-bs
18:36 🔗 Start has joined #archiveteam-bs
18:53 🔗 bsmith093 has joined #archiveteam-bs
18:57 🔗 Start has quit IRC (Quit: Disconnected.)
19:11 🔗 DragonDav has joined #archiveteam-bs
19:15 🔗 will I can't even begin to imagine the amount of storage needed for all the Vines
19:16 🔗 DragonDav is now known as Dragon
19:35 🔗 Yoshimura Get some vines, average, multiply, done.
19:36 🔗 Yoshimura Most sites have stuff in TBs, although it might from front seem like they are gigantic.
19:43 🔗 will Oh yeah with Vine you know exactly how long a video is going to be, and the rough filesize so can get a pretty good extimate
19:43 🔗 will * estimate
19:47 🔗 will Largest one I've got so far is 3.17MB
19:50 🔗 will Random selection of 5 videos yields 1.88MB average file size
19:53 🔗 will ah they have supported 140s long videos since this year though which will skew some stats
19:55 🔗 will Not sure on videos uploaded
19:55 🔗 Yoshimura 100TB or more then.
19:56 🔗 Yoshimura 39 million as of February, so 2MB * 50 million = 100TB
19:56 🔗 will Less than I thought
19:57 🔗 Yoshimura Well, in reality it might be more, plus there are pages. So my guesstimate is 100-200TB. Which is not that much when you take how "valuable" resource it is.
19:58 🔗 will Yeah my estimate there is based on a really small selection of videos
19:58 🔗 Yoshimura I'm thinking about rewriting the archivebot and stuff, what you think?
19:58 🔗 will I can't find any official stats on videos uploaded on their blog
19:59 🔗 will Not like the YouTube press page anyway...
20:02 🔗 JW_work has joined #archiveteam-bs
20:03 🔗 Frogging idk if Vine is worth hundreds of TB personally
20:03 🔗 Yoshimura Definitely is.
20:04 🔗 Yoshimura Or at least the popular ones are.
20:04 🔗 Frogging maybe the popular ones for historical purposes, yeah. I doubt there's much value in the vast majority of these 7 second clips
20:05 🔗 JW_work1 has quit IRC (Read error: Operation timed out)
20:05 🔗 Frogging but I never actually used Vine and it's not up to any of us anyway :p
20:06 🔗 xmc who are you and what have you done with the archiveteam member named Frogging
20:06 🔗 wp494 the one thing we have going for us on the vine front is that videos are quite short unlike twitch
20:07 🔗 wp494 that said it wouldn't really matter if there's more volume anyway
20:07 🔗 Frogging xmc: did I say something out of character? :p
20:08 🔗 wp494 should I put vine in video hosting or social
20:08 🔗 Kaz sep332: the gdocs link in the Vine page is dead
20:08 🔗 wp494 I'm thinking video but I see reasons for social as well
20:08 🔗 wp494 (for the navbox, that is)
20:11 🔗 JW_work has quit IRC (Quit: Leaving.)
20:12 🔗 wp494 screw it I'll toss it to a poll
20:13 🔗 JW_work has joined #archiveteam-bs
20:16 🔗 xmc i'd go with social
20:16 🔗 Yoshimura same
20:20 🔗 ranma social
20:32 🔗 BlueMaxim has joined #archiveteam-bs
21:01 🔗 sep332 Kaz: thanks, had an extra dot somehow. Fixed
21:06 🔗 Yoshimura Umm is there a way to get list of pages saved in wayback?
21:08 🔗 Fletcher has quit IRC (Ping timeout: 244 seconds)
21:08 🔗 will has quit IRC (Ping timeout: 244 seconds)
21:08 🔗 Kenshin has quit IRC (Ping timeout: 244 seconds)
21:08 🔗 closure has quit IRC (Ping timeout: 244 seconds)
21:08 🔗 useretail has quit IRC (Ping timeout: 244 seconds)
21:08 🔗 i0npulse has quit IRC (Ping timeout: 244 seconds)
21:08 🔗 Medowar has quit IRC (Ping timeout: 244 seconds)
21:08 🔗 purplebot has quit IRC (Ping timeout: 244 seconds)
21:08 🔗 Rye has quit IRC (Ping timeout: 244 seconds)
21:08 🔗 will has joined #archiveteam-bs
21:08 🔗 Medowar has joined #archiveteam-bs
21:08 🔗 Kenshin has joined #archiveteam-bs
21:09 🔗 closure has joined #archiveteam-bs
21:09 🔗 Rye has joined #archiveteam-bs
21:10 🔗 purplebot has joined #archiveteam-bs
21:11 🔗 useretail has joined #archiveteam-bs
21:13 🔗 i0npulse has joined #archiveteam-bs
21:24 🔗 Fletcher has joined #archiveteam-bs
21:53 🔗 RichardG has quit IRC (Read error: Operation timed out)
21:53 🔗 RichardG has joined #archiveteam-bs
22:00 🔗 GE has quit IRC (Quit: zzz)
22:09 🔗 JW_work Yoshimura: list of pages in Wayback — see the CDX server interface (see IA page on the wiki for links)
22:10 🔗 Yoshimura I mean, can one get the list, not only query? (load on server)
22:11 🔗 xmc you mean the list of 253 billion pages?
22:11 🔗 yipdw 273 billion
22:11 🔗 xmc sorry, i lost track counting
22:11 🔗 Yoshimura Yes.
22:11 🔗 xmc no.
22:11 🔗 JW_work there isn't a separately downloadable list that I know of, no
22:11 🔗 Yoshimura https://archive.org/details/waybackcdx are those?
22:12 🔗 xmc but ... why?
22:12 🔗 JW_work "These shards are not publicly downloadable. "
22:12 🔗 Yoshimura Damn. Archival stuff, more intelligent then just fetching stuff and uploading what is already there.
22:13 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
22:13 🔗 Yoshimura The checksums and pages are of interest to be specific. And also sizes.
22:13 🔗 JW_work it's generally felt worth uploading stuff again, for multiple reasons:
22:13 🔗 Yoshimura pages = links
22:13 🔗 JW_work 1) gives evidence the stuff was still present at the new date
22:14 🔗 yipdw stuff changes over time, and with some exceptions such as large collections of multi-gigabyte files, some duplication isn't fatal
22:14 🔗 Yoshimura I am talking large scale, yes.
22:14 🔗 yipdw don't look at the number of copies of jquery-min*.js in wayback if you have OCD
22:14 🔗 JW_work 2) while they still have plenty of space, extra duplication of stuff that was interesting enough to grab multiple times is useful
22:14 🔗 BartoCH has joined #archiveteam-bs
22:15 🔗 xmc i mean you could make a bloom filter on the contents of the cdx'es
22:15 🔗 Yoshimura ^ this
22:15 🔗 xmc but it's a lot of work to save a little bit of bandwidth
22:15 🔗 Yoshimura A lot.
22:15 🔗 xmc and a lot of bandwidth on the front-side!
22:15 🔗 JW_work 3) if/when they get a space crunch, being able to unexpectedly get more merely by removing said duplication is also neat
22:16 🔗 yipdw have fun tuning your false positive rate if you decide to go for a Bloom filter
22:16 🔗 Yoshimura That is simple.
22:16 🔗 JW_work I do agree that it would be neat to have a publically downloadable set of *hashes* of all the content in the Wayback machine, though
22:17 🔗 yipdw in any case we decided to skip that and it's been doing okay for about three years
22:17 🔗 JW_work (although, like improved searching, that increases the risk of people objecting to the inclusion of particular bits of content)
22:17 🔗 yipdw or 5-6 years if you include all AT activity
22:17 🔗 Yoshimura Can I just hammer the API about all resources?
22:17 🔗 yipdw duplication isn't really an endemic Archive Team problem; it is relatively easy to say "hey you really don't need to grab those gigabytes of ISOs again"
22:18 🔗 JW_work if you do it slowly enough (i.e. say, one query per minute) I doubt anyone would care
22:18 🔗 yipdw fetch diversity is a bigger one, IMO
22:18 🔗 * JW_work agrees about a need for greater fetch diversity
22:18 🔗 yipdw we get *so* *much* *fucking* *shit* about Western technopolitics
22:18 🔗 yipdw did I mention that we don't tend to get much from the entire other half of the world
22:19 🔗 JW_work the amount of non-English-language stuff we are missing knowing about is … really not good
22:19 🔗 Yoshimura I meant 10-100 per second with memoization
22:19 🔗 Yoshimura Or in case of CDX API wildcards would help of course
22:20 🔗 Yoshimura This would lead to the diversity yes.
22:20 🔗 JW_work 100 queries per second would probably get noticed (although it might not).
22:20 🔗 Yoshimura Given my health and my life I am rethinking everything, even my own existence.
22:20 🔗 JW_work wait, how would extracting a list of the current contents of the Wayback Machine help with diversity?
22:20 🔗 kristian_ has joined #archiveteam-bs
22:21 🔗 xmc when interacting with archive.org: the general rule is, don't make things fall over, don't make it slow for other people, and the archive won't get in your way
22:21 🔗 Yoshimura If I could get info about what pages are missing one can more easily fetch those.
22:21 🔗 JW_work but the problem is knowing what's out there, not what's missing from Wayback
22:22 🔗 JW_work we just discussed why grabing stuff that has already been grabbed is harmless
22:22 🔗 Yoshimura If I know what is out there, I can look if it is missing programmaticaly
22:22 🔗 JW_work right, but how do you find what is out there?
22:23 🔗 yipdw well, the APIs exist and if you keep the request rate down you'll probably be ok
22:23 🔗 yipdw enjoy
22:23 🔗 Yoshimura If you talk large scale it means not grabbing other stuff with those resources.
22:23 🔗 Yoshimura The API gives me hashes, right?
22:23 🔗 Yoshimura JW_work: Crawl data, users, etc. There are multiple facets
22:23 🔗 JW_work the CDX api does give "digests", which are (IIRC) sha1 with a weird format
22:24 🔗 kristian_ okay, -BS ... any francophone people here?
22:25 🔗 Yoshimura Yes, digest. Base32 is not weird, just uncommon.
22:25 🔗 JW_work I don't see how knowing more about what is *in* the Wayback Machine would help you find material that we (ArchiveTeam) don't already know about. It seems like making contacts with other people/groups would be the only way to do that.
22:26 🔗 Yoshimura It helps to discern what is there and what is not when one gets hands on datasets.
22:32 🔗 JW_work agreed — but the first job is getting one's hands on more datasets. Once that happens (or even, once specific possiblities are known), *then* figuring out how much is duplicated matters — but not until then
22:36 🔗 kniffy has joined #archiveteam-bs
22:45 🔗 SketchCow So much talking in here.
22:47 🔗 kniffy zzz
22:55 🔗 godane so archive.org only has 273 billion pages when it was 491 billion pages this past july: https://web.archive.org/web/20160713070126/https://archive.org/
22:56 🔗 xmc i expect it's 491B copies of pages, and 273 distinct urls
22:56 🔗 godane ok
22:56 🔗 godane its just weird to me
22:57 🔗 xmc yeah it is
22:57 🔗 xmc i'd expect them to be consistent
22:57 🔗 godane anyways i maybe adding close to half a million urls just from abc.net.au
22:58 🔗 xmc yow
23:00 🔗 godane 89549 urls are in abc.net.au 2006 news sitemap
23:01 🔗 godane wayback has less then 2000 urls from abc.net.au/news/2006*
23:02 🔗 godane just the 2003 sitemap urls was more then wayback machine had from 2003 to 2009 urls
23:03 🔗 godane so i think wayback machine was drunk when going after abc.net.au
23:06 🔗 computerf has quit IRC (Bye.)
23:12 🔗 computerf has joined #archiveteam-bs
23:15 🔗 JW_work As I understand, the "wide" crawls only go a very limited depth down into any particular website — if abc.net.au didn't get included in a higher-priority crawl until after 2009, that would explain it.
23:15 🔗 JW_work (assuming my understanding is correct, which it certainly might not be)
23:18 🔗 JW_work 502 on Aug 14; 505 on Sept 14; 510 on Oct 14, which is the most recent crawl (and oddly, archive.org can't be saved with Wayback Save)
23:18 🔗 godane i think the urls change to the current format in 2011
23:18 🔗 JW_work (all in billions)
23:19 🔗 JW_work and the current page (that lists the 273 billion number) now links to the blog post that explains the apparent drop
23:20 🔗 JW_work it's 273 billion "webpages" and 510 billion "time-stamped web objects" (aka "web captures")
23:21 🔗 JW_work wow, so that implies that there are 237 billion captures that *aren't* HTML, plain text or PDF. That's a lot of copies of jQuery. :-)
23:22 🔗 Kaz images?
23:22 🔗 JW_work (and 404 errors)
23:22 🔗 Kaz wait no
23:22 🔗 JW_work yes, images
23:22 🔗 Kaz ignore the wait, no then
23:26 🔗 JW_work (I don't have access to my email right now, so I'll mention this here: http://blog.archive.org/2016/10/24/faqs-for-some-new-features-available-in-the-beta-wayback-machine/#comment-355327 is a spam comment and should probably get removed) Somebody, please forward this on to info@archive
23:28 🔗 epicfacet has joined #archiveteam-bs
23:28 🔗 arkiver hi
23:29 🔗 arkiver so yeah, let us know if it is back
23:29 🔗 arkiver we'll archive it again
23:29 🔗 arkiver any idea how large the site is when it's fully online?
23:29 🔗 Kaz JW_work: I don't see any spammy comments?
23:29 🔗 Kaz I only see two comments, though the page implies there's 4
23:29 🔗 epicfacet the site was huge.
23:30 🔗 epicfacet it had a backup of the original AoS forums, about 100k members, ect
23:30 🔗 JW_work Kaz — the one from onlinepluz is what I was referring to.
23:30 🔗 epicfacet the game just fell out of popularity in the last year or two
23:30 🔗 arkiver how many posts do you think?
23:30 🔗 epicfacet I have no idea.
23:30 🔗 arkiver ok
23:30 🔗 JW_work has quit IRC (Quit: Leaving.)
23:30 🔗 arkiver well let us know if it's back and we'll have a look at it
23:31 🔗 Kaz epicfacet: is http://www.aceofspades.com/community/index related?
23:31 🔗 Kaz or is that a different game?
23:32 🔗 epicfacet what happened is a company bought the rights to the game about a year or two ago, and made that. BnS was a continuation of the original game because so many people didn't like the paid version
23:32 🔗 epicfacet when they bought the game, they actualy shut the original forums down w/o notice. someone was somehow able to grab a copy and put it on there
23:33 🔗 epicfacet so yeah, quite a history with shutdowns for this game
23:46 🔗 godane i just saved this: https://web.archive.org/web/*/http://mpegmedia.abc.net.au/local/sydney/201301/r1058576_12374885.mp3
23:46 🔗 godane mp3 came from this article: https://web.archive.org/web/20130117142641/http://www.abc.net.au/local/stories/2013/01/14/3669278.htm?site=sydney
23:48 🔗 godane again that sort of feel like a fail since we did that Aaron Swartz collection
23:49 🔗 godane but at least wayback had a copy of the article from that time
23:55 🔗 epicfacet has quit IRC (Quit: Page closed)
23:55 🔗 VADemon has joined #archiveteam-bs

irclogger-viewer