[00:04] *** bzc6p_ has left [00:08] *** useretail has joined #archiveteam [00:09] *** db48x has quit IRC (Ping timeout: 258 seconds) [00:12] *** wp494 has quit IRC () [00:14] *** wp494 has joined #archiveteam [00:29] :( [00:47] *** achip has joined #archiveteam [00:53] *** wp494 has quit IRC () [01:09] *** Ymgve has quit IRC () [01:36] *** josephroo has joined #archiveteam [01:50] *** wp494 has joined #archiveteam [01:53] *** dashcloud has quit IRC (Read error: Operation timed out) [01:56] *** dashcloud has joined #archiveteam [01:59] *** dashcloud has quit IRC (Read error: Operation timed out) [02:01] *** achip has quit IRC (Remote host closed the connection) [02:02] *** dashcloud has joined #archiveteam [02:20] *** BlueMaxim has quit IRC (Ping timeout: 335 seconds) [02:34] *** mistym has joined #archiveteam [02:35] *** achip has joined #archiveteam [02:44] *** DFJustin has quit IRC (IMHOSTFU) [02:45] *** BlueMaxim has joined #archiveteam [02:52] *** DFJustin has joined #archiveteam [02:52] *** swebb sets mode: +o DFJustin [02:54] *** Nertsy has quit IRC (Ping timeout: 335 seconds) [02:56] *** primus104 has quit IRC (Leaving.) [03:20] is there an archive.org channel? [03:28] #internetarchive [03:28] on efnet? [03:29] same network as this one [03:29] thanks [03:33] *** kyan has joined #archiveteam [03:55] *** mistym has quit IRC (Remote host closed the connection) [03:58] *** Nertsy has joined #archiveteam [04:14] *** okeuday has quit IRC (Ping timeout: 246 seconds) [04:14] *** okeuday has joined #archiveteam [04:15] *** wp494_ has joined #archiveteam [04:18] *** wp494 has quit IRC (Read error: Operation timed out) [04:24] *** mistym has joined #archiveteam [04:29] *** achip has quit IRC (Remote host closed the connection) [04:39] *** Froggypwn has joined #archiveteam [04:46] *** Froggypwn has quit IRC (~ Trillian Astra - www.trillian.im ~) [04:48] *** Daloader_ has quit IRC (Read error: Connection reset by peer) [04:48] *** Daloader_ has joined #archiveteam [04:54] *** kyan has quit IRC (Ping timeout: 480 seconds) [04:55] *** kyan has joined #archiveteam [05:01] *** aaaaaaaaa has quit IRC (Leaving) [05:17] *** Swizzle has joined #archiveteam [05:50] *** signius has quit IRC (Read error: Operation timed out) [05:50] *** signius has joined #archiveteam [06:07] *** [1]Swizzl has joined #archiveteam [06:10] *** Swizzle has quit IRC (Read error: Operation timed out) [06:10] *** [1]Swizzl is now known as Swizzle [06:54] *** db48x has joined #archiveteam [07:18] *** Swizzle has quit IRC (Quit: HydraIRC -> http://www.hydrairc.com <- Wibbly Wobbly IRC) [07:19] Who wants it: [07:19] http://archiveteam.org/index.php?title=Scoop [07:47] *** db48x has quit IRC (Read error: Operation timed out) [07:59] *** APerti has joined #archiveteam [08:11] *** mistym_ has joined #archiveteam [08:16] *** mistym has quit IRC (Read error: Operation timed out) [08:31] *** Daloader_ has quit IRC (Quit: Leaving) [08:37] *** Ctrl-S has joined #archiveteam [08:44] *** dashcloud has quit IRC (Read error: Operation timed out) [08:47] *** dashcloud has joined #archiveteam [08:49] *** brayden has quit IRC (Read error: Operation timed out) [09:01] *** schbirid has joined #archiveteam [09:02] *** brayden has joined #archiveteam [09:08] *** primus104 has joined #archiveteam [09:50] *** primus104 has quit IRC (Leaving.) [10:00] *** mistym_ has quit IRC (Remote host closed the connection) [10:01] *** Ymgve has joined #archiveteam [10:21] *** BlueMaxim has quit IRC (Quit: Leaving) [10:42] *** dashcloud has quit IRC (Read error: Operation timed out) [10:45] *** dashcloud has joined #archiveteam [11:11] *** MMovie1 has joined #archiveteam [11:13] *** MMovie has quit IRC (Read error: Operation timed out) [11:16] *** MMovie1 has quit IRC (Client Quit) [11:16] *** MMovie has joined #archiveteam [11:30] *** wp494_ is now known as wp494 [12:20] *** Emcy_ has quit IRC (Ping timeout: 480 seconds) [12:22] *** BiggieJo1 has joined #archiveteam [12:27] *** BiggieJon has quit IRC (Read error: Operation timed out) [12:33] FOS is not 100% enjoying, but is dealing with MS Clip Art pretty well. [12:34] Disk space usage on the machine in that drive is holding up nicely, mostly due to automatic processes now shoving things out. [12:47] APerti: I think the original PC version of "Sid Meier’s Pirates!" was like that. You needed to boot from the game floppy, and I seem to recall that it was unreadable in DOS. [13:27] *** Start has joined #archiveteam [13:29] *** Ctrl-S has quit IRC (Ping timeout: 845 seconds) [13:30] *** dashcloud has quit IRC (Read error: Operation timed out) [13:33] *** dashcloud has joined #archiveteam [13:38] *** Start has quit IRC (Ping timeout: 265 seconds) [13:47] *** Ctrl-S has joined #archiveteam [13:51] *** brayden has quit IRC (Ping timeout: 606 seconds) [13:51] *** primus104 has joined #archiveteam [13:56] *** brayden has joined #archiveteam [14:23] *** phuzion_ has quit IRC (Read error: Operation timed out) [14:26] *** phuzion has joined #archiveteam [14:46] *** ruukasu has quit IRC (Quit: WeeChat 1.0.1) [14:50] *** ruukasu has joined #archiveteam [15:38] *** Start has joined #archiveteam [15:39] *** Start has quit IRC (Client Quit) [15:39] *** Start has joined #archiveteam [15:50] *** Start has quit IRC (Ping timeout: 252 seconds) [15:51] *** APerti has quit IRC (Ping timeout: 370 seconds) [16:05] *** Emcy has joined #archiveteam [16:06] SketchCow: i'm looking at scoop [16:06] looks like the .xml.gz are really not gzip [16:18] *** lhobas_ has joined #archiveteam [16:20] *** wacky has joined #archiveteam [16:25] *** db48x has joined #archiveteam [16:26] *** xk_id has joined #archiveteam [16:36] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [16:42] *** Ristovski has joined #archiveteam [16:44] Hello, are you guys planning on archiving the pastebin public paste list? [16:49] there was a project or script doing that- not sure what the current status is though- maybe someone else here remembers more about it [16:49] I see [16:50] could be an archive on internet archive [17:04] *** Lord_Nigh has joined #archiveteam [17:21] *** signius has quit IRC (Read error: Operation timed out) [17:22] *** signius has joined #archiveteam [17:26] *** robink has joined #archiveteam [17:33] Ristovski: hi [17:33] I was doing that, it eventually got banned and/or broke and/or otherwise stopped working [17:33] haven't gotten around to fixing it yet [17:34] Ristovski: https://github.com/joepie91/pastebin-scrape/tree/develop [17:34] Ristovski: if you're bored, feel free to grab the code and see if it works on another system, and/or fix it where necessary, and I'll happily spin it up again :) [17:34] joepie91, I have created a pastebin crawler myself, just wanted to see if the guys here already plan on doing something like this [17:35] joepie91, mine works like yours, but more low-level [17:35] no fancy io pipes and such [17:36] Ristovski: low level in what sense? :P [17:36] my friend made one, that basically uses a sql db and has a server-client architecture, so multiple boxes can scrape and throw it in the same db etc [17:36] with a searchable interface [17:36] joepie91, I made it in like 10 mins, you get the idea :D [17:37] raylee: that's massive overkill for pastebin :P [17:38] okay, so, raylee, basically; a single box can easily scrape all of pastebin [17:39] the paste volume isn't /that/ high [17:39] distributed architecture is a bit overkill, really [17:39] and just more potential points of failure :P [17:39] he was using distributed architecture as mainly to work around bans [17:39] i believe he got one of his ips permabanned [17:39] raylee, as in, distribute the requests between more clients so it doesnt hit the requests-per-client so easily? [17:40] might be what happened to my box also, but it took a number of months [17:40] (not exaggerating) [17:41] Ristovski, yes [17:41] the distribution being more for redundancy than capacity [17:41] yeah [17:43] *** RichardG has joined #archiveteam [17:45] sweet, a pastebin scraper [17:45] I made on for myself, how did you get past the ratelimits on raw pastes? [17:46] Ctrl-S: by not hitting it? :P [17:46] you just need to not go too fast [17:46] you can archive every single paste without hitting the ratelimiter [17:47] I ended up just saving thwe view page [17:47] also, raylee, Ristovski, Ctrl-S, for context, this is the collection of historical pastes I have: https://archive.org/details/pastebinpastes [17:47] A bunch of writers on 4chan use pastebin to host their stories [17:49] how big is the collection in total? [17:49] 223 RESULTS [17:49] :) [17:49] I mean in GB [17:49] for most of those days, it has all the pastes [17:49] no idea, but it's tiny [17:49] just for that day? [17:50] arbitrary day: 16.3M gzipped [17:50] or does it scan each possible paste? [17:50] I think it extracts to 200MB per day or so, at most [17:50] Ctrl-S: what do you mean? [17:50] and does it follow links it finds in pastes to other pastes? [17:50] no, it just scrapes the 'latest pastes' list and fetches each one on a loop [17:50] so it grabs pastes as they are posted [17:50] I'd just give you the code to my script if my raid hadn't just died [17:50] before they have a chance to get deleted, really [17:52] And this is run continuously? [17:53] also this looks wrong: https://github.com/joepie91/pastebin-scrape/blob/develop/start.py [17:53] zmq is imported repeatedly with differnt messages [17:53] lines 12-16 [17:56] oh yeah, that's a bug, lol [17:56] one of those things i hadn't gotten around to fixing yet [18:04] *** aMunster has quit IRC (Read error: Operation timed out) [18:11] *** APerti has joined #archiveteam [18:36] So we have ALL the pastes, or just some of the,? [18:40] *** aaaaaaaaa has joined #archiveteam [18:50] *** db48x has quit IRC (Ping timeout: 258 seconds) [18:54] *** aMunster has joined #archiveteam [18:56] Ctrl-S: all the public pastes in the timespan where the scraper was functioning [18:57] *** RichardG has quit IRC (Ping timeout: 186 seconds) [19:01] *** lytv has quit IRC (Quit: Leaving) [19:03] *** RichardG has joined #archiveteam [19:13] *** lytv has joined #archiveteam [19:19] *** BlueMaxim has joined #archiveteam [19:34] *** Start has joined #archiveteam [19:34] *** Nertsy` has joined #archiveteam [19:39] *** mistym has joined #archiveteam [19:39] *** Nertsy has quit IRC (Ping timeout: 370 seconds) [19:55] *** Start has quit IRC (Ping timeout: 369 seconds) [20:38] *** db48x has joined #archiveteam [20:51] *** mistym has quit IRC (Remote host closed the connection) [20:59] *** db48x has quit IRC (Ping timeout: 258 seconds) [21:16] *** Baljem_ is now known as Baljem [22:05] *** db48x has joined #archiveteam [22:26] *** aaaaaaaaa has quit IRC (Leaving) [22:26] *** db48x has quit IRC (Ping timeout: 258 seconds) [22:57] *** Ristovski has quit IRC (Quit: Leaving) [23:49] *** VonCloud_ has joined #archiveteam [23:50] *** VonCloud_ is now known as VonGuar [23:50] *** VonGuar is now known as VonGuard