#archiveteam 2014-12-24,Wed

↑back Search

Time Nickname Message
00:04 πŸ”— bzc6p_ has left
00:08 πŸ”— useretail has joined #archiveteam
00:09 πŸ”— db48x has quit IRC (Ping timeout: 258 seconds)
00:12 πŸ”— wp494 has quit IRC ()
00:14 πŸ”— wp494 has joined #archiveteam
00:29 πŸ”— xmc :(
00:47 πŸ”— achip has joined #archiveteam
00:53 πŸ”— wp494 has quit IRC ()
01:09 πŸ”— Ymgve has quit IRC ()
01:36 πŸ”— josephroo has joined #archiveteam
01:50 πŸ”— wp494 has joined #archiveteam
01:53 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
01:56 πŸ”— dashcloud has joined #archiveteam
01:59 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
02:01 πŸ”— achip has quit IRC (Remote host closed the connection)
02:02 πŸ”— dashcloud has joined #archiveteam
02:20 πŸ”— BlueMaxim has quit IRC (Ping timeout: 335 seconds)
02:34 πŸ”— mistym has joined #archiveteam
02:35 πŸ”— achip has joined #archiveteam
02:44 πŸ”— DFJustin has quit IRC (IMHOSTFU)
02:45 πŸ”— BlueMaxim has joined #archiveteam
02:52 πŸ”— DFJustin has joined #archiveteam
02:52 πŸ”— swebb sets mode: +o DFJustin
02:54 πŸ”— Nertsy has quit IRC (Ping timeout: 335 seconds)
02:56 πŸ”— primus104 has quit IRC (Leaving.)
03:20 πŸ”— mhazinsk is there an archive.org channel?
03:28 πŸ”— xmc #internetarchive
03:28 πŸ”— mhazinsk on efnet?
03:29 πŸ”— xmc same network as this one
03:29 πŸ”— mhazinsk thanks
03:33 πŸ”— kyan has joined #archiveteam
03:55 πŸ”— mistym has quit IRC (Remote host closed the connection)
03:58 πŸ”— Nertsy has joined #archiveteam
04:14 πŸ”— okeuday has quit IRC (Ping timeout: 246 seconds)
04:14 πŸ”— okeuday has joined #archiveteam
04:15 πŸ”— wp494_ has joined #archiveteam
04:18 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
04:24 πŸ”— mistym has joined #archiveteam
04:29 πŸ”— achip has quit IRC (Remote host closed the connection)
04:39 πŸ”— Froggypwn has joined #archiveteam
04:46 πŸ”— Froggypwn has quit IRC (~ Trillian Astra - www.trillian.im ~)
04:48 πŸ”— Daloader_ has quit IRC (Read error: Connection reset by peer)
04:48 πŸ”— Daloader_ has joined #archiveteam
04:54 πŸ”— kyan has quit IRC (Ping timeout: 480 seconds)
04:55 πŸ”— kyan has joined #archiveteam
05:01 πŸ”— aaaaaaaaa has quit IRC (Leaving)
05:17 πŸ”— Swizzle has joined #archiveteam
05:50 πŸ”— signius has quit IRC (Read error: Operation timed out)
05:50 πŸ”— signius has joined #archiveteam
06:07 πŸ”— [1]Swizzl has joined #archiveteam
06:10 πŸ”— Swizzle has quit IRC (Read error: Operation timed out)
06:10 πŸ”— [1]Swizzl is now known as Swizzle
06:54 πŸ”— db48x has joined #archiveteam
07:18 πŸ”— Swizzle has quit IRC (Quit: HydraIRC -> http://www.hydrairc.com <- Wibbly Wobbly IRC)
07:19 πŸ”— SketchCow Who wants it:
07:19 πŸ”— SketchCow http://archiveteam.org/index.php?title=Scoop
07:47 πŸ”— db48x has quit IRC (Read error: Operation timed out)
07:59 πŸ”— APerti has joined #archiveteam
08:11 πŸ”— mistym_ has joined #archiveteam
08:16 πŸ”— mistym has quit IRC (Read error: Operation timed out)
08:31 πŸ”— Daloader_ has quit IRC (Quit: Leaving)
08:37 πŸ”— Ctrl-S has joined #archiveteam
08:44 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
08:47 πŸ”— dashcloud has joined #archiveteam
08:49 πŸ”— brayden has quit IRC (Read error: Operation timed out)
09:01 πŸ”— schbirid has joined #archiveteam
09:02 πŸ”— brayden has joined #archiveteam
09:08 πŸ”— primus104 has joined #archiveteam
09:50 πŸ”— primus104 has quit IRC (Leaving.)
10:00 πŸ”— mistym_ has quit IRC (Remote host closed the connection)
10:01 πŸ”— Ymgve has joined #archiveteam
10:21 πŸ”— BlueMaxim has quit IRC (Quit: Leaving)
10:42 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
10:45 πŸ”— dashcloud has joined #archiveteam
11:11 πŸ”— MMovie1 has joined #archiveteam
11:13 πŸ”— MMovie has quit IRC (Read error: Operation timed out)
11:16 πŸ”— MMovie1 has quit IRC (Client Quit)
11:16 πŸ”— MMovie has joined #archiveteam
11:30 πŸ”— wp494_ is now known as wp494
12:20 πŸ”— Emcy_ has quit IRC (Ping timeout: 480 seconds)
12:22 πŸ”— BiggieJo1 has joined #archiveteam
12:27 πŸ”— BiggieJon has quit IRC (Read error: Operation timed out)
12:33 πŸ”— SketchCow FOS is not 100% enjoying, but is dealing with MS Clip Art pretty well.
12:34 πŸ”— SketchCow Disk space usage on the machine in that drive is holding up nicely, mostly due to automatic processes now shoving things out.
12:47 πŸ”— SadDM APerti: I think the original PC version of "Sid Meier’s Pirates!" was like that. You needed to boot from the game floppy, and I seem to recall that it was unreadable in DOS.
13:27 πŸ”— Start has joined #archiveteam
13:29 πŸ”— Ctrl-S has quit IRC (Ping timeout: 845 seconds)
13:30 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
13:33 πŸ”— dashcloud has joined #archiveteam
13:38 πŸ”— Start has quit IRC (Ping timeout: 265 seconds)
13:47 πŸ”— Ctrl-S has joined #archiveteam
13:51 πŸ”— brayden has quit IRC (Ping timeout: 606 seconds)
13:51 πŸ”— primus104 has joined #archiveteam
13:56 πŸ”— brayden has joined #archiveteam
14:23 πŸ”— phuzion_ has quit IRC (Read error: Operation timed out)
14:26 πŸ”— phuzion has joined #archiveteam
14:46 πŸ”— ruukasu has quit IRC (Quit: WeeChat 1.0.1)
14:50 πŸ”— ruukasu has joined #archiveteam
15:38 πŸ”— Start has joined #archiveteam
15:39 πŸ”— Start has quit IRC (Client Quit)
15:39 πŸ”— Start has joined #archiveteam
15:50 πŸ”— Start has quit IRC (Ping timeout: 252 seconds)
15:51 πŸ”— APerti has quit IRC (Ping timeout: 370 seconds)
16:05 πŸ”— Emcy has joined #archiveteam
16:06 πŸ”— godane SketchCow: i'm looking at scoop
16:06 πŸ”— godane looks like the .xml.gz are really not gzip
16:18 πŸ”— lhobas_ has joined #archiveteam
16:20 πŸ”— wacky has joined #archiveteam
16:25 πŸ”— db48x has joined #archiveteam
16:26 πŸ”— xk_id has joined #archiveteam
16:36 πŸ”— Lord_Nigh has quit IRC (Read error: Operation timed out)
16:42 πŸ”— Ristovski has joined #archiveteam
16:44 πŸ”— Ristovski Hello, are you guys planning on archiving the pastebin public paste list?
16:49 πŸ”— dashcloud there was a project or script doing that- not sure what the current status is though- maybe someone else here remembers more about it
16:49 πŸ”— Ristovski I see
16:50 πŸ”— dashcloud could be an archive on internet archive
17:04 πŸ”— Lord_Nigh has joined #archiveteam
17:21 πŸ”— signius has quit IRC (Read error: Operation timed out)
17:22 πŸ”— signius has joined #archiveteam
17:26 πŸ”— robink has joined #archiveteam
17:33 πŸ”— joepie91 Ristovski: hi
17:33 πŸ”— joepie91 I was doing that, it eventually got banned and/or broke and/or otherwise stopped working
17:33 πŸ”— joepie91 haven't gotten around to fixing it yet
17:34 πŸ”— joepie91 Ristovski: https://github.com/joepie91/pastebin-scrape/tree/develop
17:34 πŸ”— joepie91 Ristovski: if you're bored, feel free to grab the code and see if it works on another system, and/or fix it where necessary, and I'll happily spin it up again :)
17:34 πŸ”— Ristovski joepie91, I have created a pastebin crawler myself, just wanted to see if the guys here already plan on doing something like this
17:35 πŸ”— Ristovski joepie91, mine works like yours, but more low-level
17:35 πŸ”— Ristovski no fancy io pipes and such
17:36 πŸ”— joepie91 Ristovski: low level in what sense? :P
17:36 πŸ”— raylee my friend made one, that basically uses a sql db and has a server-client architecture, so multiple boxes can scrape and throw it in the same db etc
17:36 πŸ”— raylee with a searchable interface
17:36 πŸ”— Ristovski joepie91, I made it in like 10 mins, you get the idea :D
17:37 πŸ”— joepie91 raylee: that's massive overkill for pastebin :P
17:38 πŸ”— joepie91 okay, so, raylee, basically; a single box can easily scrape all of pastebin
17:39 πŸ”— joepie91 the paste volume isn't /that/ high
17:39 πŸ”— joepie91 distributed architecture is a bit overkill, really
17:39 πŸ”— joepie91 and just more potential points of failure :P
17:39 πŸ”— raylee he was using distributed architecture as mainly to work around bans
17:39 πŸ”— raylee i believe he got one of his ips permabanned
17:39 πŸ”— Ristovski raylee, as in, distribute the requests between more clients so it doesnt hit the requests-per-client so easily?
17:40 πŸ”— joepie91 might be what happened to my box also, but it took a number of months
17:40 πŸ”— joepie91 (not exaggerating)
17:41 πŸ”— raylee Ristovski, yes
17:41 πŸ”— raylee the distribution being more for redundancy than capacity
17:41 πŸ”— Ristovski yeah
17:43 πŸ”— RichardG has joined #archiveteam
17:45 πŸ”— Ctrl-S sweet, a pastebin scraper
17:45 πŸ”— Ctrl-S I made on for myself, how did you get past the ratelimits on raw pastes?
17:46 πŸ”— joepie91 Ctrl-S: by not hitting it? :P
17:46 πŸ”— joepie91 you just need to not go too fast
17:46 πŸ”— joepie91 you can archive every single paste without hitting the ratelimiter
17:47 πŸ”— Ctrl-S I ended up just saving thwe view page
17:47 πŸ”— joepie91 also, raylee, Ristovski, Ctrl-S, for context, this is the collection of historical pastes I have: https://archive.org/details/pastebinpastes
17:47 πŸ”— Ctrl-S A bunch of writers on 4chan use pastebin to host their stories
17:49 πŸ”— Ctrl-S how big is the collection in total?
17:49 πŸ”— joepie91 223 RESULTS
17:49 πŸ”— joepie91 :)
17:49 πŸ”— Ctrl-S I mean in GB
17:49 πŸ”— joepie91 for most of those days, it has all the pastes
17:49 πŸ”— joepie91 no idea, but it's tiny
17:49 πŸ”— Ctrl-S just for that day?
17:50 πŸ”— joepie91 arbitrary day: 16.3M gzipped
17:50 πŸ”— Ctrl-S or does it scan each possible paste?
17:50 πŸ”— joepie91 I think it extracts to 200MB per day or so, at most
17:50 πŸ”— joepie91 Ctrl-S: what do you mean?
17:50 πŸ”— Ctrl-S and does it follow links it finds in pastes to other pastes?
17:50 πŸ”— joepie91 no, it just scrapes the 'latest pastes' list and fetches each one on a loop
17:50 πŸ”— joepie91 so it grabs pastes as they are posted
17:50 πŸ”— Ctrl-S I'd just give you the code to my script if my raid hadn't just died
17:50 πŸ”— joepie91 before they have a chance to get deleted, really
17:52 πŸ”— Ctrl-S And this is run continuously?
17:53 πŸ”— Ctrl-S also this looks wrong: https://github.com/joepie91/pastebin-scrape/blob/develop/start.py
17:53 πŸ”— Ctrl-S zmq is imported repeatedly with differnt messages
17:53 πŸ”— Ctrl-S lines 12-16
17:56 πŸ”— joepie91 oh yeah, that's a bug, lol
17:56 πŸ”— joepie91 one of those things i hadn't gotten around to fixing yet
18:04 πŸ”— aMunster has quit IRC (Read error: Operation timed out)
18:11 πŸ”— APerti has joined #archiveteam
18:36 πŸ”— Ctrl-S So we have ALL the pastes, or just some of the,?
18:40 πŸ”— aaaaaaaaa has joined #archiveteam
18:50 πŸ”— db48x has quit IRC (Ping timeout: 258 seconds)
18:54 πŸ”— aMunster has joined #archiveteam
18:56 πŸ”— joepie91 Ctrl-S: all the public pastes in the timespan where the scraper was functioning
18:57 πŸ”— RichardG has quit IRC (Ping timeout: 186 seconds)
19:01 πŸ”— lytv has quit IRC (Quit: Leaving)
19:03 πŸ”— RichardG has joined #archiveteam
19:13 πŸ”— lytv has joined #archiveteam
19:19 πŸ”— BlueMaxim has joined #archiveteam
19:34 πŸ”— Start has joined #archiveteam
19:34 πŸ”— Nertsy` has joined #archiveteam
19:39 πŸ”— mistym has joined #archiveteam
19:39 πŸ”— Nertsy has quit IRC (Ping timeout: 370 seconds)
19:55 πŸ”— Start has quit IRC (Ping timeout: 369 seconds)
20:38 πŸ”— db48x has joined #archiveteam
20:51 πŸ”— mistym has quit IRC (Remote host closed the connection)
20:59 πŸ”— db48x has quit IRC (Ping timeout: 258 seconds)
21:16 πŸ”— Baljem_ is now known as Baljem
22:05 πŸ”— db48x has joined #archiveteam
22:26 πŸ”— aaaaaaaaa has quit IRC (Leaving)
22:26 πŸ”— db48x has quit IRC (Ping timeout: 258 seconds)
22:57 πŸ”— Ristovski has quit IRC (Quit: Leaving)
23:49 πŸ”— VonCloud_ has joined #archiveteam
23:50 πŸ”— VonCloud_ is now known as VonGuar
23:50 πŸ”— VonGuar is now known as VonGuard

irclogger-viewer