Time |
Nickname |
Message |
00:04
π
|
|
bzc6p_ has left |
00:08
π
|
|
useretail has joined #archiveteam |
00:09
π
|
|
db48x has quit IRC (Ping timeout: 258 seconds) |
00:12
π
|
|
wp494 has quit IRC () |
00:14
π
|
|
wp494 has joined #archiveteam |
00:29
π
|
xmc |
:( |
00:47
π
|
|
achip has joined #archiveteam |
00:53
π
|
|
wp494 has quit IRC () |
01:09
π
|
|
Ymgve has quit IRC () |
01:36
π
|
|
josephroo has joined #archiveteam |
01:50
π
|
|
wp494 has joined #archiveteam |
01:53
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
01:56
π
|
|
dashcloud has joined #archiveteam |
01:59
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
02:01
π
|
|
achip has quit IRC (Remote host closed the connection) |
02:02
π
|
|
dashcloud has joined #archiveteam |
02:20
π
|
|
BlueMaxim has quit IRC (Ping timeout: 335 seconds) |
02:34
π
|
|
mistym has joined #archiveteam |
02:35
π
|
|
achip has joined #archiveteam |
02:44
π
|
|
DFJustin has quit IRC (IMHOSTFU) |
02:45
π
|
|
BlueMaxim has joined #archiveteam |
02:52
π
|
|
DFJustin has joined #archiveteam |
02:52
π
|
|
swebb sets mode: +o DFJustin |
02:54
π
|
|
Nertsy has quit IRC (Ping timeout: 335 seconds) |
02:56
π
|
|
primus104 has quit IRC (Leaving.) |
03:20
π
|
mhazinsk |
is there an archive.org channel? |
03:28
π
|
xmc |
#internetarchive |
03:28
π
|
mhazinsk |
on efnet? |
03:29
π
|
xmc |
same network as this one |
03:29
π
|
mhazinsk |
thanks |
03:33
π
|
|
kyan has joined #archiveteam |
03:55
π
|
|
mistym has quit IRC (Remote host closed the connection) |
03:58
π
|
|
Nertsy has joined #archiveteam |
04:14
π
|
|
okeuday has quit IRC (Ping timeout: 246 seconds) |
04:14
π
|
|
okeuday has joined #archiveteam |
04:15
π
|
|
wp494_ has joined #archiveteam |
04:18
π
|
|
wp494 has quit IRC (Read error: Operation timed out) |
04:24
π
|
|
mistym has joined #archiveteam |
04:29
π
|
|
achip has quit IRC (Remote host closed the connection) |
04:39
π
|
|
Froggypwn has joined #archiveteam |
04:46
π
|
|
Froggypwn has quit IRC (~ Trillian Astra - www.trillian.im ~) |
04:48
π
|
|
Daloader_ has quit IRC (Read error: Connection reset by peer) |
04:48
π
|
|
Daloader_ has joined #archiveteam |
04:54
π
|
|
kyan has quit IRC (Ping timeout: 480 seconds) |
04:55
π
|
|
kyan has joined #archiveteam |
05:01
π
|
|
aaaaaaaaa has quit IRC (Leaving) |
05:17
π
|
|
Swizzle has joined #archiveteam |
05:50
π
|
|
signius has quit IRC (Read error: Operation timed out) |
05:50
π
|
|
signius has joined #archiveteam |
06:07
π
|
|
[1]Swizzl has joined #archiveteam |
06:10
π
|
|
Swizzle has quit IRC (Read error: Operation timed out) |
06:10
π
|
|
[1]Swizzl is now known as Swizzle |
06:54
π
|
|
db48x has joined #archiveteam |
07:18
π
|
|
Swizzle has quit IRC (Quit: HydraIRC -> http://www.hydrairc.com <- Wibbly Wobbly IRC) |
07:19
π
|
SketchCow |
Who wants it: |
07:19
π
|
SketchCow |
http://archiveteam.org/index.php?title=Scoop |
07:47
π
|
|
db48x has quit IRC (Read error: Operation timed out) |
07:59
π
|
|
APerti has joined #archiveteam |
08:11
π
|
|
mistym_ has joined #archiveteam |
08:16
π
|
|
mistym has quit IRC (Read error: Operation timed out) |
08:31
π
|
|
Daloader_ has quit IRC (Quit: Leaving) |
08:37
π
|
|
Ctrl-S has joined #archiveteam |
08:44
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
08:47
π
|
|
dashcloud has joined #archiveteam |
08:49
π
|
|
brayden has quit IRC (Read error: Operation timed out) |
09:01
π
|
|
schbirid has joined #archiveteam |
09:02
π
|
|
brayden has joined #archiveteam |
09:08
π
|
|
primus104 has joined #archiveteam |
09:50
π
|
|
primus104 has quit IRC (Leaving.) |
10:00
π
|
|
mistym_ has quit IRC (Remote host closed the connection) |
10:01
π
|
|
Ymgve has joined #archiveteam |
10:21
π
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
10:42
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
10:45
π
|
|
dashcloud has joined #archiveteam |
11:11
π
|
|
MMovie1 has joined #archiveteam |
11:13
π
|
|
MMovie has quit IRC (Read error: Operation timed out) |
11:16
π
|
|
MMovie1 has quit IRC (Client Quit) |
11:16
π
|
|
MMovie has joined #archiveteam |
11:30
π
|
|
wp494_ is now known as wp494 |
12:20
π
|
|
Emcy_ has quit IRC (Ping timeout: 480 seconds) |
12:22
π
|
|
BiggieJo1 has joined #archiveteam |
12:27
π
|
|
BiggieJon has quit IRC (Read error: Operation timed out) |
12:33
π
|
SketchCow |
FOS is not 100% enjoying, but is dealing with MS Clip Art pretty well. |
12:34
π
|
SketchCow |
Disk space usage on the machine in that drive is holding up nicely, mostly due to automatic processes now shoving things out. |
12:47
π
|
SadDM |
APerti: I think the original PC version of "Sid Meierβs Pirates!" was like that. You needed to boot from the game floppy, and I seem to recall that it was unreadable in DOS. |
13:27
π
|
|
Start has joined #archiveteam |
13:29
π
|
|
Ctrl-S has quit IRC (Ping timeout: 845 seconds) |
13:30
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
13:33
π
|
|
dashcloud has joined #archiveteam |
13:38
π
|
|
Start has quit IRC (Ping timeout: 265 seconds) |
13:47
π
|
|
Ctrl-S has joined #archiveteam |
13:51
π
|
|
brayden has quit IRC (Ping timeout: 606 seconds) |
13:51
π
|
|
primus104 has joined #archiveteam |
13:56
π
|
|
brayden has joined #archiveteam |
14:23
π
|
|
phuzion_ has quit IRC (Read error: Operation timed out) |
14:26
π
|
|
phuzion has joined #archiveteam |
14:46
π
|
|
ruukasu has quit IRC (Quit: WeeChat 1.0.1) |
14:50
π
|
|
ruukasu has joined #archiveteam |
15:38
π
|
|
Start has joined #archiveteam |
15:39
π
|
|
Start has quit IRC (Client Quit) |
15:39
π
|
|
Start has joined #archiveteam |
15:50
π
|
|
Start has quit IRC (Ping timeout: 252 seconds) |
15:51
π
|
|
APerti has quit IRC (Ping timeout: 370 seconds) |
16:05
π
|
|
Emcy has joined #archiveteam |
16:06
π
|
godane |
SketchCow: i'm looking at scoop |
16:06
π
|
godane |
looks like the .xml.gz are really not gzip |
16:18
π
|
|
lhobas_ has joined #archiveteam |
16:20
π
|
|
wacky has joined #archiveteam |
16:25
π
|
|
db48x has joined #archiveteam |
16:26
π
|
|
xk_id has joined #archiveteam |
16:36
π
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
16:42
π
|
|
Ristovski has joined #archiveteam |
16:44
π
|
Ristovski |
Hello, are you guys planning on archiving the pastebin public paste list? |
16:49
π
|
dashcloud |
there was a project or script doing that- not sure what the current status is though- maybe someone else here remembers more about it |
16:49
π
|
Ristovski |
I see |
16:50
π
|
dashcloud |
could be an archive on internet archive |
17:04
π
|
|
Lord_Nigh has joined #archiveteam |
17:21
π
|
|
signius has quit IRC (Read error: Operation timed out) |
17:22
π
|
|
signius has joined #archiveteam |
17:26
π
|
|
robink has joined #archiveteam |
17:33
π
|
joepie91 |
Ristovski: hi |
17:33
π
|
joepie91 |
I was doing that, it eventually got banned and/or broke and/or otherwise stopped working |
17:33
π
|
joepie91 |
haven't gotten around to fixing it yet |
17:34
π
|
joepie91 |
Ristovski: https://github.com/joepie91/pastebin-scrape/tree/develop |
17:34
π
|
joepie91 |
Ristovski: if you're bored, feel free to grab the code and see if it works on another system, and/or fix it where necessary, and I'll happily spin it up again :) |
17:34
π
|
Ristovski |
joepie91, I have created a pastebin crawler myself, just wanted to see if the guys here already plan on doing something like this |
17:35
π
|
Ristovski |
joepie91, mine works like yours, but more low-level |
17:35
π
|
Ristovski |
no fancy io pipes and such |
17:36
π
|
joepie91 |
Ristovski: low level in what sense? :P |
17:36
π
|
raylee |
my friend made one, that basically uses a sql db and has a server-client architecture, so multiple boxes can scrape and throw it in the same db etc |
17:36
π
|
raylee |
with a searchable interface |
17:36
π
|
Ristovski |
joepie91, I made it in like 10 mins, you get the idea :D |
17:37
π
|
joepie91 |
raylee: that's massive overkill for pastebin :P |
17:38
π
|
joepie91 |
okay, so, raylee, basically; a single box can easily scrape all of pastebin |
17:39
π
|
joepie91 |
the paste volume isn't /that/ high |
17:39
π
|
joepie91 |
distributed architecture is a bit overkill, really |
17:39
π
|
joepie91 |
and just more potential points of failure :P |
17:39
π
|
raylee |
he was using distributed architecture as mainly to work around bans |
17:39
π
|
raylee |
i believe he got one of his ips permabanned |
17:39
π
|
Ristovski |
raylee, as in, distribute the requests between more clients so it doesnt hit the requests-per-client so easily? |
17:40
π
|
joepie91 |
might be what happened to my box also, but it took a number of months |
17:40
π
|
joepie91 |
(not exaggerating) |
17:41
π
|
raylee |
Ristovski, yes |
17:41
π
|
raylee |
the distribution being more for redundancy than capacity |
17:41
π
|
Ristovski |
yeah |
17:43
π
|
|
RichardG has joined #archiveteam |
17:45
π
|
Ctrl-S |
sweet, a pastebin scraper |
17:45
π
|
Ctrl-S |
I made on for myself, how did you get past the ratelimits on raw pastes? |
17:46
π
|
joepie91 |
Ctrl-S: by not hitting it? :P |
17:46
π
|
joepie91 |
you just need to not go too fast |
17:46
π
|
joepie91 |
you can archive every single paste without hitting the ratelimiter |
17:47
π
|
Ctrl-S |
I ended up just saving thwe view page |
17:47
π
|
joepie91 |
also, raylee, Ristovski, Ctrl-S, for context, this is the collection of historical pastes I have: https://archive.org/details/pastebinpastes |
17:47
π
|
Ctrl-S |
A bunch of writers on 4chan use pastebin to host their stories |
17:49
π
|
Ctrl-S |
how big is the collection in total? |
17:49
π
|
joepie91 |
223 RESULTS |
17:49
π
|
joepie91 |
:) |
17:49
π
|
Ctrl-S |
I mean in GB |
17:49
π
|
joepie91 |
for most of those days, it has all the pastes |
17:49
π
|
joepie91 |
no idea, but it's tiny |
17:49
π
|
Ctrl-S |
just for that day? |
17:50
π
|
joepie91 |
arbitrary day: 16.3M gzipped |
17:50
π
|
Ctrl-S |
or does it scan each possible paste? |
17:50
π
|
joepie91 |
I think it extracts to 200MB per day or so, at most |
17:50
π
|
joepie91 |
Ctrl-S: what do you mean? |
17:50
π
|
Ctrl-S |
and does it follow links it finds in pastes to other pastes? |
17:50
π
|
joepie91 |
no, it just scrapes the 'latest pastes' list and fetches each one on a loop |
17:50
π
|
joepie91 |
so it grabs pastes as they are posted |
17:50
π
|
Ctrl-S |
I'd just give you the code to my script if my raid hadn't just died |
17:50
π
|
joepie91 |
before they have a chance to get deleted, really |
17:52
π
|
Ctrl-S |
And this is run continuously? |
17:53
π
|
Ctrl-S |
also this looks wrong: https://github.com/joepie91/pastebin-scrape/blob/develop/start.py |
17:53
π
|
Ctrl-S |
zmq is imported repeatedly with differnt messages |
17:53
π
|
Ctrl-S |
lines 12-16 |
17:56
π
|
joepie91 |
oh yeah, that's a bug, lol |
17:56
π
|
joepie91 |
one of those things i hadn't gotten around to fixing yet |
18:04
π
|
|
aMunster has quit IRC (Read error: Operation timed out) |
18:11
π
|
|
APerti has joined #archiveteam |
18:36
π
|
Ctrl-S |
So we have ALL the pastes, or just some of the,? |
18:40
π
|
|
aaaaaaaaa has joined #archiveteam |
18:50
π
|
|
db48x has quit IRC (Ping timeout: 258 seconds) |
18:54
π
|
|
aMunster has joined #archiveteam |
18:56
π
|
joepie91 |
Ctrl-S: all the public pastes in the timespan where the scraper was functioning |
18:57
π
|
|
RichardG has quit IRC (Ping timeout: 186 seconds) |
19:01
π
|
|
lytv has quit IRC (Quit: Leaving) |
19:03
π
|
|
RichardG has joined #archiveteam |
19:13
π
|
|
lytv has joined #archiveteam |
19:19
π
|
|
BlueMaxim has joined #archiveteam |
19:34
π
|
|
Start has joined #archiveteam |
19:34
π
|
|
Nertsy` has joined #archiveteam |
19:39
π
|
|
mistym has joined #archiveteam |
19:39
π
|
|
Nertsy has quit IRC (Ping timeout: 370 seconds) |
19:55
π
|
|
Start has quit IRC (Ping timeout: 369 seconds) |
20:38
π
|
|
db48x has joined #archiveteam |
20:51
π
|
|
mistym has quit IRC (Remote host closed the connection) |
20:59
π
|
|
db48x has quit IRC (Ping timeout: 258 seconds) |
21:16
π
|
|
Baljem_ is now known as Baljem |
22:05
π
|
|
db48x has joined #archiveteam |
22:26
π
|
|
aaaaaaaaa has quit IRC (Leaving) |
22:26
π
|
|
db48x has quit IRC (Ping timeout: 258 seconds) |
22:57
π
|
|
Ristovski has quit IRC (Quit: Leaving) |
23:49
π
|
|
VonCloud_ has joined #archiveteam |
23:50
π
|
|
VonCloud_ is now known as VonGuar |
23:50
π
|
|
VonGuar is now known as VonGuard |