Time |
Nickname |
Message |
00:29
🔗
|
|
primus104 has quit IRC (Leaving.) |
01:19
🔗
|
|
lbft has joined #archiveteam-bs |
01:23
🔗
|
|
Kazzy has quit IRC (Read error: Operation timed out) |
01:33
🔗
|
|
Kazzy has joined #archiveteam-bs |
01:40
🔗
|
|
Kazzy has quit IRC (Ping timeout: 265 seconds) |
01:40
🔗
|
|
twrist has quit IRC (And now, for my next magic trick..) |
01:44
🔗
|
|
Kazzy has joined #archiveteam-bs |
01:48
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
01:49
🔗
|
|
Kazzy has quit IRC (Ping timeout: 272 seconds) |
01:51
🔗
|
|
Kazzy has joined #archiveteam-bs |
01:57
🔗
|
|
Kazzy has quit IRC (Ping timeout: 365 seconds) |
02:02
🔗
|
|
Kazzy has joined #archiveteam-bs |
02:05
🔗
|
|
mistym has joined #archiveteam-bs |
02:08
🔗
|
|
DopefishJ has joined #archiveteam-bs |
02:08
🔗
|
|
swebb sets mode: +o DopefishJ |
02:10
🔗
|
|
dashcloud has quit IRC (Ping timeout: 265 seconds) |
02:10
🔗
|
|
DFJustin has quit IRC (Ping timeout: 265 seconds) |
02:10
🔗
|
|
Sellyme_ has quit IRC (Ping timeout: 265 seconds) |
02:10
🔗
|
|
Sellyme has joined #archiveteam-bs |
02:13
🔗
|
|
dashcloud has joined #archiveteam-bs |
02:14
🔗
|
|
Kazzy has quit IRC (Ping timeout: 272 seconds) |
02:17
🔗
|
|
amerrykan has quit IRC (Quit: Quitting) |
02:31
🔗
|
|
DopefishJ has quit IRC (Quit: IMHOSTFU) |
02:31
🔗
|
|
DFJustin has joined #archiveteam-bs |
02:31
🔗
|
|
swebb sets mode: +o DFJustin |
02:45
🔗
|
|
amerrykan has joined #archiveteam-bs |
03:08
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
03:09
🔗
|
|
schbirid2 has joined #archiveteam-bs |
03:40
🔗
|
|
ex-parrot has joined #archiveteam-bs |
04:30
🔗
|
|
Kazzy has joined #archiveteam-bs |
04:40
🔗
|
|
ex-parrot has quit IRC (Leaving.) |
04:51
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
05:27
🔗
|
|
beardicus has quit IRC (Read error: Operation timed out) |
05:29
🔗
|
|
beardicus has joined #archiveteam-bs |
06:03
🔗
|
|
ex-parrot has joined #archiveteam-bs |
06:50
🔗
|
Kazzy |
wow, bumpy night over at ramnode. I rejoined 5 times overnight? |
07:04
🔗
|
|
Famicoman has quit IRC (Ping timeout: 369 seconds) |
07:08
🔗
|
|
robink has quit IRC (Ping timeout: 492 seconds) |
07:49
🔗
|
|
primus104 has joined #archiveteam-bs |
08:04
🔗
|
|
primus104 has quit IRC (Leaving.) |
08:19
🔗
|
|
robink has joined #archiveteam-bs |
08:36
🔗
|
midas |
bumpy night at efnet |
09:11
🔗
|
joepie91 |
^ |
09:11
🔗
|
joepie91 |
isn't it always :) |
09:16
🔗
|
midas |
since xs4all delinked it is |
09:28
🔗
|
godane |
hey everyone |
09:28
🔗
|
joepie91 |
lol |
09:28
🔗
|
godane |
i'm up at weird hours today |
09:29
🔗
|
godane |
but i have been getting up at 5 or 6 pm then going to bed at 8 am the last 2 days |
09:29
🔗
|
godane |
anyways 1972 pdfs of times news is uploaded |
09:32
🔗
|
midas |
meh |
09:32
🔗
|
midas |
uploading 2TB tar file via torrents > torrent file is over 150MB and too large to load :p |
09:33
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
09:34
🔗
|
godane |
try bigger pieces? |
09:34
🔗
|
godane |
like up it to 64 or 128mb chucks |
09:34
🔗
|
midas |
hm |
09:35
🔗
|
godane |
that would save on file size of torrent |
09:35
🔗
|
midas |
yeah |
09:36
🔗
|
godane |
may become a pain for people with slow connections but this is a 2tb tar file |
09:37
🔗
|
midas |
well it's just a upload to IA, so care for people with slow connections :p |
09:37
🔗
|
godane |
if the torrent is being uploaded to IA then they should be able to get it |
09:38
🔗
|
midas |
current issue is that rtorrent wont even load it on my side because it's huge :p |
09:39
🔗
|
godane |
ok |
09:40
🔗
|
midas |
changed it to the max size of mktorrent, lets see what it does |
09:47
🔗
|
joepie91 |
midas: gimme torrent file |
09:47
🔗
|
joepie91 |
I want to see whether qbittorrent can handle it |
09:47
🔗
|
joepie91 |
:P |
09:48
🔗
|
joepie91 |
aside; there's a torrent client SPECIFICALLY for seeding large (amounts of) torrent files |
09:48
🔗
|
joepie91 |
another aside: are you sure you want to upload a single 2TB TAR to IA :P |
10:02
🔗
|
midas |
yep :p |
10:03
🔗
|
midas |
already deleted torrent, creating new one |
10:22
🔗
|
arkiver |
<midas>uploading 2TB tar file via torrents > torrent file is over 150MB and too large to load :p |
10:22
🔗
|
arkiver |
remember to set the size when you upload |
10:33
🔗
|
midas |
arkiver: what do you mean? this is too big to load in rtorrent |
10:35
🔗
|
arkiver |
Oh, I didn't knew a torrent could be too big |
10:36
🔗
|
joepie91 |
oh hum, that's a good point |
10:36
🔗
|
joepie91 |
I hope IA can deal with this... |
10:36
🔗
|
joepie91 |
as in, grab the sizehint from the torrent.. |
10:37
🔗
|
arkiver |
we'll see |
10:43
🔗
|
midas |
in theory it should |
10:44
🔗
|
arkiver |
midas: do you already have it uploaded? the torrent? |
10:44
🔗
|
midas |
not yet, still hashing |
10:44
🔗
|
arkiver |
ah, ok |
10:44
🔗
|
midas |
Hashed 6444 of 7468 pieces. |
10:45
🔗
|
arkiver |
I'd like to see the deriving progress of the torrent in IA, so if you have it uploaded, please share the identifier with us :) |
10:45
🔗
|
arkiver |
just be sure to provide a size hint while uploading the torrent |
10:46
🔗
|
midas |
sure :) |
11:02
🔗
|
midas |
https://catalogd.archive.org/log/353604251 |
11:05
🔗
|
|
Bobby_ has joined #archiveteam-bs |
11:19
🔗
|
midas |
now we wait, alot |
11:22
🔗
|
midas |
probably first need to wait until archive.org actually connects to dht/tracker before it does anything :p |
11:23
🔗
|
arkiver |
looks good, seems to be working |
11:23
🔗
|
midas |
not seeing that much outbound traffic yet |
11:24
🔗
|
arkiver |
I'm always statrting the torrent when the torrent is already started by IA |
11:24
🔗
|
arkiver |
I don't know a lot about torrents. But you might want to try refreshing the trackers or stopping and starting the torrent |
11:24
🔗
|
|
brayden has quit IRC (Read error: Operation timed out) |
11:25
🔗
|
midas |
ill just wait |
11:25
🔗
|
arkiver |
ok |
11:25
🔗
|
arkiver |
how long did it take you to download that ftp? |
11:28
🔗
|
godane |
btw i'm finally uploading more of theblaze.com site |
11:29
🔗
|
godane |
i'm doing it by month cause trying to grab everything in wp-content path causes it to crash if i try to do the full year of 2011 |
11:36
🔗
|
midas |
arkiver: couple of weeks |
11:37
🔗
|
midas |
in total my ftp dump is now 22TB |
11:59
🔗
|
midas |
verification of data, waiting |
12:24
🔗
|
joepie91 |
FYI: Groupon is attempting to steamroll GNOME's trademarks by registering their own and releasing their own hardware and "operating system" named GNOME: http://gnome.org/groupon |
12:24
🔗
|
|
primus104 has joined #archiveteam-bs |
12:28
🔗
|
midas |
joepie91: added to archivebot for savekeeping |
12:28
🔗
|
joepie91 |
midas: yes, aware |
12:28
🔗
|
joepie91 |
that's where I originally heard of it |
12:28
🔗
|
joepie91 |
before freenode /notice'd it out |
12:28
🔗
|
joepie91 |
:) |
12:29
🔗
|
midas |
lol |
12:29
🔗
|
joepie91 |
I get a disturbing amount of my news from #archivebot |
12:29
🔗
|
joepie91 |
lol |
12:29
🔗
|
joepie91 |
it's pretty much my "shit ACTUALLY worth reading" newsfeed |
12:29
🔗
|
midas |
same |
12:29
🔗
|
midas |
via twitter and the channel, it's amazing |
12:31
🔗
|
joepie91 |
hehe, indeed |
12:31
🔗
|
arkiver |
bankruptcies ^ :) |
12:31
🔗
|
joepie91 |
midas: it's kinda funny when people ask me "what news sites do you read? because <insert mainstream site> kinda sucks..." |
12:32
🔗
|
joepie91 |
and I just go "uh... well... none, really... I just pick up stuff here and there..." |
12:32
🔗
|
midas |
err. well... i read https://twitter.com/archivebot |
12:36
🔗
|
arkiver |
midas: I see it didn't start yet |
12:36
🔗
|
arkiver |
maybe it's creathing the 1.8 TB file before starting to download? |
12:36
🔗
|
midas |
still loading data in transmission on my side it seems |
12:36
🔗
|
arkiver |
oh ok |
12:36
🔗
|
midas |
519.9 GB of 2.00 TB (25.9%) |
12:37
🔗
|
midas |
the issue with a atombox, single core work is sorta hard for it |
12:37
🔗
|
arkiver |
haha, that will take some time |
12:38
🔗
|
midas |
yep |
12:39
🔗
|
midas |
oh well, when it starts it should be fine |
12:39
🔗
|
arkiver |
I'm looking forward to seeing if it will actually get in, the full 2 TB |
12:40
🔗
|
midas |
same |
12:50
🔗
|
joepie91 |
midas: any particular reason you're using transmission? |
12:51
🔗
|
midas |
rtorrent didnt want to load it |
12:51
🔗
|
midas |
and headless box |
12:52
🔗
|
joepie91 |
midas: hold on |
12:52
🔗
|
* |
midas grabs table |
12:52
🔗
|
joepie91 |
lol |
12:52
🔗
|
joepie91 |
midas: there's a torrent client *specifically* for seeding many or large torrents |
12:52
🔗
|
joepie91 |
it does just that and nothing else |
12:52
🔗
|
joepie91 |
trying to find it for you |
12:52
🔗
|
midas |
ok |
12:52
🔗
|
midas |
still holding on to table, just to be sure |
12:54
🔗
|
joepie91 |
midas: http://www.pps.univ-paris-diderot.fr/~jch/software/hekate/ |
12:54
🔗
|
joepie91 |
:) |
12:54
🔗
|
joepie91 |
it's headless |
12:54
🔗
|
joepie91 |
and as I understand it, it's extremely low-overhead |
12:54
🔗
|
joepie91 |
so it should be perfect for this kind of thing |
12:55
🔗
|
joepie91 |
assuming your .torrent file has already been created, of course |
12:55
🔗
|
midas |
lol yep |
12:55
🔗
|
joepie91 |
okay :P |
12:55
🔗
|
joepie91 |
then it should do fine |
12:55
🔗
|
joepie91 |
protip: don't shut down transmission yet |
12:55
🔗
|
joepie91 |
both are single-core afaik so you can just run them both and see whether hekate is any faster |
12:55
🔗
|
midas |
lol nope ill leave her running :p |
12:55
🔗
|
joepie91 |
if it's not, then at least you won't lose your 500 gigs of work |
12:55
🔗
|
joepie91 |
lol |
12:55
🔗
|
midas |
650 now, amazing isnt it :p |
12:56
🔗
|
joepie91 |
lol |
12:56
🔗
|
joepie91 |
but yeah, try hekate :D |
12:56
🔗
|
midas |
doing that now |
12:56
🔗
|
midas |
BOOO! |
12:56
🔗
|
midas |
fatal: remote error: access denied or repository not exported: /hekate |
12:57
🔗
|
midas |
oh my |
12:57
🔗
|
midas |
the first link is just dead |
12:57
🔗
|
midas |
maybe we should just grab it in case of fire |
12:57
🔗
|
joepie91 |
heh |
12:58
🔗
|
joepie91 |
"in case of fire, please !a" |
12:58
🔗
|
midas |
yep :p |
12:58
🔗
|
joepie91 |
holy shit looks like my IA processing code is working so far |
12:58
🔗
|
|
Bobby__ has joined #archiveteam-bs |
12:58
🔗
|
joepie91 |
node.js streams are amazing :D |
12:58
🔗
|
joepie91 |
... speaking of the deviil |
12:58
🔗
|
midas |
screenshots yet? :p |
12:58
🔗
|
joepie91 |
ohai Bobby_ |
12:58
🔗
|
joepie91 |
er |
12:58
🔗
|
joepie91 |
Bobby__ |
12:59
🔗
|
joepie91 |
midas: screenshots? it's a script :P |
12:59
🔗
|
joepie91 |
no GUI |
12:59
🔗
|
midas |
lol :p |
12:59
🔗
|
joepie91 |
I've now gotten to the point that I can give it a collection name |
12:59
🔗
|
joepie91 |
and it'll download all of the WARC index files in every item of that collection |
12:59
🔗
|
Bobby__ |
Hi, I got kicked as Bobby and Bobby_, so now I'm Bobby__ ... :P |
12:59
🔗
|
joepie91 |
gunzip them |
12:59
🔗
|
joepie91 |
and chomp them into lines |
12:59
🔗
|
joepie91 |
next up: filtering for the lines we want and downloading the WARCs |
12:59
🔗
|
joepie91 |
or well, WARC segments really |
12:59
🔗
|
joepie91 |
hm. |
13:00
🔗
|
* |
joepie91 thinks |
13:00
🔗
|
midas |
inb4 joepie91 does imput % and disconnects from the internet |
13:00
🔗
|
joepie91 |
what? :P |
13:00
🔗
|
midas |
match all, downloads archive.org :p |
13:01
🔗
|
garyrh |
heh, only 20PB or so |
13:01
🔗
|
|
Bobby_ has quit IRC (Read error: Operation timed out) |
13:01
🔗
|
midas |
actually more 45PB :p |
13:02
🔗
|
arkiver |
20039 TB |
13:03
🔗
|
garyrh |
the 20pb is the deduped data iirc |
13:04
🔗
|
godane |
i added the domain names to the keywords of the first archivebot grab: https://archive.org/details/archiveteam_archivebot_go_001 |
13:04
🔗
|
arkiver |
IA has 22403 TB, 20039 TB filled, 1074 TB Free |
13:04
🔗
|
godane |
this maybe the best way for us to know what the hell we have archived |
13:04
🔗
|
arkiver |
51980 TB in total with backups |
13:04
🔗
|
godane |
with archivebot |
13:05
🔗
|
midas |
i hate symlinks in FTP's |
13:05
🔗
|
midas |
ftp://ftp.sunet.se/pub/databases/object-oriented/obst/current/gzip/gzip/gzip/gzip/gzip/gzip |
13:05
🔗
|
godane |
arkiver: how long does it take for IA to fill 1PB? |
13:06
🔗
|
joepie91 |
[14:00] <midas> match all, downloads archive.org :p |
13:06
🔗
|
joepie91 |
heh |
13:06
🔗
|
joepie91 |
:3 |
13:06
🔗
|
arkiver |
godane: depends https://archive.org/~tracey/mrtg/df-year.png |
13:06
🔗
|
arkiver |
sometimes a month |
13:06
🔗
|
arkiver |
sometimes 4 months |
13:07
🔗
|
midas |
depends how mad we get and start archiving websites like twitpic |
13:07
🔗
|
midas |
:p |
13:07
🔗
|
arkiver |
haha |
13:07
🔗
|
arkiver |
currently free space is dropping very fast |
13:08
🔗
|
godane |
domains added to pack 152: https://archive.org/details/archiveteam_archivebot_go_152 |
13:10
🔗
|
godane |
so another 1PB maybe add in december i guess |
13:11
🔗
|
arkiver |
probably yes |
13:17
🔗
|
godane |
i'm only thinking of holiday break and stuff |
13:18
🔗
|
godane |
i don't think people at IA want to get the call to install 1PB drive on christmas eve |
13:19
🔗
|
godane |
:P |
13:20
🔗
|
midas |
lol :p |
13:20
🔗
|
midas |
mostly because it's the size of a 42U rack |
13:21
🔗
|
midas |
not really a external drive you can carry around |
13:21
🔗
|
godane |
i know |
13:21
🔗
|
midas |
i know you know :) |
13:22
🔗
|
|
brayden has joined #archiveteam-bs |
13:35
🔗
|
joepie91 |
godane: but it's a christmas present! :) |
13:39
🔗
|
|
primus104 has quit IRC (Read error: Connection reset by peer) |
13:42
🔗
|
|
sankin has joined #archiveteam-bs |
13:43
🔗
|
godane |
i'm at 47k in my inbox |
13:45
🔗
|
godane |
what happened with this item: https://archive.org/details/archiveteam_archivebot_go_20141107000005 |
13:45
🔗
|
godane |
it as a dvd in it |
13:47
🔗
|
godane |
also some of the recent archivebot items only have json but no warc.gz |
13:47
🔗
|
godane |
like this one: https://archive.org/details/archiveteam_archivebot_go_20141106000004 |
13:48
🔗
|
godane |
some urls only have json for some reason |
13:49
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
14:13
🔗
|
joepie91 |
Bobby__: I assume you're going to be around for a while still? |
14:14
🔗
|
|
sankin has quit IRC (Leaving.) |
14:18
🔗
|
|
sankin has joined #archiveteam-bs |
14:22
🔗
|
midas |
i hope the marrage isnt like tomorrow :p |
14:29
🔗
|
Bobby__ |
Hey Joepie91 and Midas :) No, They are planning on getting maried next summer, so I'll definetly stick around. There are many more sources I'm pulling pictures and stories from, so I can keep working quite a while longer before I can dive into the realms of hyves:P |
14:34
🔗
|
joepie91 |
Bobby__: alright |
14:34
🔗
|
joepie91 |
I'm currently working on code to basically parse out the entire hyves grab efficiently |
14:34
🔗
|
joepie91 |
but I've hit a bit of a snag (how does one selectively read out of gzipped files remotely) so it may take me a while |
14:35
🔗
|
joepie91 |
:p |
14:35
🔗
|
joepie91 |
also I'm writing this as a sort of educational example of "how do you automatically process a large archiveteam megawarc dump" for other projects, so that makes things a little more complicated as well since I can't do quick hacks, heh |
14:40
🔗
|
Bobby__ |
Haha, well, I am gratefull for everything y'all do:) |
14:54
🔗
|
|
SN4T14_ has joined #archiveteam-bs |
14:54
🔗
|
|
lysobit has quit IRC (Read error: No route to host) |
14:55
🔗
|
|
Zebranky_ has quit IRC (Read error: No route to host) |
14:55
🔗
|
|
dashcloud has quit IRC (Read error: No route to host) |
14:58
🔗
|
|
Zebranky has joined #archiveteam-bs |
14:58
🔗
|
|
Bobby_ has joined #archiveteam-bs |
14:58
🔗
|
|
brayden_ has joined #archiveteam-bs |
14:59
🔗
|
|
brayden has quit IRC (Read error: No route to host) |
15:00
🔗
|
|
dashcloud has joined #archiveteam-bs |
15:00
🔗
|
|
lysobit has joined #archiveteam-bs |
15:00
🔗
|
|
SN4T14__ has quit IRC (Read error: No route to host) |
15:02
🔗
|
|
Bobby__ has quit IRC (Ping timeout: 378 seconds) |
15:05
🔗
|
schbirid2 |
oh god i looked at gaia online and want to bleach my eyes now |
15:08
🔗
|
antomatic |
Save all the chibis! |
15:11
🔗
|
godane |
i'm at 272k items uploaded |
15:15
🔗
|
espes__ |
joepie91: yeah that's the thing |
15:15
🔗
|
espes__ |
we should get better at pulling useful data from archiveteam dumps |
15:16
🔗
|
espes__ |
oh, and can't you do all what you said with like, curl /search | jq | xargs curl | zcat | grep? :P |
15:16
🔗
|
espes__ |
(pull all the indexes, that is) |
15:17
🔗
|
espes__ |
oh right, "without hacks" :P |
15:18
🔗
|
joepie91 |
espes__: well, you see, the problem here is |
15:19
🔗
|
joepie91 |
gzip doesn't *really* support random access |
15:19
🔗
|
joepie91 |
and I wasn't planning on running through X times 50GB megawarc files |
15:19
🔗
|
joepie91 |
to find a user |
15:19
🔗
|
joepie91 |
lol |
15:19
🔗
|
midas |
CRAP |
15:19
🔗
|
midas |
ERROR: No space left on device (/t/_ftp.atnf.csiro.au//2014.11.ftp.atnf.csiro.au.tar) |
15:19
🔗
|
joepie91 |
there is supposedly a way to do random access in gzip, though |
15:19
🔗
|
joepie91 |
so I'm learning gzip now |
15:19
🔗
|
joepie91 |
lol |
15:19
🔗
|
joepie91 |
midas: gave it a sizehint? |
15:19
🔗
|
garyrh |
WHAT HAVE YOU DONE MIDAS |
15:20
🔗
|
midas |
joepie91: this was a big hint: Downloading 1.8 TB |
15:20
🔗
|
joepie91 |
midas: not what I mean |
15:20
🔗
|
joepie91 |
did you specify a sizehint during upload |
15:20
🔗
|
joepie91 |
lol |
15:20
🔗
|
midas |
i know what you mean |
15:20
🔗
|
midas |
nu |
15:20
🔗
|
midas |
no* |
15:20
🔗
|
joepie91 |
well there's your problem |
15:20
🔗
|
joepie91 |
:p |
15:20
🔗
|
midas |
no really? :P |
15:22
🔗
|
midas |
hm now just a way to fix it |
15:23
🔗
|
espes__ |
joepie91: so that's the other thing. inspecting everything with range queries on the warcs would be like, 50 million requests |
15:23
🔗
|
espes__ |
which is maybe a lot? |
15:23
🔗
|
joepie91 |
espes__: not really |
15:24
🔗
|
joepie91 |
if you cache the host that the file lives on so that you don't have to bother the /download/ redirector all the time, it should be less taxing than downloading the WARCs wholesale |
15:24
🔗
|
joepie91 |
HTTP overhead should be minimal and you're still not reading more data than if you were to download them entirely |
15:24
🔗
|
joepie91 |
the only thing that would happen is a minor increase in random reads, but that should be negligible |
15:25
🔗
|
joepie91 |
let's not forget that, as far as I understand it, the above is basically what the wayback machine does, except without the HTTP requests :P |
15:25
🔗
|
joepie91 |
but afaik the wayback serves pages directly from the warc.gz files |
15:25
🔗
|
joepie91 |
using offset reads |
15:25
🔗
|
joepie91 |
(which is what the CDXes are for) |
15:25
🔗
|
schbirid2 |
http://starfrosch.ch/2014/11/11/hot-100-download-charts/ |
15:25
🔗
|
joepie91 |
I'm not awfully familiar with the archive internals, but afaik this is how it all works behind the scenes |
15:25
🔗
|
espes__ |
I think the wayback does actually do http queries to storage |
15:26
🔗
|
joepie91 |
really? that'd surprise me |
15:26
🔗
|
joepie91 |
anyway, tl;dr I'd just be recreating the behaviour of the wayback machine with potentially less overhead |
15:27
🔗
|
joepie91 |
now that you mention it actually... wayback lookups are counted in the download counts for WARCs |
15:27
🔗
|
joepie91 |
so perhaps you're right and it just uses the public HTTP interface |
15:27
🔗
|
espes__ |
so it's still nearly equivalent to doing 50 million queries to the wayback :P |
15:28
🔗
|
joepie91 |
yes, without the overhead of the wayback |
15:28
🔗
|
joepie91 |
:P |
15:28
🔗
|
joepie91 |
I mean, don't forget that the wayback loads static assets and needs to carry out date range searches |
15:28
🔗
|
joepie91 |
or date proximity searches, rather |
15:28
🔗
|
joepie91 |
that's a good bit heavier than doing a few reqs to known offsets in a known file on a known host, heh |
15:29
🔗
|
joepie91 |
and I'd just be pulling HTML of specific pages, too - I don't care about the static assets at this point |
15:29
🔗
|
joepie91 |
I parse through the CDXes first to find items of inetrest |
15:29
🔗
|
joepie91 |
interest * |
15:29
🔗
|
espes__ |
probably only 10 mil with logic assuming most paginated requests are contiguous? |
15:29
🔗
|
joepie91 |
they seem to be contiguous from what I've seen so far - sorted by user, even |
15:29
🔗
|
joepie91 |
which I presume is because of small WARCs being packed up into a megawarc |
15:30
🔗
|
joepie91 |
anyway, the CDX is in the same order as the actual WARC records |
15:30
🔗
|
joepie91 |
and my code handles the requests in the order that they are found in the CDX |
15:30
🔗
|
joepie91 |
so that should minimize disk seeks and such |
15:31
🔗
|
joepie91 |
basically, I'm not very concerned about this taxing the archive too much |
15:31
🔗
|
joepie91 |
and if it does, I'm sure Jason will be screaming in my ear in a few hours |
15:31
🔗
|
joepie91 |
into* |
15:31
🔗
|
joepie91 |
:) |
15:32
🔗
|
espes__ |
yeah, no idea if 100qps is "a lot" |
15:40
🔗
|
midas |
and for a couple of hours joepie91 :P |
15:40
🔗
|
joepie91 |
midas: possibly :P |
15:40
🔗
|
joepie91 |
well |
15:40
🔗
|
joepie91 |
since the internet arcade thing |
15:41
🔗
|
joepie91 |
he can't really complain about anybody taking down the archive |
15:41
🔗
|
joepie91 |
heh |
15:41
🔗
|
joepie91 |
events that can successfully take down the archive: |
15:41
🔗
|
joepie91 |
* internet arcade |
15:41
🔗
|
joepie91 |
* fire |
15:41
🔗
|
|
aaaaaaaaa has joined #archiveteam-bs |
15:41
🔗
|
midas |
* power shortages |
15:41
🔗
|
midas |
very effective |
15:42
🔗
|
midas |
anyway, not sure how im going to rerun this |
15:42
🔗
|
midas |
btw you know what was strange joepie91 |
15:42
🔗
|
midas |
it died after 4MB |
15:43
🔗
|
midas |
4.36MB to be correct |
15:43
🔗
|
joepie91 |
midas: what do you mean? |
15:44
🔗
|
midas |
it only grabbed 4.3MB from the torrent and after that IA commited suicide |
15:44
🔗
|
joepie91 |
aside; per gzip specification, the correct value for the "OS" byte in the member header is "13" when compressing on an Acorn RISCOS system, and "5" when on Atari TOS - "Commodore" is conspicuously absent from the specification |
15:44
🔗
|
arkiver |
I believe a torrent creates the full size when it gets a byte |
15:44
🔗
|
joepie91 |
(I am not joking) |
15:44
🔗
|
arkiver |
so IA started craeting a 1.8 TB file |
15:44
🔗
|
arkiver |
after the 4.3 MB was sended |
15:45
🔗
|
midas |
hm |
15:47
🔗
|
ersi |
sendededed |
15:47
🔗
|
joepie91 |
ok.... |
15:47
🔗
|
* |
joepie91 stares |
15:47
🔗
|
joepie91 |
so I'm reading the gzip spec |
15:47
🔗
|
joepie91 |
and I ran across this |
15:47
🔗
|
joepie91 |
"This contains a Cyclic Redundancy Check value of the uncompressed data computed according to CRC-32 algorithm used in the ISO 3309 standard and in section 8.1.1.6.2 of ITU-T recommendation V.42. (See http://www.iso.ch for ordering ISO documents. See gopher://info.itu.ch for an online version of ITU-T V.42.)" |
15:47
🔗
|
joepie91 |
brb start my gopher client |
15:47
🔗
|
ersi |
:D |
15:53
🔗
|
midas |
lol |
15:53
🔗
|
midas |
2.00 TB (7,468 pieces @ 256.0 MiB) |
15:55
🔗
|
midas |
11 days of uploading |
15:55
🔗
|
midas |
diz gonna be fun |
15:57
🔗
|
godane |
i was no sure you could set it to 256mb |
16:11
🔗
|
arkiver |
midas: how did you make it derive right? |
16:12
🔗
|
arkiver |
derive well* |
16:23
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
16:25
🔗
|
|
botpie91 has joined #archiveteam-bs |
16:30
🔗
|
|
aaaaaaaaa has joined #archiveteam-bs |
16:35
🔗
|
schbirid2 |
https://www.google.com/search?q=withgoogle.com |
17:00
🔗
|
godane |
i'm starting to download aol files again |
17:02
🔗
|
godane |
we can brute force gaia online it looks like: http://www.gaiaonline.com/forum/f.889/ |
17:03
🔗
|
|
Bobby_ has quit IRC (Read error: Operation timed out) |
17:04
🔗
|
godane |
www.gaiaonline.com/forum/t.79246943/ |
17:04
🔗
|
godane |
it redirects the page the real page link in there index: http://www.gaiaonline.com/forum/my-little-pony/my-little-pony-rules-guidelines-updated-9-5-2012/t.79246943/ |
17:05
🔗
|
schbirid2 |
very awesome http://listen.hatnote.com/#fr,en |
17:06
🔗
|
|
Bobby_ has joined #archiveteam-bs |
17:32
🔗
|
|
primus104 has joined #archiveteam-bs |
17:51
🔗
|
SmileyG |
Soooo anyone looked at grabbing kickstarters... |
17:58
🔗
|
|
primus104 has quit IRC (Leaving.) |
18:04
🔗
|
Bobby_ |
As in, the website? |
18:09
🔗
|
SketchCow |
An interesting idea. |
18:11
🔗
|
Bobby_ |
Sorry for asking so many questions all the time, but what would that mean exactly, grabbing kickstarters?:) |
18:12
🔗
|
joepie91 |
Bobby_: it's fine - I suspect he means archiving pages of kickstarter campaigns |
18:12
🔗
|
joepie91 |
cc SmileyG |
18:13
🔗
|
joepie91 |
which seems like a good idea to me given the tendency for them to vanish and/or be changed and/or for them to fail |
18:16
🔗
|
|
kyan has joined #archiveteam-bs |
18:17
🔗
|
Bobby_ |
Thank you joepie, the more I read here, the more this is starting to interest me, but there is so much jargon, that I continuesly have questions:P |
18:20
🔗
|
|
kyan has quit IRC (Client Quit) |
18:22
🔗
|
|
kyan has joined #archiveteam-bs |
18:23
🔗
|
kyan |
Looks like the chat logs at http://badcheese.com/~steve/atlogs/ have died |
18:25
🔗
|
aaaaaaaaa |
swebb used to have a log bot in here, but it disappeared a few days ago. However, there is another logging service. |
18:26
🔗
|
aaaaaaaaa |
http://archive.fart.website/bin/irclogger_logs |
19:02
🔗
|
|
mistym has joined #archiveteam-bs |
19:03
🔗
|
|
Bobby__ has joined #archiveteam-bs |
19:04
🔗
|
joepie91 |
Bobby__: feel free to ask, as long as it's something that you can't trivially Google all will be fine :) |
19:04
🔗
|
joepie91 |
and the jargon is actually not all that extensive, most things mean exactly what they sound like |
19:05
🔗
|
joepie91 |
they're just... not always conventional things to do, heh |
19:05
🔗
|
joepie91 |
eg. "let's download Hyves" is something that'd make you wonder whether there's some kind of jargon there, but it meant exactly that :P |
19:06
🔗
|
joepie91 |
aside; looks like I figured out the gzip issue, I was over-engineering, heh |
19:06
🔗
|
joepie91 |
apparently you can just feed an incomplete chunk of a gzip file into gzip, and it'll just decompress it and barf out a warning |
19:06
🔗
|
|
Bobby_ has quit IRC (Ping timeout: 504 seconds) |
19:06
🔗
|
joepie91 |
saying that it's missing data |
19:06
🔗
|
Bobby__ |
Thanks Joepie, my internet drops every once in a while, thats why I relog as often as I do. But yeah, downloading hyves sounds pretty weird:P |
19:07
🔗
|
joepie91 |
heh, your internet is nowhere near as bad as mine was back on KPN... |
19:07
🔗
|
* |
joepie91 sighs |
19:07
🔗
|
joepie91 |
couldn't stay connected to IRC for more than 5 minutes :| |
19:08
🔗
|
Bobby__ |
Wow, okay, yeah, that would suck:P |
19:09
🔗
|
joepie91 |
Bobby__: the worst part was that it was a FttH (fiber to the home) connection we were paying 75 euro a month for... |
19:09
🔗
|
joepie91 |
anyway. KPN is awful in general |
19:09
🔗
|
joepie91 |
or well, their network / customer support is |
19:09
🔗
|
Bobby__ |
Haha, true that:P |
19:10
🔗
|
joepie91 |
XS4ALL now, which is waaaaaaay better, despite being technically part of KPN |
19:13
🔗
|
Bobby__ |
I believe this is an international problem, providers not living up to expectations and promises. |
19:20
🔗
|
joepie91 |
Bobby__: yep, but KPN is a special kind of terrible |
19:20
🔗
|
joepie91 |
:P |
19:30
🔗
|
Bobby__ |
whaha:P |
19:37
🔗
|
|
primus104 has joined #archiveteam-bs |
19:49
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
19:55
🔗
|
|
aaaaaaaaa has joined #archiveteam-bs |
20:00
🔗
|
SmileyG |
hell it's bigger than international, it's a universal problem |
20:01
🔗
|
joepie91 |
SmileyG: did you hear about SaturnusCom?! they raised their prices by 20 krimoleans last month! |
20:01
🔗
|
joepie91 |
:) |
20:02
🔗
|
SmileyG |
:D |
20:02
🔗
|
garyrh |
Goddamn Vogons. |
20:03
🔗
|
SmileyG |
you read one bit of poetry.... |
20:06
🔗
|
|
ex-parro1 has joined #archiveteam-bs |
20:39
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
20:56
🔗
|
kyan |
aaaaaaaaa, thanks :) |
20:58
🔗
|
|
Bobby_ has joined #archiveteam-bs |
21:05
🔗
|
|
Bobby__ has quit IRC (Ping timeout: 504 seconds) |
21:17
🔗
|
joepie91 |
okay guys |
21:17
🔗
|
joepie91 |
for a change in tune |
21:18
🔗
|
joepie91 |
for once, a shutdown that doesn't suck: https://blog.andyet.com/2014/11/11/goodbye-andbang |
21:44
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
21:49
🔗
|
|
sankin has quit IRC (Leaving.) |
21:57
🔗
|
|
garyrh has quit IRC (Remote host closed the connection) |
21:58
🔗
|
|
cbb has joined #archiveteam-bs |
22:20
🔗
|
|
cbb has quit IRC (Quit: Nettalk6 - www.ntalk.de) |
22:22
🔗
|
|
garyrh has joined #archiveteam-bs |
22:58
🔗
|
|
Bobby__ has joined #archiveteam-bs |
23:00
🔗
|
|
Bobby__ has quit IRC (Client Quit) |
23:00
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Operation timed out) |
23:04
🔗
|
|
Bobby_ has quit IRC (Ping timeout: 378 seconds) |
23:11
🔗
|
|
Kazzy has quit IRC (Ping timeout: 265 seconds) |
23:11
🔗
|
|
Kazzy has joined #archiveteam-bs |
23:13
🔗
|
Kazzy |
zzz, shitty cheap hosts |
23:15
🔗
|
|
kyan has quit IRC (Quit: This computer has gone to sleep) |
23:15
🔗
|
joepie91 |
Kazzy: RamNode NL, yes? |
23:16
🔗
|
Kazzy |
that's the one |
23:16
🔗
|
Kazzy |
multiple times a day, disconnects for a few minutes, comes back |
23:16
🔗
|
joepie91 |
Kazzy: not seeing any connectivity issues to my ramnode NL VPS whatsoever... |
23:16
🔗
|
joepie91 |
and haven't seen any since I got it |
23:16
🔗
|
Kazzy |
i'm connected again now, i just timed out though |
23:16
🔗
|
|
kyan has joined #archiveteam-bs |
23:16
🔗
|
Kazzy |
control panel was inaccessbile for me at the time, too |
23:16
🔗
|
joepie91 |
thats rather strange... |
23:17
🔗
|
joepie91 |
anyway, hardly related to a host being cheap :P |
23:17
🔗
|
Kazzy |
http://status.ramnode.com/1281198 |
23:18
🔗
|
joepie91 |
Kazzy: seems to be isolated to that particular node? |
23:18
🔗
|
|
aaaaaaaaa has joined #archiveteam-bs |
23:19
🔗
|
Kazzy |
there's a lot i can see from http://status.ramnode.com/ currently showing as 'down', the graphs on those ones have similar downtime to mine |
23:19
🔗
|
Kazzy |
not all of them though |
23:19
🔗
|
Kazzy |
maybe i'll complain tomorrow, see if i can get put on a better node |
23:20
🔗
|
joepie91 |
huh, weird |
23:20
🔗
|
joepie91 |
that's a very, very strange graph |
23:20
🔗
|
joepie91 |
or well |
23:20
🔗
|
joepie91 |
"graph" |
23:21
🔗
|
joepie91 |
anyway, shit happens |
23:21
🔗
|
Kazzy |
'green line with specks of kazzy being sad' |
23:22
🔗
|
joepie91 |
haha |
23:22
🔗
|
joepie91 |
redundancy! :P |
23:22
🔗
|
Kazzy |
whatever anyway, i paid $15 for a year, I'm not too bothred |
23:22
🔗
|
Kazzy |
bothered* |
23:23
🔗
|
joepie91 |
lol |
23:23
🔗
|
joepie91 |
Kazzy: honestly, unless you pay >$100 a month, you're probably not going to find any kind of server or VM with flawless network :P |
23:24
🔗
|
joepie91 |
network dips happen on shared infra, as long as they're not a regular occurrence all is fine |
23:24
🔗
|
joepie91 |
(I do have certain offenders in mind for that 'regular occurrence' thing...) |
23:29
🔗
|
|
kyan has quit IRC (Quit: This computer has gone to sleep) |
23:30
🔗
|
|
kyan has joined #archiveteam-bs |
23:39
🔗
|
kyan |
I'm not impressed... YouTube says they will be deleting my messages, and to download the archive they make available |
23:40
🔗
|
kyan |
It includes a CSV file supposedly containing my received messages. It's empty. |
23:46
🔗
|
xmc |
A+ |
23:46
🔗
|
xmc |
that kind of shit is why i always turn on all email notifications that I can |
23:48
🔗
|
kyan |
yeah I did thank goodness, but I want my f*cking CSV anyway. This emoji middle finger goes out to anyone from youtube who is reading this. 🖕 |
23:49
🔗
|
SmileyG |
D: |
23:58
🔗
|
midas |
for some strange reason people who work for yahoo, youtube etc dont join this channel that often :< |
23:58
🔗
|
midas |
well, the people who work there do join sometimes |
23:58
🔗
|
midas |
that came out wrong |