Time |
Nickname |
Message |
00:33
🔗
|
|
godane has joined #archiveteam |
00:43
🔗
|
arkiver |
SketchCow: I'd like to start begin writing a project for tumblr |
00:44
🔗
|
nightpool |
arkiver: xmc was going to start one I think |
00:44
🔗
|
nightpool |
we were just talking about it in -bs |
00:44
🔗
|
xmc |
i haven't gotten around to it, so if you do it first then you win |
00:44
🔗
|
xmc |
i have some ideas about how to do it that might be valuable, but they're in scrollback of #-bs already |
00:45
🔗
|
xmc |
it would require two projects and a tiny bit of serverside code but yipdw is willing |
01:00
🔗
|
|
WinterFox has joined #archiveteam |
01:07
🔗
|
|
DoomTay has joined #archiveteam |
01:08
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
01:08
🔗
|
|
rossdylan has quit IRC (Read error: Operation timed out) |
01:28
🔗
|
|
Coderjoe has joined #archiveteam |
01:50
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
02:25
🔗
|
|
philpem has quit IRC (Ping timeout: 260 seconds) |
02:36
🔗
|
|
Aranje has quit IRC (Ping timeout: 260 seconds) |
03:49
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
04:12
🔗
|
|
Aranje has joined #archiveteam |
04:35
🔗
|
|
Coderjoe has joined #archiveteam |
04:46
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
04:52
🔗
|
|
Sk1d has joined #archiveteam |
05:04
🔗
|
|
TC02 has quit IRC (Ping timeout: 246 seconds) |
05:27
🔗
|
|
TC02 has joined #archiveteam |
05:31
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
05:31
🔗
|
|
n00b484 has joined #archiveteam |
05:32
🔗
|
n00b484 |
it seems to be working now but cant connect using MRIC |
05:35
🔗
|
|
n00b484 has quit IRC (Client Quit) |
05:40
🔗
|
|
TC01 has quit IRC (Ping timeout: 260 seconds) |
05:53
🔗
|
|
yipdw has quit IRC (Read error: Operation timed out) |
05:54
🔗
|
|
TC01 has joined #archiveteam |
05:54
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
05:56
🔗
|
|
JesseW has joined #archiveteam |
06:08
🔗
|
|
yipdw has joined #archiveteam |
06:33
🔗
|
|
tomwsmf has quit IRC (Ping timeout: 258 seconds) |
07:07
🔗
|
|
metal_cam has joined #archiveteam |
07:13
🔗
|
|
Emcy has quit IRC (Read error: Operation timed out) |
07:14
🔗
|
|
TC02 has quit IRC (Ping timeout: 246 seconds) |
07:21
🔗
|
|
TC02 has joined #archiveteam |
07:30
🔗
|
|
TC02 has quit IRC (Ping timeout: 246 seconds) |
07:31
🔗
|
|
TC02 has joined #archiveteam |
07:31
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
08:32
🔗
|
|
W1nterFox has joined #archiveteam |
08:35
🔗
|
|
WinterFox has quit IRC (Ping timeout: 492 seconds) |
09:12
🔗
|
|
philpem has joined #archiveteam |
09:16
🔗
|
|
schbirid has joined #archiveteam |
09:48
🔗
|
|
pfallenop has quit IRC (Ping timeout: 260 seconds) |
09:54
🔗
|
arkiver |
xmc: I'll have a look at the logs of #-bs. I'm not sure how spread out it is over the logs, so if you are around, maybe you could give me a small overview of the idea? |
09:54
🔗
|
arkiver |
I'm basically thinking a warrior project to create the WARCs for the wayback machine. |
09:54
🔗
|
arkiver |
We'l be extracting new tumblr sites as we archive them |
10:00
🔗
|
arkiver |
I see the main idea is to seperate archiving images and blogs |
10:02
🔗
|
arkiver |
I've started a little on the warrior project now |
10:04
🔗
|
arkiver |
We should be able to do some test runs soon |
10:08
🔗
|
arkiver |
xmc: what would be the reason for seperating the grab of images from the other files? |
10:13
🔗
|
|
Scuttle has joined #archiveteam |
10:13
🔗
|
Scuttle |
hum...I have an archivebot pullig data from one of my sites, how do I find out what's going on? :) |
10:14
🔗
|
arkiver |
hi |
10:14
🔗
|
|
GLaDOS has quit IRC (Read error: Connection reset by peer) |
10:14
🔗
|
arkiver |
What is your site? |
10:14
🔗
|
|
GLaDOS has joined #archiveteam |
10:15
🔗
|
arkiver |
The dashboard of ArchiveBot can be found here http://archivebot.com/ |
10:17
🔗
|
Scuttle |
randomwaffle.gbs.fm |
10:17
🔗
|
arkiver |
Looks like it's on the dashboard |
10:17
🔗
|
Scuttle |
right |
10:18
🔗
|
Scuttle |
if someone wants, I can rsync the whole site somewhere |
10:18
🔗
|
arkiver |
not sure what the grab is doing now though |
10:18
🔗
|
Scuttle |
well, downloading everything it seems :) |
10:19
🔗
|
|
pfallenop has joined #archiveteam |
10:19
🔗
|
Igloo |
Yeah it's going to put them onto the internet archive |
10:19
🔗
|
Igloo |
Notes say you're the last surviving waffleimages mirror? |
10:19
🔗
|
Scuttle |
may very well be |
10:19
🔗
|
Igloo |
How big is the repo? |
10:19
🔗
|
Scuttle |
around 330 gigs |
10:20
🔗
|
Igloo |
We can delay the crawl / make it less resource intensive if it's causing you problems |
10:21
🔗
|
Scuttle |
ah, that's no problem, I just noticed my access-logs were a lot bigger than they used to be :D |
10:21
🔗
|
Igloo |
aha :) |
10:21
🔗
|
Igloo |
Seems someone wants to preserve it forever |
10:21
🔗
|
Igloo |
So got added to the crawlers to download & upload to the internet archive / viewable in the way back machine |
10:21
🔗
|
|
terg has joined #archiveteam |
10:21
🔗
|
Scuttle |
aight |
10:22
🔗
|
Scuttle |
it's mostly forum-linked pics though I think... |
10:22
🔗
|
Igloo |
182Gb done so about half way |
10:22
🔗
|
Scuttle |
and that would be broken anywas since I don't have access to the waffleimages-domain |
10:22
🔗
|
Igloo |
No other notes :-/ |
10:24
🔗
|
terg |
post the KAT raid, apart from proxies and such, is there any database lying about of KAT torrent info? |
10:24
🔗
|
Igloo |
There is a torrent (ironically) kicking around somewhere |
10:25
🔗
|
terg |
very unfortunate, I wonder if it'd be a good idea to do regular (incremental if possible) archivals of large torrent indexes |
10:25
🔗
|
terg |
whereabouts should I look? |
10:25
🔗
|
arkiver |
I think that is a good idea |
10:26
🔗
|
arkiver |
I'm planning on getting something going to go by all torrent sites |
10:26
🔗
|
arkiver |
we already have a good archive of rutracker |
10:26
🔗
|
Igloo |
Good idea, Scale is an issue |
10:26
🔗
|
arkiver |
But let's move this to #archiveteam-bs |
10:26
🔗
|
terg |
gotcha |
11:04
🔗
|
|
Atom-- has quit IRC (Read error: Operation timed out) |
11:04
🔗
|
|
winterfox has joined #archiveteam |
11:05
🔗
|
|
Emcy has joined #archiveteam |
11:06
🔗
|
|
W1nterFox has quit IRC (Ping timeout: 492 seconds) |
11:07
🔗
|
|
Emcy has quit IRC (Client Quit) |
11:30
🔗
|
|
Emcy has joined #archiveteam |
11:57
🔗
|
|
Emcy_ has joined #archiveteam |
12:05
🔗
|
|
Sanqui has left . |
12:05
🔗
|
|
Sanqui has joined #archiveteam |
12:06
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
12:06
🔗
|
|
terg has quit IRC (My Mac has gone to sleep. ZZZzzz…) |
12:10
🔗
|
|
Emcy has quit IRC (Read error: Operation timed out) |
12:16
🔗
|
|
Coderjoe has joined #archiveteam |
12:32
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
12:35
🔗
|
|
BartoCH has joined #archiveteam |
12:49
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
12:49
🔗
|
|
BlueMaxim has joined #archiveteam |
13:06
🔗
|
|
atomotic has joined #archiveteam |
13:13
🔗
|
|
kristian_ has joined #archiveteam |
13:20
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
13:23
🔗
|
|
Coderjoe has joined #archiveteam |
13:29
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
13:40
🔗
|
|
vOYtEC has joined #archiveteam |
13:49
🔗
|
|
GLaDOS has quit IRC (Quit: Oh crap, I died.) |
13:49
🔗
|
|
GLaDOS has joined #archiveteam |
13:56
🔗
|
|
REiN^ has quit IRC (Ping timeout: 244 seconds) |
13:58
🔗
|
|
redlob has quit IRC (ZNC - http://znc.in) |
14:03
🔗
|
|
redlob has joined #archiveteam |
14:14
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
14:15
🔗
|
|
REiN^ has joined #archiveteam |
15:05
🔗
|
|
DoomTay has joined #archiveteam |
15:30
🔗
|
xmc |
arkiver: split the images off because otherwise you'll get all the images copied into every blog's warc. that will multiply your grab size by like fifty |
15:51
🔗
|
arkiver |
IA is currently working on something to deduplicate WARCs. |
15:51
🔗
|
arkiver |
Duplicate records will be replaced by revisit records |
15:51
🔗
|
arkiver |
I'll ask around, but size might not matter too much |
15:52
🔗
|
arkiver |
Bandwidth is a more a problem |
15:52
🔗
|
Kazzy |
Jumping in on this - I'm assuming that means IA will only keep 1 copy of everything, even if the same file is uploaded in every WARC? |
15:53
🔗
|
arkiver |
If the same file is upload in 50 WARCs, 49 WARCs will have revisit records and 1 WARC will hold the actual file |
15:53
🔗
|
arkiver |
(If I understood IA's idea correctly) |
15:53
🔗
|
Kazzy |
revisit records being just a pointer to the actual file? |
15:53
🔗
|
arkiver |
Yeah |
15:54
🔗
|
DoomTay |
Does this mean no more cases of multiple timestamps for things that haven |
15:54
🔗
|
DoomTay |
't haven't changed at all? |
15:54
🔗
|
Kazzy |
awesome, had always wondered if there was an easy way to do that, though it sounds like a ton of processing work. Makes sense for it to be done on IA's end really |
15:54
🔗
|
arkiver |
As far as I know, just a redirect to an other record, without making the URL and timestamp in the Wayback Machine look like it is redirected |
15:54
🔗
|
arkiver |
DoomTay: no, see above ^ |
15:55
🔗
|
arkiver |
The idea isn't totally clear yet though, still being discussed, so things might change |
16:00
🔗
|
DoomTay |
Okay, so after looking at what a revisit record is, it looks like it IS some form of redundancy removal |
16:00
🔗
|
DoomTay |
Yay |
16:01
🔗
|
DoomTay |
Though I doubt this would save more than, say, a few gigs worth of filespace |
16:08
🔗
|
Kazzy |
huh |
16:09
🔗
|
Kazzy |
Replacing a whole file and just throwing a pointer in saves tons |
16:09
🔗
|
Kazzy |
Even with just AT's stuff, there's an absolutel TON of duplication, due to the nature of what we do |
16:09
🔗
|
Kazzy |
When you scale that up to IA, that's terabytes, at least |
16:11
🔗
|
DoomTay |
Hell, maybe petabytes |
16:12
🔗
|
DoomTay |
Speaking of files, anyone know how copies with matching digests can still have different lengths? Is that actually the length of the WARC? |
16:12
🔗
|
DoomTay |
Like with http://web.archive.org/cdx/search/cdx?url=http://www.doomworld.com/batman/main.JPG&output=json |
16:15
🔗
|
schbirid |
wget has a dedup flag for warc btw |
16:16
🔗
|
Kazzy |
that only goes so far though schbirid, I guess that works for ArchiveBot, but not warrior projects |
17:02
🔗
|
|
kristian_ has quit IRC (Leaving) |
17:29
🔗
|
|
Scuttle has left Leaving |
17:50
🔗
|
|
JesseW has joined #archiveteam |
18:12
🔗
|
|
tomwsmf has joined #archiveteam |
18:38
🔗
|
|
metal_cam is now known as metalcamp |
19:05
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
19:05
🔗
|
|
Start has joined #archiveteam |
19:16
🔗
|
schbirid |
no idea what it means but http://ddl-warez.to/ has a notice "only 68 days left" |
19:17
🔗
|
DoomTay |
....and it has freaking CloudFlare |
19:19
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
19:21
🔗
|
schbirid |
of cousre |
20:09
🔗
|
|
maseck has quit IRC (Quit: No Ping reply in 180 seconds.) |
20:09
🔗
|
|
maseck has joined #archiveteam |
20:50
🔗
|
Medowar |
...and there we go http://www.bbc.com/news/business-36879831 |
20:50
🔗
|
Medowar |
.title |
20:51
🔗
|
|
kristian_ has joined #archiveteam |
20:51
🔗
|
Kazzy |
.title http://www.bbc.co.uk/news/business-36879831 |
20:51
🔗
|
Kazzy |
sod it, "Verizon 'agrees $5bn Yahoo deal'" |
21:07
🔗
|
HCross2 |
Oh God. Yahoo and AOL having a baby |
21:08
🔗
|
|
Kazzy is now known as Kaz |
21:10
🔗
|
DoomTay |
This is gonna be fun... |
21:26
🔗
|
|
Actium has joined #archiveteam |
21:52
🔗
|
|
godane has joined #archiveteam |
21:54
🔗
|
Nemo_bis |
Supercookies for everyone! |
21:54
🔗
|
|
Emcy_ has quit IRC (Read error: Operation timed out) |
21:55
🔗
|
|
Emcy_ has joined #archiveteam |
22:00
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
22:00
🔗
|
|
pguth_ has joined #archiveteam |
22:02
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
22:02
🔗
|
|
Emcy_ has quit IRC (Read error: Operation timed out) |
22:04
🔗
|
|
Emcy_ has joined #archiveteam |
22:10
🔗
|
|
kristian_ has quit IRC (Leaving) |
22:15
🔗
|
|
winterfox has quit IRC (Ping timeout: 492 seconds) |
22:33
🔗
|
|
ndiddy has joined #archiveteam |
22:37
🔗
|
|
Coderjoe has quit IRC (Ping timeout: 260 seconds) |
22:38
🔗
|
|
Coderjoe has joined #archiveteam |
22:51
🔗
|
|
dashcloud has joined #archiveteam |
23:01
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
23:01
🔗
|
|
pguth_ has joined #archiveteam |
23:11
🔗
|
|
kristian_ has joined #archiveteam |
23:19
🔗
|
|
Coderjoe has quit IRC (Ping timeout: 260 seconds) |
23:22
🔗
|
|
Swaxx has joined #archiveteam |
23:23
🔗
|
Swaxx |
hi anyone here? |
23:23
🔗
|
* |
Actium says hi and goes back into hiding |
23:24
🔗
|
Swaxx |
how can i post a link in a forumpost? |
23:25
🔗
|
DoomTay |
I don't think this is the place for that |
23:25
🔗
|
Swaxx |
ow okay, |
23:25
🔗
|
Swaxx |
do you know the irc adress to extratorrents? |
23:28
🔗
|
|
Swaxx has quit IRC (Quit: Page closed) |
23:34
🔗
|
|
Coderjoe has joined #archiveteam |
23:37
🔗
|
|
BlueMaxim has joined #archiveteam |
23:55
🔗
|
|
closure has joined #archiveteam |