Time |
Nickname |
Message |
00:27
🔗
|
|
SmileyG has quit IRC (Remote host closed the connection) |
00:28
🔗
|
|
Marc has quit IRC (Ping timeout: 240 seconds) |
00:30
🔗
|
|
BlueMaxim has quit IRC (Ping timeout: 265 seconds) |
00:31
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
00:31
🔗
|
|
BlueMaxim has joined #archiveteam |
00:31
🔗
|
|
Marc has joined #archiveteam |
00:33
🔗
|
|
Start has joined #archiveteam |
00:34
🔗
|
|
wp494 has quit IRC (Read error: Operation timed out) |
00:34
🔗
|
arkiver |
Discovery of blogger is going to start tomorrow |
00:35
🔗
|
arkiver |
because of the captchas popping up if we go too fast, we'll need a lot of ips |
00:35
🔗
|
arkiver |
if you know people who can help us out, please let them know |
00:37
🔗
|
|
Smiley has joined #archiveteam |
00:39
🔗
|
SketchCow |
Ask Kenshin |
00:41
🔗
|
arkiver |
SketchCow: just for confirmation, are we really going to download the full blogger? |
00:41
🔗
|
arkiver |
that will be very very big |
00:49
🔗
|
|
wp494 has joined #archiveteam |
00:51
🔗
|
ohhdemgir |
do it |
00:51
🔗
|
ohhdemgir |
arkiver, how big guesstimate |
00:52
🔗
|
|
GLaDOS has joined #archiveteam |
00:52
🔗
|
|
swebb sets mode: +o GLaDOS |
01:11
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Operation timed out) |
01:13
🔗
|
|
Specular has joined #archiveteam |
01:14
🔗
|
|
aaaaaaaaa has joined #archiveteam |
01:15
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
01:21
🔗
|
|
dashcloud has joined #archiveteam |
01:21
🔗
|
|
Spring has quit IRC (Read error: Operation timed out) |
01:31
🔗
|
|
signius has quit IRC (Read error: Operation timed out) |
01:32
🔗
|
|
primus104 has quit IRC (Leaving.) |
01:35
🔗
|
|
Spring has joined #archiveteam |
01:40
🔗
|
|
Specular has quit IRC (Ping timeout: 370 seconds) |
01:46
🔗
|
|
signius has joined #archiveteam |
02:06
🔗
|
|
Specular has joined #archiveteam |
02:16
🔗
|
|
Spring has quit IRC (Read error: Operation timed out) |
02:19
🔗
|
SketchCow |
We're going to try and download a lot of it |
02:19
🔗
|
SketchCow |
With an eye towards blogs that match "sex", "eros", "nude" |
02:26
🔗
|
|
Ymgve has quit IRC () |
02:39
🔗
|
|
RedType_ has quit IRC (Remote host closed the connection) |
02:58
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
03:29
🔗
|
|
mistym has joined #archiveteam |
03:38
🔗
|
|
dashcloud has quit IRC (hub.dk irc.homelien.no) |
03:38
🔗
|
|
NovaKing has quit IRC (hub.dk irc.homelien.no) |
03:38
🔗
|
|
maltris has quit IRC (hub.dk irc.homelien.no) |
03:38
🔗
|
|
ionpulse has quit IRC (hub.dk irc.homelien.no) |
03:38
🔗
|
|
pikhq has quit IRC (hub.dk irc.homelien.no) |
03:38
🔗
|
|
altlabel has quit IRC (hub.dk irc.homelien.no) |
03:38
🔗
|
|
Jogie has quit IRC (hub.dk irc.homelien.no) |
03:42
🔗
|
|
maltris_ has joined #archiveteam |
03:45
🔗
|
xmc |
paging ohhdemgir |
03:45
🔗
|
xmc |
^ |
03:45
🔗
|
xmc |
it's what he was born for! |
03:45
🔗
|
|
NovaKing_ has joined #archiveteam |
03:47
🔗
|
S[h]O[r]T |
im ready to gear up a shit ton of machines to help :) |
03:54
🔗
|
|
dashcloud has joined #archiveteam |
04:03
🔗
|
|
Spring has joined #archiveteam |
04:04
🔗
|
|
SN4T14 has joined #archiveteam |
04:11
🔗
|
|
Specular has quit IRC (Read error: Operation timed out) |
04:51
🔗
|
|
Spring has quit IRC (Ping timeout: 362 seconds) |
04:58
🔗
|
Kenshin |
arkiver: i'll step in once you stablize the code :) |
05:00
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
05:23
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
05:38
🔗
|
|
nwf has quit IRC (WeeChat 1.0.1) |
05:39
🔗
|
|
nwf has joined #archiveteam |
05:45
🔗
|
|
mistym has joined #archiveteam |
05:46
🔗
|
Start |
i'm wondering when we should start projects for angelfire and tripod |
05:47
🔗
|
Start |
that way we can have all three major 90s web hosts backed up |
05:48
🔗
|
Start |
also, did the geocities archive include sites from geocities japan? |
05:49
🔗
|
xmc |
Start: lycos, but that's on tripod now |
05:54
🔗
|
|
nwf has quit IRC (Read error: Operation timed out) |
05:54
🔗
|
|
nwf has joined #archiveteam |
05:57
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
06:03
🔗
|
|
RedType has joined #archiveteam |
06:12
🔗
|
|
RedType has quit IRC (Quit: Lost terminal) |
06:19
🔗
|
|
RedType has joined #archiveteam |
06:23
🔗
|
|
mistym has joined #archiveteam |
06:33
🔗
|
|
RedType has quit IRC (Client Quit) |
07:04
🔗
|
|
Muad-Dib has quit IRC (Ping timeout: 260 seconds) |
07:08
🔗
|
|
Muad-Dib has joined #archiveteam |
07:13
🔗
|
|
pikhq has joined #archiveteam |
07:13
🔗
|
|
altlabel has joined #archiveteam |
07:13
🔗
|
|
ionpulse has joined #archiveteam |
07:21
🔗
|
|
Jogie has joined #archiveteam |
07:37
🔗
|
|
Emcy_ has quit IRC (Read error: Connection reset by peer) |
08:12
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
08:18
🔗
|
|
primus104 has joined #archiveteam |
08:25
🔗
|
|
acridAxid has quit IRC (Quit: Quitting) |
08:28
🔗
|
|
khaoohs_ has joined #archiveteam |
08:29
🔗
|
|
dashcloud has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
wp494 has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
rejon has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
Famicoma1 has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
balrog has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
lrkj has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
slash` has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
Baljem has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
ats has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
espes__ has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
Mayonaise has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
marvinw has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
khaoohs has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
Froggypwn has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
oli has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
Cameron_D has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
gibigiana has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
okeuday has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
ohhdemgir has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
ryan___ has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
thefinn93 has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
eprillios has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
Rickster has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
kanzure has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
fenn has quit IRC (west.us.hub irc.eversible.com) |
08:29
🔗
|
|
xmc has quit IRC (west.us.hub irc.eversible.com) |
08:30
🔗
|
|
acridAxid has joined #archiveteam |
08:30
🔗
|
|
wp494_ has joined #archiveteam |
08:30
🔗
|
|
wp494_ has quit IRC (Excess Flood) |
08:30
🔗
|
|
wp494_ has joined #archiveteam |
08:31
🔗
|
|
Rickster` has joined #archiveteam |
08:32
🔗
|
|
kanzure_ has joined #archiveteam |
08:32
🔗
|
|
gibigian1 has joined #archiveteam |
08:32
🔗
|
|
lrkj_ has joined #archiveteam |
08:36
🔗
|
|
acridAxid has quit IRC (Read error: Operation timed out) |
08:39
🔗
|
|
ryan__ has joined #archiveteam |
08:41
🔗
|
|
acridAxid has joined #archiveteam |
08:44
🔗
|
|
Rickster` is now known as Rickster |
08:44
🔗
|
|
rejon has joined #archiveteam |
08:44
🔗
|
|
balrog has joined #archiveteam |
08:44
🔗
|
|
slash` has joined #archiveteam |
08:44
🔗
|
|
thefinn93 has joined #archiveteam |
08:44
🔗
|
|
Baljem has joined #archiveteam |
08:44
🔗
|
|
ats has joined #archiveteam |
08:44
🔗
|
|
espes__ has joined #archiveteam |
08:44
🔗
|
|
Mayonaise has joined #archiveteam |
08:44
🔗
|
|
marvinw has joined #archiveteam |
08:44
🔗
|
|
oli has joined #archiveteam |
08:44
🔗
|
|
Cameron_D has joined #archiveteam |
08:44
🔗
|
|
okeuday has joined #archiveteam |
08:44
🔗
|
|
xmc has joined #archiveteam |
08:44
🔗
|
|
irc.eversible.com sets mode: +oo balrog xmc |
08:44
🔗
|
|
swebb sets mode: +o balrog |
08:44
🔗
|
|
swebb sets mode: +o xmc |
08:44
🔗
|
|
balrog sets mode: +o Lord_Nigh |
08:48
🔗
|
|
fenn has joined #archiveteam |
08:51
🔗
|
|
Emcy_ has joined #archiveteam |
08:51
🔗
|
|
fenn has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
rejon has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
balrog has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
slash` has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
Baljem has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
ats has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
espes__ has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
Mayonaise has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
marvinw has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
oli has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
Cameron_D has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
okeuday has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
thefinn93 has quit IRC (west.us.hub irc.eversible.com) |
08:51
🔗
|
|
xmc has quit IRC (west.us.hub irc.eversible.com) |
08:53
🔗
|
|
oli_ has joined #archiveteam |
08:54
🔗
|
|
espes___ has joined #archiveteam |
08:55
🔗
|
|
schbirid has joined #archiveteam |
08:55
🔗
|
|
marvinw_ has joined #archiveteam |
08:55
🔗
|
|
Baljem_ has joined #archiveteam |
09:06
🔗
|
|
oli_ is now known as oli |
09:06
🔗
|
|
eprillios has joined #archiveteam |
09:10
🔗
|
|
dashcloud has joined #archiveteam |
09:10
🔗
|
|
rejon has joined #archiveteam |
09:10
🔗
|
|
balrog has joined #archiveteam |
09:10
🔗
|
|
thefinn93 has joined #archiveteam |
09:10
🔗
|
|
Mayonaise has joined #archiveteam |
09:10
🔗
|
|
Cameron_D has joined #archiveteam |
09:10
🔗
|
|
xmc has joined #archiveteam |
09:10
🔗
|
|
irc.eversible.com sets mode: +oo balrog xmc |
09:10
🔗
|
|
swebb sets mode: +o balrog |
09:10
🔗
|
|
swebb sets mode: +o xmc |
09:18
🔗
|
|
ats has joined #archiveteam |
09:19
🔗
|
|
fenn has joined #archiveteam |
09:21
🔗
|
|
primus104 has quit IRC (Leaving.) |
09:38
🔗
|
antomatic |
That was quick: |
09:38
🔗
|
antomatic |
http://www.engadget.com/2015/02/27/google-reverses-blogger-porn-ban/ |
09:46
🔗
|
|
eprillios has quit IRC (Ping timeout: 506 seconds) |
09:52
🔗
|
|
eprillios has joined #archiveteam |
09:59
🔗
|
|
Famicoman has joined #archiveteam |
10:09
🔗
|
espes___ |
I like the imagine it's because of SketchCow shouting at them yesterday |
10:12
🔗
|
arkiver |
ohhdemgir: full blogger would be many 10's of TB's I think |
10:13
🔗
|
arkiver |
Kenshin S[h]O[r]T: thanks! I'll keep you informed |
10:13
🔗
|
arkiver |
so do we still want blogger or do we close the project now? http://www.engadget.com/2015/02/27/google-reverses-blogger-porn-ban/ |
10:14
🔗
|
Kenshin |
if they're going to keep things and only target commerical porn, i feel there's no rush to archive it |
10:16
🔗
|
espes___ |
full blogger |
10:16
🔗
|
espes___ |
>300TB |
10:18
🔗
|
Atluxity |
well... why not have blogger on our project-list anyway? If we get to it then why not have it as a large project warriors can work with when there is nothing else? |
10:18
🔗
|
|
swebb has quit IRC (Read error: Operation timed out) |
10:19
🔗
|
Atluxity |
would it be hard to de-duplicate if they announce a shutdown later? |
10:19
🔗
|
Atluxity |
just do an incremental grab then? |
10:19
🔗
|
espes___ |
'cause storage is expensive |
10:22
🔗
|
|
swebb has joined #archiveteam |
10:23
🔗
|
|
Ymgve has joined #archiveteam |
10:25
🔗
|
arkiver |
deduplicating is hard with 300 TB |
10:26
🔗
|
arkiver |
and the storage is a problem, but if they announce to go away a week before shutdown we'll not have enough time to save everything |
10:26
🔗
|
arkiver |
so a slow constant proect |
10:26
🔗
|
arkiver |
project* would be good |
10:34
🔗
|
fenn |
why is deduplicating hard? |
10:35
🔗
|
ersi |
You need to crunch a lot and keep a lot of data in memory |
10:35
🔗
|
ersi |
tldr "resource intensive and complex task" |
10:35
🔗
|
fenn |
is it not just a matter of comparing hashes? |
10:35
🔗
|
fenn |
either perceptual hash or md5 |
10:37
🔗
|
fenn |
file size is a decent first pass too |
11:32
🔗
|
Atluxity |
depending on storage solution, some have it buildt in |
11:33
🔗
|
Atluxity |
but without knowing the storage in detail it is hard to plan for |
11:34
🔗
|
Atluxity |
implementing in pipeline software would certainly be a challenge |
11:34
🔗
|
Atluxity |
but maybe a constant big project could be better than nothing at all? |
12:00
🔗
|
|
slash` has joined #archiveteam |
12:25
🔗
|
|
primus104 has joined #archiveteam |
12:32
🔗
|
antomatic |
I agree with Atluxity, it'd be really good to have a big, ongoing, unhurried project that can serve as a backstop for hungry warriors with nothing else to do |
12:35
🔗
|
antomatic |
Google/Blogger have demonstrated that their whims are arbitrary and changeable |
12:35
🔗
|
antomatic |
so a pre-emptive grab, over time, seems like a worthwhile investmenet. |
12:35
🔗
|
antomatic |
*investment |
12:38
🔗
|
antomatic |
As regards deduplicating, perhaps the project could keep track of the time/date of each blog's grab, so that future grabs (if done) can use the blogger features to grab 'everything since' that date, etc |
12:40
🔗
|
antomatic |
it should be a solvable problem |
12:48
🔗
|
Atluxity |
ah, I did not know of such a feature |
12:50
🔗
|
antomatic |
something like /search?updated-min=yyyy-mm-ddThh:mm:ssZ&max-results=499 |
12:51
🔗
|
arkiver |
That's a good idea |
12:51
🔗
|
arkiver |
that's not so hard to implement |
12:52
🔗
|
arkiver |
if SketchCow thinks it's good to do and we are good on space, I think we should do that |
12:52
🔗
|
arkiver |
update every month |
12:57
🔗
|
antomatic |
Mm, that works - e.g. "everything since Jan 1st 2014" is expressed like: http://buzz.blogger.com/search?updated-min=2014-01-01T00:00:00Z&max-results=499 |
12:58
🔗
|
antomatic |
that's blogger.com but the same works on blogspot.com |
13:04
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
13:11
🔗
|
|
wp494_ has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) |
13:11
🔗
|
|
wp494 has joined #archiveteam |
13:21
🔗
|
|
signius has quit IRC (Read error: Operation timed out) |
13:24
🔗
|
|
ohhdemgir has joined #archiveteam |
13:34
🔗
|
|
signius has joined #archiveteam |
13:46
🔗
|
|
sankin has joined #archiveteam |
14:50
🔗
|
|
sankin has quit IRC (Leaving.) |
14:58
🔗
|
|
sankin has joined #archiveteam |
15:01
🔗
|
ohhdemgir |
so whos writing the discovery for blogger? |
15:02
🔗
|
|
primus104 has quit IRC (Leaving.) |
15:04
🔗
|
Kazzy |
ohhdemgir: arkiver is/was writing disco scripts |
15:22
🔗
|
|
khaoohs_ has quit IRC (Ping timeout: 306 seconds) |
15:28
🔗
|
arkiver |
discovery scripts are ready |
15:28
🔗
|
arkiver |
not started yet, but ready |
15:32
🔗
|
|
mistym has joined #archiveteam |
15:50
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
15:53
🔗
|
|
mhazinsk has quit IRC (Ping timeout: 186 seconds) |
15:56
🔗
|
|
mhazinsk has joined #archiveteam |
16:11
🔗
|
|
mistym has joined #archiveteam |
16:15
🔗
|
SketchCow |
I take partial credit for the reversal, because they used some of my language |
16:16
🔗
|
SketchCow |
My attitude on Blogger grabs is: |
16:16
🔗
|
SketchCow |
- Grab the oldest blogs (most likely to die in some arbitrary cull) |
16:16
🔗
|
SketchCow |
- Grab the biggest blogs (less likely, but good to hace) |
16:17
🔗
|
SketchCow |
- Grab blogs with words like "erotic", "sex-positive" or "adult-oriented" in the search results, as those are now shown to be second class citizens |
16:17
🔗
|
SketchCow |
That can go on for a while |
16:17
🔗
|
SketchCow |
Also, getting discovery going and making frameworks is also important. Having record of what was out there can be really helpful for researchers. |
16:19
🔗
|
arkiver |
Ok, we'll run the discovery and filter for those sex oriented words and I'll run a discovery to check which blogs request a verification of age |
16:20
🔗
|
Start |
we could also discover blogs by abusing the next blog button on the navbar |
16:20
🔗
|
arkiver |
So the idea to grab everything and update regularly is off? |
16:22
🔗
|
arkiver |
Biggest blogs is biggest as in a lot of visitors or biggest as in a lot of posts? We can check the number of posts easy, but number of vistors would be a bit harder. That would have to be user submitted |
16:26
🔗
|
ohhdemgir |
arkiver, ready.. so lets start? |
16:31
🔗
|
|
khaoohs has joined #archiveteam |
16:31
🔗
|
|
oli_ has joined #archiveteam |
16:33
🔗
|
|
oli has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
Rickster has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
Muad-Dib has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
GLaDOS has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
WubTheCap has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
Sue_ has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
nox has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
danneh_ has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
LittUp has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
deathy has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
russss has quit IRC (hub.se efnet.port80.se) |
16:33
🔗
|
|
lhobas has quit IRC (hub.se efnet.port80.se) |
16:36
🔗
|
|
aaaaaaaaa has joined #archiveteam |
16:37
🔗
|
|
Sue__ has joined #archiveteam |
16:42
🔗
|
SketchCow |
Sorry, biggest in terms of popularity. |
16:42
🔗
|
SketchCow |
These are just arbitrary things, just to do instead of a full deep scan. |
16:48
🔗
|
Atluxity |
I like the generall idea of having a big project to work on in more idle time for our warriors |
16:48
🔗
|
ersi |
Who doesn't :) |
16:49
🔗
|
Atluxity |
it is not motivating for a user running a warrior to see it being idle |
16:49
🔗
|
|
oli_ is now known as oli |
16:49
🔗
|
ersi |
That has been obvious for quite some time, yeah :) |
16:49
🔗
|
|
WubTheCap has joined #archiveteam |
17:08
🔗
|
|
Spring has joined #archiveteam |
17:12
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
17:21
🔗
|
Kenshin |
one note btw, there are blogs that use their own domains |
17:21
🔗
|
Kenshin |
but still use blogger's platform |
17:27
🔗
|
|
Start_ has joined #archiveteam |
17:27
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
17:27
🔗
|
|
Start_ is now known as Start |
17:36
🔗
|
Start |
once we've got more of the upcoming projects out of the way, i'd like to start projects for some of the isp web hosts we've found at #webroasting |
17:36
🔗
|
Start |
not sure which order they'd be done in |
17:37
🔗
|
Start |
maybe oldest/biggest/most decayed first |
17:38
🔗
|
|
primus104 has joined #archiveteam |
17:46
🔗
|
|
primus104 has quit IRC (Leaving.) |
17:59
🔗
|
|
xmc sets mode: +o swebb |
18:06
🔗
|
|
mistym has joined #archiveteam |
18:07
🔗
|
|
mistym_ has joined #archiveteam |
18:16
🔗
|
|
mistym has quit IRC (Ping timeout: 600 seconds) |
18:27
🔗
|
|
primus104 has joined #archiveteam |
18:34
🔗
|
|
slash` has quit IRC (Ping timeout: 512 seconds) |
18:36
🔗
|
|
balrog has quit IRC (Ping timeout: 512 seconds) |
18:37
🔗
|
|
balrog has joined #archiveteam |
18:37
🔗
|
|
swebb sets mode: +o balrog |
18:37
🔗
|
|
thefinn93 has quit IRC (Ping timeout: 606 seconds) |
18:43
🔗
|
|
Nickname has joined #archiveteam |
18:43
🔗
|
Nickname |
http://science.slashdot.org/story/15/02/25/2313241/argonne-national-laboratory-shuts-down-online-ask-a-scientist-program |
18:44
🔗
|
Nickname |
NEWTON is (soon to be was) an on online repository of science questions submitted by school children from around the world. |
18:44
🔗
|
yipdw |
got it |
18:44
🔗
|
yipdw |
http://archive.fart.website/archivebot/viewer/?q=newton |
18:44
🔗
|
|
thefinn93 has joined #archiveteam |
18:48
🔗
|
Nickname |
@yipdw: Sorry for asking then. I was unable to find it in the wiki. |
18:48
🔗
|
Nickname |
Did you really get everything? |
18:48
🔗
|
yipdw |
no apologies needed, just wanted to point it out |
18:48
🔗
|
yipdw |
I haven't done a thorough check but glancing over the logs it doesn't look dreadfully bad |
18:49
🔗
|
yipdw |
if you'd like to verify, you can use https://github.com/ikreymer/webarchiveplayer |
18:50
🔗
|
yipdw |
download https://archive.org/download/archiveteam_archivebot_go_20150227000003/www.newton.dep.anl.gov-inf-20150226-022927-cm6eh-00000.warc.gz and load it into the player |
18:50
🔗
|
yipdw |
actually i'll do that now |
18:53
🔗
|
Nickname |
so the archivebot http://archive.fart.website/archivebot/viewer/?q=newton , shows files that are already saved on archive.org? |
18:55
🔗
|
yipdw |
Nickname: yeah, it's an index of the archivebot collection in IA, which in turn gets ingested into the Wayback Machine eventually |
18:55
🔗
|
yipdw |
Wayback works but there are details that I've found that other tools currently render better |
18:55
🔗
|
yipdw |
infinite-scroll for example |
18:56
🔗
|
yipdw |
so pywb/webarchiveplayer/etc |
18:58
🔗
|
Nickname |
What should I even look for? (Isn't there a Tool to check for broken links or detect suspicious external/internal Links?)(Oh infinite-scroll how I hate thee...) |
19:00
🔗
|
Smiley |
wget can check for broken links. |
19:01
🔗
|
yipdw |
Nickname: I suggest as a first step downloading the WARC and looking through it manually, comparing it against Newton |
19:01
🔗
|
yipdw |
once it looks like the copy is reasonably faithful it's good up to the point that you trust our spiders |
19:02
🔗
|
yipdw |
the crawler is https://github.com/chfoo/wpull |
19:02
🔗
|
yipdw |
I guess you could run a link check on the loaded WARC, I don't know of any tools to do that |
19:02
🔗
|
yipdw |
it's also ambiguous -- a 404 in the WARC could very well be a 404 on the captured site |
19:04
🔗
|
Nickname |
hmm. The Internet Archive's sites are offline for scheduled maintenance and upgrades. Oh well. I'll check back later. |
19:07
🔗
|
Nickname |
Does the WARC capture 404-at-crawl Events? |
19:09
🔗
|
Nickname |
(Capturing the received Error-Page could also be useful, for checking Purposes.) |
19:11
🔗
|
|
RedType has joined #archiveteam |
19:23
🔗
|
|
slash` has joined #archiveteam |
19:25
🔗
|
|
RedType_ has joined #archiveteam |
19:26
🔗
|
|
RedType_ has quit IRC (Client Quit) |
19:28
🔗
|
|
RedType_ has joined #archiveteam |
19:34
🔗
|
SketchCow |
Internet Archive's having a little sadness |
19:34
🔗
|
SketchCow |
Leonard Nimoy's gone, what's the point |
19:39
🔗
|
Smiley |
nod |
19:39
🔗
|
Smiley |
turn off the lights on your way out |
19:40
🔗
|
SketchCow |
So |
19:40
🔗
|
SketchCow |
Someone passed me information and wants it some way confidential |
19:40
🔗
|
Smiley |
?LOL |
19:40
🔗
|
SketchCow |
So I'm going to paraphrase it |
19:41
🔗
|
Smiley |
...k.... |
19:41
🔗
|
* |
Smiley sounds the sirens |
19:41
🔗
|
SketchCow |
Last.fm is going to switch codebases in the first two weeks of April |
19:41
🔗
|
SketchCow |
Opinion of these folks is... it's not going to well |
19:42
🔗
|
|
RedType has quit IRC (Quit: Lost terminal) |
19:42
🔗
|
SketchCow |
Code music data likely to survive, but some user generated material may die |
19:42
🔗
|
SketchCow |
From latter: |
19:42
🔗
|
SketchCow |
There are 1m+ forum posts spanning nearly 11 years across global |
19:42
🔗
|
SketchCow |
forums (which you can see at http://www.last.fm/forum) and group |
19:42
🔗
|
SketchCow |
forums. The good news is all the forums are accessible by incrementing |
19:42
🔗
|
SketchCow |
the ID at the end of http://www.last.fm/forum/<id>. They are spread |
19:42
🔗
|
SketchCow |
across a mostly-continuous ID namespace. |
19:42
🔗
|
SketchCow |
'm also slightly concerned about user journals (e.g. |
19:42
🔗
|
SketchCow |
http://www.last.fm/user/Russ/journal), but that's more difficult as |
19:42
🔗
|
SketchCow |
there's no easy way of enumerating users, short of crawling similar |
19:42
🔗
|
SketchCow |
user/friends lists. |
19:42
🔗
|
SketchCow |
I'd also suggest archiving blog.last.fm during this switchover as |
19:42
🔗
|
SketchCow |
hilarity is likely to ensue. |
19:42
🔗
|
SketchCow |
... |
19:42
🔗
|
SketchCow |
That's all. |
19:43
🔗
|
SketchCow |
So I think a project is worth it |
19:43
🔗
|
SketchCow |
I suggest #lastchance.fm |
19:46
🔗
|
garyrh_ |
their blog is in archivebot |
19:46
🔗
|
garyrh_ |
last post was jan. 2014 |
19:47
🔗
|
xmc |
last.fm's name is self-parodying |
19:48
🔗
|
garyrh_ |
didntlast.fm |
19:49
🔗
|
xmc |
wontlast |
19:50
🔗
|
garyrh_ |
camelast.fm, etc. etc |
19:52
🔗
|
|
Stilett0 has joined #archiveteam |
19:52
🔗
|
|
Stilett0 has left |
20:13
🔗
|
|
kyan has quit IRC (Quit: Leaving) |
20:17
🔗
|
|
BlueMaxim has joined #archiveteam |
20:17
🔗
|
sep332 |
Google has updated their updated policy. doesn't look (quite) as bad anymore |
20:17
🔗
|
sep332 |
https://productforums.google.com/forum/m/#!category-topic/blogger/jAep2mLabQY |
20:18
🔗
|
sep332 |
i'm guessing we're grabbing anyway, huh? |
20:21
🔗
|
xmc |
no rules no masters #yolo |
20:29
🔗
|
sep332 |
oh well ok then |
20:30
🔗
|
|
lag has joined #archiveteam |
20:38
🔗
|
|
Nickname has quit IRC (Quit: Page closed) |
20:38
🔗
|
SketchCow |
We went over how we're doing blogger. |
20:38
🔗
|
SketchCow |
Grab old blogs, grab sexy and erotic blogs |
20:39
🔗
|
SketchCow |
Don't go nuts, but save from them because they're obviously not so hot |
20:41
🔗
|
sep332 |
thanks. i saw the scrollback from 10 hours ago but missed the update 4 hours ago |
20:47
🔗
|
|
lag2 has joined #archiveteam |
20:51
🔗
|
|
lag has quit IRC (Ping timeout: 258 seconds) |
20:59
🔗
|
|
RedType_ has quit IRC (Quit: leaving) |
20:59
🔗
|
|
RedType has joined #archiveteam |
21:10
🔗
|
|
mistym_ has quit IRC (Remote host closed the connection) |
21:56
🔗
|
|
mistym has joined #archiveteam |
22:01
🔗
|
|
sankin has quit IRC (Leaving.) |
22:09
🔗
|
tephra_ |
codehaus (http://www.codehaus.org/) is shutting down |
22:16
🔗
|
|
lag2 has quit IRC (Quit: Leaving) |
22:28
🔗
|
|
cbb2 has joined #archiveteam |
23:17
🔗
|
|
cbb2 has quit IRC (hub.dk irc.efnet.pl) |
23:17
🔗
|
|
primus104 has quit IRC (hub.dk irc.efnet.pl) |
23:17
🔗
|
|
schbirid has quit IRC (hub.dk irc.efnet.pl) |
23:17
🔗
|
|
Zebranky_ has quit IRC (hub.dk irc.efnet.pl) |
23:17
🔗
|
|
Fusl has quit IRC (hub.dk irc.efnet.pl) |
23:17
🔗
|
|
cbb2 has joined #archiveteam |
23:17
🔗
|
|
primus104 has joined #archiveteam |
23:17
🔗
|
|
schbirid has joined #archiveteam |
23:17
🔗
|
|
Zebranky_ has joined #archiveteam |
23:17
🔗
|
|
Fusl has joined #archiveteam |
23:24
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
23:25
🔗
|
|
Ymgve has quit IRC () |
23:27
🔗
|
|
db48x` has joined #archiveteam |
23:32
🔗
|
|
Spring has quit IRC (Quit: Leaving) |
23:32
🔗
|
|
Ymgve has joined #archiveteam |
23:37
🔗
|
|
BlueMaxim has joined #archiveteam |
23:45
🔗
|
|
Jonimus has joined #archiveteam |
23:50
🔗
|
|
T31m_ has joined #archiveteam |
23:50
🔗
|
|
nico_32_ has joined #archiveteam |
23:50
🔗
|
|
nico_32 has quit IRC (Read error: Connection reset by peer) |
23:50
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
23:51
🔗
|
|
Daloader_ has joined #archiveteam |
23:53
🔗
|
|
BlueMaxim has joined #archiveteam |
23:57
🔗
|
|
T31m_ has quit IRC (Read error: Operation timed out) |
23:58
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
23:58
🔗
|
|
T31M has quit IRC (Read error: Operation timed out) |
23:59
🔗
|
|
BlueMaxim has joined #archiveteam |
23:59
🔗
|
|
khaoohs has quit IRC (Read error: Connection reset by peer) |
23:59
🔗
|
|
khaoohs has joined #archiveteam |