| Time |
Nickname |
Message |
|
00:18
🔗
|
|
DoomTay has joined #archiveteam-bs |
|
00:19
🔗
|
|
Stiletto has joined #archiveteam-bs |
|
00:19
🔗
|
|
tomwsmf-a has joined #archiveteam-bs |
|
00:24
🔗
|
|
DiscantX has joined #archiveteam-bs |
|
00:30
🔗
|
|
JesseW has joined #archiveteam-bs |
|
00:53
🔗
|
godane |
i'm not doing the examiner.com website |
|
00:53
🔗
|
godane |
mostly cause its too big |
|
00:53
🔗
|
godane |
even when doing daily sitemap dumps of it |
|
00:54
🔗
|
godane |
there is like 1000+ urls per a day from that website |
|
00:57
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
|
00:57
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
|
01:12
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
|
01:23
🔗
|
|
Stiletto has quit IRC (Ping timeout: 244 seconds) |
|
01:24
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
|
01:28
🔗
|
|
Coderjoe has joined #archiveteam-bs |
|
01:37
🔗
|
DoomTay |
Well ArchiveBot is doing it anyway, thanks to SketchCow |
|
02:01
🔗
|
|
coretx has quit IRC (Read error: Operation timed out) |
|
02:02
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
|
02:02
🔗
|
|
RichardG has joined #archiveteam-bs |
|
02:04
🔗
|
|
coretx has joined #archiveteam-bs |
|
02:05
🔗
|
|
JesseW has joined #archiveteam-bs |
|
02:10
🔗
|
|
tomwsmf-a has quit IRC (Read error: Operation timed out) |
|
02:18
🔗
|
|
Stiletto has joined #archiveteam-bs |
|
02:45
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
|
02:45
🔗
|
|
RichardG has joined #archiveteam-bs |
|
03:09
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
|
03:09
🔗
|
|
RichardG has joined #archiveteam-bs |
|
03:33
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
|
03:52
🔗
|
|
RichardG has quit IRC (Ping timeout: 370 seconds) |
|
03:54
🔗
|
|
Swizzle has quit IRC (Quit: Leaving) |
|
03:57
🔗
|
|
RichardG has joined #archiveteam-bs |
|
04:01
🔗
|
|
Coderjoe has joined #archiveteam-bs |
|
04:05
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
|
04:08
🔗
|
|
RichardG has quit IRC (Ping timeout: 260 seconds) |
|
04:11
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
04:12
🔗
|
|
RichardG has joined #archiveteam-bs |
|
04:27
🔗
|
ranma |
www.asstr.org isn't run by IA/Jason Scott/someone in AT, is it? x) |
|
04:27
🔗
|
ranma |
(alt.sex.stories text repository) |
|
04:28
🔗
|
Frogging |
that's been around forever |
|
04:28
🔗
|
Frogging |
I doubt it |
|
04:28
🔗
|
ranma |
ah |
|
04:29
🔗
|
* |
ranma watches CITIES ON THE EDGE OF NEVER: Life in the Trenches of the Web in 2012 (JS talk for some posh UK conference) |
|
04:30
🔗
|
|
GLaDOS has quit IRC (Ping timeout: 260 seconds) |
|
04:31
🔗
|
JesseW |
ranma: #archivebot has grabbed copies of it more than once, I think, though. |
|
04:31
🔗
|
Frogging |
that's good |
|
04:31
🔗
|
Frogging |
:p |
|
04:31
🔗
|
ranma |
lol |
|
04:32
🔗
|
Frogging |
they've got a lot of nifty stuff on there |
|
04:32
🔗
|
Frogging |
heh. heh |
|
04:33
🔗
|
ranma |
yes. my first memory of a.s.s content was the Smurf Smuckfest story |
|
04:33
🔗
|
* |
ranma coughs |
|
04:34
🔗
|
ranma |
probably on aol :x |
|
04:36
🔗
|
ranma |
has the old video content on AOL ever been backed up? or was it mercilessly been nuked? |
|
04:36
🔗
|
ranma |
i converted Final Fantasy 7 videos to RM5 and uploaded |
|
04:36
🔗
|
ranma |
*has it |
|
04:36
🔗
|
ranma |
*has it been |
|
04:37
🔗
|
Frogging |
http://www.archiveteam.org/index.php?title=AOL |
|
04:39
🔗
|
ranma |
have the files section been backed up? |
|
04:40
🔗
|
ranma |
or hard to say? |
|
04:42
🔗
|
pikhq |
I am *really* curious who's actually running asstr.org, actually... |
|
04:43
🔗
|
ranma |
maybe one of those DNS history sites caught non-anonymized info |
|
04:43
🔗
|
pikhq |
There's nominally a nonprofit backing it, but that could just be the result of a particularly dedicated single person. |
|
04:43
🔗
|
JesseW |
pikhq: it says there is a team of a couple of people |
|
04:43
🔗
|
pikhq |
Well then. |
|
04:44
🔗
|
ranma |
a furry couple |
|
04:44
🔗
|
ranma |
there were no furries at Denver Comic Con this year :'( |
|
04:44
🔗
|
pikhq |
ranma: Literally, or just guessing? |
|
04:44
🔗
|
ranma |
offensively guessing |
|
04:45
🔗
|
ranma |
how small of a site does AT go after? |
|
04:45
🔗
|
ranma |
and off the radar |
|
04:45
🔗
|
yipdw |
1 page |
|
04:45
🔗
|
yipdw |
archivebot was built for that use case |
|
04:45
🔗
|
JesseW |
and as long as it is public, obscure is fine |
|
04:45
🔗
|
yipdw |
yeah |
|
04:45
🔗
|
yipdw |
private sites or sites that really seem like they should be private, well |
|
04:46
🔗
|
yipdw |
this is where I get into shouting matches so I'm just gonna stop there |
|
04:46
🔗
|
pikhq |
There might be other considerations, but the general heuristic is: is it public information? If so, archive it. |
|
04:46
🔗
|
ranma |
amateur private photo shoots at a comic con? |
|
04:46
🔗
|
yipdw |
uh |
|
04:46
🔗
|
ranma |
-private |
|
04:47
🔗
|
yipdw |
i dunno it depends on what the shoots are |
|
04:47
🔗
|
ranma |
just con-goers |
|
04:47
🔗
|
ranma |
probably non-notable |
|
04:47
🔗
|
yipdw |
oh, I had a different conception of what you meant |
|
04:47
🔗
|
ranma |
i have to work out my let |
|
04:48
🔗
|
ranma |
let's encrypt cert for the folder that JUST has the footage |
|
04:48
🔗
|
ranma |
meanwhile, the folder only had the DCC16 folder of this gallery: https://yourmom.likesbuttse.xxx/gallery-naughty/ (rest is nsfw) |
|
04:48
🔗
|
ranma |
https://yourmom.likesbuttse.xxx/gallery-naughty/ |
|
04:49
🔗
|
yipdw |
i've just seen some shit go down at comic-cons that *really* shouldn't be archived because it would just be a massive dick move |
|
04:49
🔗
|
ranma |
er yeah |
|
04:49
🔗
|
ranma |
ah okay |
|
04:49
🔗
|
yipdw |
but that doesn't necessarily apply to your case so *shrug* |
|
04:49
🔗
|
yipdw |
I dunno, I guess a good question to ask yourself is "would someone be harmed with a permanent and eventually searchable record of this" |
|
04:50
🔗
|
ranma |
probably not. unless they're applying for top secret+ clearance |
|
04:51
🔗
|
JesseW |
and if it is your own content, there's no need to involve archiveteam in it at all -- you are perfectly capable of uploading it to any number of additional places yourself |
|
04:51
🔗
|
ranma |
yeah, i question the value |
|
04:52
🔗
|
ranma |
except for one or two con-goers |
|
04:52
🔗
|
ranma |
does AT back up flickr from time to time? |
|
04:52
🔗
|
ranma |
or TIA |
|
04:52
🔗
|
yipdw |
i suspect we will have to eventually |
|
04:52
🔗
|
JesseW |
all of flickr? hardly |
|
04:52
🔗
|
JesseW |
TIA? |
|
04:52
🔗
|
DoomTay |
TIA? |
|
04:52
🔗
|
ranma |
IA |
|
04:52
🔗
|
yipdw |
or ask Yahoo! real kindly to save it somewhere before they blow it up |
|
04:52
🔗
|
ranma |
;o |
|
04:52
🔗
|
DoomTay |
TumblrInAction? |
|
04:52
🔗
|
DoomTay |
Oh |
|
04:53
🔗
|
JesseW |
Three Inch Acronynm? |
|
04:53
🔗
|
ranma |
Three Ingot Acronym |
|
04:53
🔗
|
JesseW |
regarding back of IA, see http://iabak.archiveteam.org/ |
|
04:53
🔗
|
Frogging |
TumblrInAction was my first thought |
|
04:53
🔗
|
Frogging |
:p |
|
04:56
🔗
|
ranma |
speaking of which, how big was the Tumblr backup? |
|
04:56
🔗
|
Frogging |
I'm not aware there is a tumblr backup.. |
|
04:57
🔗
|
ranma |
http://www.archiveteam.org/index.php?title=Tumblr |
|
04:57
🔗
|
ranma |
"test project" |
|
04:57
🔗
|
ranma |
http://www.archiveteam.org/index.php?title=Projects#Warrior_projects |
|
04:58
🔗
|
Frogging |
"Not saved yet" |
|
04:58
🔗
|
JesseW |
I've been intermittently making snapshots of particular tumblr blogs as I come across them, with archivebot -- I'm always glad for more suggestions. |
|
04:58
🔗
|
JesseW |
I wasn't aware of a test project |
|
04:58
🔗
|
ranma |
ah |
|
04:59
🔗
|
JesseW |
It looks like the test was 4 years ago |
|
04:59
🔗
|
ranma |
oh, i missed the "result" column. just assumed the fact that it was in a green box and that "archive posted" meant that it was completed |
|
04:59
🔗
|
JesseW |
by alard, who isn't regularly involved with AT currently (AFAIK) |
|
05:00
🔗
|
JesseW |
apparently it was 133gb |
|
05:00
🔗
|
JesseW |
according to https://archive.org/details/archiveteam-tumblr-test |
|
05:00
🔗
|
ranma |
if i'm reading it correctly, RapidShare was 2TB? |
|
05:01
🔗
|
ranma |
http://tracker.archiveteam.org/rapidsharedisco/ |
|
05:01
🔗
|
DoomTay |
Woof! |
|
05:01
🔗
|
ranma |
http://www.archiveteam.org/index.php?title=RapidShare |
|
05:05
🔗
|
|
metalcamp has joined #archiveteam-bs |
|
05:14
🔗
|
ranma |
ssl cert updated, but probably not notable https://pics.yougave.me/gallery/ |
|
05:18
🔗
|
JesseW |
ranma: why not just upload a copy elsewhere (i.e. IA, flickr, etc)? |
|
05:19
🔗
|
JesseW |
they seem like perfectly nice pictures |
|
05:20
🔗
|
ranma |
i'd rather not if not in an organized, someone anonymous large archive |
|
05:20
🔗
|
ranma |
*somewhat |
|
05:20
🔗
|
ranma |
but if Flickr will eventually be crawled, i can do that! :D |
|
05:21
🔗
|
JesseW |
ah, that makes more sense |
|
05:22
🔗
|
JesseW |
although, if you dump them in an item on IA with a one-off email address, and minimal metadata (esspecially if you compress them with something unusual) they'll be pretty well lost for a good long while |
|
05:23
🔗
|
JesseW |
and if you want to be even more sure they are lost, encrypt them with a relatively short key -- that way someone would have to actively bother to decrypt them (which will presumably be trivial eventually, but not for a while) |
|
05:24
🔗
|
JesseW |
also, doesn't the con have a place to submit photos taken there (many cons do)? |
|
05:31
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
|
05:55
🔗
|
HCross |
anyone else getting a constant ImportError: cannot import name RetryError |
|
05:55
🔗
|
HCross |
error, since the recent update of internetarchive |
|
06:03
🔗
|
HCross |
^ never mind, I cocked up |
|
06:13
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
06:16
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
06:54
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
|
07:01
🔗
|
|
RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) |
|
07:13
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
|
07:27
🔗
|
|
RichardG has joined #archiveteam-bs |
|
07:55
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
|
08:50
🔗
|
|
DiscantX has joined #archiveteam-bs |
|
08:57
🔗
|
|
zhongfu_ has joined #archiveteam-bs |
|
08:57
🔗
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
|
09:04
🔗
|
|
zhongfu_ has quit IRC (Ping timeout: 260 seconds) |
|
09:04
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
|
09:05
🔗
|
|
zhongfu has joined #archiveteam-bs |
|
09:12
🔗
|
|
DiscantX has joined #archiveteam-bs |
|
09:26
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
09:51
🔗
|
|
zhongfu has quit IRC (Remote host closed the connection) |
|
10:06
🔗
|
|
Sum has quit IRC (Ping timeout: 246 seconds) |
|
10:07
🔗
|
|
Sum has joined #archiveteam-bs |
|
10:14
🔗
|
|
zhongfu has joined #archiveteam-bs |
|
10:20
🔗
|
|
Sum has quit IRC (Ping timeout: 246 seconds) |
|
10:32
🔗
|
|
zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.) |
|
10:32
🔗
|
|
GLaDOS has joined #archiveteam-bs |
|
10:34
🔗
|
|
zhongfu has joined #archiveteam-bs |
|
10:58
🔗
|
|
Sum has joined #archiveteam-bs |
|
11:03
🔗
|
|
Sum has quit IRC (Quit: Leaving) |
|
12:05
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
|
12:10
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
12:23
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
|
13:38
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
|
13:48
🔗
|
|
VADemon has joined #archiveteam-bs |
|
14:17
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
|
15:21
🔗
|
|
r3c0d3x has quit IRC (Ping timeout: 260 seconds) |
|
15:23
🔗
|
|
r3c0d3x has joined #archiveteam-bs |
|
15:54
🔗
|
|
Start has joined #archiveteam-bs |
|
15:59
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
|
16:15
🔗
|
|
DoomTay has joined #archiveteam-bs |
|
16:18
🔗
|
|
JesseW has joined #archiveteam-bs |
|
16:37
🔗
|
Frogging |
arkiver: do you ever use something like BeautifulSoup to parse pages in warrior projects? |
|
16:37
🔗
|
Frogging |
or just simple text searches |
|
16:40
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
|
16:41
🔗
|
arkiver |
I never use BeautifulSoup |
|
16:43
🔗
|
arkiver |
Everything is extracted using pattern matching in lua or regex in Python |
|
16:48
🔗
|
|
dashcloud has quit IRC (Ping timeout: 244 seconds) |
|
16:49
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
16:56
🔗
|
godane |
so i found this website: http://www.houstonlgbthistory.org/ |
|
16:56
🔗
|
godane |
its in archivebot right now |
|
16:56
🔗
|
godane |
may have tons of pdfs |
|
18:45
🔗
|
|
Start has joined #archiveteam-bs |
|
18:48
🔗
|
|
REiN^ has joined #archiveteam-bs |
|
19:36
🔗
|
|
dashcloud has quit IRC (Ping timeout: 244 seconds) |
|
19:37
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
|
19:39
🔗
|
|
DiscantX has joined #archiveteam-bs |
|
19:40
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
19:46
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
|
19:47
🔗
|
|
Start has joined #archiveteam-bs |
|
19:52
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
|
20:07
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
|
20:08
🔗
|
|
mutoso has quit IRC (Quit: leaving) |
|
20:18
🔗
|
|
mutoso has joined #archiveteam-bs |
|
20:37
🔗
|
|
dxrt has quit IRC (Read error: Operation timed out) |
|
20:38
🔗
|
|
jspiros has quit IRC (Read error: Operation timed out) |
|
20:41
🔗
|
|
dxrt has joined #archiveteam-bs |
|
21:22
🔗
|
|
robink has quit IRC (Ping timeout: 633 seconds) |
|
21:30
🔗
|
|
bzc6p has joined #archiveteam-bs |
|
21:30
🔗
|
|
swebb sets mode: +o bzc6p |
|
21:36
🔗
|
HCross |
yipdw, are you recruiting more pipelines atm? |
|
21:45
🔗
|
|
jspiros has joined #archiveteam-bs |
|
21:59
🔗
|
yipdw |
HCross: no |
|
22:00
🔗
|
HCross |
ok |
|
22:10
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
22:13
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
22:23
🔗
|
|
bzc6p has left |
|
22:35
🔗
|
yipdw |
so if someone is interested in looking at the DNS-error-with-url-list thing |
|
22:36
🔗
|
yipdw |
you will want to look at pipeline/archivebot/seesaw/tasks.py:273-314 |
|
22:36
🔗
|
yipdw |
that's the DownloadUrlFile task. the other part, and this is the part that i have not yet understood well enough to make a fix, is seesaw retry behavior |
|
22:36
🔗
|
yipdw |
i suspect there is a max retries limit somewhere but I haven't been able to find it |
|
22:44
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
|
22:47
🔗
|
|
aschmitz_ has quit IRC (Read error: Operation timed out) |
|
22:48
🔗
|
|
aschmitz_ has joined #archiveteam-bs |
|
22:49
🔗
|
FalconK |
yipdw: well it's going to get an exception on line 285 requests.get(timeout=none, ...) |
|
22:50
🔗
|
FalconK |
so it will go to the handler at 301 and do self.schedule_retry(item) unconditionally |
|
22:50
🔗
|
FalconK |
so there's the bug |
|
22:51
🔗
|
FalconK |
the right thing to do is probably add a field to item for number of times retried, increment it on each retry, and have it not schedule_retry if the counter is greater than some arbitrary constant |
|
22:53
🔗
|
FalconK |
I'm not sure if you just fall out when that happens, or if you must call complete_item |
|
22:53
🔗
|
FalconK |
because I don't know much about python RetryableTask |
|
22:53
🔗
|
FalconK |
** Task |
|
23:18
🔗
|
DoomTay |
Anyone heard of PostGhost? |
|
23:18
🔗
|
DoomTay |
Tweet archive that just shut down today |
|
23:18
🔗
|
DoomTay |
I actually had no idea it existed until now |
|
23:18
🔗
|
yipdw |
FalconK: yeah, the exit strategy is what I haven't figured out yet |
|
23:19
🔗
|
|
robink has joined #archiveteam-bs |
|
23:30
🔗
|
DoomTay |
I'm about halfway through archving artist pages on portalgraphics. It |
|
23:31
🔗
|
DoomTay |
'It's staggering how much wasn't saved beforehand, even though the site in its current form has been aroun since ~2010-2011 |
|
23:39
🔗
|
|
tomwsmf-a has joined #archiveteam-bs |
|
23:41
🔗
|
FalconK |
yipdw: the most intuitive thing to me seems to be to treat it as though it were aborted |
|
23:41
🔗
|
FalconK |
our definition of success is pretty squishy though |
|
23:42
🔗
|
FalconK |
oh, why on earth might one of my WARCs in opensource https://archive.org/details/archiveteam_archivebot_go_falconk_uprisingradio_org_20160427 have almost 70k views? |
|
23:46
🔗
|
arkiver |
because it's popular? |
|
23:46
🔗
|
|
Start has joined #archiveteam-bs |
|
23:46
🔗
|
arkiver |
Guess it's some important site you saved there |
|
23:47
🔗
|
FalconK |
guess so but it's in opensource and theoretically not in wayback. |
|
23:47
🔗
|
arkiver |
I see |
|
23:47
🔗
|
arkiver |
Everything with mediatype 'web' goes into the wayback machine |
|
23:47
🔗
|
arkiver |
also if it is in opensource |
|
23:47
🔗
|
FalconK |
oh |
|
23:48
🔗
|
arkiver |
it just takes up to a month or so to get in the wayback macine |
|
23:48
🔗
|
arkiver |
machine* |
|
23:48
🔗
|
arkiver |
where in a web collection it takes a day or so |
|
23:48
🔗
|
FalconK |
so there is no need for me to annoy IA people with requests to move my content into the archivebot collection then |
|
23:48
🔗
|
arkiver |
well, it might be nice to have it moved to a web collections |
|
23:48
🔗
|
arkiver |
but to have it in the wayback machine, no |
|
23:49
🔗
|
FalconK |
if I had permission I would upload it straight there, but such is not forthcoming |
|
23:59
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |