Time |
Nickname |
Message |
00:18
🔗
|
|
DoomTay has joined #archiveteam-bs |
00:19
🔗
|
|
Stiletto has joined #archiveteam-bs |
00:19
🔗
|
|
tomwsmf-a has joined #archiveteam-bs |
00:24
🔗
|
|
DiscantX has joined #archiveteam-bs |
00:30
🔗
|
|
JesseW has joined #archiveteam-bs |
00:53
🔗
|
godane |
i'm not doing the examiner.com website |
00:53
🔗
|
godane |
mostly cause its too big |
00:53
🔗
|
godane |
even when doing daily sitemap dumps of it |
00:54
🔗
|
godane |
there is like 1000+ urls per a day from that website |
00:57
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
00:57
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
01:12
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
01:23
🔗
|
|
Stiletto has quit IRC (Ping timeout: 244 seconds) |
01:24
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
01:28
🔗
|
|
Coderjoe has joined #archiveteam-bs |
01:37
🔗
|
DoomTay |
Well ArchiveBot is doing it anyway, thanks to SketchCow |
02:01
🔗
|
|
coretx has quit IRC (Read error: Operation timed out) |
02:02
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
02:02
🔗
|
|
RichardG has joined #archiveteam-bs |
02:04
🔗
|
|
coretx has joined #archiveteam-bs |
02:05
🔗
|
|
JesseW has joined #archiveteam-bs |
02:10
🔗
|
|
tomwsmf-a has quit IRC (Read error: Operation timed out) |
02:18
🔗
|
|
Stiletto has joined #archiveteam-bs |
02:45
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
02:45
🔗
|
|
RichardG has joined #archiveteam-bs |
03:09
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
03:09
🔗
|
|
RichardG has joined #archiveteam-bs |
03:33
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
03:52
🔗
|
|
RichardG has quit IRC (Ping timeout: 370 seconds) |
03:54
🔗
|
|
Swizzle has quit IRC (Quit: Leaving) |
03:57
🔗
|
|
RichardG has joined #archiveteam-bs |
04:01
🔗
|
|
Coderjoe has joined #archiveteam-bs |
04:05
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
04:08
🔗
|
|
RichardG has quit IRC (Ping timeout: 260 seconds) |
04:11
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:12
🔗
|
|
RichardG has joined #archiveteam-bs |
04:27
🔗
|
ranma |
www.asstr.org isn't run by IA/Jason Scott/someone in AT, is it? x) |
04:27
🔗
|
ranma |
(alt.sex.stories text repository) |
04:28
🔗
|
Frogging |
that's been around forever |
04:28
🔗
|
Frogging |
I doubt it |
04:28
🔗
|
ranma |
ah |
04:29
🔗
|
* |
ranma watches CITIES ON THE EDGE OF NEVER: Life in the Trenches of the Web in 2012 (JS talk for some posh UK conference) |
04:30
🔗
|
|
GLaDOS has quit IRC (Ping timeout: 260 seconds) |
04:31
🔗
|
JesseW |
ranma: #archivebot has grabbed copies of it more than once, I think, though. |
04:31
🔗
|
Frogging |
that's good |
04:31
🔗
|
Frogging |
:p |
04:31
🔗
|
ranma |
lol |
04:32
🔗
|
Frogging |
they've got a lot of nifty stuff on there |
04:32
🔗
|
Frogging |
heh. heh |
04:33
🔗
|
ranma |
yes. my first memory of a.s.s content was the Smurf Smuckfest story |
04:33
🔗
|
* |
ranma coughs |
04:34
🔗
|
ranma |
probably on aol :x |
04:36
🔗
|
ranma |
has the old video content on AOL ever been backed up? or was it mercilessly been nuked? |
04:36
🔗
|
ranma |
i converted Final Fantasy 7 videos to RM5 and uploaded |
04:36
🔗
|
ranma |
*has it |
04:36
🔗
|
ranma |
*has it been |
04:37
🔗
|
Frogging |
http://www.archiveteam.org/index.php?title=AOL |
04:39
🔗
|
ranma |
have the files section been backed up? |
04:40
🔗
|
ranma |
or hard to say? |
04:42
🔗
|
pikhq |
I am *really* curious who's actually running asstr.org, actually... |
04:43
🔗
|
ranma |
maybe one of those DNS history sites caught non-anonymized info |
04:43
🔗
|
pikhq |
There's nominally a nonprofit backing it, but that could just be the result of a particularly dedicated single person. |
04:43
🔗
|
JesseW |
pikhq: it says there is a team of a couple of people |
04:43
🔗
|
pikhq |
Well then. |
04:44
🔗
|
ranma |
a furry couple |
04:44
🔗
|
ranma |
there were no furries at Denver Comic Con this year :'( |
04:44
🔗
|
pikhq |
ranma: Literally, or just guessing? |
04:44
🔗
|
ranma |
offensively guessing |
04:45
🔗
|
ranma |
how small of a site does AT go after? |
04:45
🔗
|
ranma |
and off the radar |
04:45
🔗
|
yipdw |
1 page |
04:45
🔗
|
yipdw |
archivebot was built for that use case |
04:45
🔗
|
JesseW |
and as long as it is public, obscure is fine |
04:45
🔗
|
yipdw |
yeah |
04:45
🔗
|
yipdw |
private sites or sites that really seem like they should be private, well |
04:46
🔗
|
yipdw |
this is where I get into shouting matches so I'm just gonna stop there |
04:46
🔗
|
pikhq |
There might be other considerations, but the general heuristic is: is it public information? If so, archive it. |
04:46
🔗
|
ranma |
amateur private photo shoots at a comic con? |
04:46
🔗
|
yipdw |
uh |
04:46
🔗
|
ranma |
-private |
04:47
🔗
|
yipdw |
i dunno it depends on what the shoots are |
04:47
🔗
|
ranma |
just con-goers |
04:47
🔗
|
ranma |
probably non-notable |
04:47
🔗
|
yipdw |
oh, I had a different conception of what you meant |
04:47
🔗
|
ranma |
i have to work out my let |
04:48
🔗
|
ranma |
let's encrypt cert for the folder that JUST has the footage |
04:48
🔗
|
ranma |
meanwhile, the folder only had the DCC16 folder of this gallery: https://yourmom.likesbuttse.xxx/gallery-naughty/ (rest is nsfw) |
04:48
🔗
|
ranma |
https://yourmom.likesbuttse.xxx/gallery-naughty/ |
04:49
🔗
|
yipdw |
i've just seen some shit go down at comic-cons that *really* shouldn't be archived because it would just be a massive dick move |
04:49
🔗
|
ranma |
er yeah |
04:49
🔗
|
ranma |
ah okay |
04:49
🔗
|
yipdw |
but that doesn't necessarily apply to your case so *shrug* |
04:49
🔗
|
yipdw |
I dunno, I guess a good question to ask yourself is "would someone be harmed with a permanent and eventually searchable record of this" |
04:50
🔗
|
ranma |
probably not. unless they're applying for top secret+ clearance |
04:51
🔗
|
JesseW |
and if it is your own content, there's no need to involve archiveteam in it at all -- you are perfectly capable of uploading it to any number of additional places yourself |
04:51
🔗
|
ranma |
yeah, i question the value |
04:52
🔗
|
ranma |
except for one or two con-goers |
04:52
🔗
|
ranma |
does AT back up flickr from time to time? |
04:52
🔗
|
ranma |
or TIA |
04:52
🔗
|
yipdw |
i suspect we will have to eventually |
04:52
🔗
|
JesseW |
all of flickr? hardly |
04:52
🔗
|
JesseW |
TIA? |
04:52
🔗
|
DoomTay |
TIA? |
04:52
🔗
|
ranma |
IA |
04:52
🔗
|
yipdw |
or ask Yahoo! real kindly to save it somewhere before they blow it up |
04:52
🔗
|
ranma |
;o |
04:52
🔗
|
DoomTay |
TumblrInAction? |
04:52
🔗
|
DoomTay |
Oh |
04:53
🔗
|
JesseW |
Three Inch Acronynm? |
04:53
🔗
|
ranma |
Three Ingot Acronym |
04:53
🔗
|
JesseW |
regarding back of IA, see http://iabak.archiveteam.org/ |
04:53
🔗
|
Frogging |
TumblrInAction was my first thought |
04:53
🔗
|
Frogging |
:p |
04:56
🔗
|
ranma |
speaking of which, how big was the Tumblr backup? |
04:56
🔗
|
Frogging |
I'm not aware there is a tumblr backup.. |
04:57
🔗
|
ranma |
http://www.archiveteam.org/index.php?title=Tumblr |
04:57
🔗
|
ranma |
"test project" |
04:57
🔗
|
ranma |
http://www.archiveteam.org/index.php?title=Projects#Warrior_projects |
04:58
🔗
|
Frogging |
"Not saved yet" |
04:58
🔗
|
JesseW |
I've been intermittently making snapshots of particular tumblr blogs as I come across them, with archivebot -- I'm always glad for more suggestions. |
04:58
🔗
|
JesseW |
I wasn't aware of a test project |
04:58
🔗
|
ranma |
ah |
04:59
🔗
|
JesseW |
It looks like the test was 4 years ago |
04:59
🔗
|
ranma |
oh, i missed the "result" column. just assumed the fact that it was in a green box and that "archive posted" meant that it was completed |
04:59
🔗
|
JesseW |
by alard, who isn't regularly involved with AT currently (AFAIK) |
05:00
🔗
|
JesseW |
apparently it was 133gb |
05:00
🔗
|
JesseW |
according to https://archive.org/details/archiveteam-tumblr-test |
05:00
🔗
|
ranma |
if i'm reading it correctly, RapidShare was 2TB? |
05:01
🔗
|
ranma |
http://tracker.archiveteam.org/rapidsharedisco/ |
05:01
🔗
|
DoomTay |
Woof! |
05:01
🔗
|
ranma |
http://www.archiveteam.org/index.php?title=RapidShare |
05:05
🔗
|
|
metalcamp has joined #archiveteam-bs |
05:14
🔗
|
ranma |
ssl cert updated, but probably not notable https://pics.yougave.me/gallery/ |
05:18
🔗
|
JesseW |
ranma: why not just upload a copy elsewhere (i.e. IA, flickr, etc)? |
05:19
🔗
|
JesseW |
they seem like perfectly nice pictures |
05:20
🔗
|
ranma |
i'd rather not if not in an organized, someone anonymous large archive |
05:20
🔗
|
ranma |
*somewhat |
05:20
🔗
|
ranma |
but if Flickr will eventually be crawled, i can do that! :D |
05:21
🔗
|
JesseW |
ah, that makes more sense |
05:22
🔗
|
JesseW |
although, if you dump them in an item on IA with a one-off email address, and minimal metadata (esspecially if you compress them with something unusual) they'll be pretty well lost for a good long while |
05:23
🔗
|
JesseW |
and if you want to be even more sure they are lost, encrypt them with a relatively short key -- that way someone would have to actively bother to decrypt them (which will presumably be trivial eventually, but not for a while) |
05:24
🔗
|
JesseW |
also, doesn't the con have a place to submit photos taken there (many cons do)? |
05:31
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
05:55
🔗
|
HCross |
anyone else getting a constant ImportError: cannot import name RetryError |
05:55
🔗
|
HCross |
error, since the recent update of internetarchive |
06:03
🔗
|
HCross |
^ never mind, I cocked up |
06:13
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
06:16
🔗
|
|
dashcloud has joined #archiveteam-bs |
06:54
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
07:01
🔗
|
|
RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) |
07:13
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
07:27
🔗
|
|
RichardG has joined #archiveteam-bs |
07:55
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
08:50
🔗
|
|
DiscantX has joined #archiveteam-bs |
08:57
🔗
|
|
zhongfu_ has joined #archiveteam-bs |
08:57
🔗
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
09:04
🔗
|
|
zhongfu_ has quit IRC (Ping timeout: 260 seconds) |
09:04
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
09:05
🔗
|
|
zhongfu has joined #archiveteam-bs |
09:12
🔗
|
|
DiscantX has joined #archiveteam-bs |
09:26
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
09:51
🔗
|
|
zhongfu has quit IRC (Remote host closed the connection) |
10:06
🔗
|
|
Sum has quit IRC (Ping timeout: 246 seconds) |
10:07
🔗
|
|
Sum has joined #archiveteam-bs |
10:14
🔗
|
|
zhongfu has joined #archiveteam-bs |
10:20
🔗
|
|
Sum has quit IRC (Ping timeout: 246 seconds) |
10:32
🔗
|
|
zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.) |
10:32
🔗
|
|
GLaDOS has joined #archiveteam-bs |
10:34
🔗
|
|
zhongfu has joined #archiveteam-bs |
10:58
🔗
|
|
Sum has joined #archiveteam-bs |
11:03
🔗
|
|
Sum has quit IRC (Quit: Leaving) |
12:05
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:10
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
12:23
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
13:38
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
13:48
🔗
|
|
VADemon has joined #archiveteam-bs |
14:17
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
15:21
🔗
|
|
r3c0d3x has quit IRC (Ping timeout: 260 seconds) |
15:23
🔗
|
|
r3c0d3x has joined #archiveteam-bs |
15:54
🔗
|
|
Start has joined #archiveteam-bs |
15:59
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
16:15
🔗
|
|
DoomTay has joined #archiveteam-bs |
16:18
🔗
|
|
JesseW has joined #archiveteam-bs |
16:37
🔗
|
Frogging |
arkiver: do you ever use something like BeautifulSoup to parse pages in warrior projects? |
16:37
🔗
|
Frogging |
or just simple text searches |
16:40
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
16:41
🔗
|
arkiver |
I never use BeautifulSoup |
16:43
🔗
|
arkiver |
Everything is extracted using pattern matching in lua or regex in Python |
16:48
🔗
|
|
dashcloud has quit IRC (Ping timeout: 244 seconds) |
16:49
🔗
|
|
dashcloud has joined #archiveteam-bs |
16:56
🔗
|
godane |
so i found this website: http://www.houstonlgbthistory.org/ |
16:56
🔗
|
godane |
its in archivebot right now |
16:56
🔗
|
godane |
may have tons of pdfs |
18:45
🔗
|
|
Start has joined #archiveteam-bs |
18:48
🔗
|
|
REiN^ has joined #archiveteam-bs |
19:36
🔗
|
|
dashcloud has quit IRC (Ping timeout: 244 seconds) |
19:37
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
19:39
🔗
|
|
DiscantX has joined #archiveteam-bs |
19:40
🔗
|
|
dashcloud has joined #archiveteam-bs |
19:46
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
19:47
🔗
|
|
Start has joined #archiveteam-bs |
19:52
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
20:07
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
20:08
🔗
|
|
mutoso has quit IRC (Quit: leaving) |
20:18
🔗
|
|
mutoso has joined #archiveteam-bs |
20:37
🔗
|
|
dxrt has quit IRC (Read error: Operation timed out) |
20:38
🔗
|
|
jspiros has quit IRC (Read error: Operation timed out) |
20:41
🔗
|
|
dxrt has joined #archiveteam-bs |
21:22
🔗
|
|
robink has quit IRC (Ping timeout: 633 seconds) |
21:30
🔗
|
|
bzc6p has joined #archiveteam-bs |
21:30
🔗
|
|
swebb sets mode: +o bzc6p |
21:36
🔗
|
HCross |
yipdw, are you recruiting more pipelines atm? |
21:45
🔗
|
|
jspiros has joined #archiveteam-bs |
21:59
🔗
|
yipdw |
HCross: no |
22:00
🔗
|
HCross |
ok |
22:10
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:13
🔗
|
|
dashcloud has joined #archiveteam-bs |
22:23
🔗
|
|
bzc6p has left |
22:35
🔗
|
yipdw |
so if someone is interested in looking at the DNS-error-with-url-list thing |
22:36
🔗
|
yipdw |
you will want to look at pipeline/archivebot/seesaw/tasks.py:273-314 |
22:36
🔗
|
yipdw |
that's the DownloadUrlFile task. the other part, and this is the part that i have not yet understood well enough to make a fix, is seesaw retry behavior |
22:36
🔗
|
yipdw |
i suspect there is a max retries limit somewhere but I haven't been able to find it |
22:44
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
22:47
🔗
|
|
aschmitz_ has quit IRC (Read error: Operation timed out) |
22:48
🔗
|
|
aschmitz_ has joined #archiveteam-bs |
22:49
🔗
|
FalconK |
yipdw: well it's going to get an exception on line 285 requests.get(timeout=none, ...) |
22:50
🔗
|
FalconK |
so it will go to the handler at 301 and do self.schedule_retry(item) unconditionally |
22:50
🔗
|
FalconK |
so there's the bug |
22:51
🔗
|
FalconK |
the right thing to do is probably add a field to item for number of times retried, increment it on each retry, and have it not schedule_retry if the counter is greater than some arbitrary constant |
22:53
🔗
|
FalconK |
I'm not sure if you just fall out when that happens, or if you must call complete_item |
22:53
🔗
|
FalconK |
because I don't know much about python RetryableTask |
22:53
🔗
|
FalconK |
** Task |
23:18
🔗
|
DoomTay |
Anyone heard of PostGhost? |
23:18
🔗
|
DoomTay |
Tweet archive that just shut down today |
23:18
🔗
|
DoomTay |
I actually had no idea it existed until now |
23:18
🔗
|
yipdw |
FalconK: yeah, the exit strategy is what I haven't figured out yet |
23:19
🔗
|
|
robink has joined #archiveteam-bs |
23:30
🔗
|
DoomTay |
I'm about halfway through archving artist pages on portalgraphics. It |
23:31
🔗
|
DoomTay |
'It's staggering how much wasn't saved beforehand, even though the site in its current form has been aroun since ~2010-2011 |
23:39
🔗
|
|
tomwsmf-a has joined #archiveteam-bs |
23:41
🔗
|
FalconK |
yipdw: the most intuitive thing to me seems to be to treat it as though it were aborted |
23:41
🔗
|
FalconK |
our definition of success is pretty squishy though |
23:42
🔗
|
FalconK |
oh, why on earth might one of my WARCs in opensource https://archive.org/details/archiveteam_archivebot_go_falconk_uprisingradio_org_20160427 have almost 70k views? |
23:46
🔗
|
arkiver |
because it's popular? |
23:46
🔗
|
|
Start has joined #archiveteam-bs |
23:46
🔗
|
arkiver |
Guess it's some important site you saved there |
23:47
🔗
|
FalconK |
guess so but it's in opensource and theoretically not in wayback. |
23:47
🔗
|
arkiver |
I see |
23:47
🔗
|
arkiver |
Everything with mediatype 'web' goes into the wayback machine |
23:47
🔗
|
arkiver |
also if it is in opensource |
23:47
🔗
|
FalconK |
oh |
23:48
🔗
|
arkiver |
it just takes up to a month or so to get in the wayback macine |
23:48
🔗
|
arkiver |
machine* |
23:48
🔗
|
arkiver |
where in a web collection it takes a day or so |
23:48
🔗
|
FalconK |
so there is no need for me to annoy IA people with requests to move my content into the archivebot collection then |
23:48
🔗
|
arkiver |
well, it might be nice to have it moved to a web collections |
23:48
🔗
|
arkiver |
but to have it in the wayback machine, no |
23:49
🔗
|
FalconK |
if I had permission I would upload it straight there, but such is not forthcoming |
23:59
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |