#archiveteam-bs 2016-07-08,Fri

↑back Search

Time Nickname Message
00:18 🔗 DoomTay has joined #archiveteam-bs
00:19 🔗 Stiletto has joined #archiveteam-bs
00:19 🔗 tomwsmf-a has joined #archiveteam-bs
00:24 🔗 DiscantX has joined #archiveteam-bs
00:30 🔗 JesseW has joined #archiveteam-bs
00:53 🔗 godane i'm not doing the examiner.com website
00:53 🔗 godane mostly cause its too big
00:53 🔗 godane even when doing daily sitemap dumps of it
00:54 🔗 godane there is like 1000+ urls per a day from that website
00:57 🔗 VADemon has quit IRC (Quit: left4dead)
00:57 🔗 DiscantX has quit IRC (Read error: Operation timed out)
01:12 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
01:23 🔗 Stiletto has quit IRC (Ping timeout: 244 seconds)
01:24 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
01:28 🔗 Coderjoe has joined #archiveteam-bs
01:37 🔗 DoomTay Well ArchiveBot is doing it anyway, thanks to SketchCow
02:01 🔗 coretx has quit IRC (Read error: Operation timed out)
02:02 🔗 RichardG has quit IRC (Read error: Operation timed out)
02:02 🔗 RichardG has joined #archiveteam-bs
02:04 🔗 coretx has joined #archiveteam-bs
02:05 🔗 JesseW has joined #archiveteam-bs
02:10 🔗 tomwsmf-a has quit IRC (Read error: Operation timed out)
02:18 🔗 Stiletto has joined #archiveteam-bs
02:45 🔗 RichardG has quit IRC (Read error: Operation timed out)
02:45 🔗 RichardG has joined #archiveteam-bs
03:09 🔗 RichardG has quit IRC (Read error: Operation timed out)
03:09 🔗 RichardG has joined #archiveteam-bs
03:33 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
03:52 🔗 RichardG has quit IRC (Ping timeout: 370 seconds)
03:54 🔗 Swizzle has quit IRC (Quit: Leaving)
03:57 🔗 RichardG has joined #archiveteam-bs
04:01 🔗 Coderjoe has joined #archiveteam-bs
04:05 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:08 🔗 RichardG has quit IRC (Ping timeout: 260 seconds)
04:11 🔗 Sk1d has joined #archiveteam-bs
04:12 🔗 RichardG has joined #archiveteam-bs
04:27 🔗 ranma www.asstr.org isn't run by IA/Jason Scott/someone in AT, is it? x)
04:27 🔗 ranma (alt.sex.stories text repository)
04:28 🔗 Frogging that's been around forever
04:28 🔗 Frogging I doubt it
04:28 🔗 ranma ah
04:29 🔗 * ranma watches CITIES ON THE EDGE OF NEVER: Life in the Trenches of the Web in 2012 (JS talk for some posh UK conference)
04:30 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
04:31 🔗 JesseW ranma: #archivebot has grabbed copies of it more than once, I think, though.
04:31 🔗 Frogging that's good
04:31 🔗 Frogging :p
04:31 🔗 ranma lol
04:32 🔗 Frogging they've got a lot of nifty stuff on there
04:32 🔗 Frogging heh. heh
04:33 🔗 ranma yes. my first memory of a.s.s content was the Smurf Smuckfest story
04:33 🔗 * ranma coughs
04:34 🔗 ranma probably on aol :x
04:36 🔗 ranma has the old video content on AOL ever been backed up? or was it mercilessly been nuked?
04:36 🔗 ranma i converted Final Fantasy 7 videos to RM5 and uploaded
04:36 🔗 ranma *has it
04:36 🔗 ranma *has it been
04:37 🔗 Frogging http://www.archiveteam.org/index.php?title=AOL
04:39 🔗 ranma have the files section been backed up?
04:40 🔗 ranma or hard to say?
04:42 🔗 pikhq I am *really* curious who's actually running asstr.org, actually...
04:43 🔗 ranma maybe one of those DNS history sites caught non-anonymized info
04:43 🔗 pikhq There's nominally a nonprofit backing it, but that could just be the result of a particularly dedicated single person.
04:43 🔗 JesseW pikhq: it says there is a team of a couple of people
04:43 🔗 pikhq Well then.
04:44 🔗 ranma a furry couple
04:44 🔗 ranma there were no furries at Denver Comic Con this year :'(
04:44 🔗 pikhq ranma: Literally, or just guessing?
04:44 🔗 ranma offensively guessing
04:45 🔗 ranma how small of a site does AT go after?
04:45 🔗 ranma and off the radar
04:45 🔗 yipdw 1 page
04:45 🔗 yipdw archivebot was built for that use case
04:45 🔗 JesseW and as long as it is public, obscure is fine
04:45 🔗 yipdw yeah
04:45 🔗 yipdw private sites or sites that really seem like they should be private, well
04:46 🔗 yipdw this is where I get into shouting matches so I'm just gonna stop there
04:46 🔗 pikhq There might be other considerations, but the general heuristic is: is it public information? If so, archive it.
04:46 🔗 ranma amateur private photo shoots at a comic con?
04:46 🔗 yipdw uh
04:46 🔗 ranma -private
04:47 🔗 yipdw i dunno it depends on what the shoots are
04:47 🔗 ranma just con-goers
04:47 🔗 ranma probably non-notable
04:47 🔗 yipdw oh, I had a different conception of what you meant
04:47 🔗 ranma i have to work out my let
04:48 🔗 ranma let's encrypt cert for the folder that JUST has the footage
04:48 🔗 ranma meanwhile, the folder only had the DCC16 folder of this gallery: https://yourmom.likesbuttse.xxx/gallery-naughty/ (rest is nsfw)
04:48 🔗 ranma https://yourmom.likesbuttse.xxx/gallery-naughty/
04:49 🔗 yipdw i've just seen some shit go down at comic-cons that *really* shouldn't be archived because it would just be a massive dick move
04:49 🔗 ranma er yeah
04:49 🔗 ranma ah okay
04:49 🔗 yipdw but that doesn't necessarily apply to your case so *shrug*
04:49 🔗 yipdw I dunno, I guess a good question to ask yourself is "would someone be harmed with a permanent and eventually searchable record of this"
04:50 🔗 ranma probably not. unless they're applying for top secret+ clearance
04:51 🔗 JesseW and if it is your own content, there's no need to involve archiveteam in it at all -- you are perfectly capable of uploading it to any number of additional places yourself
04:51 🔗 ranma yeah, i question the value
04:52 🔗 ranma except for one or two con-goers
04:52 🔗 ranma does AT back up flickr from time to time?
04:52 🔗 ranma or TIA
04:52 🔗 yipdw i suspect we will have to eventually
04:52 🔗 JesseW all of flickr? hardly
04:52 🔗 JesseW TIA?
04:52 🔗 DoomTay TIA?
04:52 🔗 ranma IA
04:52 🔗 yipdw or ask Yahoo! real kindly to save it somewhere before they blow it up
04:52 🔗 ranma ;o
04:52 🔗 DoomTay TumblrInAction?
04:52 🔗 DoomTay Oh
04:53 🔗 JesseW Three Inch Acronynm?
04:53 🔗 ranma Three Ingot Acronym
04:53 🔗 JesseW regarding back of IA, see http://iabak.archiveteam.org/
04:53 🔗 Frogging TumblrInAction was my first thought
04:53 🔗 Frogging :p
04:56 🔗 ranma speaking of which, how big was the Tumblr backup?
04:56 🔗 Frogging I'm not aware there is a tumblr backup..
04:57 🔗 ranma http://www.archiveteam.org/index.php?title=Tumblr
04:57 🔗 ranma "test project"
04:57 🔗 ranma http://www.archiveteam.org/index.php?title=Projects#Warrior_projects
04:58 🔗 Frogging "Not saved yet"
04:58 🔗 JesseW I've been intermittently making snapshots of particular tumblr blogs as I come across them, with archivebot -- I'm always glad for more suggestions.
04:58 🔗 JesseW I wasn't aware of a test project
04:58 🔗 ranma ah
04:59 🔗 JesseW It looks like the test was 4 years ago
04:59 🔗 ranma oh, i missed the "result" column. just assumed the fact that it was in a green box and that "archive posted" meant that it was completed
04:59 🔗 JesseW by alard, who isn't regularly involved with AT currently (AFAIK)
05:00 🔗 JesseW apparently it was 133gb
05:00 🔗 JesseW according to https://archive.org/details/archiveteam-tumblr-test
05:00 🔗 ranma if i'm reading it correctly, RapidShare was 2TB?
05:01 🔗 ranma http://tracker.archiveteam.org/rapidsharedisco/
05:01 🔗 DoomTay Woof!
05:01 🔗 ranma http://www.archiveteam.org/index.php?title=RapidShare
05:05 🔗 metalcamp has joined #archiveteam-bs
05:14 🔗 ranma ssl cert updated, but probably not notable https://pics.yougave.me/gallery/
05:18 🔗 JesseW ranma: why not just upload a copy elsewhere (i.e. IA, flickr, etc)?
05:19 🔗 JesseW they seem like perfectly nice pictures
05:20 🔗 ranma i'd rather not if not in an organized, someone anonymous large archive
05:20 🔗 ranma *somewhat
05:20 🔗 ranma but if Flickr will eventually be crawled, i can do that! :D
05:21 🔗 JesseW ah, that makes more sense
05:22 🔗 JesseW although, if you dump them in an item on IA with a one-off email address, and minimal metadata (esspecially if you compress them with something unusual) they'll be pretty well lost for a good long while
05:23 🔗 JesseW and if you want to be even more sure they are lost, encrypt them with a relatively short key -- that way someone would have to actively bother to decrypt them (which will presumably be trivial eventually, but not for a while)
05:24 🔗 JesseW also, doesn't the con have a place to submit photos taken there (many cons do)?
05:31 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
05:55 🔗 HCross anyone else getting a constant ImportError: cannot import name RetryError
05:55 🔗 HCross error, since the recent update of internetarchive
06:03 🔗 HCross ^ never mind, I cocked up
06:13 🔗 dashcloud has quit IRC (Read error: Operation timed out)
06:16 🔗 dashcloud has joined #archiveteam-bs
06:54 🔗 BlueMaxim has quit IRC (Quit: Leaving)
07:01 🔗 RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue)
07:13 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
07:27 🔗 RichardG has joined #archiveteam-bs
07:55 🔗 DoomTay has quit IRC (Quit: Page closed)
08:50 🔗 DiscantX has joined #archiveteam-bs
08:57 🔗 zhongfu_ has joined #archiveteam-bs
08:57 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
09:04 🔗 zhongfu_ has quit IRC (Ping timeout: 260 seconds)
09:04 🔗 DiscantX has quit IRC (Read error: Operation timed out)
09:05 🔗 zhongfu has joined #archiveteam-bs
09:12 🔗 DiscantX has joined #archiveteam-bs
09:26 🔗 BlueMaxim has joined #archiveteam-bs
09:51 🔗 zhongfu has quit IRC (Remote host closed the connection)
10:06 🔗 Sum has quit IRC (Ping timeout: 246 seconds)
10:07 🔗 Sum has joined #archiveteam-bs
10:14 🔗 zhongfu has joined #archiveteam-bs
10:20 🔗 Sum has quit IRC (Ping timeout: 246 seconds)
10:32 🔗 zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.)
10:32 🔗 GLaDOS has joined #archiveteam-bs
10:34 🔗 zhongfu has joined #archiveteam-bs
10:58 🔗 Sum has joined #archiveteam-bs
11:03 🔗 Sum has quit IRC (Quit: Leaving)
12:05 🔗 BlueMaxim has quit IRC (Quit: Leaving)
12:10 🔗 BlueMaxim has joined #archiveteam-bs
12:23 🔗 DiscantX has quit IRC (Read error: Operation timed out)
13:38 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:48 🔗 VADemon has joined #archiveteam-bs
14:17 🔗 Start has quit IRC (Quit: Disconnected.)
15:21 🔗 r3c0d3x has quit IRC (Ping timeout: 260 seconds)
15:23 🔗 r3c0d3x has joined #archiveteam-bs
15:54 🔗 Start has joined #archiveteam-bs
15:59 🔗 Start has quit IRC (Quit: Disconnected.)
16:15 🔗 DoomTay has joined #archiveteam-bs
16:18 🔗 JesseW has joined #archiveteam-bs
16:37 🔗 Frogging arkiver: do you ever use something like BeautifulSoup to parse pages in warrior projects?
16:37 🔗 Frogging or just simple text searches
16:40 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
16:41 🔗 arkiver I never use BeautifulSoup
16:43 🔗 arkiver Everything is extracted using pattern matching in lua or regex in Python
16:48 🔗 dashcloud has quit IRC (Ping timeout: 244 seconds)
16:49 🔗 dashcloud has joined #archiveteam-bs
16:56 🔗 godane so i found this website: http://www.houstonlgbthistory.org/
16:56 🔗 godane its in archivebot right now
16:56 🔗 godane may have tons of pdfs
18:45 🔗 Start has joined #archiveteam-bs
18:48 🔗 REiN^ has joined #archiveteam-bs
19:36 🔗 dashcloud has quit IRC (Ping timeout: 244 seconds)
19:37 🔗 VADemon has quit IRC (Quit: left4dead)
19:39 🔗 DiscantX has joined #archiveteam-bs
19:40 🔗 dashcloud has joined #archiveteam-bs
19:46 🔗 Start has quit IRC (Quit: Disconnected.)
19:47 🔗 Start has joined #archiveteam-bs
19:52 🔗 Start has quit IRC (Quit: Disconnected.)
20:07 🔗 DiscantX has quit IRC (Read error: Operation timed out)
20:08 🔗 mutoso has quit IRC (Quit: leaving)
20:18 🔗 mutoso has joined #archiveteam-bs
20:37 🔗 dxrt has quit IRC (Read error: Operation timed out)
20:38 🔗 jspiros has quit IRC (Read error: Operation timed out)
20:41 🔗 dxrt has joined #archiveteam-bs
21:22 🔗 robink has quit IRC (Ping timeout: 633 seconds)
21:30 🔗 bzc6p has joined #archiveteam-bs
21:30 🔗 swebb sets mode: +o bzc6p
21:36 🔗 HCross yipdw, are you recruiting more pipelines atm?
21:45 🔗 jspiros has joined #archiveteam-bs
21:59 🔗 yipdw HCross: no
22:00 🔗 HCross ok
22:10 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:13 🔗 dashcloud has joined #archiveteam-bs
22:23 🔗 bzc6p has left
22:35 🔗 yipdw so if someone is interested in looking at the DNS-error-with-url-list thing
22:36 🔗 yipdw you will want to look at pipeline/archivebot/seesaw/tasks.py:273-314
22:36 🔗 yipdw that's the DownloadUrlFile task. the other part, and this is the part that i have not yet understood well enough to make a fix, is seesaw retry behavior
22:36 🔗 yipdw i suspect there is a max retries limit somewhere but I haven't been able to find it
22:44 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
22:47 🔗 aschmitz_ has quit IRC (Read error: Operation timed out)
22:48 🔗 aschmitz_ has joined #archiveteam-bs
22:49 🔗 FalconK yipdw: well it's going to get an exception on line 285 requests.get(timeout=none, ...)
22:50 🔗 FalconK so it will go to the handler at 301 and do self.schedule_retry(item) unconditionally
22:50 🔗 FalconK so there's the bug
22:51 🔗 FalconK the right thing to do is probably add a field to item for number of times retried, increment it on each retry, and have it not schedule_retry if the counter is greater than some arbitrary constant
22:53 🔗 FalconK I'm not sure if you just fall out when that happens, or if you must call complete_item
22:53 🔗 FalconK because I don't know much about python RetryableTask
22:53 🔗 FalconK ** Task
23:18 🔗 DoomTay Anyone heard of PostGhost?
23:18 🔗 DoomTay Tweet archive that just shut down today
23:18 🔗 DoomTay I actually had no idea it existed until now
23:18 🔗 yipdw FalconK: yeah, the exit strategy is what I haven't figured out yet
23:19 🔗 robink has joined #archiveteam-bs
23:30 🔗 DoomTay I'm about halfway through archving artist pages on portalgraphics. It
23:31 🔗 DoomTay 'It's staggering how much wasn't saved beforehand, even though the site in its current form has been aroun since ~2010-2011
23:39 🔗 tomwsmf-a has joined #archiveteam-bs
23:41 🔗 FalconK yipdw: the most intuitive thing to me seems to be to treat it as though it were aborted
23:41 🔗 FalconK our definition of success is pretty squishy though
23:42 🔗 FalconK oh, why on earth might one of my WARCs in opensource https://archive.org/details/archiveteam_archivebot_go_falconk_uprisingradio_org_20160427 have almost 70k views?
23:46 🔗 arkiver because it's popular?
23:46 🔗 Start has joined #archiveteam-bs
23:46 🔗 arkiver Guess it's some important site you saved there
23:47 🔗 FalconK guess so but it's in opensource and theoretically not in wayback.
23:47 🔗 arkiver I see
23:47 🔗 arkiver Everything with mediatype 'web' goes into the wayback machine
23:47 🔗 arkiver also if it is in opensource
23:47 🔗 FalconK oh
23:48 🔗 arkiver it just takes up to a month or so to get in the wayback macine
23:48 🔗 arkiver machine*
23:48 🔗 arkiver where in a web collection it takes a day or so
23:48 🔗 FalconK so there is no need for me to annoy IA people with requests to move my content into the archivebot collection then
23:48 🔗 arkiver well, it might be nice to have it moved to a web collections
23:48 🔗 arkiver but to have it in the wayback machine, no
23:49 🔗 FalconK if I had permission I would upload it straight there, but such is not forthcoming
23:59 🔗 DoomTay has quit IRC (Quit: Page closed)

irclogger-viewer