[00:18] *** DoomTay has joined #archiveteam-bs [00:19] *** Stiletto has joined #archiveteam-bs [00:19] *** tomwsmf-a has joined #archiveteam-bs [00:24] *** DiscantX has joined #archiveteam-bs [00:30] *** JesseW has joined #archiveteam-bs [00:53] i'm not doing the examiner.com website [00:53] mostly cause its too big [00:53] even when doing daily sitemap dumps of it [00:54] there is like 1000+ urls per a day from that website [00:57] *** VADemon has quit IRC (Quit: left4dead) [00:57] *** DiscantX has quit IRC (Read error: Operation timed out) [01:12] *** JesseW has quit IRC (Ping timeout: 370 seconds) [01:23] *** Stiletto has quit IRC (Ping timeout: 244 seconds) [01:24] *** Coderjoe has quit IRC (Read error: Operation timed out) [01:28] *** Coderjoe has joined #archiveteam-bs [01:37] Well ArchiveBot is doing it anyway, thanks to SketchCow [02:01] *** coretx has quit IRC (Read error: Operation timed out) [02:02] *** RichardG has quit IRC (Read error: Operation timed out) [02:02] *** RichardG has joined #archiveteam-bs [02:04] *** coretx has joined #archiveteam-bs [02:05] *** JesseW has joined #archiveteam-bs [02:10] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [02:18] *** Stiletto has joined #archiveteam-bs [02:45] *** RichardG has quit IRC (Read error: Operation timed out) [02:45] *** RichardG has joined #archiveteam-bs [03:09] *** RichardG has quit IRC (Read error: Operation timed out) [03:09] *** RichardG has joined #archiveteam-bs [03:33] *** Coderjoe has quit IRC (Read error: Operation timed out) [03:52] *** RichardG has quit IRC (Ping timeout: 370 seconds) [03:54] *** Swizzle has quit IRC (Quit: Leaving) [03:57] *** RichardG has joined #archiveteam-bs [04:01] *** Coderjoe has joined #archiveteam-bs [04:05] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:08] *** RichardG has quit IRC (Ping timeout: 260 seconds) [04:11] *** Sk1d has joined #archiveteam-bs [04:12] *** RichardG has joined #archiveteam-bs [04:27] www.asstr.org isn't run by IA/Jason Scott/someone in AT, is it? x) [04:27] (alt.sex.stories text repository) [04:28] that's been around forever [04:28] I doubt it [04:28] ah [04:29] * ranma watches CITIES ON THE EDGE OF NEVER: Life in the Trenches of the Web in 2012 (JS talk for some posh UK conference) [04:30] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [04:31] ranma: #archivebot has grabbed copies of it more than once, I think, though. [04:31] that's good [04:31] :p [04:31] lol [04:32] they've got a lot of nifty stuff on there [04:32] heh. heh [04:33] yes. my first memory of a.s.s content was the Smurf Smuckfest story [04:33] * ranma coughs [04:34] probably on aol :x [04:36] has the old video content on AOL ever been backed up? or was it mercilessly been nuked? [04:36] i converted Final Fantasy 7 videos to RM5 and uploaded [04:36] *has it [04:36] *has it been [04:37] http://www.archiveteam.org/index.php?title=AOL [04:39] have the files section been backed up? [04:40] or hard to say? [04:42] I am *really* curious who's actually running asstr.org, actually... [04:43] maybe one of those DNS history sites caught non-anonymized info [04:43] There's nominally a nonprofit backing it, but that could just be the result of a particularly dedicated single person. [04:43] pikhq: it says there is a team of a couple of people [04:43] Well then. [04:44] a furry couple [04:44] there were no furries at Denver Comic Con this year :'( [04:44] ranma: Literally, or just guessing? [04:44] offensively guessing [04:45] how small of a site does AT go after? [04:45] and off the radar [04:45] 1 page [04:45] archivebot was built for that use case [04:45] and as long as it is public, obscure is fine [04:45] yeah [04:45] private sites or sites that really seem like they should be private, well [04:46] this is where I get into shouting matches so I'm just gonna stop there [04:46] There might be other considerations, but the general heuristic is: is it public information? If so, archive it. [04:46] amateur private photo shoots at a comic con? [04:46] uh [04:46] -private [04:47] i dunno it depends on what the shoots are [04:47] just con-goers [04:47] probably non-notable [04:47] oh, I had a different conception of what you meant [04:47] i have to work out my let [04:48] let's encrypt cert for the folder that JUST has the footage [04:48] meanwhile, the folder only had the DCC16 folder of this gallery: https://yourmom.likesbuttse.xxx/gallery-naughty/ (rest is nsfw) [04:48] https://yourmom.likesbuttse.xxx/gallery-naughty/ [04:49] i've just seen some shit go down at comic-cons that *really* shouldn't be archived because it would just be a massive dick move [04:49] er yeah [04:49] ah okay [04:49] but that doesn't necessarily apply to your case so *shrug* [04:49] I dunno, I guess a good question to ask yourself is "would someone be harmed with a permanent and eventually searchable record of this" [04:50] probably not. unless they're applying for top secret+ clearance [04:51] and if it is your own content, there's no need to involve archiveteam in it at all -- you are perfectly capable of uploading it to any number of additional places yourself [04:51] yeah, i question the value [04:52] except for one or two con-goers [04:52] does AT back up flickr from time to time? [04:52] or TIA [04:52] i suspect we will have to eventually [04:52] all of flickr? hardly [04:52] TIA? [04:52] TIA? [04:52] IA [04:52] or ask Yahoo! real kindly to save it somewhere before they blow it up [04:52] ;o [04:52] TumblrInAction? [04:52] Oh [04:53] Three Inch Acronynm? [04:53] Three Ingot Acronym [04:53] regarding back of IA, see http://iabak.archiveteam.org/ [04:53] TumblrInAction was my first thought [04:53] :p [04:56] speaking of which, how big was the Tumblr backup? [04:56] I'm not aware there is a tumblr backup.. [04:57] http://www.archiveteam.org/index.php?title=Tumblr [04:57] "test project" [04:57] http://www.archiveteam.org/index.php?title=Projects#Warrior_projects [04:58] "Not saved yet" [04:58] I've been intermittently making snapshots of particular tumblr blogs as I come across them, with archivebot -- I'm always glad for more suggestions. [04:58] I wasn't aware of a test project [04:58] ah [04:59] It looks like the test was 4 years ago [04:59] oh, i missed the "result" column. just assumed the fact that it was in a green box and that "archive posted" meant that it was completed [04:59] by alard, who isn't regularly involved with AT currently (AFAIK) [05:00] apparently it was 133gb [05:00] according to https://archive.org/details/archiveteam-tumblr-test [05:00] if i'm reading it correctly, RapidShare was 2TB? [05:01] http://tracker.archiveteam.org/rapidsharedisco/ [05:01] Woof! [05:01] http://www.archiveteam.org/index.php?title=RapidShare [05:05] *** metalcamp has joined #archiveteam-bs [05:14] ssl cert updated, but probably not notable https://pics.yougave.me/gallery/ [05:18] ranma: why not just upload a copy elsewhere (i.e. IA, flickr, etc)? [05:19] they seem like perfectly nice pictures [05:20] i'd rather not if not in an organized, someone anonymous large archive [05:20] *somewhat [05:20] but if Flickr will eventually be crawled, i can do that! :D [05:21] ah, that makes more sense [05:22] although, if you dump them in an item on IA with a one-off email address, and minimal metadata (esspecially if you compress them with something unusual) they'll be pretty well lost for a good long while [05:23] and if you want to be even more sure they are lost, encrypt them with a relatively short key -- that way someone would have to actively bother to decrypt them (which will presumably be trivial eventually, but not for a while) [05:24] also, doesn't the con have a place to submit photos taken there (many cons do)? [05:31] *** Aranje has quit IRC (Quit: Three sheets to the wind) [05:55] anyone else getting a constant ImportError: cannot import name RetryError [05:55] error, since the recent update of internetarchive [06:03] ^ never mind, I cocked up [06:13] *** dashcloud has quit IRC (Read error: Operation timed out) [06:16] *** dashcloud has joined #archiveteam-bs [06:54] *** BlueMaxim has quit IRC (Quit: Leaving) [07:01] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [07:13] *** JesseW has quit IRC (Ping timeout: 370 seconds) [07:27] *** RichardG has joined #archiveteam-bs [07:55] *** DoomTay has quit IRC (Quit: Page closed) [08:50] *** DiscantX has joined #archiveteam-bs [08:57] *** zhongfu_ has joined #archiveteam-bs [08:57] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [09:04] *** zhongfu_ has quit IRC (Ping timeout: 260 seconds) [09:04] *** DiscantX has quit IRC (Read error: Operation timed out) [09:05] *** zhongfu has joined #archiveteam-bs [09:12] *** DiscantX has joined #archiveteam-bs [09:26] *** BlueMaxim has joined #archiveteam-bs [09:51] *** zhongfu has quit IRC (Remote host closed the connection) [10:06] *** Sum has quit IRC (Ping timeout: 246 seconds) [10:07] *** Sum has joined #archiveteam-bs [10:14] *** zhongfu has joined #archiveteam-bs [10:20] *** Sum has quit IRC (Ping timeout: 246 seconds) [10:32] *** zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.) [10:32] *** GLaDOS has joined #archiveteam-bs [10:34] *** zhongfu has joined #archiveteam-bs [10:58] *** Sum has joined #archiveteam-bs [11:03] *** Sum has quit IRC (Quit: Leaving) [12:05] *** BlueMaxim has quit IRC (Quit: Leaving) [12:10] *** BlueMaxim has joined #archiveteam-bs [12:23] *** DiscantX has quit IRC (Read error: Operation timed out) [13:38] *** BlueMaxim has quit IRC (Quit: Leaving) [13:48] *** VADemon has joined #archiveteam-bs [14:17] *** Start has quit IRC (Quit: Disconnected.) [15:21] *** r3c0d3x has quit IRC (Ping timeout: 260 seconds) [15:23] *** r3c0d3x has joined #archiveteam-bs [15:54] *** Start has joined #archiveteam-bs [15:59] *** Start has quit IRC (Quit: Disconnected.) [16:15] *** DoomTay has joined #archiveteam-bs [16:18] *** JesseW has joined #archiveteam-bs [16:37] arkiver: do you ever use something like BeautifulSoup to parse pages in warrior projects? [16:37] or just simple text searches [16:40] *** JesseW has quit IRC (Ping timeout: 370 seconds) [16:41] I never use BeautifulSoup [16:43] Everything is extracted using pattern matching in lua or regex in Python [16:48] *** dashcloud has quit IRC (Ping timeout: 244 seconds) [16:49] *** dashcloud has joined #archiveteam-bs [16:56] so i found this website: http://www.houstonlgbthistory.org/ [16:56] its in archivebot right now [16:56] may have tons of pdfs [18:45] *** Start has joined #archiveteam-bs [18:48] *** REiN^ has joined #archiveteam-bs [19:36] *** dashcloud has quit IRC (Ping timeout: 244 seconds) [19:37] *** VADemon has quit IRC (Quit: left4dead) [19:39] *** DiscantX has joined #archiveteam-bs [19:40] *** dashcloud has joined #archiveteam-bs [19:46] *** Start has quit IRC (Quit: Disconnected.) [19:47] *** Start has joined #archiveteam-bs [19:52] *** Start has quit IRC (Quit: Disconnected.) [20:07] *** DiscantX has quit IRC (Read error: Operation timed out) [20:08] *** mutoso has quit IRC (Quit: leaving) [20:18] *** mutoso has joined #archiveteam-bs [20:37] *** dxrt has quit IRC (Read error: Operation timed out) [20:38] *** jspiros has quit IRC (Read error: Operation timed out) [20:41] *** dxrt has joined #archiveteam-bs [21:22] *** robink has quit IRC (Ping timeout: 633 seconds) [21:30] *** bzc6p has joined #archiveteam-bs [21:30] *** swebb sets mode: +o bzc6p [21:36] yipdw, are you recruiting more pipelines atm? [21:45] *** jspiros has joined #archiveteam-bs [21:59] HCross: no [22:00] ok [22:10] *** dashcloud has quit IRC (Read error: Operation timed out) [22:13] *** dashcloud has joined #archiveteam-bs [22:23] *** bzc6p has left [22:35] so if someone is interested in looking at the DNS-error-with-url-list thing [22:36] you will want to look at pipeline/archivebot/seesaw/tasks.py:273-314 [22:36] that's the DownloadUrlFile task. the other part, and this is the part that i have not yet understood well enough to make a fix, is seesaw retry behavior [22:36] i suspect there is a max retries limit somewhere but I haven't been able to find it [22:44] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [22:47] *** aschmitz_ has quit IRC (Read error: Operation timed out) [22:48] *** aschmitz_ has joined #archiveteam-bs [22:49] yipdw: well it's going to get an exception on line 285 requests.get(timeout=none, ...) [22:50] so it will go to the handler at 301 and do self.schedule_retry(item) unconditionally [22:50] so there's the bug [22:51] the right thing to do is probably add a field to item for number of times retried, increment it on each retry, and have it not schedule_retry if the counter is greater than some arbitrary constant [22:53] I'm not sure if you just fall out when that happens, or if you must call complete_item [22:53] because I don't know much about python RetryableTask [22:53] ** Task [23:18] Anyone heard of PostGhost? [23:18] Tweet archive that just shut down today [23:18] I actually had no idea it existed until now [23:18] FalconK: yeah, the exit strategy is what I haven't figured out yet [23:19] *** robink has joined #archiveteam-bs [23:30] I'm about halfway through archving artist pages on portalgraphics. It [23:31] 'It's staggering how much wasn't saved beforehand, even though the site in its current form has been aroun since ~2010-2011 [23:39] *** tomwsmf-a has joined #archiveteam-bs [23:41] yipdw: the most intuitive thing to me seems to be to treat it as though it were aborted [23:41] our definition of success is pretty squishy though [23:42] oh, why on earth might one of my WARCs in opensource https://archive.org/details/archiveteam_archivebot_go_falconk_uprisingradio_org_20160427 have almost 70k views? [23:46] because it's popular? [23:46] *** Start has joined #archiveteam-bs [23:46] Guess it's some important site you saved there [23:47] guess so but it's in opensource and theoretically not in wayback. [23:47] I see [23:47] Everything with mediatype 'web' goes into the wayback machine [23:47] also if it is in opensource [23:47] oh [23:48] it just takes up to a month or so to get in the wayback macine [23:48] machine* [23:48] where in a web collection it takes a day or so [23:48] so there is no need for me to annoy IA people with requests to move my content into the archivebot collection then [23:48] well, it might be nice to have it moved to a web collections [23:48] but to have it in the wayback machine, no [23:49] if I had permission I would upload it straight there, but such is not forthcoming [23:59] *** DoomTay has quit IRC (Quit: Page closed)