[01:22] *** brayden_ has joined #archiveteam-bs [01:22] *** swebb sets mode: +o brayden_ [01:26] *** brayden has quit IRC (Read error: Operation timed out) [03:25] I wonder if archive wants video files from a university course I just took... [03:39] *** pizzaiolo has quit IRC (pizzaiolo) [03:56] Hm, looks like the only active Warrior project right now is #urlteam . I'll go add more shortners to urlteam. [04:17] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:24] *** Sk1d has joined #archiveteam-bs [04:24] *** Sk1d has quit IRC (Connection Closed) [04:35] *** ploop has joined #archiveteam-bs [04:37] Somebody2: so far I've been writing a new script every time I want to archive files from a site, but they're always very far from perfect and stop working every now and again and require constant maintenance [04:38] additionally i have no idea how i should be handling various errors so if my internet cuts out for a few seconds or something i end up with the script either crashing or missing files [04:38] *** BlueMaxim has joined #archiveteam-bs [04:39] and it occurred to me that downloading webpages is not something that i should be having problems with, since plenty of other people's software does it without issue [04:41] well, you've come to the right place. [04:41] the easy part is figuring out that i need to download x.com/fileid/x where x is {1..5000000} and maybe do some mime detection to give it a good filename or something [04:42] but somehow i struggle with http, which should be the easier part [04:42] Look over the docs for wpull; there's also grab-site that offers an interface over it. [04:43] You may also find the code for the Warrior projects informative; those are in the ArchiveTeam github organization. [04:44] I don't persionally do a whole lot of that exact thing, so I'm probably not the best person to answer really detailed questions. [04:47] *** Aranje has quit IRC (Quit: Three sheets to the wind) [04:51] this looks interesting [04:53] I hope so. :-) It serves us pretty well. [07:26] there is a thunderstorm outside [07:26] *** GE has joined #archiveteam-bs [07:26] like monsoon like rain is going on where i live [07:44] *** Jonison has joined #archiveteam-bs [07:53] *** schbirid has joined #archiveteam-bs [08:05] *** espes___ has joined #archiveteam-bs [08:06] *** will has quit IRC (Ping timeout: 250 seconds) [08:07] *** luckcolor has quit IRC (Remote host closed the connection) [08:08] *** midas has quit IRC (hub.se irc.underworld.no) [08:08] *** Jonimus has quit IRC (hub.se irc.underworld.no) [08:08] *** JensRex has quit IRC (hub.se irc.underworld.no) [08:08] *** Lord_Nigh has quit IRC (hub.se irc.underworld.no) [08:08] *** alfiepate has quit IRC (hub.se irc.underworld.no) [08:08] *** Riviera has quit IRC (hub.se irc.underworld.no) [08:08] *** espes__ has quit IRC (hub.se irc.underworld.no) [08:08] *** tammy_ has quit IRC (hub.se irc.underworld.no) [08:08] *** i0npulse has quit IRC (hub.se irc.underworld.no) [08:08] *** purplebot has quit IRC (hub.se irc.underworld.no) [08:08] *** Rai-chan has quit IRC (hub.se irc.underworld.no) [08:08] *** medowar has quit IRC (hub.se irc.underworld.no) [08:08] *** Hecatz has quit IRC (hub.se irc.underworld.no) [08:09] *** LordNigh2 has joined #archiveteam-bs [08:09] *** luckcolor has joined #archiveteam-bs [08:09] *** will has joined #archiveteam-bs [08:10] *** alfie has joined #archiveteam-bs [08:11] I think #noanswers needs requeuing, 70k items out [08:17] *** midas1 has joined #archiveteam-bs [08:17] *** Jonimoose has joined #archiveteam-bs [08:17] *** swebb sets mode: +o Jonimoose [08:23] *** LordNigh2 is now known as Lord_Nigh [08:53] *** GE has quit IRC (Remote host closed the connection) [09:12] *** Jonison has quit IRC (Read error: Connection reset by peer) [09:18] *** Jonison has joined #archiveteam-bs [09:19] *** Somebody2 has quit IRC (Read error: Operation timed out) [09:20] *** Jonimoose has quit IRC (west.us.hub irc.Prison.NET) [09:21] *** xmc has quit IRC (Read error: Operation timed out) [09:21] *** Somebody2 has joined #archiveteam-bs [09:24] *** midas1 is now known as midas [09:26] *** xmc has joined #archiveteam-bs [09:26] *** swebb sets mode: +o xmc [09:43] *** deathy has quit IRC (Remote host closed the connection) [09:43] *** HCross2 has quit IRC (Remote host closed the connection) [09:47] *** JAA has joined #archiveteam-bs [09:52] *** deathy has joined #archiveteam-bs [09:57] Server: IIS/4.1 [09:57] X-Powered-By: Visual Basic 2.0 on Rails [09:57] I lol'd [10:20] *** HCross2 has joined #archiveteam-bs [10:28] *** JAA has quit IRC (Quit: Page closed) [10:34] *** Jonimoose has joined #archiveteam-bs [10:34] *** irc.Prison.NET sets mode: +o Jonimoose [10:34] *** swebb sets mode: +o Jonimoose [10:36] *** purplebot has joined #archiveteam-bs [10:36] *** Rai-chan has joined #archiveteam-bs [10:36] *** medowar has joined #archiveteam-bs [10:36] *** Hecatz has joined #archiveteam-bs [10:39] *** i0npulse has joined #archiveteam-bs [10:39] *** tammy_ has joined #archiveteam-bs [11:03] *** JensRex has joined #archiveteam-bs [11:03] *** dashcloud has quit IRC (Read error: Connection reset by peer) [11:04] *** dashcloud has joined #archiveteam-bs [11:32] Upload of the first chunk of data.gov has begun - 1.5TB at 55Mbps [11:33] Anyone know if I can use the IA python tool to upload more than 1 file to an item at a time please? [12:30] *** pizzaiolo has joined #archiveteam-bs [13:05] *** BlueMaxim has quit IRC (Quit: Leaving) [14:02] *** JensRex has quit IRC (Remote host closed the connection) [14:03] *** JensRex has joined #archiveteam-bs [14:20] *** Yurume has quit IRC (Remote host closed the connection) [14:20] *** antomati_ is now known as antomatic [14:24] *** Ravenloft has quit IRC (Read error: Operation timed out) [14:31] *** Yurume has joined #archiveteam-bs [14:44] *** Dark_Star has quit IRC (Read error: Operation timed out) [14:44] *** hook54321 has quit IRC (Ping timeout: 250 seconds) [14:44] *** godane has quit IRC (Ping timeout: 250 seconds) [14:44] *** kanzure has quit IRC (Ping timeout: 250 seconds) [14:44] *** kanzure has joined #archiveteam-bs [14:44] *** alembic has quit IRC (Ping timeout: 260 seconds) [14:47] *** godane has joined #archiveteam-bs [14:58] *** logchfoo0 starts logging #archiveteam-bs at Tue May 02 14:58:53 2017 [14:58] *** logchfoo0 has joined #archiveteam-bs [14:59] *** hook54321 has joined #archiveteam-bs [15:00] *** alembic has joined #archiveteam-bs [15:07] *** Ctrl-S___ has joined #archiveteam-bs [15:12] *** kvieta has quit IRC (Ping timeout: 370 seconds) [15:12] *** GE has joined #archiveteam-bs [15:13] *** nightpool has joined #archiveteam-bs [15:26] *** icedice has joined #archiveteam-bs [15:26] *** icedice2 has joined #archiveteam-bs [15:31] *** yipdw has quit IRC (Read error: Operation timed out) [15:33] *** me_ has joined #archiveteam-bs [15:36] *** icedice2 has quit IRC (Quit: Leaving) [17:28] HCross2: yes, just give it a list of items [17:28] or a directory where it can find all items [17:28] files* [17:48] I meant concurrent - I fed it a directory and off it went [17:49] So I point it at a directory and it uploads say 5 files at once [17:55] *** GE has quit IRC (Remote host closed the connection) [18:02] *** namespace has joined #archiveteam-bs [18:02] But yeah. [18:02] It's not so much that piracy sites have no cultural value, quite the contrary they're some of the largest 'open' repositories of cultural value out there. [18:02] traditionally we don't care much about legal risk, because the real risk seems low [18:03] They're just radioactive to touch. [18:03] Yeah but. [18:03] Piracy sites are one of the cases where it's not. [18:03] Especially if they just shut down because someone else was suing them or whatever. [18:03] i see no evidence, only fear [18:04] * namespace shrugs [18:04] Not gonna argue this when it's not even my decision lol. [18:05] it's the decision of every member for themselves, of whether they want to participate in that sort of project [18:06] we've archived shitloads of pirated everything and nothing has happened so far [18:06] we've even archived people being scared about it in irc! [18:06] hehe [18:07] i think we've received a few takedowns on things, but no other fallout [18:08] i know that a ftpsite i archived got darked [18:08] FEEEAR [18:08] Did someone call for fear? I work in fear. [18:09] yes, hello, fear department, we need a delivery [18:09] Did you want regular fear or extra spicy fear [18:09] well what did the requisition form say [18:09] come ON we have standardized forms for a *reason* [18:10] Form unintelligible, blood streaks covering checkboxes [18:10] While people are here: is there a list of people who have access to the tracker for different projects? Yahoo Answers needs a requeue and I'm not sure who is best to ping [18:10] Ping arkiver or yipdw or I'm not sure who else [18:12] *** me_ is now known as yipdw [18:12] the claims page is 500ing out [18:12] one sec [18:13] yahooanswers has admins set as arkiver and medowar, for the record [18:14] (they, and anyone set as global-admin, can jiggle it) [18:14] oh [18:15] it's because someone named pronerdJay has something like 100,000 claims and the page is going FML [18:15] i haven't come across something so quinessentially AT in a while [18:15] er, maybe it's closer to 50,000 [18:15] either way [18:16] haha [18:16] $ ruby release-claims.rb yahooanswers pronerdJay [18:16] /home/yipdw/.rvm/gems/ruby-2.3.3/gems/activesupport-3.2.5/lib/active_support/values/time_zone.rb:270: warning: circular argument reference - now [18:16] /home/yipdw/.rvm/gems/ruby-2.3.3/gems/redis-2.2.2/lib/redis.rb:215:in `block in hgetall': stack level too deep (SystemStackError) [18:16] fuck Rub [18:16] y [18:16] that's the rub [18:16] wait what how is that stack trace possible [18:16] is hgetall recursing to build a hash?? [18:17] oh, no, it uses Hash[] and passes the reply in using a splat [18:17] fuck Ruby [18:17] archiveteam: finding bugs in standard system tools since 2009 [18:17] I think newer versions of redis-rb fix this [18:20] oh, but that script is using the tracker gem bundle and I can't update it without affecting the world [18:20] bleh I'll write something [18:21] Is Yahoo Answers going down? [18:21] I have some places where Yahoo Answers can go [18:22] icedice: Yahoo Answers is being grabbed preemptively in case Verizon decides to can it [18:22] Ah, right [18:22] Yahoo sold out to Verizon [18:23] ok, it looks like release-stale worked [18:23] the spice is flowing again on yahooanswers and I'm getting out of jwz mode [18:24] Thanks yipdw [18:24] yipdw: we already have a way of handling too many out items [18:25] Requeue on the Workarounds page [18:25] there's a few scripts that seem to work, release-claims just can't handle firepower of that magnitude [18:25] oh, right [18:25] I guess that page does the same as release-stale, huh [18:27] I guess so [18:44] https://archive.org/details/pulpmagazinearchive?&sort=-publicdate&and[]=addeddate:2017* [18:44] I'm uploading 10,000 zines [18:44] Should I ask permission [18:44] * SketchCow bites nails [19:06] *** ndiddy has quit IRC () [19:06] Even more data.gov has just started the slow march up to the IA [19:15] SketchCow: lolno [19:36] BTW the tracker also has stale items for yuku, almost a year old [19:39] *** GE has joined #archiveteam-bs [19:59] Is there any way to find the Imgur link that was posted in OP's (now deleted) post? [19:59] https://www.reddit.com/r/webhosting/comments/4w6d63/buyshared_gets_mentioned_a_lot_when_it_comes_to/ [19:59] Nothing on Archive.org [20:02] icedice: It looks like this may be a mirror of the original post: https://webdesignersolutions.wordpress.com/2016/08/04/buyshared-gets-mentioned-a-lot-when-it-comes-to-cheap-shared-hosting-heres-the-uptime-log-since-february-for-an-account-i-have-with-them-via-rwebhosting/ [20:06] Thanks! [20:30] *** schbirid has quit IRC (Quit: Leaving) [20:32] *** kvieta has joined #archiveteam-bs [20:46] *** kvieta has quit IRC (Read error: Operation timed out) [20:54] *** Ravenloft has joined #archiveteam-bs [20:56] *** kvieta has joined #archiveteam-bs [21:04] *** tuluu_ has joined #archiveteam-bs [21:04] *** tuluu has quit IRC (Ping timeout: 250 seconds) [21:07] *** Jonison has quit IRC (Read error: Connection reset by peer) [21:10] *** ndiddy has joined #archiveteam-bs [21:58] *** espes__ has joined #archiveteam-bs [21:59] *** espes___ has quit IRC (Ping timeout: 250 seconds) [22:02] *** midas has quit IRC (Ping timeout: 250 seconds) [22:02] *** Gfy has quit IRC (Ping timeout: 250 seconds) [22:03] *** mls has quit IRC (Ping timeout: 250 seconds) [22:03] *** midas has joined #archiveteam-bs [22:04] *** tsr has quit IRC (Ping timeout: 250 seconds) [22:05] *** Gfy has joined #archiveteam-bs [22:06] *** andai has quit IRC (Ping timeout: 250 seconds) [22:08] *** Kaz has quit IRC (Ping timeout: 250 seconds) [22:10] *** GE has quit IRC (Remote host closed the connection) [22:11] *** Aoede has quit IRC (Ping timeout: 250 seconds) [22:11] *** hook54321 has quit IRC (Ping timeout: 250 seconds) [22:11] *** C4K3 has quit IRC (Ping timeout: 250 seconds) [22:13] *** tsr has joined #archiveteam-bs [22:13] *** HP_ has joined #archiveteam-bs [22:13] *** C4K3 has joined #archiveteam-bs [22:14] *** hook54321 has joined #archiveteam-bs [22:14] *** andai has joined #archiveteam-bs [22:14] *** HP has quit IRC (Ping timeout: 250 seconds) [22:14] *** nightpool has quit IRC (Ping timeout: 250 seconds) [22:15] *** Kaz has joined #archiveteam-bs [22:16] *** mls has joined #archiveteam-bs [22:17] *** andai has quit IRC (Ping timeout: 250 seconds) [22:17] *** SN4T14 has quit IRC (Ping timeout: 250 seconds) [22:17] *** SN4T14 has joined #archiveteam-bs [22:21] *** mls has quit IRC (Ping timeout: 250 seconds) [22:21] *** mls has joined #archiveteam-bs [22:22] *** Aoede has joined #archiveteam-bs [22:22] *** andai has joined #archiveteam-bs [22:27] *** nightpool has joined #archiveteam-bs [22:46] *** Aoede has quit IRC (Ping timeout: 250 seconds) [22:48] *** Aoede has joined #archiveteam-bs [22:57] *** andai has quit IRC (Ping timeout: 250 seconds) [22:58] *** andai has joined #archiveteam-bs [23:05] *** sun_rise has joined #archiveteam-bs [23:06] I have questions about what is/is not appropriate for archiveteam/bot and not sure where to pose them [23:06] here is a good place to ask [23:09] Three people I know have been sued for defamation over 'survivor' websites by institutions they alleged abused them/others as children. Two of them were forced to settle and remove the content from the web. [23:09] archive it [23:09] this is 100% okay [23:10] unless they want it removed, which, well, doesn't sound like they do [23:12] "it", in this case, is going to be a lot bigger than just the 'survivor' websites. I am interested in crawling the 'industry' sites as well. My original plan was to do this own my own and I started researching best practices for this sort of thing. I was really pleasantly surprised to find Archiveteam/bot. [23:12] It's an amazing service and I don't want to abuse it. The crawl I started yesterday pointed at a single domain has already grown much larger than I was expecting. [23:14] yep, that'll happen [23:14] if you want, you can next time run your jobs with --no-offsite-links [23:14] by default archivebot will fetch every page on the site you submit, and every page that is linked to [23:14] in order to present context [23:14] (along with images and script and stylesheets used on these pages) [23:14] I think, for this job, that was probably the appropriate setting - I didn't realize this until after it started running, though. [23:15] mm, possibly [23:20] Ultimately I'm going to be interested in hundreds of domains that this site points to or that I have collected elsewhere that are relevant to this topic. I doubt any single one of them will end up as large as this - they seem to mostly be fairly lean wordpress product page type sites. I guess what I'm after is a general sense of what *wouldn't* be appropriate for archivebot. At what point should I be using something else? [23:20] Is there some standard/threshold of general interest or threatened status? If I end up trying to crawl from a list of sites - should that be done in chunks? How do I ensure my jobs don't spiral out of control? [23:21] If I made a donation to offset my usage is there some guide to how much things generally cost? [23:21] feel free to use archivebot [23:21] you sound like someone who's fairly conscious of the resources they're using [23:22] if you look on the dashboard and you have more jobs running than anyone else, you might want to rethink how you're going about doing things [23:22] that said, everyone who cares about something fills up the queue eventually [23:23] we have a cost shameboard that kind of tries to be a forever-cost of data storage [23:23] I saw this but wasn't sure how quickly that would fill up. There are some high scorers! [23:23] but if you throw some chum towards https://archive.org/donate/ it'll probably be fine [23:23] hehe [23:24] I noticed there are 2 warc files associated with my crawl that have already been uploaded to archive.org. Will those continue to be uploaded in chunks? [23:24] yep [23:24] whenever the pipeline cuts off the warc file and starts a new one, the uploader sends the finished warc file off to IA [23:24] if I do a crawl from a pastebin list of domains will they show up in the same IA folder or separate per domain? [23:25] jobs go into warc files named by the url you submit, no matter of whether you use it as a list of urls or a single website [23:26] if you're doing less than a few dozen sites, i'd suggest one !a per site [23:26] like, one day i did all the campaign websites for my city's election [23:28] *** dashcloud has quit IRC (Remote host closed the connection) [23:29] we've asked before about what wouldn't be appropriate and sketchcow weighed in: [23:29] In another channel, regarding uploading stuff of dubious value or duplication to archive.org: [23:29] General archive rule: gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad. [23:29] I am going to go ahead and define dubious value that the uploader can't even begin to dream up a use. [23:29] If the uploader can'te ven come up with a use case, that's dubious value. [23:29] Example: 14gb quicktime movie aimed at a blank wall for an hour, no change [23:30] *** BlueMaxim has joined #archiveteam-bs [23:31] so if it's in any way useful and it's not already archived, go hog wild, if it's gonna be mainly duplicated data then be careful about getting up into tens or hundreds of gigs [23:32] small sites don't matter except don't do many at the same time that there aren't any archivebot slots free for emergencies [23:33] *** dashcloud has joined #archiveteam-bs [23:33] this is admittedly hampered by the fact that we don't actually have a readout for the number of free slots [23:33] so submitting a list of urls might be more polite? [23:34] or come in and feed one in every so often as previous ones finish [23:35] I'm thinking I can prioritize the stuff that I most fear being lost right now and get to crawling 'the enemy' later when I have a better grasp of how big these things get [23:35] having a ton of sites on one job can be a problem because the jobs do crash from time to time [23:40] what I usually do before putting a site through archivebot is bring the site up in the wayback machine and see if the site has been crawled pretty well already or not [23:41] if the most recent crawl is from ages ago or you click a couple links and they come up "this page has not been archived" then it's due for a go [23:48] ok