[01:22] *** brayden_ has joined #archiveteam-bs
[01:22] *** swebb sets mode: +o brayden_
[01:26] *** brayden has quit IRC (Read error: Operation timed out)
[03:25] <Odd0002> I wonder if archive wants video files from a university course I just took...
[03:39] *** pizzaiolo has quit IRC (pizzaiolo)
[03:56] <Somebody2> Hm, looks like the only active Warrior project right now is #urlteam . I'll go add more shortners to urlteam.
[04:17] *** Sk1d has quit IRC (Ping timeout: 250 seconds)
[04:24] *** Sk1d has joined #archiveteam-bs
[04:24] *** Sk1d has quit IRC (Connection Closed)
[04:35] *** ploop has joined #archiveteam-bs
[04:37] <ploop> Somebody2: so far I've been writing a new script every time I want to archive files from a site, but they're always very far from perfect and stop working every now and again and require constant maintenance
[04:38] <ploop> additionally i have no idea how i should be handling various errors so if my internet cuts out for a few seconds or something i end up with the script either crashing or missing files
[04:38] *** BlueMaxim has joined #archiveteam-bs
[04:39] <ploop> and it occurred to me that downloading webpages is not something that i should be having problems with, since plenty of other people's software does it without issue
[04:41] <Somebody2> well, you've come to the right place.
[04:41] <ploop> the easy part is figuring out that i need to download x.com/fileid/x where x is {1..5000000} and maybe do some mime detection to give it a good filename or something
[04:42] <ploop> but somehow i struggle with http, which should be the easier part
[04:42] <Somebody2> Look over the docs for wpull; there's also grab-site that offers an interface over it.
[04:43] <Somebody2> You may also find the code for the Warrior projects informative; those are in the ArchiveTeam github organization.
[04:44] <Somebody2> I don't persionally do a whole lot of that exact thing, so I'm probably not the best person to answer really detailed questions.
[04:47] *** Aranje has quit IRC (Quit: Three sheets to the wind)
[04:51] <ploop> this looks interesting
[04:53] <Somebody2> I hope so. :-) It serves us pretty well.
[07:26] <godane> there is a thunderstorm outside
[07:26] *** GE has joined #archiveteam-bs
[07:26] <godane> like monsoon like rain is going on where i live
[07:44] *** Jonison has joined #archiveteam-bs
[07:53] *** schbirid has joined #archiveteam-bs
[08:05] *** espes___ has joined #archiveteam-bs
[08:06] *** will has quit IRC (Ping timeout: 250 seconds)
[08:07] *** luckcolor has quit IRC (Remote host closed the connection)
[08:08] *** midas has quit IRC (hub.se irc.underworld.no)
[08:08] *** Jonimus has quit IRC (hub.se irc.underworld.no)
[08:08] *** JensRex has quit IRC (hub.se irc.underworld.no)
[08:08] *** Lord_Nigh has quit IRC (hub.se irc.underworld.no)
[08:08] *** alfiepate has quit IRC (hub.se irc.underworld.no)
[08:08] *** Riviera has quit IRC (hub.se irc.underworld.no)
[08:08] *** espes__ has quit IRC (hub.se irc.underworld.no)
[08:08] *** tammy_ has quit IRC (hub.se irc.underworld.no)
[08:08] *** i0npulse has quit IRC (hub.se irc.underworld.no)
[08:08] *** purplebot has quit IRC (hub.se irc.underworld.no)
[08:08] *** Rai-chan has quit IRC (hub.se irc.underworld.no)
[08:08] *** medowar has quit IRC (hub.se irc.underworld.no)
[08:08] *** Hecatz has quit IRC (hub.se irc.underworld.no)
[08:09] *** LordNigh2 has joined #archiveteam-bs
[08:09] *** luckcolor has joined #archiveteam-bs
[08:09] *** will has joined #archiveteam-bs
[08:10] *** alfie has joined #archiveteam-bs
[08:11] <t2t2> I think #noanswers needs requeuing, 70k items out
[08:17] *** midas1 has joined #archiveteam-bs
[08:17] *** Jonimoose has joined #archiveteam-bs
[08:17] *** swebb sets mode: +o Jonimoose
[08:23] *** LordNigh2 is now known as Lord_Nigh
[08:53] *** GE has quit IRC (Remote host closed the connection)
[09:12] *** Jonison has quit IRC (Read error: Connection reset by peer)
[09:18] *** Jonison has joined #archiveteam-bs
[09:19] *** Somebody2 has quit IRC (Read error: Operation timed out)
[09:20] *** Jonimoose has quit IRC (west.us.hub irc.Prison.NET)
[09:21] *** xmc has quit IRC (Read error: Operation timed out)
[09:21] *** Somebody2 has joined #archiveteam-bs
[09:24] *** midas1 is now known as midas
[09:26] *** xmc has joined #archiveteam-bs
[09:26] *** swebb sets mode: +o xmc
[09:43] *** deathy has quit IRC (Remote host closed the connection)
[09:43] *** HCross2 has quit IRC (Remote host closed the connection)
[09:47] *** JAA has joined #archiveteam-bs
[09:52] *** deathy has joined #archiveteam-bs
[09:57] <JAA> Server: IIS/4.1
[09:57] <JAA> X-Powered-By: Visual Basic 2.0 on Rails
[09:57] <JAA> I lol'd
[10:20] *** HCross2 has joined #archiveteam-bs
[10:28] *** JAA has quit IRC (Quit: Page closed)
[10:34] *** Jonimoose has joined #archiveteam-bs
[10:34] *** irc.Prison.NET sets mode: +o Jonimoose
[10:34] *** swebb sets mode: +o Jonimoose
[10:36] *** purplebot has joined #archiveteam-bs
[10:36] *** Rai-chan has joined #archiveteam-bs
[10:36] *** medowar has joined #archiveteam-bs
[10:36] *** Hecatz has joined #archiveteam-bs
[10:39] *** i0npulse has joined #archiveteam-bs
[10:39] *** tammy_ has joined #archiveteam-bs
[11:03] *** JensRex has joined #archiveteam-bs
[11:03] *** dashcloud has quit IRC (Read error: Connection reset by peer)
[11:04] *** dashcloud has joined #archiveteam-bs
[11:32] <HCross2> Upload of the first chunk of data.gov has begun - 1.5TB at 55Mbps
[11:33] <HCross2> Anyone know if I can use the IA python tool to upload more than 1 file to an item at a time please?
[12:30] *** pizzaiolo has joined #archiveteam-bs
[13:05] *** BlueMaxim has quit IRC (Quit: Leaving)
[14:02] *** JensRex has quit IRC (Remote host closed the connection)
[14:03] *** JensRex has joined #archiveteam-bs
[14:20] *** Yurume has quit IRC (Remote host closed the connection)
[14:20] *** antomati_ is now known as antomatic
[14:24] *** Ravenloft has quit IRC (Read error: Operation timed out)
[14:31] *** Yurume has joined #archiveteam-bs
[14:44] *** Dark_Star has quit IRC (Read error: Operation timed out)
[14:44] *** hook54321 has quit IRC (Ping timeout: 250 seconds)
[14:44] *** godane has quit IRC (Ping timeout: 250 seconds)
[14:44] *** kanzure has quit IRC (Ping timeout: 250 seconds)
[14:44] *** kanzure has joined #archiveteam-bs
[14:44] *** alembic has quit IRC (Ping timeout: 260 seconds)
[14:47] *** godane has joined #archiveteam-bs
[14:58] *** logchfoo0 starts logging #archiveteam-bs at Tue May 02 14:58:53 2017
[14:58] *** logchfoo0 has joined #archiveteam-bs
[14:59] *** hook54321 has joined #archiveteam-bs
[15:00] *** alembic has joined #archiveteam-bs
[15:07] *** Ctrl-S___ has joined #archiveteam-bs
[15:12] *** kvieta has quit IRC (Ping timeout: 370 seconds)
[15:12] *** GE has joined #archiveteam-bs
[15:13] *** nightpool has joined #archiveteam-bs
[15:26] *** icedice has joined #archiveteam-bs
[15:26] *** icedice2 has joined #archiveteam-bs
[15:31] *** yipdw has quit IRC (Read error: Operation timed out)
[15:33] *** me_ has joined #archiveteam-bs
[15:36] *** icedice2 has quit IRC (Quit: Leaving)
[17:28] <arkiver> HCross2: yes, just give it a list of items
[17:28] <arkiver> or a directory where it can find all items
[17:28] <arkiver> files*
[17:48] <HCross2> I meant concurrent - I fed it a directory and off it went
[17:49] <HCross2> So I point it at a directory and it uploads say 5 files at once
[17:55] *** GE has quit IRC (Remote host closed the connection)
[18:02] *** namespace has joined #archiveteam-bs
[18:02] <namespace> But yeah.
[18:02] <namespace> It's not so much that piracy sites have no cultural value, quite the contrary they're some of the largest 'open' repositories of cultural value out there.
[18:02] <xmc> traditionally we don't care much about legal risk, because the real risk seems low
[18:03] <namespace> They're just radioactive to touch.
[18:03] <namespace> Yeah but.
[18:03] <namespace> Piracy sites are one of the cases where it's not.
[18:03] <namespace> Especially if they just shut down because someone else was suing them or whatever.
[18:03] <xmc> i see no evidence, only fear
[18:04] * namespace shrugs
[18:04] <namespace> Not gonna argue this when it's not even my decision lol.
[18:05] <xmc> it's the decision of every member for themselves, of whether they want to participate in that sort of project
[18:06] <DFJustin> we've archived shitloads of pirated everything and nothing has happened so far
[18:06] <xmc> we've even archived people being scared about it in irc!
[18:06] <xmc> hehe
[18:07] <xmc> i think we've received a few takedowns on things, but no other fallout
[18:08] <xmc> i know that a ftpsite i archived got darked
[18:08] <SketchCow> FEEEAR
[18:08] <SketchCow> Did someone call for fear? I work in fear.
[18:09] <xmc> yes, hello, fear department, we need a delivery
[18:09] <SketchCow> Did you want regular fear or extra spicy fear
[18:09] <xmc> well what did the requisition form say
[18:09] <xmc> come ON we have standardized forms for a *reason*
[18:10] <SketchCow> Form unintelligible, blood streaks covering checkboxes
[18:10] <MrRadar> While people are here: is there a list of people who have access to the tracker for different projects? Yahoo Answers needs a requeue and I'm not sure who is best to ping
[18:10] <SketchCow> Ping arkiver or yipdw or I'm not sure who else
[18:12] *** me_ is now known as yipdw
[18:12] <yipdw> the claims page is 500ing out
[18:12] <yipdw> one sec
[18:13] <xmc> yahooanswers has admins set as arkiver and medowar, for the record
[18:14] <xmc> (they, and anyone set as global-admin, can jiggle it)
[18:14] <yipdw> oh
[18:15] <yipdw> it's because someone named pronerdJay has something like 100,000 claims and the page is going FML
[18:15] <yipdw> i haven't come across something so quinessentially AT in a while
[18:15] <yipdw> er, maybe it's closer to 50,000
[18:15] <yipdw> either way
[18:16] <xmc> haha
[18:16] <yipdw> $ ruby release-claims.rb yahooanswers pronerdJay
[18:16] <yipdw> /home/yipdw/.rvm/gems/ruby-2.3.3/gems/activesupport-3.2.5/lib/active_support/values/time_zone.rb:270: warning: circular argument reference - now
[18:16] <yipdw> /home/yipdw/.rvm/gems/ruby-2.3.3/gems/redis-2.2.2/lib/redis.rb:215:in `block in hgetall': stack level too deep (SystemStackError)
[18:16] <yipdw> fuck Rub
[18:16] <yipdw> y
[18:16] <xmc> that's the rub
[18:16] <yipdw> wait what how is that stack trace possible
[18:16] <yipdw> is hgetall recursing to build a hash??
[18:17] <yipdw> oh, no, it uses Hash[] and passes the reply in using a splat
[18:17] <yipdw> fuck Ruby
[18:17] <xmc> archiveteam: finding bugs in standard system tools since 2009
[18:17] <yipdw> I think newer versions of redis-rb fix this
[18:20] <yipdw> oh, but that script is using the tracker gem bundle and I can't update it without affecting the world
[18:20] <yipdw> bleh I'll write something
[18:21] <icedice> Is Yahoo Answers going down?
[18:21] <yipdw> I have some places where Yahoo Answers can go
[18:22] <MrRadar> icedice: Yahoo Answers is being grabbed preemptively in case Verizon decides to can it
[18:22] <icedice> Ah, right
[18:22] <icedice> Yahoo sold out to Verizon
[18:23] <yipdw> ok, it looks like release-stale worked
[18:23] <yipdw> the spice is flowing again on yahooanswers and I'm getting out of jwz mode
[18:24] <MrRadar> Thanks yipdw
[18:24] <arkiver> yipdw: we already have a way of handling too many out items
[18:25] <arkiver> Requeue on the Workarounds page
[18:25] <yipdw> there's a few scripts that seem to work, release-claims just can't handle firepower of that magnitude
[18:25] <yipdw> oh, right
[18:25] <yipdw> I guess that page does the same as release-stale, huh
[18:27] <arkiver> I guess so
[18:44] <SketchCow> https://archive.org/details/pulpmagazinearchive?&sort=-publicdate&and[]=addeddate:2017*
[18:44] <SketchCow> I'm uploading 10,000 zines
[18:44] <SketchCow> Should I ask permission
[18:44] * SketchCow bites nails
[19:06] *** ndiddy has quit IRC ()
[19:06] <HCross2> Even more data.gov has just started the slow march up to the IA
[19:15] <namespace> SketchCow: lolno
[19:36] <t2t2> BTW the tracker also has stale items for yuku, almost a year old
[19:39] *** GE has joined #archiveteam-bs
[19:59] <icedice> Is there any way to find the Imgur link that was posted in OP's (now deleted) post?
[19:59] <icedice> https://www.reddit.com/r/webhosting/comments/4w6d63/buyshared_gets_mentioned_a_lot_when_it_comes_to/
[19:59] <icedice> Nothing on Archive.org
[20:02] <MrRadar> icedice: It looks like this may be a mirror of the original post: https://webdesignersolutions.wordpress.com/2016/08/04/buyshared-gets-mentioned-a-lot-when-it-comes-to-cheap-shared-hosting-heres-the-uptime-log-since-february-for-an-account-i-have-with-them-via-rwebhosting/
[20:06] <icedice> Thanks!
[20:30] *** schbirid has quit IRC (Quit: Leaving)
[20:32] *** kvieta has joined #archiveteam-bs
[20:46] *** kvieta has quit IRC (Read error: Operation timed out)
[20:54] *** Ravenloft has joined #archiveteam-bs
[20:56] *** kvieta has joined #archiveteam-bs
[21:04] *** tuluu_ has joined #archiveteam-bs
[21:04] *** tuluu has quit IRC (Ping timeout: 250 seconds)
[21:07] *** Jonison has quit IRC (Read error: Connection reset by peer)
[21:10] *** ndiddy has joined #archiveteam-bs
[21:58] *** espes__ has joined #archiveteam-bs
[21:59] *** espes___ has quit IRC (Ping timeout: 250 seconds)
[22:02] *** midas has quit IRC (Ping timeout: 250 seconds)
[22:02] *** Gfy has quit IRC (Ping timeout: 250 seconds)
[22:03] *** mls has quit IRC (Ping timeout: 250 seconds)
[22:03] *** midas has joined #archiveteam-bs
[22:04] *** tsr has quit IRC (Ping timeout: 250 seconds)
[22:05] *** Gfy has joined #archiveteam-bs
[22:06] *** andai has quit IRC (Ping timeout: 250 seconds)
[22:08] *** Kaz has quit IRC (Ping timeout: 250 seconds)
[22:10] *** GE has quit IRC (Remote host closed the connection)
[22:11] *** Aoede has quit IRC (Ping timeout: 250 seconds)
[22:11] *** hook54321 has quit IRC (Ping timeout: 250 seconds)
[22:11] *** C4K3 has quit IRC (Ping timeout: 250 seconds)
[22:13] *** tsr has joined #archiveteam-bs
[22:13] *** HP_ has joined #archiveteam-bs
[22:13] *** C4K3 has joined #archiveteam-bs
[22:14] *** hook54321 has joined #archiveteam-bs
[22:14] *** andai has joined #archiveteam-bs
[22:14] *** HP has quit IRC (Ping timeout: 250 seconds)
[22:14] *** nightpool has quit IRC (Ping timeout: 250 seconds)
[22:15] *** Kaz has joined #archiveteam-bs
[22:16] *** mls has joined #archiveteam-bs
[22:17] *** andai has quit IRC (Ping timeout: 250 seconds)
[22:17] *** SN4T14 has quit IRC (Ping timeout: 250 seconds)
[22:17] *** SN4T14 has joined #archiveteam-bs
[22:21] *** mls has quit IRC (Ping timeout: 250 seconds)
[22:21] *** mls has joined #archiveteam-bs
[22:22] *** Aoede has joined #archiveteam-bs
[22:22] *** andai has joined #archiveteam-bs
[22:27] *** nightpool has joined #archiveteam-bs
[22:46] *** Aoede has quit IRC (Ping timeout: 250 seconds)
[22:48] *** Aoede has joined #archiveteam-bs
[22:57] *** andai has quit IRC (Ping timeout: 250 seconds)
[22:58] *** andai has joined #archiveteam-bs
[23:05] *** sun_rise has joined #archiveteam-bs
[23:06] <sun_rise> I have questions about what is/is not appropriate for archiveteam/bot and not sure where to pose them
[23:06] <xmc> here is a good place to ask
[23:09] <sun_rise> Three people I know have been sued for defamation over 'survivor' websites by institutions they alleged abused them/others as children.  Two of them were forced to settle and remove the content from the web.
[23:09] <xmc> archive it
[23:09] <xmc> this is 100% okay
[23:10] <xmc> unless they want it removed, which, well, doesn't sound like they do
[23:12] <sun_rise> "it", in this case, is going to be a lot bigger than just the 'survivor' websites.  I am interested in crawling the 'industry' sites as well. My original plan was to do this own my own and I started researching best practices for this sort of thing.  I was really pleasantly surprised to find Archiveteam/bot.
[23:12] <sun_rise> It's an amazing service and I don't want to abuse it.  The crawl I started yesterday pointed at a single domain has already grown much larger than I was expecting.
[23:14] <xmc> yep, that'll happen
[23:14] <xmc> if you want, you can next time run your jobs with --no-offsite-links
[23:14] <xmc> by default archivebot will fetch every page on the site you submit, and every page that is linked to
[23:14] <xmc> in order to present context
[23:14] <xmc> (along with images and script and stylesheets used on these pages)
[23:14] <sun_rise> I think, for this job, that was probably the appropriate setting - I didn't realize this until after it started running, though.
[23:15] <xmc> mm, possibly
[23:20] <sun_rise> Ultimately I'm going to be interested in hundreds of domains that this site points to or that I have collected elsewhere that are relevant to this topic.  I doubt any single one of them will end up as large as this - they seem to mostly be fairly lean wordpress product page type sites.  I guess what I'm after is a general sense of what *wouldn't* be appropriate for archivebot. At what point should I be using something else? 
[23:20] <sun_rise> Is there some standard/threshold of general interest or threatened status? If I end up trying to crawl from a list of sites - should that be done in chunks? How do I ensure my jobs don't spiral out of control?
[23:21] <sun_rise> If I made a donation to offset my usage is there some guide to how much things generally cost?
[23:21] <xmc> feel free to use archivebot
[23:21] <xmc> you sound like someone who's fairly conscious of the resources they're using
[23:22] <xmc> if you look on the dashboard and you have more jobs running than anyone else, you might want to rethink how you're going about doing things
[23:22] <xmc> that said, everyone who cares about something fills up the queue eventually
[23:23] <xmc> we have a cost shameboard that kind of tries to be a forever-cost of data storage
[23:23] <sun_rise> I saw this but wasn't sure how quickly that would fill up. There are some high scorers!
[23:23] <xmc> but if you throw some chum towards https://archive.org/donate/ it'll probably be fine
[23:23] <xmc> hehe
[23:24] <sun_rise> I noticed there are 2 warc files associated with my crawl that have already been uploaded to archive.org.  Will those continue to be uploaded in chunks?
[23:24] <xmc> yep
[23:24] <xmc> whenever the pipeline cuts off the warc file and starts a new one, the uploader sends the finished warc file off to IA
[23:24] <sun_rise> if I do a crawl from a pastebin list of domains will they show up in the same IA folder or separate per domain?
[23:25] <xmc> jobs go into warc files named by the url you submit, no matter of whether you use it as a list of urls or a single website
[23:26] <xmc> if you're doing less than a few dozen sites, i'd suggest one !a per site
[23:26] <xmc> like, one day i did all the campaign websites for my city's election
[23:28] *** dashcloud has quit IRC (Remote host closed the connection)
[23:29] <DFJustin> we've asked before about what wouldn't be appropriate and sketchcow weighed in:
[23:29] <DFJustin> <SketchCow> In another channel, regarding uploading stuff of dubious value or duplication to archive.org: 
[23:29] <DFJustin> <SketchCow> General archive rule: gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad.
[23:29] <DFJustin> <SketchCow> I am going to go ahead and define dubious value that the uploader can't even begin to dream up a use.
[23:29] <DFJustin> <SketchCow> If the uploader can'te ven come up with a use case, that's dubious value.
[23:29] <DFJustin> <SketchCow> Example: 14gb quicktime movie aimed at a blank wall for an hour, no change
[23:30] *** BlueMaxim has joined #archiveteam-bs
[23:31] <DFJustin> so if it's in any way useful and it's not already archived, go hog wild, if it's gonna be mainly duplicated data then be careful about getting up into tens or hundreds of gigs
[23:32] <DFJustin> small sites don't matter except don't do many at the same time that there aren't any archivebot slots free for emergencies
[23:33] *** dashcloud has joined #archiveteam-bs
[23:33] <DFJustin> this is admittedly hampered by the fact that we don't actually have a readout for the number of free slots
[23:33] <sun_rise> so submitting a list of urls might be more polite?
[23:34] <DFJustin> or come in and feed one in every so often as previous ones finish
[23:35] <sun_rise> I'm thinking I can prioritize the stuff that I most fear being lost right now and get to crawling 'the enemy' later when I have a better grasp of how big these things get
[23:35] <DFJustin> having a ton of sites on one job can be a problem because the jobs do crash from time to time
[23:40] <DFJustin> what I usually do before putting a site through archivebot is bring the site up in the wayback machine and see if the site has been crawled pretty well already or not
[23:41] <DFJustin> if the most recent crawl is from ages ago or you click a couple links and they come up "this page has not been archived" then it's due for a go
[23:48] <sun_rise> ok