#archiveteam-bs 2017-05-02,Tue

↑back Search

Time Nickname Message
01:22 🔗 brayden_ has joined #archiveteam-bs
01:22 🔗 swebb sets mode: +o brayden_
01:26 🔗 brayden has quit IRC (Read error: Operation timed out)
03:25 🔗 Odd0002 I wonder if archive wants video files from a university course I just took...
03:39 🔗 pizzaiolo has quit IRC (pizzaiolo)
03:56 🔗 Somebody2 Hm, looks like the only active Warrior project right now is #urlteam . I'll go add more shortners to urlteam.
04:17 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:24 🔗 Sk1d has joined #archiveteam-bs
04:24 🔗 Sk1d has quit IRC (Connection Closed)
04:35 🔗 ploop has joined #archiveteam-bs
04:37 🔗 ploop Somebody2: so far I've been writing a new script every time I want to archive files from a site, but they're always very far from perfect and stop working every now and again and require constant maintenance
04:38 🔗 ploop additionally i have no idea how i should be handling various errors so if my internet cuts out for a few seconds or something i end up with the script either crashing or missing files
04:38 🔗 BlueMaxim has joined #archiveteam-bs
04:39 🔗 ploop and it occurred to me that downloading webpages is not something that i should be having problems with, since plenty of other people's software does it without issue
04:41 🔗 Somebody2 well, you've come to the right place.
04:41 🔗 ploop the easy part is figuring out that i need to download x.com/fileid/x where x is {1..5000000} and maybe do some mime detection to give it a good filename or something
04:42 🔗 ploop but somehow i struggle with http, which should be the easier part
04:42 🔗 Somebody2 Look over the docs for wpull; there's also grab-site that offers an interface over it.
04:43 🔗 Somebody2 You may also find the code for the Warrior projects informative; those are in the ArchiveTeam github organization.
04:44 🔗 Somebody2 I don't persionally do a whole lot of that exact thing, so I'm probably not the best person to answer really detailed questions.
04:47 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
04:51 🔗 ploop this looks interesting
04:53 🔗 Somebody2 I hope so. :-) It serves us pretty well.
07:26 🔗 godane there is a thunderstorm outside
07:26 🔗 GE has joined #archiveteam-bs
07:26 🔗 godane like monsoon like rain is going on where i live
07:44 🔗 Jonison has joined #archiveteam-bs
07:53 🔗 schbirid has joined #archiveteam-bs
08:05 🔗 espes___ has joined #archiveteam-bs
08:06 🔗 will has quit IRC (Ping timeout: 250 seconds)
08:07 🔗 luckcolor has quit IRC (Remote host closed the connection)
08:08 🔗 midas has quit IRC (hub.se irc.underworld.no)
08:08 🔗 Jonimus has quit IRC (hub.se irc.underworld.no)
08:08 🔗 JensRex has quit IRC (hub.se irc.underworld.no)
08:08 🔗 Lord_Nigh has quit IRC (hub.se irc.underworld.no)
08:08 🔗 alfiepate has quit IRC (hub.se irc.underworld.no)
08:08 🔗 Riviera has quit IRC (hub.se irc.underworld.no)
08:08 🔗 espes__ has quit IRC (hub.se irc.underworld.no)
08:08 🔗 tammy_ has quit IRC (hub.se irc.underworld.no)
08:08 🔗 i0npulse has quit IRC (hub.se irc.underworld.no)
08:08 🔗 purplebot has quit IRC (hub.se irc.underworld.no)
08:08 🔗 Rai-chan has quit IRC (hub.se irc.underworld.no)
08:08 🔗 medowar has quit IRC (hub.se irc.underworld.no)
08:08 🔗 Hecatz has quit IRC (hub.se irc.underworld.no)
08:09 🔗 LordNigh2 has joined #archiveteam-bs
08:09 🔗 luckcolor has joined #archiveteam-bs
08:09 🔗 will has joined #archiveteam-bs
08:10 🔗 alfie has joined #archiveteam-bs
08:11 🔗 t2t2 I think #noanswers needs requeuing, 70k items out
08:17 🔗 midas1 has joined #archiveteam-bs
08:17 🔗 Jonimoose has joined #archiveteam-bs
08:17 🔗 swebb sets mode: +o Jonimoose
08:23 🔗 LordNigh2 is now known as Lord_Nigh
08:53 🔗 GE has quit IRC (Remote host closed the connection)
09:12 🔗 Jonison has quit IRC (Read error: Connection reset by peer)
09:18 🔗 Jonison has joined #archiveteam-bs
09:19 🔗 Somebody2 has quit IRC (Read error: Operation timed out)
09:20 🔗 Jonimoose has quit IRC (west.us.hub irc.Prison.NET)
09:21 🔗 xmc has quit IRC (Read error: Operation timed out)
09:21 🔗 Somebody2 has joined #archiveteam-bs
09:24 🔗 midas1 is now known as midas
09:26 🔗 xmc has joined #archiveteam-bs
09:26 🔗 swebb sets mode: +o xmc
09:43 🔗 deathy has quit IRC (Remote host closed the connection)
09:43 🔗 HCross2 has quit IRC (Remote host closed the connection)
09:47 🔗 JAA has joined #archiveteam-bs
09:52 🔗 deathy has joined #archiveteam-bs
09:57 🔗 JAA Server: IIS/4.1
09:57 🔗 JAA X-Powered-By: Visual Basic 2.0 on Rails
09:57 🔗 JAA I lol'd
10:20 🔗 HCross2 has joined #archiveteam-bs
10:28 🔗 JAA has quit IRC (Quit: Page closed)
10:34 🔗 Jonimoose has joined #archiveteam-bs
10:34 🔗 irc.Prison.NET sets mode: +o Jonimoose
10:34 🔗 swebb sets mode: +o Jonimoose
10:36 🔗 purplebot has joined #archiveteam-bs
10:36 🔗 Rai-chan has joined #archiveteam-bs
10:36 🔗 medowar has joined #archiveteam-bs
10:36 🔗 Hecatz has joined #archiveteam-bs
10:39 🔗 i0npulse has joined #archiveteam-bs
10:39 🔗 tammy_ has joined #archiveteam-bs
11:03 🔗 JensRex has joined #archiveteam-bs
11:03 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
11:04 🔗 dashcloud has joined #archiveteam-bs
11:32 🔗 HCross2 Upload of the first chunk of data.gov has begun - 1.5TB at 55Mbps
11:33 🔗 HCross2 Anyone know if I can use the IA python tool to upload more than 1 file to an item at a time please?
12:30 🔗 pizzaiolo has joined #archiveteam-bs
13:05 🔗 BlueMaxim has quit IRC (Quit: Leaving)
14:02 🔗 JensRex has quit IRC (Remote host closed the connection)
14:03 🔗 JensRex has joined #archiveteam-bs
14:20 🔗 Yurume has quit IRC (Remote host closed the connection)
14:20 🔗 antomati_ is now known as antomatic
14:24 🔗 Ravenloft has quit IRC (Read error: Operation timed out)
14:31 🔗 Yurume has joined #archiveteam-bs
14:44 🔗 Dark_Star has quit IRC (Read error: Operation timed out)
14:44 🔗 hook54321 has quit IRC (Ping timeout: 250 seconds)
14:44 🔗 godane has quit IRC (Ping timeout: 250 seconds)
14:44 🔗 kanzure has quit IRC (Ping timeout: 250 seconds)
14:44 🔗 kanzure has joined #archiveteam-bs
14:44 🔗 alembic has quit IRC (Ping timeout: 260 seconds)
14:47 🔗 godane has joined #archiveteam-bs
14:58 🔗 logchfoo0 starts logging #archiveteam-bs at Tue May 02 14:58:53 2017
14:58 🔗 logchfoo0 has joined #archiveteam-bs
14:59 🔗 hook54321 has joined #archiveteam-bs
15:00 🔗 alembic has joined #archiveteam-bs
15:07 🔗 Ctrl-S___ has joined #archiveteam-bs
15:12 🔗 kvieta has quit IRC (Ping timeout: 370 seconds)
15:12 🔗 GE has joined #archiveteam-bs
15:13 🔗 nightpool has joined #archiveteam-bs
15:26 🔗 icedice has joined #archiveteam-bs
15:26 🔗 icedice2 has joined #archiveteam-bs
15:31 🔗 yipdw has quit IRC (Read error: Operation timed out)
15:33 🔗 me_ has joined #archiveteam-bs
15:36 🔗 icedice2 has quit IRC (Quit: Leaving)
17:28 🔗 arkiver HCross2: yes, just give it a list of items
17:28 🔗 arkiver or a directory where it can find all items
17:28 🔗 arkiver files*
17:48 🔗 HCross2 I meant concurrent - I fed it a directory and off it went
17:49 🔗 HCross2 So I point it at a directory and it uploads say 5 files at once
17:55 🔗 GE has quit IRC (Remote host closed the connection)
18:02 🔗 namespace has joined #archiveteam-bs
18:02 🔗 namespace But yeah.
18:02 🔗 namespace It's not so much that piracy sites have no cultural value, quite the contrary they're some of the largest 'open' repositories of cultural value out there.
18:02 🔗 xmc traditionally we don't care much about legal risk, because the real risk seems low
18:03 🔗 namespace They're just radioactive to touch.
18:03 🔗 namespace Yeah but.
18:03 🔗 namespace Piracy sites are one of the cases where it's not.
18:03 🔗 namespace Especially if they just shut down because someone else was suing them or whatever.
18:03 🔗 xmc i see no evidence, only fear
18:04 🔗 * namespace shrugs
18:04 🔗 namespace Not gonna argue this when it's not even my decision lol.
18:05 🔗 xmc it's the decision of every member for themselves, of whether they want to participate in that sort of project
18:06 🔗 DFJustin we've archived shitloads of pirated everything and nothing has happened so far
18:06 🔗 xmc we've even archived people being scared about it in irc!
18:06 🔗 xmc hehe
18:07 🔗 xmc i think we've received a few takedowns on things, but no other fallout
18:08 🔗 xmc i know that a ftpsite i archived got darked
18:08 🔗 SketchCow FEEEAR
18:08 🔗 SketchCow Did someone call for fear? I work in fear.
18:09 🔗 xmc yes, hello, fear department, we need a delivery
18:09 🔗 SketchCow Did you want regular fear or extra spicy fear
18:09 🔗 xmc well what did the requisition form say
18:09 🔗 xmc come ON we have standardized forms for a *reason*
18:10 🔗 SketchCow Form unintelligible, blood streaks covering checkboxes
18:10 🔗 MrRadar While people are here: is there a list of people who have access to the tracker for different projects? Yahoo Answers needs a requeue and I'm not sure who is best to ping
18:10 🔗 SketchCow Ping arkiver or yipdw or I'm not sure who else
18:12 🔗 me_ is now known as yipdw
18:12 🔗 yipdw the claims page is 500ing out
18:12 🔗 yipdw one sec
18:13 🔗 xmc yahooanswers has admins set as arkiver and medowar, for the record
18:14 🔗 xmc (they, and anyone set as global-admin, can jiggle it)
18:14 🔗 yipdw oh
18:15 🔗 yipdw it's because someone named pronerdJay has something like 100,000 claims and the page is going FML
18:15 🔗 yipdw i haven't come across something so quinessentially AT in a while
18:15 🔗 yipdw er, maybe it's closer to 50,000
18:15 🔗 yipdw either way
18:16 🔗 xmc haha
18:16 🔗 yipdw $ ruby release-claims.rb yahooanswers pronerdJay
18:16 🔗 yipdw /home/yipdw/.rvm/gems/ruby-2.3.3/gems/activesupport-3.2.5/lib/active_support/values/time_zone.rb:270: warning: circular argument reference - now
18:16 🔗 yipdw /home/yipdw/.rvm/gems/ruby-2.3.3/gems/redis-2.2.2/lib/redis.rb:215:in `block in hgetall': stack level too deep (SystemStackError)
18:16 🔗 yipdw fuck Rub
18:16 🔗 yipdw y
18:16 🔗 xmc that's the rub
18:16 🔗 yipdw wait what how is that stack trace possible
18:16 🔗 yipdw is hgetall recursing to build a hash??
18:17 🔗 yipdw oh, no, it uses Hash[] and passes the reply in using a splat
18:17 🔗 yipdw fuck Ruby
18:17 🔗 xmc archiveteam: finding bugs in standard system tools since 2009
18:17 🔗 yipdw I think newer versions of redis-rb fix this
18:20 🔗 yipdw oh, but that script is using the tracker gem bundle and I can't update it without affecting the world
18:20 🔗 yipdw bleh I'll write something
18:21 🔗 icedice Is Yahoo Answers going down?
18:21 🔗 yipdw I have some places where Yahoo Answers can go
18:22 🔗 MrRadar icedice: Yahoo Answers is being grabbed preemptively in case Verizon decides to can it
18:22 🔗 icedice Ah, right
18:22 🔗 icedice Yahoo sold out to Verizon
18:23 🔗 yipdw ok, it looks like release-stale worked
18:23 🔗 yipdw the spice is flowing again on yahooanswers and I'm getting out of jwz mode
18:24 🔗 MrRadar Thanks yipdw
18:24 🔗 arkiver yipdw: we already have a way of handling too many out items
18:25 🔗 arkiver Requeue on the Workarounds page
18:25 🔗 yipdw there's a few scripts that seem to work, release-claims just can't handle firepower of that magnitude
18:25 🔗 yipdw oh, right
18:25 🔗 yipdw I guess that page does the same as release-stale, huh
18:27 🔗 arkiver I guess so
18:44 🔗 SketchCow https://archive.org/details/pulpmagazinearchive?&sort=-publicdate&and[]=addeddate:2017*
18:44 🔗 SketchCow I'm uploading 10,000 zines
18:44 🔗 SketchCow Should I ask permission
18:44 🔗 * SketchCow bites nails
19:06 🔗 ndiddy has quit IRC ()
19:06 🔗 HCross2 Even more data.gov has just started the slow march up to the IA
19:15 🔗 namespace SketchCow: lolno
19:36 🔗 t2t2 BTW the tracker also has stale items for yuku, almost a year old
19:39 🔗 GE has joined #archiveteam-bs
19:59 🔗 icedice Is there any way to find the Imgur link that was posted in OP's (now deleted) post?
19:59 🔗 icedice https://www.reddit.com/r/webhosting/comments/4w6d63/buyshared_gets_mentioned_a_lot_when_it_comes_to/
19:59 🔗 icedice Nothing on Archive.org
20:02 🔗 MrRadar icedice: It looks like this may be a mirror of the original post: https://webdesignersolutions.wordpress.com/2016/08/04/buyshared-gets-mentioned-a-lot-when-it-comes-to-cheap-shared-hosting-heres-the-uptime-log-since-february-for-an-account-i-have-with-them-via-rwebhosting/
20:06 🔗 icedice Thanks!
20:30 🔗 schbirid has quit IRC (Quit: Leaving)
20:32 🔗 kvieta has joined #archiveteam-bs
20:46 🔗 kvieta has quit IRC (Read error: Operation timed out)
20:54 🔗 Ravenloft has joined #archiveteam-bs
20:56 🔗 kvieta has joined #archiveteam-bs
21:04 🔗 tuluu_ has joined #archiveteam-bs
21:04 🔗 tuluu has quit IRC (Ping timeout: 250 seconds)
21:07 🔗 Jonison has quit IRC (Read error: Connection reset by peer)
21:10 🔗 ndiddy has joined #archiveteam-bs
21:58 🔗 espes__ has joined #archiveteam-bs
21:59 🔗 espes___ has quit IRC (Ping timeout: 250 seconds)
22:02 🔗 midas has quit IRC (Ping timeout: 250 seconds)
22:02 🔗 Gfy has quit IRC (Ping timeout: 250 seconds)
22:03 🔗 mls has quit IRC (Ping timeout: 250 seconds)
22:03 🔗 midas has joined #archiveteam-bs
22:04 🔗 tsr has quit IRC (Ping timeout: 250 seconds)
22:05 🔗 Gfy has joined #archiveteam-bs
22:06 🔗 andai has quit IRC (Ping timeout: 250 seconds)
22:08 🔗 Kaz has quit IRC (Ping timeout: 250 seconds)
22:10 🔗 GE has quit IRC (Remote host closed the connection)
22:11 🔗 Aoede has quit IRC (Ping timeout: 250 seconds)
22:11 🔗 hook54321 has quit IRC (Ping timeout: 250 seconds)
22:11 🔗 C4K3 has quit IRC (Ping timeout: 250 seconds)
22:13 🔗 tsr has joined #archiveteam-bs
22:13 🔗 HP_ has joined #archiveteam-bs
22:13 🔗 C4K3 has joined #archiveteam-bs
22:14 🔗 hook54321 has joined #archiveteam-bs
22:14 🔗 andai has joined #archiveteam-bs
22:14 🔗 HP has quit IRC (Ping timeout: 250 seconds)
22:14 🔗 nightpool has quit IRC (Ping timeout: 250 seconds)
22:15 🔗 Kaz has joined #archiveteam-bs
22:16 🔗 mls has joined #archiveteam-bs
22:17 🔗 andai has quit IRC (Ping timeout: 250 seconds)
22:17 🔗 SN4T14 has quit IRC (Ping timeout: 250 seconds)
22:17 🔗 SN4T14 has joined #archiveteam-bs
22:21 🔗 mls has quit IRC (Ping timeout: 250 seconds)
22:21 🔗 mls has joined #archiveteam-bs
22:22 🔗 Aoede has joined #archiveteam-bs
22:22 🔗 andai has joined #archiveteam-bs
22:27 🔗 nightpool has joined #archiveteam-bs
22:46 🔗 Aoede has quit IRC (Ping timeout: 250 seconds)
22:48 🔗 Aoede has joined #archiveteam-bs
22:57 🔗 andai has quit IRC (Ping timeout: 250 seconds)
22:58 🔗 andai has joined #archiveteam-bs
23:05 🔗 sun_rise has joined #archiveteam-bs
23:06 🔗 sun_rise I have questions about what is/is not appropriate for archiveteam/bot and not sure where to pose them
23:06 🔗 xmc here is a good place to ask
23:09 🔗 sun_rise Three people I know have been sued for defamation over 'survivor' websites by institutions they alleged abused them/others as children. Two of them were forced to settle and remove the content from the web.
23:09 🔗 xmc archive it
23:09 🔗 xmc this is 100% okay
23:10 🔗 xmc unless they want it removed, which, well, doesn't sound like they do
23:12 🔗 sun_rise "it", in this case, is going to be a lot bigger than just the 'survivor' websites. I am interested in crawling the 'industry' sites as well. My original plan was to do this own my own and I started researching best practices for this sort of thing. I was really pleasantly surprised to find Archiveteam/bot.
23:12 🔗 sun_rise It's an amazing service and I don't want to abuse it. The crawl I started yesterday pointed at a single domain has already grown much larger than I was expecting.
23:14 🔗 xmc yep, that'll happen
23:14 🔗 xmc if you want, you can next time run your jobs with --no-offsite-links
23:14 🔗 xmc by default archivebot will fetch every page on the site you submit, and every page that is linked to
23:14 🔗 xmc in order to present context
23:14 🔗 xmc (along with images and script and stylesheets used on these pages)
23:14 🔗 sun_rise I think, for this job, that was probably the appropriate setting - I didn't realize this until after it started running, though.
23:15 🔗 xmc mm, possibly
23:20 🔗 sun_rise Ultimately I'm going to be interested in hundreds of domains that this site points to or that I have collected elsewhere that are relevant to this topic. I doubt any single one of them will end up as large as this - they seem to mostly be fairly lean wordpress product page type sites. I guess what I'm after is a general sense of what *wouldn't* be appropriate for archivebot. At what point should I be using something else?
23:20 🔗 sun_rise Is there some standard/threshold of general interest or threatened status? If I end up trying to crawl from a list of sites - should that be done in chunks? How do I ensure my jobs don't spiral out of control?
23:21 🔗 sun_rise If I made a donation to offset my usage is there some guide to how much things generally cost?
23:21 🔗 xmc feel free to use archivebot
23:21 🔗 xmc you sound like someone who's fairly conscious of the resources they're using
23:22 🔗 xmc if you look on the dashboard and you have more jobs running than anyone else, you might want to rethink how you're going about doing things
23:22 🔗 xmc that said, everyone who cares about something fills up the queue eventually
23:23 🔗 xmc we have a cost shameboard that kind of tries to be a forever-cost of data storage
23:23 🔗 sun_rise I saw this but wasn't sure how quickly that would fill up. There are some high scorers!
23:23 🔗 xmc but if you throw some chum towards https://archive.org/donate/ it'll probably be fine
23:23 🔗 xmc hehe
23:24 🔗 sun_rise I noticed there are 2 warc files associated with my crawl that have already been uploaded to archive.org. Will those continue to be uploaded in chunks?
23:24 🔗 xmc yep
23:24 🔗 xmc whenever the pipeline cuts off the warc file and starts a new one, the uploader sends the finished warc file off to IA
23:24 🔗 sun_rise if I do a crawl from a pastebin list of domains will they show up in the same IA folder or separate per domain?
23:25 🔗 xmc jobs go into warc files named by the url you submit, no matter of whether you use it as a list of urls or a single website
23:26 🔗 xmc if you're doing less than a few dozen sites, i'd suggest one !a per site
23:26 🔗 xmc like, one day i did all the campaign websites for my city's election
23:28 🔗 dashcloud has quit IRC (Remote host closed the connection)
23:29 🔗 DFJustin we've asked before about what wouldn't be appropriate and sketchcow weighed in:
23:29 🔗 DFJustin <SketchCow> In another channel, regarding uploading stuff of dubious value or duplication to archive.org:
23:29 🔗 DFJustin <SketchCow> General archive rule: gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad.
23:29 🔗 DFJustin <SketchCow> I am going to go ahead and define dubious value that the uploader can't even begin to dream up a use.
23:29 🔗 DFJustin <SketchCow> If the uploader can'te ven come up with a use case, that's dubious value.
23:29 🔗 DFJustin <SketchCow> Example: 14gb quicktime movie aimed at a blank wall for an hour, no change
23:30 🔗 BlueMaxim has joined #archiveteam-bs
23:31 🔗 DFJustin so if it's in any way useful and it's not already archived, go hog wild, if it's gonna be mainly duplicated data then be careful about getting up into tens or hundreds of gigs
23:32 🔗 DFJustin small sites don't matter except don't do many at the same time that there aren't any archivebot slots free for emergencies
23:33 🔗 dashcloud has joined #archiveteam-bs
23:33 🔗 DFJustin this is admittedly hampered by the fact that we don't actually have a readout for the number of free slots
23:33 🔗 sun_rise so submitting a list of urls might be more polite?
23:34 🔗 DFJustin or come in and feed one in every so often as previous ones finish
23:35 🔗 sun_rise I'm thinking I can prioritize the stuff that I most fear being lost right now and get to crawling 'the enemy' later when I have a better grasp of how big these things get
23:35 🔗 DFJustin having a ton of sites on one job can be a problem because the jobs do crash from time to time
23:40 🔗 DFJustin what I usually do before putting a site through archivebot is bring the site up in the wayback machine and see if the site has been crawled pretty well already or not
23:41 🔗 DFJustin if the most recent crawl is from ages ago or you click a couple links and they come up "this page has not been archived" then it's due for a go
23:48 🔗 sun_rise ok

irclogger-viewer