#archiveteam-bs 2017-05-02,Tue

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***brayden_ has joined #archiveteam-bs
swebb sets mode: +o brayden_
brayden has quit IRC (Read error: Operation timed out)
[01:22]
........................ (idle for 1h59mn)
Odd0002I wonder if archive wants video files from a university course I just took... [03:25]
***pizzaiolo has quit IRC (pizzaiolo) [03:39]
.... (idle for 17mn)
Somebody2Hm, looks like the only active Warrior project right now is #urlteam . I'll go add more shortners to urlteam. [03:56]
..... (idle for 21mn)
***Sk1d has quit IRC (Ping timeout: 250 seconds) [04:17]
Sk1d has joined #archiveteam-bs
Sk1d has quit IRC (Connection Closed)
[04:24]
ploop has joined #archiveteam-bs [04:35]
ploopSomebody2: so far I've been writing a new script every time I want to archive files from a site, but they're always very far from perfect and stop working every now and again and require constant maintenance
additionally i have no idea how i should be handling various errors so if my internet cuts out for a few seconds or something i end up with the script either crashing or missing files
[04:37]
***BlueMaxim has joined #archiveteam-bs [04:38]
ploopand it occurred to me that downloading webpages is not something that i should be having problems with, since plenty of other people's software does it without issue [04:39]
Somebody2well, you've come to the right place. [04:41]
ploopthe easy part is figuring out that i need to download x.com/fileid/x where x is {1..5000000} and maybe do some mime detection to give it a good filename or something
but somehow i struggle with http, which should be the easier part
[04:41]
Somebody2Look over the docs for wpull; there's also grab-site that offers an interface over it.
You may also find the code for the Warrior projects informative; those are in the ArchiveTeam github organization.
I don't persionally do a whole lot of that exact thing, so I'm probably not the best person to answer really detailed questions.
[04:42]
***Aranje has quit IRC (Quit: Three sheets to the wind) [04:47]
ploopthis looks interesting [04:51]
Somebody2I hope so. :-) It serves us pretty well. [04:53]
............................... (idle for 2h33mn)
godanethere is a thunderstorm outside [07:26]
***GE has joined #archiveteam-bs [07:26]
godanelike monsoon like rain is going on where i live [07:26]
.... (idle for 18mn)
***Jonison has joined #archiveteam-bs [07:44]
schbirid has joined #archiveteam-bs [07:53]
espes___ has joined #archiveteam-bs
will has quit IRC (Ping timeout: 250 seconds)
luckcolor has quit IRC (Remote host closed the connection)
midas has quit IRC (hub.se irc.underworld.no)
Jonimus has quit IRC (hub.se irc.underworld.no)
JensRex has quit IRC (hub.se irc.underworld.no)
Lord_Nigh has quit IRC (hub.se irc.underworld.no)
alfiepate has quit IRC (hub.se irc.underworld.no)
Riviera has quit IRC (hub.se irc.underworld.no)
espes__ has quit IRC (hub.se irc.underworld.no)
tammy_ has quit IRC (hub.se irc.underworld.no)
i0npulse has quit IRC (hub.se irc.underworld.no)
purplebot has quit IRC (hub.se irc.underworld.no)
Rai-chan has quit IRC (hub.se irc.underworld.no)
medowar has quit IRC (hub.se irc.underworld.no)
Hecatz has quit IRC (hub.se irc.underworld.no)
LordNigh2 has joined #archiveteam-bs
luckcolor has joined #archiveteam-bs
will has joined #archiveteam-bs
alfie has joined #archiveteam-bs
[08:05]
t2t2I think #noanswers needs requeuing, 70k items out [08:11]
***midas1 has joined #archiveteam-bs
Jonimoose has joined #archiveteam-bs
swebb sets mode: +o Jonimoose
[08:17]
LordNigh2 is now known as Lord_Nigh [08:23]
....... (idle for 30mn)
GE has quit IRC (Remote host closed the connection) [08:53]
.... (idle for 19mn)
Jonison has quit IRC (Read error: Connection reset by peer) [09:12]
Jonison has joined #archiveteam-bs
Somebody2 has quit IRC (Read error: Operation timed out)
Jonimoose has quit IRC (west.us.hub irc.Prison.NET)
xmc has quit IRC (Read error: Operation timed out)
Somebody2 has joined #archiveteam-bs
midas1 is now known as midas
xmc has joined #archiveteam-bs
swebb sets mode: +o xmc
[09:18]
.... (idle for 17mn)
deathy has quit IRC (Remote host closed the connection)
HCross2 has quit IRC (Remote host closed the connection)
JAA has joined #archiveteam-bs
[09:43]
deathy has joined #archiveteam-bs [09:52]
JAAServer: IIS/4.1
X-Powered-By: Visual Basic 2.0 on Rails
I lol'd
[09:57]
..... (idle for 23mn)
***HCross2 has joined #archiveteam-bs [10:20]
JAA has quit IRC (Quit: Page closed) [10:28]
Jonimoose has joined #archiveteam-bs
irc.Prison.NET sets mode: +o Jonimoose
swebb sets mode: +o Jonimoose
purplebot has joined #archiveteam-bs
Rai-chan has joined #archiveteam-bs
medowar has joined #archiveteam-bs
Hecatz has joined #archiveteam-bs
i0npulse has joined #archiveteam-bs
tammy_ has joined #archiveteam-bs
[10:34]
..... (idle for 24mn)
JensRex has joined #archiveteam-bs
dashcloud has quit IRC (Read error: Connection reset by peer)
dashcloud has joined #archiveteam-bs
[11:03]
...... (idle for 28mn)
HCross2Upload of the first chunk of data.gov has begun - 1.5TB at 55Mbps
Anyone know if I can use the IA python tool to upload more than 1 file to an item at a time please?
[11:32]
............ (idle for 57mn)
***pizzaiolo has joined #archiveteam-bs [12:30]
........ (idle for 35mn)
BlueMaxim has quit IRC (Quit: Leaving) [13:05]
............ (idle for 57mn)
JensRex has quit IRC (Remote host closed the connection)
JensRex has joined #archiveteam-bs
[14:02]
.... (idle for 17mn)
Yurume has quit IRC (Remote host closed the connection)
antomati_ is now known as antomatic
Ravenloft has quit IRC (Read error: Operation timed out)
[14:20]
Yurume has joined #archiveteam-bs [14:31]
Dark_Star has quit IRC (Read error: Operation timed out)
hook54321 has quit IRC (Ping timeout: 250 seconds)
godane has quit IRC (Ping timeout: 250 seconds)
kanzure has quit IRC (Ping timeout: 250 seconds)
kanzure has joined #archiveteam-bs
alembic has quit IRC (Ping timeout: 260 seconds)
godane has joined #archiveteam-bs
[14:44]
logchfoo0 starts logging #archiveteam-bs at Tue May 02 14:58:53 2017
logchfoo0 has joined #archiveteam-bs
hook54321 has joined #archiveteam-bs
alembic has joined #archiveteam-bs
[14:58]
Ctrl-S___ has joined #archiveteam-bs [15:07]
kvieta has quit IRC (Ping timeout: 370 seconds)
GE has joined #archiveteam-bs
nightpool has joined #archiveteam-bs
[15:12]
icedice has joined #archiveteam-bs
icedice2 has joined #archiveteam-bs
[15:26]
yipdw has quit IRC (Read error: Operation timed out)
me_ has joined #archiveteam-bs
icedice2 has quit IRC (Quit: Leaving)
[15:31]
....................... (idle for 1h52mn)
arkiverHCross2: yes, just give it a list of items
or a directory where it can find all items
files*
[17:28]
..... (idle for 20mn)
HCross2I meant concurrent - I fed it a directory and off it went
So I point it at a directory and it uploads say 5 files at once
[17:48]
***GE has quit IRC (Remote host closed the connection) [17:55]
namespace has joined #archiveteam-bs [18:02]
namespaceBut yeah.
It's not so much that piracy sites have no cultural value, quite the contrary they're some of the largest 'open' repositories of cultural value out there.
[18:02]
xmctraditionally we don't care much about legal risk, because the real risk seems low [18:02]
namespaceThey're just radioactive to touch.
Yeah but.
Piracy sites are one of the cases where it's not.
Especially if they just shut down because someone else was suing them or whatever.
[18:03]
xmci see no evidence, only fear [18:03]
namespacenamespace shrugs
Not gonna argue this when it's not even my decision lol.
[18:04]
xmcit's the decision of every member for themselves, of whether they want to participate in that sort of project [18:05]
DFJustinwe've archived shitloads of pirated everything and nothing has happened so far [18:06]
xmcwe've even archived people being scared about it in irc!
hehe
i think we've received a few takedowns on things, but no other fallout
i know that a ftpsite i archived got darked
[18:06]
SketchCowFEEEAR
Did someone call for fear? I work in fear.
[18:08]
xmcyes, hello, fear department, we need a delivery [18:09]
SketchCowDid you want regular fear or extra spicy fear [18:09]
xmcwell what did the requisition form say
come ON we have standardized forms for a *reason*
[18:09]
SketchCowForm unintelligible, blood streaks covering checkboxes [18:10]
MrRadarWhile people are here: is there a list of people who have access to the tracker for different projects? Yahoo Answers needs a requeue and I'm not sure who is best to ping [18:10]
SketchCowPing arkiver or yipdw or I'm not sure who else [18:10]
***me_ is now known as yipdw [18:12]
yipdwthe claims page is 500ing out
one sec
[18:12]
xmcyahooanswers has admins set as arkiver and medowar, for the record
(they, and anyone set as global-admin, can jiggle it)
[18:13]
yipdwoh
it's because someone named pronerdJay has something like 100,000 claims and the page is going FML
i haven't come across something so quinessentially AT in a while
er, maybe it's closer to 50,000
either way
[18:14]
xmchaha [18:16]
yipdw$ ruby release-claims.rb yahooanswers pronerdJay
/home/yipdw/.rvm/gems/ruby-2.3.3/gems/activesupport-3.2.5/lib/active_support/values/time_zone.rb:270: warning: circular argument reference - now
/home/yipdw/.rvm/gems/ruby-2.3.3/gems/redis-2.2.2/lib/redis.rb:215:in `block in hgetall': stack level too deep (SystemStackError)
fuck Rub
y
[18:16]
xmcthat's the rub [18:16]
yipdwwait what how is that stack trace possible
is hgetall recursing to build a hash??
oh, no, it uses Hash[] and passes the reply in using a splat
fuck Ruby
[18:16]
xmcarchiveteam: finding bugs in standard system tools since 2009 [18:17]
yipdwI think newer versions of redis-rb fix this
oh, but that script is using the tracker gem bundle and I can't update it without affecting the world
bleh I'll write something
[18:17]
icediceIs Yahoo Answers going down? [18:21]
yipdwI have some places where Yahoo Answers can go [18:21]
MrRadaricedice: Yahoo Answers is being grabbed preemptively in case Verizon decides to can it [18:22]
icediceAh, right
Yahoo sold out to Verizon
[18:22]
yipdwok, it looks like release-stale worked
the spice is flowing again on yahooanswers and I'm getting out of jwz mode
[18:23]
MrRadarThanks yipdw [18:24]
arkiveryipdw: we already have a way of handling too many out items
Requeue on the Workarounds page
[18:24]
yipdwthere's a few scripts that seem to work, release-claims just can't handle firepower of that magnitude
oh, right
I guess that page does the same as release-stale, huh
[18:25]
arkiverI guess so [18:27]
.... (idle for 17mn)
SketchCowhttps://archive.org/details/pulpmagazinearchive?&sort=-publicdate&and[]=addeddate:2017*
I'm uploading 10,000 zines
Should I ask permission
SketchCow bites nails
[18:44]
..... (idle for 22mn)
***ndiddy has quit IRC () [19:06]
HCross2Even more data.gov has just started the slow march up to the IA [19:06]
namespaceSketchCow: lolno [19:15]
..... (idle for 21mn)
t2t2BTW the tracker also has stale items for yuku, almost a year old [19:36]
***GE has joined #archiveteam-bs [19:39]
..... (idle for 20mn)
icediceIs there any way to find the Imgur link that was posted in OP's (now deleted) post?
https://www.reddit.com/r/webhosting/comments/4w6d63/buyshared_gets_mentioned_a_lot_when_it_comes_to/
Nothing on Archive.org
[19:59]
MrRadaricedice: It looks like this may be a mirror of the original post: https://webdesignersolutions.wordpress.com/2016/08/04/buyshared-gets-mentioned-a-lot-when-it-comes-to-cheap-shared-hosting-heres-the-uptime-log-since-february-for-an-account-i-have-with-them-via-rwebhosting/ [20:02]
icediceThanks! [20:06]
..... (idle for 24mn)
***schbirid has quit IRC (Quit: Leaving)
kvieta has joined #archiveteam-bs
[20:30]
kvieta has quit IRC (Read error: Operation timed out) [20:46]
Ravenloft has joined #archiveteam-bs
kvieta has joined #archiveteam-bs
[20:54]
tuluu_ has joined #archiveteam-bs
tuluu has quit IRC (Ping timeout: 250 seconds)
Jonison has quit IRC (Read error: Connection reset by peer)
ndiddy has joined #archiveteam-bs
[21:04]
.......... (idle for 48mn)
espes__ has joined #archiveteam-bs
espes___ has quit IRC (Ping timeout: 250 seconds)
midas has quit IRC (Ping timeout: 250 seconds)
Gfy has quit IRC (Ping timeout: 250 seconds)
mls has quit IRC (Ping timeout: 250 seconds)
midas has joined #archiveteam-bs
tsr has quit IRC (Ping timeout: 250 seconds)
Gfy has joined #archiveteam-bs
andai has quit IRC (Ping timeout: 250 seconds)
Kaz has quit IRC (Ping timeout: 250 seconds)
GE has quit IRC (Remote host closed the connection)
Aoede has quit IRC (Ping timeout: 250 seconds)
hook54321 has quit IRC (Ping timeout: 250 seconds)
C4K3 has quit IRC (Ping timeout: 250 seconds)
tsr has joined #archiveteam-bs
HP_ has joined #archiveteam-bs
C4K3 has joined #archiveteam-bs
hook54321 has joined #archiveteam-bs
andai has joined #archiveteam-bs
HP has quit IRC (Ping timeout: 250 seconds)
nightpool has quit IRC (Ping timeout: 250 seconds)
Kaz has joined #archiveteam-bs
mls has joined #archiveteam-bs
andai has quit IRC (Ping timeout: 250 seconds)
SN4T14 has quit IRC (Ping timeout: 250 seconds)
SN4T14 has joined #archiveteam-bs
mls has quit IRC (Ping timeout: 250 seconds)
mls has joined #archiveteam-bs
Aoede has joined #archiveteam-bs
andai has joined #archiveteam-bs
[21:58]
nightpool has joined #archiveteam-bs [22:27]
.... (idle for 19mn)
Aoede has quit IRC (Ping timeout: 250 seconds)
Aoede has joined #archiveteam-bs
[22:46]
andai has quit IRC (Ping timeout: 250 seconds)
andai has joined #archiveteam-bs
[22:57]
sun_rise has joined #archiveteam-bs [23:05]
sun_riseI have questions about what is/is not appropriate for archiveteam/bot and not sure where to pose them [23:06]
xmchere is a good place to ask [23:06]
sun_riseThree people I know have been sued for defamation over 'survivor' websites by institutions they alleged abused them/others as children. Two of them were forced to settle and remove the content from the web. [23:09]
xmcarchive it
this is 100% okay
unless they want it removed, which, well, doesn't sound like they do
[23:09]
sun_rise"it", in this case, is going to be a lot bigger than just the 'survivor' websites. I am interested in crawling the 'industry' sites as well. My original plan was to do this own my own and I started researching best practices for this sort of thing. I was really pleasantly surprised to find Archiveteam/bot.
It's an amazing service and I don't want to abuse it. The crawl I started yesterday pointed at a single domain has already grown much larger than I was expecting.
[23:12]
xmcyep, that'll happen
if you want, you can next time run your jobs with --no-offsite-links
by default archivebot will fetch every page on the site you submit, and every page that is linked to
in order to present context
(along with images and script and stylesheets used on these pages)
[23:14]
sun_riseI think, for this job, that was probably the appropriate setting - I didn't realize this until after it started running, though. [23:14]
xmcmm, possibly [23:15]
sun_riseUltimately I'm going to be interested in hundreds of domains that this site points to or that I have collected elsewhere that are relevant to this topic. I doubt any single one of them will end up as large as this - they seem to mostly be fairly lean wordpress product page type sites. I guess what I'm after is a general sense of what *wouldn't* be appropriate for archivebot. At what point should I be using something else?
Is there some standard/threshold of general interest or threatened status? If I end up trying to crawl from a list of sites - should that be done in chunks? How do I ensure my jobs don't spiral out of control?
If I made a donation to offset my usage is there some guide to how much things generally cost?
[23:20]
xmcfeel free to use archivebot
you sound like someone who's fairly conscious of the resources they're using
if you look on the dashboard and you have more jobs running than anyone else, you might want to rethink how you're going about doing things
that said, everyone who cares about something fills up the queue eventually
we have a cost shameboard that kind of tries to be a forever-cost of data storage
[23:21]
sun_riseI saw this but wasn't sure how quickly that would fill up. There are some high scorers! [23:23]
xmcbut if you throw some chum towards https://archive.org/donate/ it'll probably be fine
hehe
[23:23]
sun_riseI noticed there are 2 warc files associated with my crawl that have already been uploaded to archive.org. Will those continue to be uploaded in chunks? [23:24]
xmcyep
whenever the pipeline cuts off the warc file and starts a new one, the uploader sends the finished warc file off to IA
[23:24]
sun_riseif I do a crawl from a pastebin list of domains will they show up in the same IA folder or separate per domain? [23:24]
xmcjobs go into warc files named by the url you submit, no matter of whether you use it as a list of urls or a single website
if you're doing less than a few dozen sites, i'd suggest one !a per site
like, one day i did all the campaign websites for my city's election
[23:25]
***dashcloud has quit IRC (Remote host closed the connection) [23:28]
DFJustinwe've asked before about what wouldn't be appropriate and sketchcow weighed in:
<SketchCow> In another channel, regarding uploading stuff of dubious value or duplication to archive.org:
<SketchCow> General archive rule: gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad.
<SketchCow> I am going to go ahead and define dubious value that the uploader can't even begin to dream up a use.
<SketchCow> If the uploader can'te ven come up with a use case, that's dubious value.
<SketchCow> Example: 14gb quicktime movie aimed at a blank wall for an hour, no change
[23:29]
***BlueMaxim has joined #archiveteam-bs [23:30]
DFJustinso if it's in any way useful and it's not already archived, go hog wild, if it's gonna be mainly duplicated data then be careful about getting up into tens or hundreds of gigs
small sites don't matter except don't do many at the same time that there aren't any archivebot slots free for emergencies
[23:31]
***dashcloud has joined #archiveteam-bs [23:33]
DFJustinthis is admittedly hampered by the fact that we don't actually have a readout for the number of free slots [23:33]
sun_riseso submitting a list of urls might be more polite? [23:33]
DFJustinor come in and feed one in every so often as previous ones finish [23:34]
sun_riseI'm thinking I can prioritize the stuff that I most fear being lost right now and get to crawling 'the enemy' later when I have a better grasp of how big these things get [23:35]
DFJustinhaving a ton of sites on one job can be a problem because the jobs do crash from time to time [23:35]
what I usually do before putting a site through archivebot is bring the site up in the wayback machine and see if the site has been crawled pretty well already or not
if the most recent crawl is from ages ago or you click a couple links and they come up "this page has not been archived" then it's due for a go
[23:40]
sun_riseok [23:48]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)