#archiveteam-bs 2017-06-21,Wed

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***j08nY has quit IRC (Quit: Leaving) [00:00]
..... (idle for 22mn)
arkiverMrRadar2: what URL? [00:22]
***sheaf has quit IRC (Quit: sheaf) [00:25]
............ (idle for 58mn)
bitBaron has joined #archiveteam-bs [01:23]
crusherit does seem like something's up with imzy
i'm using the warrior script, i haven't seen any of my threads actually upload any data
my urlte.am and eroshare threads are archiving just fine though...
[01:35]
................... (idle for 1h31mn)
***th1x has joined #archiveteam-bs [03:11]
dashcloud has quit IRC (Remote host closed the connection)
dashcloud has joined #archiveteam-bs
[03:17]
.................. (idle for 1h25mn)
Sk1d has quit IRC (Ping timeout: 194 seconds) [04:43]
Sk1d has joined #archiveteam-bs [04:48]
crusher has quit IRC (Ping timeout: 268 seconds) [04:57]
Aranje has quit IRC (Quit: Three sheets to the wind) [05:10]
Aranje has joined #archiveteam-bs [05:19]
th1x has quit IRC (Read error: Operation timed out) [05:25]
..... (idle for 21mn)
Aranje has quit IRC (Three sheets to the wind) [05:46]
..... (idle for 23mn)
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) [06:09]
schbirid has joined #archiveteam-bs [06:22]
.... (idle for 18mn)
kyounko has joined #archiveteam-bs
voidsta has quit IRC (Remote host closed the connection)
voidsta- has joined #archiveteam-bs
voidsta- has quit IRC (Client Quit)
voidsta- has joined #archiveteam-bs
[06:40]
voidsta- is now known as voidsta
Jonison has joined #archiveteam-bs
[06:49]
...... (idle for 25mn)
j08nY has joined #archiveteam-bs [07:17]
...... (idle for 27mn)
SHODAN_UI has joined #archiveteam-bs [07:44]
.... (idle for 15mn)
jtn2 has joined #archiveteam-bs [07:59]
.... (idle for 15mn)
jrwr has quit IRC (Read error: Operation timed out)
robogoat has quit IRC (Read error: Operation timed out)
robogoat has joined #archiveteam-bs
[08:14]
..... (idle for 24mn)
j08nY has quit IRC (Quit: Leaving) [08:39]
....... (idle for 31mn)
SHODAN_UI has quit IRC (Remote host closed the connection) [09:10]
BlueMaxim has quit IRC (Read error: Operation timed out) [09:19]
.............. (idle for 1h9mn)
logchfoo2 starts logging #archiveteam-bs at Wed Jun 21 10:28:19 2017
logchfoo2 has joined #archiveteam-bs
[10:28]
......... (idle for 42mn)
SHODAN_UI has joined #archiveteam-bs [11:10]
........ (idle for 38mn)
victorbje has joined #archiveteam-bs [11:48]
........ (idle for 39mn)
C4K3_ has joined #archiveteam-bs
C4K3 has quit IRC (Ping timeout: 260 seconds)
[12:27]
icedice has joined #archiveteam-bs
th1x has joined #archiveteam-bs
[12:36]
.... (idle for 19mn)
MrRadararkiver: For example https://www.imzy.com/api/accounts/profiles/daylen?check=true
I can send you one of the partial WARCs if that would help
They all have that ?check=true parameter
[12:56]
***vbdc has joined #archiveteam-bs [13:10]
vbdcgetting rate-limited when doing the upload, 120 connections seems like a small amount. Anything I can do to help workaround this bottleneck? [13:11]
..... (idle for 23mn)
timmcMrRadar: If I'm logged in and view my profile, the ?check=true API call comes back with empty 200 OK; for someone else's profile it is an empty 401. I suspect this is an authenticated API call and isn't suitable for archiving.
The 206 when unauthenticated is weird, though. I can check with weffey...
[13:34]
***vbdc has quit IRC (Ping timeout: 268 seconds) [13:36]
timmcarkiver, MrRadar: weffey says any ?check call should be skipped -- it's an auth'd call to see if an object exists (and whether the user has permissions to it) without getting the full payload. [13:42]
***crusher_ has joined #archiveteam-bs [13:47]
MrRadarvbdc: That's just the limit for FOS
We've tried raising it in the past but it actually slows down due to the server's disk IO getting saturated
We could possibly add another rsync target
If someone wants to volunteer one
[13:48]
.... (idle for 15mn)
victorbjeMrRadar: what's the requirements for adding a rsync target? Might be able to host one [14:04]
.... (idle for 17mn)
***icedice has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac)
icedice has joined #archiveteam-bs
[14:21]
MrRadarvictorbje: IIRC at least 500 Mbit Internet and several TB storage. arkiver can give you more details [14:28]
crusher_i wonder if how hard it would be to redirect path the warrior uses to cache the scraped files based on file size
wonder how hard*
My biggest bottleneck right now is my lack of RAID disks and / or that it's a pretty slow drive
so i was thinking of using a ramdisks to cache all the small files and selectively throw large ones to disk
[14:35]
JAASo according to the Tilt API, the US and Australia are states, Canada and the UK are provinces, and Ireland's a county (no, not "country"). ¯\_(ツ)_/¯ [14:39]
crusher_lol [14:41]
MrRadarcruster_: if you're referring to FOS, that actually wouldn't help too much. FOS runs the "megawarc factory" which combines individual items together into "megawarcs" so when there are projects with tons of small files it uses pretty much all the I/O resources it can [14:41]
crusher_I see.
on average how big can those file-balls get?
[14:41]
MrRadar40 GB is the usual size
Some projects with extra-large items use 80 GB
[14:42]
crusher_*spits cereal* that's what a warrior spits out per thread?
that doesn't sound right...
[14:42]
yipdwno [14:42]
MrRadarNo, that's the size of a megawarc
Individual items can range from the KB to a dozen or so GB depending on the project
[14:43]
***bitBaron has joined #archiveteam-bs [14:43]
crusher_that makes more sense [14:44]
MrRadarYou can get a sense of it from here: http://fos.textfiles.com/pipeline.html
The "inbox" is the items waiting to be megawarc'ed
The outbox are megawarcs waiting to be uploaded to IA
Another useful page here lists the items as they are uploaded: http://fos.textfiles.com/ARCHIVETEAM/
[14:44]
crusher_interesting
so on the client side, a ramdisk could be useful for small files provided the connection isn't saturated, correct?
[14:47]
JAANice. I've never seen that before. [14:48]
MrRadarFrom what I understand, the data is saved to a temporary file, gzipped, and then concatenated on to the end of the result WARC file
So I'm not sure how much a ramdisk would help, especially since Linux in general has a very good disk cache
(As long as it has enough free memory)
[14:49]
crusher_so in essence, i should allocate more ram to the warriors and let them do their thing
whatever it takes to give the poor HDD some breathing room
[14:51]
MrRadarYes, it's worth a try. I think it does flush the file out to disk when it's done downloading it but when it gets read back it should be reading from the cache [14:51]
crusher_right now it's getting hammered to 100% with non-sequential I/O
probably because it's running 10 warriors but....
(shhh details)
[14:52]
***odemg has quit IRC (Read error: Operation timed out) [14:55]
crusher_looking at the ram usage, i can see of the 400MB allocated, it's only using between 60-82 Megs
(400 each)
[15:07]
MrRadarCheck out what top says inside the VM [15:08]
***bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
odemg has joined #archiveteam-bs
[15:12]
JAAcrusher_: You can also try running the scripts directly to reduce the overhead. [15:18]
MrRadarYeah, having 1 kernel schedule the I/O for everything would probably do a better job than 10 kernels that aren't aware of what the others are doing [15:18]
crusher_yeah... [15:20]
Kazscrolling, catching up [15:21]
***odemg has quit IRC (Read error: Operation timed out) [15:22]
Kazcrusher_: ram disk won't really provide any benefit
I've seen projects that we've maxed out multiple gigabit links constantly, you wouldn't be able to get it into ram, megawarc it, then offload quick enough
[15:22]
***odemg has joined #archiveteam-bs [15:22]
crusher_i'm talking client (warrior) side [15:23]
Kazah, warrior side I've run stuff in ram before
as long as you run under capacity - knwing that some items will obviously be a long way outside the average, and cause issues
[15:23]
MrRadarFor example, items for Eroshare have *huge* variation in size. From a few MB to 15+ GB [15:24]
crusher_that's what i'm currently using most of the warriors on [15:25]
MrRadarBut even for projects like Yahoo Answers which is mostly in the neighborhood of 100 MB per item I've hard a few GB-sized items [15:25]
crusher_it seems to be the only current project that isn't 100% saturated or done
i'd help with newsgrabber, but the warrior seems to freeze and do nothing on that one...
[15:25]
Kazhmm
what does the webui show when it's frozen?
[15:28]
crusher_current project screen is blank [15:28]
Kazoh hm [15:28]
crusher_available shows it is working on newsgrabber [15:28]
Kaznewsgrabber has a ton of requirements, possible that the warrior install script doesn't get it all [15:29]
crusher_hmm. [15:29]
so if i was to run them in the host OS, how difficult of a process would that be [15:34]
Kaznot too much work, all the setup instructions are in the git repo [15:35]
***LastNinja has quit IRC (Read error: Operation timed out) [15:38]
crusher_dumb question, but there are 12 pages of projects... Which one should i be looking for? [15:44]
JAAFor people using OpenVPN: https://guidovranken.wordpress.com/2017/06/21/the-openvpn-post-audit-bug-bonanza/
crusher_: I'd start from the wiki homepage, where the currently active projects are listed.
[15:44]
crusher_ah, so there's no way to run them warrior like in the host
it's all manual?
[15:45]
JAANot sure what you mean by "all", but yes, more things are manual than in the warrior. The code doesn't update automatically, for example, and there is no "ArchiveTeam's Choice" equivalent for scripts. [15:46]
crusher_right [15:46]
JAAFor most projects, you clone the corresponding git repository and execute something like run-pipeline pipeline.py --concurrent N NICK
Once you have the dependencies installed, that is.
You may also want to use --disable-web-server, depending on your setup.
[15:47]
crusher_if you had a spare i5 with 8Gigs of ram and a 300 / 300 internet connection, what would you run? [15:48]
MrRadarIf you run multiple pipelines and you *do* want to run the web UI make sure you assign each pipeline a different port
I'd run URLTeam, Yahoo Answers, Eroshare, and maybe Imzy (though I suspect that needs a script update)
[15:48]
crusher_how many concurrent runs for each?
this machine is 100% available
[15:50]
JAAI'm running 10 on URLTeam and 3 on Yahoo Answers. [15:51]
Kazgithub.com/archiveteam/newsgrabber-warrior [15:51]
JAAYahoo bans pretty quickly if you go too fast. [15:51]
crusher_something i noticed with the urlteam on warrior was that it was constantly running out of tasks to do [15:52]
JAAImzy was running fine with 6 concurrent threads before, but I haven't tried again since the latest updates. [15:52]
Kaznewsgrabber will never run out of jobs, that's part of the fun [15:53]
victorbjeis the "time left" the time until the service shuts down or until all items are done at current speed?
in the web ui, top right
[15:59]
***bitBaron has joined #archiveteam-bs [16:00]
MrRadarUntil the service shuts down
Though they sometimes stay up for hours or even days past their official shutdown time
[16:01]
crusher_or shut down sooner than they said they would [16:02]
MrRadarOr occaisonally they whitelist us to let us access the service after the official shutdown [16:02]
victorbjeall right, thanks [16:07]
..... (idle for 20mn)
timmcSpeaking of which, I heard from weffey that Imzy will probably go dark at *around* 06:00 UTC on 2017-06-23, depending on other scheduling constraints. [16:27]
crusher_well, either the script broke or there's nothing left to grab [16:28]
MrRadarcrusher_: The Imzy script needs an update to ignore the ?check=true URLs and then a requeue
I'm sure arkiver will get around to it when he has time
[16:28]
............ (idle for 55mn)
JAAhttp://sarahcandersen.com/post/162085779429 ;-) [17:24]
MrRadarAn interesting sci-fi story on that subject from the author of the story Arrival was based on: http://subterraneanpress.com/magazine/fall_2013/the_truth_of_fact_the_truth_of_feeling_by_ted_chiang
It's food for thought
[17:30]
..... (idle for 22mn)
***jrwr has joined #archiveteam-bs [17:53]
kisspunchIs there a tool to download wordpress sites? (other than wget) [17:58]
Froggingwhat would such a tool do that wget doesn't do? [18:05]
schbiridwpull :P [18:09]
kisspunchsame sort of thing as tumblr tools I've seen. a big one is dealing with pagination changing between crawls and tags. any kind of parsing (here are the comments, here is the post, here are the tags) would be extra. i'd also be happy being pointed at a good wget config, though.
so, if I visit a site again after 1 blog post, I'd rather not download the ENTIRE blog worth of index, which is what will happen right now
[18:09]
Frogginghmm, good point [18:13]
kisspunchThis is actually a problem with lots of stuff, not just wordpress :( [18:15]
...... (idle for 29mn)
***SHODAN_UI has quit IRC (Remote host closed the connection) [18:44]
jrwrWas watching the slides on how Jason got sued for two billion dollars, found this on the IA https://archive.org/details/ModeleskiCompOrder
Pretty much he has been marked insane by the courts
[18:48]
RedTypeFounder Paul Andrew Mitchell, an advanced systems development consultant for 35 years, has spent the past sixteen years since 1990 A.D. doing a detailed investigation of the United States Constitution, federal statute laws, and the important court cases.
AD
[18:49]
MrRadarLOL [18:49]
jrwrYep
I'm reading the whole thing now
holy shit
the top of page 3 is fucking gold
GOLD
[18:50]
MrRadarAnother "entertaining" "sued by an insane idiot" story is the time "game studio" Digital Homocide sued Jim Sterling for $10M+ for trashing one of their garbage shovelware games: https://www.youtube.com/watch?v=qS-LXvhy1Do [18:52]
jrwrSketchCo1: This guy is a hoot [18:52]
timmc"Defendant Mitchell shall undertake formal competency restoration procedures at a qualified federal medical center" <-- what does *that* mean? [18:54]
***kisspunch has quit IRC (Quit: ZNC - http://znc.in) [18:54]
MrRadarIn hindsight. I'm sure it was frustrating for him to deal with this baloney :/
When the case was ongoing
[18:54]
jrwrRight [18:54]
***kisspunch has joined #archiveteam-bs [18:54]
timmcI feel bad for both parties but for different reasons. [18:55]
jrwrhe was deemed a "Mass Mailer" by the courts as well
https://www.plainsite.org/dockets/1z7yzelvr/washington-western-district-court/usa-v-modeleski/
my god
there is so much content
[18:55]
***powerKitt has joined #archiveteam-bs [18:55]
jrwrwait that whole site links back to the IA
thats interesting
[18:55]
powerKittIA spat out an rsync error trying to transfer the files for something I uploaded via torrent, and I stupidly deleted the files off my hard drive since I thought it was done. Is there anyway they can be recovered from a backup on IA's end?
https://catalogd.archive.org/log_show.php?task_id=682448038&full=1
[18:56]
kisspunchok wait re: the topic we found a way to scrape arbitrary dominos orders by enumerating urls... [19:03]
jrwrwat [19:03]
xmclol [19:04]
MrRadarLOL [19:04]
kisspunchyeah we were trying to automate ordering pizza like sensible programmers and typo-d something, and got someone else's order? [19:04]
jrwrmake a warrior
it will be good data for the future
[19:04]
crusher_find the pizza order and time that gets it to you the fastest [19:05]
kisspunchHmm so re: valhalla, I don't feel like the approach will work, because it ultimately needs you to run some weird VM and it's a pain. Would anyone object if I wrote a (compatible) windows program? [19:11]
jrwrfor news grabber? [19:11]
kisspunchjwr: for IA.BAK
jrwr sorry
It's definitely trading off total space available and reliability of that space
But I feel like increasing redundancy can compensate for worse reliability? It's not clear, transfers aren't free if anyone has numbers to plug in
Also there are decent arguments against writing a 'compatible' program
I'm thinking here of the success of things like @Home, which is something like "double click to install, press OK" and then it runs forever across reboots by default
[19:11]
jrwrnot a bad idea kisspunch
spreading the love is key
I though it really just used git + some magic
[19:17]
kisspunchI thought it was using git-annex [19:18]
jrwrYa [19:18]
kisspunchAnyway yes step 2 is writing the program [19:18]
jrwrYep [19:18]
kisspunchI wanted to sound out peeps for whether they will object even once my program works though :) [19:18]
jrwrNo, We always love anything new
just don't expect much support besides the basics
[19:18]
kisspunchThat's totally fine [19:19]
jrwrI approve, but I do suggest making it cross plat as well [19:19]
kisspunchI generally like that sort of thing, but any particular reason? [19:19]
***icedice has quit IRC (Ping timeout: 260 seconds) [19:20]
.... (idle for 18mn)
schbiridi am a geo guy, if you want a map
of those pizzas
[19:38]
.... (idle for 19mn)
***ruunyan has quit IRC (Read error: Operation timed out)
ruunyan has joined #archiveteam-bs
[19:59]
.... (idle for 15mn)
crusher_how many machines does mundus2018 have... [20:15]
***powerKitt has quit IRC (Quit: Page closed)
schbirid has quit IRC (Quit: Leaving)
kisspunch has quit IRC (Quit: ZNC - http://znc.in)
kisspunch has joined #archiveteam-bs
Jonison2 has joined #archiveteam-bs
Jonison has quit IRC (Ping timeout: 260 seconds)
[20:25]
..... (idle for 21mn)
Jonison2 has quit IRC (Quit: Leaving)
SHODAN_UI has joined #archiveteam-bs
[20:52]
.... (idle for 19mn)
_Crusher_ has joined #archiveteam-bs
crusher_ has quit IRC (Quit: Page closed)
_Crusher_ is now known as crusher
[21:12]
Jonison has joined #archiveteam-bs [21:18]
MrRadarCould someone with a Japanaese IP address help me grab a file? It's geoip filtered for some reason.
File is here: http://dambo.mydns.jp/uploader/giga/file/GigaPp8347.wav.html
"Password" is YM1980BD
[21:20]
***crusher2 has joined #archiveteam-bs [21:21]
crusherCan you repost that link again [21:21]
MrRadarhttp://dambo.mydns.jp/uploader/giga/file/GigaPp8347.wav.html
There's a copy on YouTube but I'd prefer to get the original uncompressed version if possible
[21:21]
crusherI'll have it in.... About half an hour [21:24]
MrRadarThanks!
"Password" is YM1980BD in case you missed that too
[21:24]
crusherI saw that, just was hoping to avoid typing the url into the browser :P [21:25]
...... (idle for 27mn)
***crusher has quit IRC (Ping timeout: 492 seconds) [21:52]
Crusher has joined #archiveteam-bs [21:59]
SHODAN_UI has quit IRC (Remote host closed the connection)
Crusher_ has joined #archiveteam-bs
Crusher has quit IRC (Read error: Connection reset by peer)
[22:08]
Jonison has quit IRC (Ping timeout: 260 seconds) [22:19]
..... (idle for 22mn)
FroggingImgur has started redirecting direct links on the desktop [22:41]
MrRadarE.g. from i.imgur.com/blah to imgur.com/bla ? [22:47]
***sheaf has joined #archiveteam-bs [22:55]
Froggingsort off... I think it's more of a server side rewrite http://www.fastquake.com/images/screen-imgurredir-20170621-183414.png [22:57]
***sheaf has quit IRC (Remote host closed the connection) [22:58]
timmcRedirecting from direct image view to image embedded in page? [22:59]
Froggingyes [22:59]
timmcYeah, they've been trying to get away from being hotlink-friendly. [22:59]
Froggingit's concerning to me [23:00]
timmcRunning an image host is basically a sucker's game. I'm surprised they've lasted this long, honestly. [23:00]
Froggingsame
they might be on the way down
[23:00]
Crusher_Any idea why the urlte.am warrior likes to report "no items available"
It seems like that would be something you'd expect to have loads available
[23:03]
Froggingjust wait, you'll get items
the tracker doesn't generate items as fast as people take them
[23:06]
crusher2i do, it's just that my machine is outpacing it
oh i see
so in other words, pick a different project, this one's covered
right?
[23:06]
FroggingI'm not sure [23:10]
crusher2Is the vine one still shut down? [23:10]
Frogginglooks like it http://tracker.archiveteam.org/vine/ [23:11]
arkiverI requeued imzy
and also queued posts
[23:11]
FroggingI've been told though that doing URLTeam is useful. I don't know the details of how the tracker works or where the bottleneck really is
I just know that with high concurrency it often can't get items
but it still runs most of the time
[23:11]
crusher2arkiver: i'm still getting the same Server returned 0 error [23:13]
MrRadararkiver: Did you see the comments from earlier about ignoring ?check=true URLs? [23:14]
crusher2how would i go about doing that?
and no not really
[23:15]
arkiverMrRadar: we're already skipping those if a 206 is received
I tested it and it really should work
[23:15]
MrRadarOK. I'll make sure my scripts are fully updated [23:15]
arkiveryeah, server is being a little hammered right now
I'm making an update though the skip some URLs
[23:15]
crusher2all my threads are sleeping from the error
aside from a couple that say they are being limited
[23:19]
arkiveror there's maybe something wrong with you connection
mine do connect
[23:20]
crusher2hmm.
i can do eroshare and urlte.am just fine
[23:20]
arkiverI have updated imzy [23:21]
crusher2i see it
am i the only one with a bunch of batch scripts to do basic control over the warriors? xD
odd...
arkiver: Nope. Still the same post reboot
[23:22]
MrRadarDamn, I just came up with the perfect Imzy channel name: #thelasyimzy (reference to the 2007 film The Last Mimzy)
*#thelastimzy
[23:28]
Frogginghaha that's good [23:28]
MrRadarOf course the project is nearly done at this point [23:28]
crusher2well, i can still *connect* to their site, so i'm not blocked or anything
the only errors im getting are a pair of 422 on their splash page for some gifs
i'm loading up wireshark to see if that tells me anything
it must be something on my end, there are other warriors getting through...
[23:29]
MrRadarMan, the huge items from Eroshare are really blocking up FOS's rsync connections. [23:33]
crusher2i've still got two that are going to take another hour [23:34]
***lucysun has joined #archiveteam-bs [23:40]
lucysuncan someone help me find archives of aol forums or chat rooms from 1995 and before - does this even exist? [23:41]
DFJustinlucysun: archive team downloaded a bunch of the file collections from aol groups, some of them have logs I think https://archive.org/search.php?query=subject%3A%22aol+files%22&page=2
er https://archive.org/search.php?query=subject%3A%22aol+files%22
[23:45]
crusher2arkiver: i still have to narrow it down to see if these packets are for the imzy warriors or not,
but i'm getting loads of FCS errors from an ip that points right at archive.org
specifically a map telling me how many books were scanned in the last 12 hours.
would you like a short packet capture?
[23:52]
***antomatic has quit IRC (Read error: Operation timed out) [23:57]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)