#archiveteam-bs 2017-06-12,Mon

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***j08nY has quit IRC (Quit: Leaving)
Ravenloft has joined #archiveteam-bs
[00:17]
...... (idle for 25mn)
tsp_Going through old stuff, I have about 200 Offline Explorer backup files containing sites that no longer exist on the live web (I haven't looked at wayback/reocities yet or tried to track them down if they moved). Anyone want these? [00:44]
joepie91tsp_: the answer is always yes :) you can just upload them to an item on archive.org
tsp_: ideally with as much information as you can remember about how you created them
[00:58]
jrwrif you need any help with it tsp_ let me know [01:08]
arkiverthe project for imzy is started [01:12]
jrwrthe warrior? [01:13]
arkiveryes
https://tracker.archiveteam.org/imzy/
FOS is the target
[01:13]
jrwrif it gets bogged down you can use jrwr.io [01:14]
arkiverthanks [01:14]
jrwrwheres the github [01:14]
arkiverwill keep that in mind
imzy-grab
[01:14]
jrwryou have free use of that server until the 6th of the month
then it expires
[01:14]
arkiversounds good [01:14]
jrwrwat
Resource Limit Is Reached
on the AT wiki
[01:14]
***wp494 has joined #archiveteam-bs [01:15]
jrwrSpinning up Virtual Machines [01:17]
arkivergreenie: we have started
with archiving imzy
[01:17]
jrwr20 VMs about to land [01:17]
arkiverwell, the site is a little slow
they'll mostly be limited by the tracker
[01:18]
jrwrjrwr turns off 15 of them [01:18]
MrRadarIs there a channel for Imzy? [01:18]
jrwrno
we can all hang out in here :p
[01:19]
MrRadarWhat's the recommended concurrent? [01:19]
jrwr6 [01:19]
MrRadarOK. I'll start there [01:19]
arkiverarkiver is afk
ping me if you see anything strange
[01:19]
***tfgbd_znc has quit IRC (Read error: Operation timed out) [01:20]
tsp_Thanks.
jrwr: Can I Upload them under one item? If so, what identifier do I use?
[01:32]
jrwropensource
web
and just use a name others can use to find it
[01:34]
tsp_I don't know how people will find it, it's about 200 individual files with different sites in them. [01:36]
jrwrare they named well?
the files
like what domain they came off of
[01:37]
tsp_Not really [01:37]
jrwrHrm
"Collection of archived websites"
and just upload them
[01:38]
tsp_tsp_ nods [01:39]
jrwrany details you can put in the desc helps [01:39]
tsp_I can extract a list of base URLs from most of them, there's a .dat inside the archives which can tell me where they came from. But it's more difficult to extract what settings I used to archive them. [01:39]
jrwrif you can prepare the list of domains/URLs and put them in the desc
they will be indexed so if someone goes looking for it
[01:40]
tsp_tsp_ nods [01:41]
jrwryou can start the upload now if you want
you can edit the desc after the fact
[01:41]
tsp_Ok. How do I start uploading, the ia command line utility? [01:42]
jrwrthere is [01:42]
tsp_Can I also rename and delete files after the fact? Some of these might not be dead. [01:42]
jrwrYou can
but I don't suggest it, since its always good to have some kind of backup in case
http://internetarchive.readthedocs.io/en/latest/cli.html
the web interface does work pretty well
[01:43]
tsp_I can also upload another 200 live ones ripped from the same time, I just separated out the dead ones a while ago. [01:45]
jrwrnever hurts [01:45]
tsp_I don't think the web interface lets me upload hundreds of files in bulk, though. [01:45]
jrwrno, it does not like that
the ia cli command does
[01:46]
tsp_I still want to keep live and dead separated, I'll focus on the dead ones since they're the most important then figure out what to do with the rest. [01:47]
jrwrYep
if you like saving sites, I do suggest warcproxy
https://github.com/internetarchive/warcprox
[01:47]
tsp_I forgot about that. I download fanfiction sites sometimes in bulk, will it work for, say, 10 GB of data next run? [01:48]
jrwrif you are browsing websites
that proxy saves it in a way that will flat out import into wayback machine
I wonder if there is a effort to save onion sites to IA
[01:48]
xmci don't know of one, but i would very much approve [01:51]
jrwrI would use a fork of archivebot
since it has all the ignore lists
[01:55]
tsp_I'll just call it MiscWebsiteRips or something. [02:04]
jrwr: I guess I want texts or data (probably data since these are archives) and not web, since they're not warc files? [02:18]
jrwrits still from the web
so its best if you put them under web and community texts
works the best really
[02:19]
tsp_Oh, so ia upload MiscWebsiteRips *.boe --metadata=mediatype:web
How would I get it under opensource?
[02:27]
Somebody2tsp_: it will default to being in the opensource collection, I think. [02:31]
tsp_Is there a preference to identifier names? MiscWebsiteRips vs misc_website_rips vs misc-website-rips? [02:35]
***pizzaiolo has quit IRC (pizzaiolo) [02:37]
Somebody2tsp_: not between those choices, no. Probably the underscore version is safest.
as there are file systems without case-sensitivity, and shell tools that grumble at dashes.
I'd personally recommend uploading each of the sites into *separate* items, rather than one, though.
That way you can have separate metadata values for each site
but uploading *anything* is much better than not.
[02:38]
tsp_Somebody2: Can I split them out later? Right now they're kind of a mess, and I'd have to go through and figure out when things were downloaded, which ones have images, only partial sites, etc
Oh, there's an ia move command, so I can.
[02:41]
xmci think that only moves files within items
but i'm not sure
[02:42]
Somebody2You absolutely should upload them as one item if it would take work to separate them.
tsp_:
There's nothing at *all* wrong with immediately uploading a "raw" item with the pile-of-bits you currently have ...
[02:43]
xmcoh, huh, it might let you move files from one item to another! [02:44]
Somebody2... and then later uploading cleaned up, split up items with better metadata later.
If/when IA wants to de-duplicate the data they store, they can and will do so -- we uploaders should not worry about it
(unless you are getting up into the tens of terabytes range)
[02:45]
jrwrxmc: archivebot is "interesting" to setup [02:46]
xmcheh i bet [02:47]
tsp_The dead collection is only 2.6 GB, so not much. [02:47]
Somebody2tsp_: Yeah, at 3 GB, don't hesistate to upload a combined version now and split ones later. [02:48]
jrwrxmc: Ill have it run in #TorArchiveBot
since I don't want to rewrite the commands so it can exist in #ArchiveBot
[02:53]
xmcsounds good to me! [02:53]
Somebody2jrwr: wait, why are you running a copy of archivebot? [02:55]
jrwrI want to make a Tor version of it
to archive tor websites and maybe other darknets
[02:55]
MrRadararkiver: I'm getting an infinite sequence of HTTP 206 responses for a URL for Imzy: 87=206 https://www.imzy.com/api/accounts/profiles/mcnulty?check=true [03:06]
.... (idle for 16mn)
Somebody2jrwr: cool. I'm a bit doubtful about how many darknet websites there are that are sufficiently public that I'd support archiving them into a public repository like archive.org
but it's good to have available
[03:22]
xmcidk, archive everything on .onion that you can touch [03:30]
Somebody2xmc: presumably not the entirity of facebook... :-) [03:35]
***gui7 has quit IRC (Read error: Operation timed out) [03:37]
xmcwell, sure [03:38]
..... (idle for 23mn)
FroggingThis is off-topic, but does anyone know what to do about an insect that is crawling around inside an LCD
Like behind the panel
I've turned it off hoping it'll leave, dunno what else can be done. Probably nothing short of taking it apart :/
[04:01]
........... (idle for 50mn)
***Sk1d has quit IRC (Ping timeout: 250 seconds) [04:53]
Yurume has quit IRC (Remote host closed the connection)
Sk1d has joined #archiveteam-bs
[04:58]
Yurume has joined #archiveteam-bs [05:07]
Lord_NighSomebody2: I'm hearing comments which imply "something" happened regarding the robots.txt thing i mentioned earlier, and none of the IA staff are allowed to talk about it [05:16]
.... (idle for 16mn)
xmcLord_Nigh: what kind of something [05:32]
Lord_Nighnot sure. just comments from ia staff on twitter saying they can't talk about it, when asked flat out
the faq was never updated to note that retroactive blocking is once again possible (and in fact expanded) either
[05:32]
xmcthe reveal of this has been completely bungled and i'm not very pleased [05:34]
Lord_Nighsee https://twitter.com/TheMogMiner/status/873950228994502658
themogminer got it wrong about the faq being revised; the faq was NEVER revised (or it wasn't 2 weeks ago, i didn't look since) and still implies that retroactive blocking isn't possible
[05:35]
the faqs page looks like it was updated, but is even less clear about removal and retroactive stuff than before
no, i'm wrong
its the same from 2 years ago:
"How can I have my site's pages excluded from the Wayback Machine?
You can send an email request for us to review to info@archive.org with the URL (web address) in the text of your message. "
before 2 years ago it was:
"How can I have my site's pages excluded from the Wayback Machine?
You can exclude your site from display in the Wayback Machine by placing a simple robots.txt file on your Web server.
Here are directions on how to automatically exclude your site. If you cannot place the robots.txt file, opt not to, or have further questions, email us at info at archive dot org.
If you are emailing to ask that your website not be archived, please note that you'll need to include the url (web address) in the text of your message. "
the "directions on how to automatically exclude your site" was a link to https://web.archive.org/web/20130606003203/http://archive.org/about/exclude.php
faq from 6/2013 is https://web.archive.org/web/20130606003203/http://archive.org/about/faqs.php
so the faq was never updated to restore the old verbage, and the exclude.php is a 404 currently
[05:45]
FroggingI thought they were making the robots.txt handling more permissive, not less
at least that's what I got from people talking about it in here
[05:49]
Lord_Nighthe info on exclude.php in the archive isn't valid anymore either, since archive.org respects user agent of "*" now, it doesn't explicitly look for "ia_archiver" like it did 2 or 3 years ago [05:50]
Frogging(permissive meaning blocking fewer sites) [05:50]
Lord_NighFrogging: exactly! this is really mysterious
for 2 or 3 years it only respected the robots.txt that was present when the site was archived, and did not allow 'retroactive' blocking at all
i.e. if it archived on june 2012 and there was a robots.txt on that day blocking "ia_archiver" it was blocked in the archiver
er
i.e. if it archived on june 2012 and there was a robots.txt on that day blocking "ia_archiver" it was blocked in the archive
but if later or earlier the robots.txt didn't exist or didnt block "ia_archiver" it would happily let the site display
the behavior now is if robots.txt blocks any sort of robot by "*" user agent (and probably also "ia_archiver" though I have not tested that), and that file is currently live, it will block EVERY ARCHIVED VERSION OF THAT SITE, EVER
this is a major problem with malicious squatters, who have gone right back to abusing the system like they did 3 years ago
so this is why i am majorly concerned, especially since the https://twitter.com/TheMogMiner/status/873950228994502658 text implies there may be a legal/court order preventing IA staff from talking about it, let alone fixing it
[05:50]
xmcyes
i'm going to bed
[05:54]
Lord_Nighand since i haven't heard any noise on slashdot or ycombinator about this, i assume it is a sealed court order, or even possibly an NSL
which is even creepier
[05:55]
yipdwcould even be Majestic 12 influence
🤔
[05:56]
pikhqI'd be inferring something to do with legal system bustedness, that's for sure. [05:56]
yipdwlol that tweet [05:58]
Froggingreally doubt it's an NSL, I'm sure there are channels for that and they do not involve robots.txt [05:58]
yipdwinferring conspiracies from "not allowed to comment"
Warren Spector missed the mark on Deus Ex
[05:58]
pikhqYeah, I'm loving that they're inferring *anything* more involved than "something something legal system".
(hell, it might not even be a sealed court order, just a court order we're ignorant of, and "not allowed to comment" is more "IA's lawyers would hate me for saying anything")
(or a settlement, or, well. Anything, really. Not much to go off of.)
[05:59]
Lord_Nighsealed settlement would suck since that means it can never change again
I'm probably reading way too much into those comments
[06:01]
pikhqI'd not be confident reading much more into it than "there is probably a lawyer involved in this". [06:02]
.... (idle for 19mn)
tsp_I started uploading misc_website_rips with ia, but it stopped here: error uploading The Fanfic Vault.boe: We encountered an internal error. Please try again. - uploadItem.py: uploading id = misc_website_rips using contribSubmit!
I didn't see anything in the docs about what to do next, most importantly, will it resume if I run the same command again with the list of previously-uploaded files?
[06:21]
........ (idle for 37mn)
***SHODAN_UI has joined #archiveteam-bs [06:59]
schbirid has joined #archiveteam-bs [07:04]
.......... (idle for 46mn)
j08nY has joined #archiveteam-bs [07:50]
..... (idle for 21mn)
BlueMaxim has quit IRC (Quit: Leaving) [08:11]
............ (idle for 56mn)
SHODAN_UI has quit IRC (Remote host closed the connection) [09:07]
........... (idle for 53mn)
pizzaiolo has joined #archiveteam-bs [10:00]
........ (idle for 39mn)
j08nY has quit IRC (Read error: Operation timed out) [10:39]
......... (idle for 43mn)
SHODAN_UI has joined #archiveteam-bs [11:22]
......... (idle for 40mn)
j08nY has joined #archiveteam-bs [12:02]
.......... (idle for 49mn)
dcmorton has quit IRC (Quit: ZNC - http://znc.in) [12:51]
dcmorton has joined #archiveteam-bs [12:56]
.......... (idle for 49mn)
Odd0002 has quit IRC (Remote host closed the connection) [13:45]
.................... (idle for 1h35mn)
jrwrtsp_: correct it will resume
"If you think the Archive is a bad actor you are 12 pineapples short of a luau"
Word-Smithing at its finest
[15:20]
***BartoCH has quit IRC (Ping timeout: 260 seconds) [15:35]
BartoCH has joined #archiveteam-bs [15:46]
pizzaiolo has quit IRC (Quit: pizzaiolo) [15:52]
.... (idle for 19mn)
Pudsey has joined #archiveteam-bs [16:11]
.... (idle for 16mn)
bsmith093tapedrive: if i could make a suggestion, archive.org/details/fanfictiondotnet_repack much better version than one giant tar file. that was my bad. [16:27]
tapedriveI've actually downladed that (the zip of the entire item) a few weeks ago, and I'll be integrating that into my archive and database - along with the 2012 scrapes - which will allow us to see how many stories we're actually missing overall.
And also thanks for doing that scrape, bsmith093, it's been a lifesaver to me many times
[16:30]
***Pudsey has quit IRC (Remote host closed the connection)
pizzaiolo has joined #archiveteam-bs
[16:39]
tuluu has joined #archiveteam-bs [16:48]
tuluu has quit IRC (Ping timeout: 260 seconds)
tuluu has joined #archiveteam-bs
[16:59]
mls_ has quit IRC (Read error: Connection reset by peer) [17:13]
godanehttps://archive.org/details/godaneinbox?sort=-publicdate&&and[]=subject%3A%22Sports%20Illustrated%22 [17:14]
***mls_ has joined #archiveteam-bs [17:14]
JAAbsmith093: Although you're right in that several smaller files are easier to handle, you can browse the contents of tar and zip archives on the IA and only download individual files from it. For example, https://archive.org/download/fanfictiondotnet_repack/Fanfiction_Q.zip/ (Don't attempt to do this with the larger zips though; my browser was not amused.) [17:25]
***Smiley has joined #archiveteam-bs [17:31]
MrRadararkiver: It looks like the Imzy project doesn't work for the Warrior VM since the www.imzy.com domain requires TLS 1.2 [17:34]
***Simpbrain has quit IRC (Ping timeout: 506 seconds) [17:35]
arkiverah... [17:37]
***pizzaiolo has quit IRC (Quit: pizzaiolo) [17:40]
...... (idle for 28mn)
icedice has joined #archiveteam-bs [18:08]
tsp_jrwr: My item is unavailable due to content issues, not sure why. It looks like i'll have to email IA to find out why, maybe it didn't like one of the files. [18:09]
***SHODAN_UI has quit IRC (Remote host closed the connection) [18:11]
PurpleSymtsp_: You can replace /details/ with /history/ in the item’s URL. Usually there is a reason in the logs. [18:23]
tsp_PurpleSym: I'm there (http://archive.org/history/misc_web_rips). Doesn't seem like there are any tasks, though I can't tell if "server readonly -- tasks waiting for harddrive fix" is a task or not. [18:29]
PurpleSymThat item does not exist.
You probably meant misc_website_rips: https://catalogd.archive.org/log/682094076
[18:30]
tsp_Sorry, yeah, just noticed that. [18:32]
Antivirus got it. Is there anything I can do about it now that I know what the problem is?
PurpleSym: Thanks, didn't know about history.
[18:37]
MrRadarYour best bet is probably to create a new item without the offending data [18:37]
tsp_MrRadar: With the understanding that any new sites I add might disable the new item due to malware? [18:42]
MrRadarYou might want to switch to 1 item per site then [18:42]
tsp_The biggest issue with that is coming up with identifiers. [18:44]
MrRadarsitename_grabdate ? [18:45]
***tuluu has quit IRC (Remote host closed the connection)
tuluu has joined #archiveteam-bs
[18:45]
tsp_That might work, if i can get grabdates out of these files. [18:47]
.... (idle for 15mn)
JAAarkiver: We may want to reduce the Imzy rate limit a bit. I see tons of timeouts.
And 504s
[19:02]
..... (idle for 20mn)
Better now [19:22]
bsmith093tapedrive: i also just uploaded this one recently https://archive.org/details/Fanfictiondotnet1011dump
roughly 750k more stories, with a metadata db using the same schema
[19:30]
tapedrivebsmith093: I'm extracting that one now ;) (taking forever on a 256mb Raspberry Pi XD )
Thanks for your great work!
[19:30]
bsmith093im also doing fictionpress and ao3 in that format, but as a simple scrape like the others. Redudancy is good, though, and i only started this because it was really easy to give fanficfare a giant list. [19:32]
tapedriveI'm doing it in the same format as you to make it easy to merge them (which is what I'm doing now) and work out how many are actually missing [19:33]
bsmith093tapedrive: to save some time, 130k of the first million stories on ffnet are all thats left of the oldest group. [19:34]
tapedriveYeah - I have a mysql database of all of them now
Saying if they've been archived, deleted, or still need to be archived
And a script that runs looking at fanfiction.net for new stories and updates, and updates the database accordingly.
Scrapers query the databse and get new stories to archive/rearchive
[19:34]
tsp_I have a fanfiction.net db I've been collecting for a while, kind of an oddball format but it can be parsed. If someone can generate a list of story ids that are missing, I can go through and check which ones I have. [19:42]
tapedriveIn a few days (hopefully) I'll have a list of all the ones that I don't have (which will include the 2012 scrape and bsmith093's multiple scrapes) [19:43]
nightpoolwhat's going on with fanfiction.net? [19:44]
tapedriveNothing, yet.
But as far as I can find out the site's run by either a single person or a small team, that never reply to emails, tweets or comments.
And every month the site has different problems, which sometimes get fixed, and more often don't.
[19:44]
tsp_I mostly collect Harry Potter, but a few others got in as well. [19:46]
nightpooloh, yeah, I mean, it's good that it's being archived.
I was just wondering if there was a proximate cause to discussion
[19:47]
tapedriveIt's only come up because my crawl has finished recently. Nothing's happend with the site [19:48]
***medowar has joined #archiveteam-bs [19:50]
wp494_ has joined #archiveteam-bs
luckcolor has quit IRC (Read error: Operation timed out)
dboard has quit IRC (Read error: Operation timed out)
luckcolor has joined #archiveteam-bs
Petri152 has quit IRC (Read error: Operation timed out)
dboard has joined #archiveteam-bs
JAA has quit IRC (Read error: Operation timed out)
Lord_Nigh has quit IRC (Read error: Operation timed out)
ndiddy has quit IRC (Read error: Operation timed out)
K4k has quit IRC (Read error: Operation timed out)
mundus20- has quit IRC (Write error: Broken pipe)
aschmitz has quit IRC (Read error: Operation timed out)
mhazinsk has quit IRC (Read error: Operation timed out)
jrwr has quit IRC (Read error: Operation timed out)
will has quit IRC (Read error: Operation timed out)
C4K3 has quit IRC (Read error: Operation timed out)
ZexaronS has quit IRC (Read error: Operation timed out)
Stilett0 has quit IRC (Read error: Operation timed out)
Stilett0 has joined #archiveteam-bs
K4k has joined #archiveteam-bs
PotcFdk has quit IRC (Read error: Operation timed out)
mundus201 has joined #archiveteam-bs
rocode has quit IRC (Read error: Operation timed out)
rocode has joined #archiveteam-bs
jrwr has joined #archiveteam-bs
ZexaronS has joined #archiveteam-bs
ndiddy has joined #archiveteam-bs
mhazinsk has joined #archiveteam-bs
Lord_Nigh has joined #archiveteam-bs
will has joined #archiveteam-bs
wp494 has quit IRC (Read error: Operation timed out)
PotcFdk has joined #archiveteam-bs
JAA has joined #archiveteam-bs
Petri152 has joined #archiveteam-bs
C4K3 has joined #archiveteam-bs
[19:55]
SHODAN_UI has joined #archiveteam-bs [20:14]
...... (idle for 25mn)
tsp_How should I turn a URL into an identifier? I can tell which page I started from, for example: http://home.att.net/~polliwog-press/pollistoryindex.htm
If I do some manual editing, I can get it to home.att.net__polliwog-press_2009-12-12, however I'm not sure if I should turn ~ into _ or remove it altogether to avoid the double _.
[20:39]
xmcarchivebot turns it into an item named after just the hostname and the date of capture
(and the pipeline)
so you could call it archiveteam-ondemand_home.att.net_2016-06-12 or whatever
[20:43]
tsp_There might be a few home.att.net rips in my collection, of different people. They're each in one .boe file. [20:44]
xmcarchivebot sometimes puts multiple crawl jobs in the same IA item, which isn't an issue [20:45]
***th1x has joined #archiveteam-bs [20:46]
tsp_I tried to put all my backups in one item, but the antivirus scanner found something and killed the entire item instead of the one file. [20:46]
xmchuh
email info@archive.org and tell them it's a false positive
[20:47]
tsp_Maybe it's not. But I'd rather just have that one file excluded rather than the entire thing.
I'll email, thanks.
[20:47]
xmc: But as someone else said, it might be a better idea to upload all of them as individual items. But I'm finding that kind of difficult to work with atm. [20:52]
***schbirid has quit IRC (Quit: Leaving) [21:05]
Lord_Nighhttps://archive.org/about/faqs.php <- some of the content removal requests in the forum at the bottom of that page are weird... one page was about what looks like a foreign-langauge 'carfacts' equivalent site for a specific vehicle, which was not actually taken down...
guessing someone was trying to hide some prior damage or something
and the archive.org admins wisely decided to ignore the request
huh, this might be related to the robots.txt thing: https://archive.org/post/1074464/robotstxt-processing-failure although that might just be 'collateral damage' from the recent change and not a bug?
i.e. if www.example.com and example.com serve different robots.txt files, both will use the one from www.example.com
[21:07]
............ (idle for 55mn)
***SHODAN_UI has quit IRC (Remote host closed the connection)
wp494_ is now known as wp494
[22:04]
Gilfoyle has quit IRC (Read error: Operation timed out) [22:14]
......... (idle for 40mn)
BlueMaxim has joined #archiveteam-bs [22:54]
.... (idle for 19mn)
nyany has quit IRC (Remote host closed the connection) [23:13]
jrwryipdw: you around by any chance [23:27]
***nyany has joined #archiveteam-bs [23:28]
...... (idle for 28mn)
dashcloud has quit IRC (Read error: Operation timed out) [23:56]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)