#archiveteam-bs 2017-12-16,Sat

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***JAA sets mode: +bb Ya_ALLAH!*@* *!*@185.143.4* [01:03]
......... (idle for 42mn)
CoolCanuk has joined #archiveteam-bs [01:45]
.... (idle for 16mn)
DopefishJ is now known as DFJustin [02:01]
odemg has quit IRC (Ping timeout: 250 seconds)
odemg has joined #archiveteam-bs
[02:10]
pizzaiolo has quit IRC (pizzaiolo) [02:21]
......... (idle for 44mn)
ZexaronS has quit IRC (Read error: Operation timed out)
Stilett0 has quit IRC (Read error: Operation timed out)
[03:05]
ndiddy has joined #archiveteam-bs [03:16]
Stilett0 has joined #archiveteam-bs
Stilett0 is now known as Stiletto
[03:21]
.... (idle for 17mn)
robogoatSomebody2: Ok, can you explain something, regarding darking and crawl history?
It is possible for someone to own domain A, have content on domain A, be fine with IA archiving it,
and then sell/let the domain lapse, and the subsequent owner/squatter puts up a prohibitive robots.txt
It's my understanding that the material is then non-accessible,
is it "darked"?
[03:38]
...... (idle for 26mn)
***qw3rty111 has joined #archiveteam-bs
qw3rty119 has quit IRC (Read error: Operation timed out)
[04:05]
CoolCanuk has quit IRC (Quit: Connection closed for inactivity) [04:14]
vantecBeen out of the loop for a bit, but think this mostly still stands: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ [04:21]
.............. (idle for 1h7mn)
***bithippo has quit IRC (Read error: Connection reset by peer) [05:28]
.... (idle for 16mn)
wacky has quit IRC (Read error: Operation timed out)
wacky_ has joined #archiveteam-bs
[05:44]
.............. (idle for 1h6mn)
Somebody2robogoat: First of all, note that I'm not employed by IA, and in fact have only visited once. This is just outside curious onlooker.
With that out of the way -- dark'ing applies to *items* (a jargon term) on archive.org, not web pages in the Wayback Machine.
The Wayback Machine is an *interface* to a whole bunch of WARC files, stored in various items on IA.
The WARC files are what actually contain the HTML (and URLs, and dates) that are displayed through the Wayback Machine.
If an item is darked, it can't be included in the Wayback Machine (or any other interface, like the TV News viewer, or the Emularity).
[06:50]
***Stiletto has quit IRC (Ping timeout: 250 seconds) [06:56]
Somebody2So none of the items containing WARCs that contain web pages visible through the Wayback Machine are darked (unless I'm missing something).
However -- that doesn't mean you can download the WARC files yourself, directly (with some exceptions).
Most of the items containing WARCs used in the Wayback Machine are "private", which means while you can see the file names, and sizes, and hashes, ...
... you can't actually download the actual files without special permission (which the software that runs the Wayback Machine has).
The WARCs produced by ArchiveTeam generally are *NOT* private -- although we prefer not to talk about this too loudly, to avoid people complaining.
A recently added feature of the Wayback Machine provides links from a particular web page to the item containing the WARC it came from.
Actually, it looks like it only links to the *collection* containing the item containing the WARC, sorry.
So, now, to get back to robots.txt -- the Wayback Machine does (currently) include a feature to disable access to URLs whose most recent robots.txt file Disallows them.
The details of exactly how this operates (i.e. which Agent names does it recognize, how does it parse different Allow and Disallow lines, ...
... what does it do if there is no robots.txt file) are subtle, changing, and undocumented.
And robots.txt files do *NOT* apply to themselves, so you can always see the contents of all the robots.txt files IA has captured for a domain.
(unless there was a specific complaint sent to IA asking for the domain to be excluded, which they also honor)
But the robots.txt logic doesn't apply at ALL to the underlying items -- so *if* you can download them, you can still access the data that way.
Hopefully that answers the question.
(and sorry everyone else for the literal wall of text)
There is also a robots.txt feature included in the Save Page Now feature, but that's a separate thing.
[06:56]
***bwn has quit IRC (Read error: Connection reset by peer) [07:22]
.... (idle for 15mn)
DFJustin has quit IRC (Remote host closed the connection) [07:37]
DFJustin has joined #archiveteam-bs
swebb sets mode: +o DFJustin
bwn has joined #archiveteam-bs
[07:44]
.... (idle for 17mn)
ZexaronS has joined #archiveteam-bs
mr_archiv has quit IRC (Quit: WeeChat 1.6)
mr_archiv has joined #archiveteam-bs
mr_archiv has quit IRC (Client Quit)
mr_archiv has joined #archiveteam-bs
[08:01]
.......... (idle for 49mn)
Mateon1 has quit IRC (Read error: Connection reset by peer)
Mateon1 has joined #archiveteam-bs
[08:54]
.... (idle for 16mn)
MrDignity has quit IRC (Remote host closed the connection)
MrDignity has joined #archiveteam-bs
[09:11]
............. (idle for 1h1mn)
schbirid has joined #archiveteam-bs [10:12]
nyany has quit IRC (Leaving) [10:17]
..... (idle for 23mn)
JAA sets mode: +bb BestPrize!*@* *!pointspri@* [10:40]
MrDignity has quit IRC (Remote host closed the connection)
MrDignity has joined #archiveteam-bs
[10:53]
.................. (idle for 1h26mn)
kimmer12 has joined #archiveteam-bs
BlueMaxim has quit IRC (Quit: Leaving)
schbirid has quit IRC (Quit: Leaving)
kimmer1 has quit IRC (Ping timeout: 633 seconds)
[12:19]
...... (idle for 28mn)
dashcloud has quit IRC (No Ping reply in 180 seconds.)
dashcloud has joined #archiveteam-bs
[12:54]
Stilett0 has joined #archiveteam-bs [13:00]
.......... (idle for 46mn)
dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
[13:46]
.......... (idle for 45mn)
Specular has joined #archiveteam-bs [14:33]
Specularis there any known way of converting Web Archive files saved from Safari to the MHT format? [14:34]
.... (idle for 18mn)
***godane has quit IRC (Quit: Leaving.) [14:52]
.... (idle for 17mn)
kimmer1 has joined #archiveteam-bs
kimmer12 has quit IRC (Ping timeout: 633 seconds)
[15:09]
..... (idle for 23mn)
kimmer12 has joined #archiveteam-bs
kimmer12 has quit IRC (Remote host closed the connection)
[15:35]
kimmer1 has quit IRC (Ping timeout: 633 seconds) [15:42]
..... (idle for 24mn)
Specularsomehow my search queries were too specific prior and just found this. Mac only but will test later. https://langui.net/webarchive-to-mht/
oh it's commercial. Typical Mac apps, ahaha.
[16:06]
........... (idle for 51mn)
***Specular has quit IRC (Quit: Leaving) [16:58]
pizzaiolo has joined #archiveteam-bs [17:09]
......... (idle for 43mn)
ola_norsk has joined #archiveteam-bs [17:52]
ola_norskhow might one go about to 'archive a person' on the internet archive? I'm thinking of the youtuber Charles Green a.k.a 'Angry Granpa' [17:54]
***kimmer1 has joined #archiveteam-bs [18:04]
Somebody2ola_norsk: You can't archive a person. But you could archive the work they have posted online. I'd use Archivebot and youtube-dl [18:12]
ola_norskSomebody2: could those various items, from e.g the fandom wikia, twitter, to youtube videos etc; later be made into e.g a 'Collection' ?
Somebody2: without having to be one item, i mean
[18:14]
Somebody2Yes, once you upload them, send an email to info@archive suggesting they be made into a collection, and someone will likely do it eventually. [18:15]
ola_norskty [18:15]
***godane has joined #archiveteam-bs [18:16]
ola_norskbtw, would the items need a certain meta-tag?
other than topics, i means
[18:17]
JAAFor WARCs, you need to set the mediatype to web. Anything else is optional and can be changed post-upload (mediatype can only be set at item creation). But the more metadata, the better! :-) [18:20]
ola_norskokidoki [18:20]
JAA(If you forget to set the mediatype correctly on upload, send them an email instead of trying to work around it by creating a new item or whatever; they can change it, I believe.) [18:21]
***pizzaiolo has quit IRC (Read error: Operation timed out) [18:22]
ola_norskspeaking of which, i messed that up on this item https://archive.org/details/vidme_AfterPrisonJoe :/
and it seems to have messed up the media format detection. I did re-download the videos locally though.
[18:22]
***pizzaiolo has joined #archiveteam-bs [18:26]
ola_norskJAA: would it be a good idea to simply tar.gz the vidme videos, add that to the item; and then send an email asking the all item's content to be replaced by the content of the tar.gz [18:27]
JAAI doubt it.
Why don't you just use the 'ia' tool (Python package internetarchive)?
[18:27]
ola_norski do use that
but, there seems to be a bug that's preventing changing metadata.
[18:28]
JAAHm? [18:30]
ola_norskJAA: https://github.com/jjjake/internetarchive/issues/228
but, maybe it's resolved already. I've not tested yet.
[18:30]
ouch, might i have accidentally closed the issue? Or do they timeout on github after some time :/ [18:41]
JAA: what version of 'ia' are you on?
ola_norsk is on 1.7.4
[18:46]
....... (idle for 32mn)
how do i submit 'angrygrandpa.wikia.com' to get grabbed by archivebot?
nevermind
[19:18]
.... (idle for 19mn)
***dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
dashcloud has joined #archiveteam-bs
[19:38]
.... (idle for 18mn)
dashcloud has quit IRC (Read error: Connection reset by peer)
dashcloud has joined #archiveteam-bs
[19:56]
..... (idle for 20mn)
BlueMaxim has joined #archiveteam-bs [20:20]
.... (idle for 15mn)
JAAola_norsk: Sorry, had to leave. I've been using 1.7.1 and 1.7.3.
ola_norsk, Somebody2: If you archive a site through ArchiveBot, you can't easily add just that site's archives to a collection afterwards though, because the archives are generally spread over multiple items and each of those items also contains loads of other jobs' archives.
It might be better to use grab-site/wpull and upload the WARCs yourself. Then you can create clean items for each site you archive or whatever, and these can easily be added to a collection (or multiple) afterwards.
[20:35]
Somebody2That is a good point, thank you. [20:42]
PurpleSymWhat’s the #archivebot IRC logs password? [20:47]
JAAQuery [20:56]
PurpleSymAnd username, JAA? [20:57]
ola_norskJAA: ah. i will see if i have the space for that wikia. (though, it's already submitted to archivebot as a job) :/ I will use !abort if my harddrive runs out
JAA: so there's no real way to recall a specific archivebot job/task?
[20:59]
JAAola_norsk: You could upload the WARCs already while you're still grabbing it. That's what ArchiveBot pipelines do as well.
What do you mean by "recall"?
[21:00]
ola_norskJAA: to make that specific archivebot task into an item on ia
JAA: a warc item, i mean, with topics etc.
[21:01]
JAAIn theory, you could download the files and reupload them to a new item. Or use 'ia copy', which does a server-side copy I believe. Whether that's a good idea though is another question entirely... [21:02]
ola_norskim n00b at using archivebot i'm afraid :/ [21:03]
JAAI guess someone from IA could move the files to a separate item. But again I'm not sure whether they do that.
But yeah, that's all manual.
Some pipelines upload files directly, and there you sort-of have one item per job (though it doesn't contain all relevant files and may sometimes contain other jobs as well).
But other than that...
[21:03]
ola_norska warc item that's manually uploaded as item at a later time though, could that use data already grabbed trough archivebot? Or would it be causing duplicate-hell ? [21:05]
JAAI don't know how IA handles duplicates. [21:06]
ola_norskaye, me neither [21:06]
JAAThat's why I wonder if it's a good idea.
If they deduplicate the files, then it would probably be fine.
Maybe someone else knows more about this.
[21:07]
ola_norsk"somewhere, in the deep cellars of internet archive; There's a single gnome set to the task of checksum'ing all files and writing symlinks" lol
:D
[21:08]
***Mateon1 has quit IRC (Read error: Operation timed out)
Mateon1 has joined #archiveteam-bs
[21:12]
Somebody2IA does not (*YET*) dedeuplicate.
(AFAIK)
[21:15]
***jschwart has joined #archiveteam-bs [21:20]
...... (idle for 29mn)
dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
[21:49]
ola_norski have no idea. if there's no pressing need it's ok i guess. And thinking they would if need be, at least on items that's not been altered for quite a while. [21:51]
godanededeuplicate i can see being done on video and pdf items
i don't think it would work with warc files
[21:52]
ola_norskaye
not with derived/recompressed files either i think. not unless the original was checked beforehand
godane: i guess with warc it would need checking content; and patching the stuff and/or link list in the warcs
[21:53]
godanesort of my thought [21:58]
ola_norskaye [21:58]
ezunless ia unpacks all wars on their side [21:59]
godanei was think it would be check important web urls that have the same checksum and making derive warc archive to only store it once [21:59]
ezwarc is generally rather unfortunate thing to do for bulk file formats
(not sure about the wisdom of reinventing zip files, either)
[22:00]
godaneso it would be derive warc either way
something like this for warc makes more sense if we are doing my librarybox project idea
[22:00]
ola_norskgot link? [22:01]
godanecause then people can host full archives of cbsnews.com for example without it take 100gb [22:01]
ezthe thing is that on mass scale, dupes dont happen that often in general
so its often not worth the time bothering with it, especially for small items
[22:02]
ola_norskbut e.g for twitter, users might upload memes that are just copied from other sites. i don't know if twitter alters all images; but that could cause duplicates i think
if, twitter gives each uploaded image it's own file and filename, i mean
[22:04]
ezsomewhat
theres been study for this for 4chan, which granted, isnt representative sample of twitter
but as far as actual md5 stored _live_, only 10-15% were dupes
[22:05]
ola_norskok [22:06]
ezhowever over time, there were indeed >50% dupes in certain time periods
exactly as you say, some image got really popular and got reposted over and over
[22:06]
***pizzaiolo has quit IRC (Read error: Operation timed out) [22:08]
ola_norskin any case, it's something that's fixable in the future though i would guess. E.g picking trough all the shit and finding e,g the most reoccuring, or highquality image; either by md5 or image regocnitions
image recognition*
[22:08]
***pizzaiolo has joined #archiveteam-bs [22:09]
ezwell, this isnt that exact study i've seen, but mentions the distribution of dupes per post https://i.imgur.com/S9pxJqV.png
ola_norsk: it could be done of course. note that if you were to do what you say, you'd also build index for reverse image search
which would be a really handy thing for IA to have
needless to say, it depends if IA wants to diversify as a search engine
[22:10]
ola_norskwell, the meta.xml is there, containing the md5's i think :D [22:12]
ezmd5s useless for search [22:12]
ola_norskaye, but to find duplicates i mean [22:12]
ezon md5 level, the dupes dont happen often enough given a random kitchen sink of files
it definitely makes sense for certain datasets
[22:12]
***pizzaiolo has quit IRC (Client Quit) [22:13]
ezlike those ad laden pirate rapidshare upload servers, they sure do md5. they have this huge sample of highly specific cotent, and the files often are same copies.
megaupload actually got nailed for this legally
[22:13]
***pizzaiolo has joined #archiveteam-bs [22:13]
ezthey dmcad the link, but not the md5
deduping 10k 1MB image files and deduping 10k 1GB files is what makes the difference
as you get random mix of both, its obvious which part of the set to focus on
[22:13]
ola_norskit would be slow work i guess [22:15]
ezdepends on how the system works, really
most data stores working on per-file basis often compute hash on streaming upload, and make a symlink when hash found at that time, too
but more general setups often dont have the luxury of having a hook fire per each new file
[22:15]
ola_norskwhat if there was a distributed tool to Warrior, that picked trough the IA items xml and looked for duplicate md5's ?
(of only original files that is)
[22:16]
ezits possible to fetch the hashes from ia directly via api [22:18]
JAAjoepie91: FYI, I'm porting your anti-CloudFlare code to Python. [22:18]
eznot everything is available tho. as long it has xml sha1, you can api fetch it
build an offline database too, etc
[22:19]
ola_norskez: it's a quite a number of items on ia though. But if it was a delegated slow and steady task
ez: like a background task or something
[22:20]
ezno i mean you can have the hashes offline
as comparably small structure
unfortunately its really awkward to get it at this moment
[22:21]
ola_norskez: i mean making a full list of duplicate files. where e.g the 'parent item' is by first date [22:22]
ezola_norsk: a bloom filter with reasonable collision rate is like 10 bits per item
regardless of number of items
not sure if ia supports search by hash
last time i checked the api (some years ago) it didnt
[22:22]
ola_norskthe md5 hashes are in the xml of (each?) item [22:25]
ezif it still doesnt, you'd need to store hash, as well as xml id to locate its context as you say
which would bload the database a great deal
ola_norsk: yea
the idea is that i run scan over my filesystem and compare every file to filter i scrapped from ia api
[22:25]
ola_norskit's not going anywhere though. So it could basically be done slow as shit don't you think? [22:26]
ezand upload only files which dont match. this is because querying IA with 500M+ files is not realistic
so bloom filter would work fine for uploads of random crap and making more or less sure its not a dupe
but it wont tell you *where* your files are on ia, unless you abuse the api with fulltext search and what not
[22:26]
ola_norski have no idea man :)
does the logs say?
the history of items?
ola_norsk 's brain is broken and beered :/
i'm guessing IA would put some gnome to work the day their harddrive is full :D
(which i'm guessing is not tomorrow lol) :D
[22:28]
ezjust restrict some classes of uploads when space starts running short
but yea, space can be done on cheap if you have the scale
[22:32]
ola_norskdoh, restriction is bad :/
might they as well check if that file already exist?
[22:33]
ezas i said, at those scales, the content is so diverse it happens rather infrequently, especially if your files are comparably small (ie a lot of small items of diverse content) [22:34]
ola_norsk...maybe they already do.. :0 [22:34]
***odemg has quit IRC (Read error: Connection reset by peer) [22:34]
ezits easy to do for single files, but not quite sure about warc [22:35]
ola_norskaye, it would need unpacking and stuff
and e.g youtube videos that are mkv combined would need split into audio and video i guess, then compared
[22:35]
ezthe thing is i've seen deduping rather infrequently in large setups like this - the restrictions on flexibility of what you can do (you now need some sort of fast hash index to check against, you need some symlink mechanism now)
youtube doesnt dedupe im pretty sure
not by the source video anyway
since 99% of content they get is original uploads. most dupes they'd otherwise get usually gets struck by contentid
1% is the long trail of short meme videos reposted over and over and what not, but its just a tiny part of long trail
[22:37]
ola_norski sometimes upload by getting videoes with youtube-dl, and it seems that often combines 2 files, audio and video, into a single file..would that make different md5 sum?
(without repacking/recoding, i mean)
[22:39]
ez(its easy to test - each reupload yields new reencode on yt, and the encode is even slightly different as it contains timestamp in mp4 header)
ola_norsk: yes, highest quality is available only via HLS
curiously, its an artificial restriction, as other parts of google infra which uses yt infra, together mp4 1080/4k just fine
the restriction is specific to yt, i suppose in a bid to frustrate trivial ripping attempts via browser extensions
[22:40]
ola_norskbut, i think what i mean is; If i run 'youtube-dl' on the same youtube video twice..Then e.g the audio and video (often webm and mp4), before they are merged into MKV file, would be the very same files each time? or? [22:44]
ezola_norsk: on and off its possible to abuse youtube apis to get the actual original mp4, but it changes 2 times now (works for google+, only for your own videos when logged in)
*has changed
so definitely not something to rely on
but if i were as crazy to archive yt, i'd definitely try to rip the original files, not the re-encodes
[22:44]
***odemg has joined #archiveteam-bs [22:45]
ola_norskno hehe :D i'm just talking example as to how to detect duplicate videos :D
or duplicate uploads in general
[22:45]
ezola_norsk: depends what you command ytdl to do
generally if you ask it same file, same format, you overwhelmingly get the exact same file
but google re-encodes those from time to time
[22:46]
ola_norskaye, most often it just combines audio and video, 2 different files, into a KVM file. And, i'm thinking if the kvm file were split again, into those two a/v files, the md5 would be the same in two instances of where youtube-dl were used to download the same video.
and, by that, duplicate video uploads could be detected
the generated merged file (kvm etc) would be different, but the two merged files would be same, since there's no re-encoding occurring
[22:48]
ezhum, that sounds elaborate? [22:51]
ola_norskaye, i have a headacke :D [22:51]
ezwhy not just tell ytdl to rip everything to mkv from the get-go? [22:51]
ola_norski just meant in relation to detecting duplicates in IA items (where e.g 2 youtube vidoes are uploaded twice)
where md5sum of the two kvm files are not an option
since each download would cause two diffrent kvm to be made locally by the downloaders
if, however, it's possible to split a kvm, into the audio and video files contained..i'm thinking those two would yield identical md5
[22:54]
ola_norsk ran out of duplication-detection smartness :/ [23:02]
ezi have no idea what kvm file is [23:05]
***nyany has joined #archiveteam-bs [23:10]
ola_norskez: usually when i use youtube-dl it downloads audio and video seperate, then combine the two into a kvm file [23:16]
ezwhats an kvm file?
oh
mkv
[23:16]
ola_norskah, sorry, yes
mkv
[23:17]
ezyea its a bit annoying, mostly because the ffmpeg transcoder is non-deterministic
i think it puts timestamp in the mkv header or something silly like that
s/transcoder/muxer/
[23:18]
ola_norskdoes it alter the two audio and video files though? or can it be split from the mkv?
either way, if that is possible; then detecting duplicate mkv files uploaded to ia is possible
even if the md5 sum differs between to mkv files containing the same content
[23:19]
ezthe raw bitstream is kept as-is [23:22]
ola_norskok [23:23]
ezmeaning the mux is as "original" as sent in the google HLS track
but the container metadata are often unstable on account of different versions of software muxing slightly differently (think seek metadata and such)
[23:23]
ola_norskif the framecount doesn't differ though, and neither does the frames, that could be a further step?
ez: the extent of my knowledge is rather spent when it comes to codecs :D
[23:27]
eztl;dr is that you cant rely on what ytdl gives you as muxed output
perhaps if you just ask for the 720p mp4, as that one isnt remuxed (yet?)
[23:28]
ola_norski'm thinking some kind of image/frame simularity detection then i guess
rather*
[23:30]
***jschwart has quit IRC (Quit: Konversation terminated!) [23:32]
ola_norskthe md5 is in the item xml though, so i guess that is where one would have to start to find duplicates on ia [23:32]
ez: it is as you say elaborate. So i'm glad i don't have to do it :D
ez: (and so should everyone else be, of me not doing it lol) ;)
[23:39]
***BlueMaxim has quit IRC (Quit: Leaving) [23:41]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)