|↑back Search ←Prev date Next date→ Show only urls||(Click on time to select a line by its url)|
|zino||jrwr, not that it helps, but I'm sending all my symphaty and good thoughts.||[00:05]|
|jrwr||Im going to hack up some cgi-bin
and compile php7 on this box
wouldnt be the first time I've worked around shit like this (LOOKING AT YOUR FERAL)
|........... (idle for 51mn)|
|so I turned on file based cache
it should help /some/
I'm working with apache some to get this working its being a PITA, I suspect a apache module doing this
|***||BnAboyZ has joined #archiveteam-bs||[01:08]|
|.... (idle for 19mn)|
|Somebody2||BTW, regarding WARC uploads going into the Wayback Machine -- I've now gotten confirmation that it is still a trusted-uploaders-only process (which isn't surprising).
JAA is trusted, and ivan as well, presumably.
|jrwr||So SketchCow, I'm stuck, the PHP is too old to update mediawiki, Apache is not behaving with the CGI override due to mod_security being forced on, and overall the entire account is limited to 7 (confirmed) connections concurrent (thats whats causing the resource limit pages currently)
I've added the static file cache and it is helping
|SketchCow||There'll be some roughness as we figure out what to do.||[01:44]|
its using 2.6 as its kernel....
|SketchCow||But if I have intelligent requests for the host, I'm sure they can help.||[01:44]|
|jrwr||Ok, So the main one is can I have my limits increased for the number of CGI scripts ran at one one time. I keep getting resource limit errors on top of this error log: [Wed Dec 20 20:41:37 2017] [error] mod_hostinglimits:Error on LVE enter: LVE(527) HANDLER(application/x-httpd-php5) HOSTNAME(archiveteam.org) URL(/index.php) TID(318310) errno (7) Read more: http://e.cloudlinux.com/MHL-E2BIG min_uid (0)||[01:46]|
|SketchCow||Well, assemble them all in one place for me.
I mean after a day of looking it COMPLETELY over
And then I'll bring it to TQ and see what they thing
|SketchCow||No sense in piecemealing
Also, let me glance at the cpanel
Ya, its pretty much those two issues I have with it, I'm compiling them into a google sheet for tracking
|jrwr||Im poking the poor wiki very hard
whats up jacketcha
|jrwr||Well, the WARC standard is pretty simple overall
have some node
I was planning to add it to my chrome extension
im logging off for now SketchCow, See ya in the morning
|***||robink has quit IRC (Ping timeout: 246 seconds)||[02:13]|
|jacketcha||ok, so can somebody explain to me how warc files work
sorry for being dumb
i whonestly have no idea
|Frogging||do you have a more specific question?||[02:16]|
|jacketcha||How is the data structured? I am going to assume that it isn't just copying in the HTML source code after the headers are added.||[02:20]|
|Frogging||It stores the full response headers and body
That includes responses containing binary data, HTML, CSS, plain text, whatever
|jacketcha||Is there any specific order to that?||[02:22]|
|Frogging||Records can be in any order AFAIK||[02:24]|
|jacketcha||Great, so it'll work just find with asynchronous saving. Thanks, that was actually really helpful.||[02:25]|
|Frogging||there's WARC 1.1 (which is the latest version) linked there too||[02:29]|
|***||bithippo has joined #archiveteam-bs||[02:38]|
|bithippo||Thinking about grabbing Imgur. All of it. Anything I should keep in mind prior to putting it in cold storage?
(iterating over ever permutation of image urls based on how Imgur generates image urls)
|jacketcha||Is the way Imgur generates urls known?
Better question, is the source of data for the RNG Imgur uses known?
|***||robink has joined #archiveteam-bs||[02:48]|
"Choosing 5 characters from 26 lowercase letters + 26 uppercase letters + 10 numerical digests, leaves us with 916,132,832 possible combinations (625). Upgrading to 7 characters gives us 3,521,614,606,208 (3.52 trillion) possibilities."
404->check back in the future, 200->WARC gz
Etag header on a request is the MD5 of the image
|jacketcha||That is still 3 trillion web requests||[02:53]|
Alternatives? Besides waiting until Imgur runs out of runway and then its more pressing :/
(ie Twitpic 2.0)
My only frustration is that the URL isn't deterministic from a hash of the image, so it's possible an image exists, is deleted, and then replaced without any way to know
|jacketcha||Unless it was already archived
Look, it's a good idea in practice, but here's the thing
|jacketcha||Imgur gets around 17.3611111111 new images per second
That would place it at around 2687000000 images today
|jacketcha||It gets worse
That means you have a roughly 0.076300228743465619423705658937702969914635168416766807608280491104220976146849283303277891127907878357259164917744009861455363906114286574925748247085571170136781627322552461470824865159611513732687195% chance of getting an image every time you send a request
Don't be fooled by the high precision, the accuracy of your plan is very low.
But, there is a way to make it higher.
Much higher, in fact
If you can figure out the source of the data that Imgur uses for its random number generation algorithms, you can at least grab the newest images
|bithippo||Sounds workable using their API to get latest image paths and then working backwards||[03:06]|
But if it is randomly generated, even pseudorandomly generated, you're still screwed.
you could just email them
|bithippo||"Hi. I will take one Imgur pls. Will be over with hard drives shortly."
Appreciate the input!
If a Warrior or ArchiveBot finds a WARC file, is it added to the collection of WARC files or is it added into the WARC file of the site it is located on?
|godane||i'm up to 18k items this month
this year has been slower then last year
i just hope i can get it past 100k for the year
it 97,682 so far
maybe I should start counting mine
and putting it in actual warc files
|***||pizzaiolo has quit IRC (Remote host closed the connection)||[03:33]|
|robink has quit IRC (Ping timeout: 246 seconds)||[03:45]|
|Somebody2||jacketcha: if a HTTP request returns a WARC file, and that HTTP request and response is being stored into a WARC file,
then, yes, you'll have nested WARC-formatted data
AFAIK, no WARC-recording tool will automatically un-nest it (and that would probably not be a good idea in any case)
so that means that there possibly could be an archive of the entire internet floating around the wayback machine somewhere, but nobody would ever know because it was nested.
See, this is one of the things
You are asking... well, you're asking for a college course in how WARC works
It's sort of on topic and sort of off
It's certainly sucking all the air out of the room
It's nice to see people talking
|jacketcha||so nested warc files are basically politics
|***||bithippo has quit IRC (Ping timeout: 260 seconds)||[04:11]|
You're wandering into a welding shop going "So.... why cold welds"
|jacketcha||that seems very accurate||[04:16]|
|***||bithippo has joined #archiveteam-bs||[04:20]|
|kyounko has joined #archiveteam-bs||[04:33]|
|..... (idle for 22mn)|
|qw3rty117 has joined #archiveteam-bs||[04:55]|
|qw3rty116 has quit IRC (Read error: Operation timed out)||[05:01]|
|..... (idle for 21mn)|
|bithippo has quit IRC (Quit: Page closed)||[05:22]|
|Stiletto has quit IRC (Read error: Operation timed out)||[05:29]|
|Stilett0 has joined #archiveteam-bs
BlueMaxim has quit IRC (Read error: Operation timed out)
BlueMaxim has joined #archiveteam-bs
|...... (idle for 26mn)|
|wp494 has quit IRC (Ping timeout: 250 seconds)
wp494 has joined #archiveteam-bs
|zgrant has left
wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES)
wp494 has joined #archiveteam-bs
|kimmer1 has joined #archiveteam-bs
midas2 has quit IRC (Ping timeout: 1212 seconds)
kimmer12 has quit IRC (Ping timeout: 633 seconds)
|midas2 has joined #archiveteam-bs||[06:47]|
|........ (idle for 37mn)|
|wp494_ has joined #archiveteam-bs||[07:24]|
|ZexaronS has quit IRC (Read error: Connection reset by peer)
ZexaronS has joined #archiveteam-bs
wp494 has quit IRC (Read error: Operation timed out)
|wp494_ has quit IRC (Ping timeout: 633 seconds)
odemg has quit IRC (Read error: Operation timed out)
|odemg has joined #archiveteam-bs||[07:46]|
|wp494 has joined #archiveteam-bs||[07:53]|
|robink has joined #archiveteam-bs||[08:03]|
|jacketcha||I wonder how many times jquery has been archived
By this point, there must be at least a hundred copies made of it each week
|............. (idle for 1h2mn)|
|***||jacketcha has quit IRC (Read error: Connection reset by peer)
jacketcha has joined #archiveteam-bs
|Mateon1 has quit IRC (Ping timeout: 245 seconds)
Mateon1 has joined #archiveteam-bs
|.... (idle for 19mn)|
|jrwr||SWEET BABY JESUS
someone, I got php7 to run
on this holy shit old host
|PurpleSym||Is this a dedicated server, jrwr?||[09:39]|
|jrwr||not even close
its a shared host running linux 2.6 on a old cpanel
running a god old apache + php 5.3
I override the mod_security and mod_suphp all to fux to get PHP scripts to run with a custom statically linked php binary I made
|PurpleSym||Wtf? 2.6 EOL’d years ago.||[09:41]|
|jrwr||I'm making do with what I have
its where its staying
I'm making its own little world on this webhost
jrwr compiles memcached
comes the fun part
I'm going to update mediawiki
i can't even update windows without doing a ritual to please the tech gods
|jrwr||this is dark magic
php does not like doing this
I have a plan
to compile php with memcached, and then run a little memcached server so mediawiki can cache objects
you don't even need php
just do what I do and use 437 IFTTT applets as your server
with a touch of github pages
this is the archiveteam wiki I'm working on
|jacketcha||Is the ArchiveTeam wiki archived?||[09:55]|
|jacketcha||You know what I want to try? My school has unlimited storage on all google accounts under its organization. I wonder how far they would let me push that.||[09:57]|
|jrwr||Its staying where it is for now
|jacketcha||is it because you missed a semicolon somewhere but there isn't a really good php linter yet
i just remembered i have midterms
|.... (idle for 18mn)|
thats a ton better
|***||jacketcha has quit IRC (Remote host closed the connection)||[10:19]|
|jrwr||Archive team is now running on mediawiki 1.30.0||[10:19]|
|***||jacketcha has joined #archiveteam-bs
jacketcha has quit IRC (Remote host closed the connection)
|jacket has joined #archiveteam-bs||[10:29]|
|fie has joined #archiveteam-bs||[10:43]|
|fie has quit IRC (Read error: Connection reset by peer)||[10:48]|
|.... (idle for 19mn)|
|pizzaiolo has joined #archiveteam-bs||[11:07]|
|jrwr||Igloo: better huh?||[11:14]|
|Igloo||jrwr: miles and miles||[11:16]|
Response times are sub 200ms
Before they where 1400ms
|JAA||jrwr: Well done! Much, much better. <3||[11:20]|
|JAA||Somebody2: Ah, makes sense. Thanks for checking.||[11:21]|
|jrwr||I woke up from a strange dream at 3am (flying a airplane and somewhat crashing it)
And then had a brainwave on how to get php working correctly
Been up since then, work is going to be hell today
|.......... (idle for 46mn)|
|***||BlueMaxim has quit IRC (Leaving)||[12:08]|
|....... (idle for 32mn)|
|Igloo||Ok, So it looks like we can iterate through the numbers||[12:40]|
|JAA||For user profiles, maybe. For characters, no way.||[12:41]|
|Igloo||212 million users? Unlikely||[12:41]|
|JAA||But it should be fairly simple to scrape them from https://www.saintsrow.com/community/characters/mostrecent
The question is, how do we get the actual characters (not just the images)?
|Smiley||is there a log of what has gone before?
I have a 'archiveteam' account registered along with my personal one
not sure why
maybe i just suggested scraping for SR3
|Igloo||https://www.saintsrow.com/users/show/212300001 appears to be the lowest. https://www.saintsrow.com/users/show/213056573 latest
2300001 3056573 ~705,000 user profiles?
Those are easy
|........ (idle for 35mn)|
|jrwr||SketchCow: email fixed
confirmed working with password resets being sent to a gmail account
you can see when I dropped in the php changes
|..... (idle for 20mn)|
|joepie91: it doesnt work like that
this whole box is from 2011
|joepie91||ah, just an old cpanel then that doesn't support it, or?||[13:53]|
I have methods and apis
im patching it in
|***||icedice has joined #archiveteam-bs||[14:05]|
|icedice has quit IRC (Ping timeout: 250 seconds)||[14:13]|
|JAA||jrwr: LOL, that graph is beautiful!||[14:13]|
|.... (idle for 15mn)|
its going up on my wall
|........... (idle for 50mn)|
fucking A rating!
|MrRadar2||:D :D :D https://i.imgur.com/CloHYLR.png||[15:20]|
|jrwr||with Strict Transport Security (HSTS) on (left it pretty short just in case)||[15:20]|
|Igloo||Just need a redirect now ;-)||[15:20]|
|jrwr||na, not going to enforce it
HTST is enough
|***||zgrant has joined #archiveteam-bs||[15:23]|
SketchCow: SSL is now installed
hehe, all my home stuff with LE gets A rating too
Which is bonza
|SketchCow||I think that's all I can think of
Someone proposed some sort of theme upgrade
But it all seems just fine to me now.
I /might/ get bored and add in a new editor but the new editor requires all kinds of crazy
|SketchCow||If people come up with things, we'll consider them now that it's possible
Generally, someone complaining they can't work on te Wiki because they miss a gimgaw are focused on the wrong things.
I am using the file based cache built into mw
so bots and stuff all get served static pages
I feel like I just refurbed my 1984 Chrysler lebaron convertible (I own one) https://drive.google.com/file/d/1AQqXNiluKTk5xuCYStfVexiH1LLUOYaLLQ/view?usp=sharing
runs great, and talks to you
|I love the old DEC speech Synths
sound better then software ones
|........ (idle for 35mn)|
|godane||so another box of tapes i bought is shipped||[16:28]|
|..... (idle for 22mn)|
|***||dd0a13f37 has joined #archiveteam-bs||[16:50]|
|godane||so this happenedhttp://mashable.com/2017/12/20/sesame-street-irc-macarthur-grant-refugee-middle-east/#fIS9la5_bSq7||[16:53]|
|...... (idle for 25mn)|
|***||schbirid has joined #archiveteam-bs
jacket has quit IRC (Read error: Connection reset by peer)
jacket has joined #archiveteam-bs
|dd0a13f37||aria2c is a mystery
if I have it use 1 connection or 10, I still get about 2r/s
if I split it up across 6 command windows, 12r/s
might have to do with the fact that it's split across multiple IPs though
Anyone know a good tool to do this automatically? Split up http requests over multiple proxies?
|....... (idle for 34mn)|
|***||bithippo has joined #archiveteam-bs||[17:59]|
|......... (idle for 43mn)|
|ola_norsk||is there some sort of code available to look at on how IA get urls from warcs?
or does it convert first, somehow, then do _that_ ?
i was kind of expecting warcs to a kind of archive with an index, not all data being a single file :/
e.g containing something i could open in gedit etc..
|dd0a13f37||Can someone help me?
I'm trying to archive a site with aria2c. When using multiple concurrent connections, I get about 2r/s. The speed is around 50bkit, despite my internet connection being much faster.
When using multiple instances over multiple IPs, it's not much better. The individual speeds for some drop down to 5kbit.
|bithippo||Recommend using https://github.com/ludios/grab-site to archive instead||[18:54]|
|dd0a13f37||I am currently running 50 instances of aria2 with 20 concurrent connections each. I get 5r/s total.||[18:54]|
|bithippo||If you need a WARC file, request/response headers, etc.||[18:54]|
|dd0a13f37||This is abysmally low, what gives?||[18:54]|
|bithippo||You are most likely being throttled by IP||[18:54]|
|dd0a13f37||But I'm spreading it over 50 different IPs.||[18:54]|
|bithippo||(even if distributing across multiple IPs)
Sliding window of bytes in the webserver config. Initial requests are fast, subsequent requests slow down if you try to firehose
What's the hostname?
Or my hostname?
|bithippo||Nope, site hostname. Checking something.||[18:55]|
|dd0a13f37||So how can it throttle them to 2-5 KiB/s when using 50 different IPs, but 50 KiB/s when using 1?||[18:55]|
|bithippo||Are these all anonymous web requests? Or are you signed in/setting a cookie to be logged in to fetch data?||[18:57]|
|dd0a13f37||These are all anonymous requests from tor exit nodes. No cookies are stored.||[18:58]|
|bithippo||Could be throttling by tor IPs. I did that at my last gig on our Nginx servers.
(Tor requests were notoriously bad scraping actors in our case)
|dd0a13f37||But I used tor IPs before too. And those were at 50KiB/s, not 2-5KiB/s.||[18:58]|
|bithippo||I don't have a good answer unfortunately :/ Lots of variables that could be causing it. What's the purpose of using Tor to perform the requests?||[18:59]|
|dd0a13f37||The only logical explanation is my connection being the bottleneck, but that would put it at around 2 mbit, which is way too slow
Because I don't want to get in any trouble for the scraping, and they could ban my IP
Now it jumped up to 12 resp/s, which was my previous peak when using 6 different IPs.
|bithippo||Is a cloud provider VM out of the question with a slow concurrency rate?
2 requests per second, say.
|dd0a13f37||It could be on their end too.
I could just leave the computer on over night, but I would prefer not to pay any money.
Could it be they just serve 12 connections at the same time?
Now down to 7 again. Sure is a mystery what is going on...
And now back up to 14. At this speed, it will take 6 hours, which is slow but acceptable.
|bithippo||I can rip it for you and provide a torrent file when I'm done.||[19:07]|
|dd0a13f37||It went up to 34 now.
How? With grab-site?
They might block the IP being used in that case
|bithippo||10 second wait between requests
It'll take a while, but it'll finish eventually.
Have to head out, leave me a note here if that's a plan
|dd0a13f37||10 seconds would take half a year for the whole site, and it would change during the time, so I don't think that's a good idea
But if it continues to be this fast then it should be done in a few hours, which is good.
|Well, the only logical explanation is some advanced throttling algorithm in place. I can't find any other explanation for why it's so slow.||[19:21]|
|https://pastebin.com/VkCS1yJ1 It apparently got faster over time, with a peak of 46 resp/s, before slowing down.||[19:27]|
|.... (idle for 18mn)|
|ola_norsk||any sqlite geninouses savvy to very basic sqlite reational database structure in here who wouldn't mind if ask some questions?
|dd0a13f37||I have a basic knowledge, shoot||[19:46]|
|ola_norsk||dd0a13f37: thanks. if you please take a look at the SQL here, (just page search for 'sqlqueries') https://github.com/DuckHP/twario-warrior-tool/blob/master/src/twario/sqlitetwario.py
i'm sure that sql could be done better, and i think you'll agree. Sadly my sql is quite shit
to optimize storage etc..i mean
and speed etc
|dd0a13f37||You can add a constraint for TweetUserId so it has to have a corresponding entry in Users
And Users should either have id INTEGER PRIMARY KEY, or TweetUserID should be a username
|ola_norsk||yeah been thinking that so i made a 'users' table
|dd0a13f37||Display name isn't stable, but it might be overkill to provision for that
search for foreign key constraint
|ola_norsk||aye, i'm not even sure yet if 'tweep' reads display names||[19:52]|
Well, if you want to archive avatars etc it might be neat to have. You could have three tables, but it might be overkill
tweets - tweet text, date, username
users - username (not unique), date, avatar, displayname
or wait, that makes two
|***||Valentine has quit IRC (Read error: Connection reset by peer)||[19:54]|
|ola_norsk||avatar might be doable||[19:54]|
|dd0a13f37||And then just do SELECT * FROM users WHERE username = ... LIMIT 1||[19:54]|
|dd0a13f37||not sure about the syntax||[19:55]|
|ola_norsk||the requests sql i can figure out i think, but i suck at schema/structure :/||[19:58]|
|***||BnAboyZ has quit IRC (Quit: The Lounge - https://thelounge.github.io)||[19:58]|
|ola_norsk||i'll check out that link you posted. thanks||[19:59]|
|dd0a13f37||Well, have one users table that for each username can have multiple entries (e.g. if they change their avatar you get a new entry with same username)
And one tweets table, since they are immutable
|***||Valentine has joined #archiveteam-bs||[20:00]|
|ola_norsk||e.g if a tweet is identical, just have a 'content' table perhaps, and refereance that?||[20:03]|
|dd0a13f37||If you're ever building your own scraper, it seems like mobile.twitter.com is more pleasant to work with
|ola_norsk||i'm just "re-doing" a tool called 'tweep'||[20:03]|
|dd0a13f37||As in, modifying it?||[20:04]|
|ola_norsk||aye, this: https://github.com/haccer/tweep ..it seems to work quite well, but could use some tweaking||[20:05]|
|dd0a13f37||If you want a complete archive, you could probably crawl pretty nicely. Start off by the timeline, then see what accounts and hashtags you find. Then traverse those accounts and hashtags, see what accounts and hashtags you find.
>The --fruit feature will display Tweets that might contain sensitive info
|ola_norsk||aye, have not tested that yet, but i've been thinking of removing it
basically 'user' and 'search words' is my focus
not exactly too keen on archiving 'doxing' tweets
|dd0a13f37||Well, why bother taking it out? Just don't use it, or remove all documentation references to it if you're really concerned.||[20:08]|
|dd0a13f37||I think mobile.twitter.com is better. It shows 30 tweets/page instead of 20, and the pages are faster to download||[20:09]|
|hook54321||JAA: Are you still grabbing the Catalonia cameras that update every 5 minutes or so?||[20:10]|
|ola_norsk||dd0a13f37: it seems to require a signin/account||[20:10]|
I think so, at least.
Let me check.
|ola_norsk||dd0a13f37: i deliberaly made myself banned from twitter :/||[20:10]|
|hook54321||I need to start recording the cameras I was recording again||[20:10]|
|JAA||Yep, it's still grabbing... something.||[20:11]|
|dd0a13f37||mobile.twitter.com doesn't need an account||[20:11]|
|JAA||Haven't looked at the content in a long time though.||[20:11]|
|dd0a13f37||https://mobile.twitter.com/jack?max_id=938593014343024639 works just fine for me||[20:11]|
|ola_norsk||dd0a13f37: so it's just because i'm using desktop browser then?
dd0a13f37: that link worked btw
dd0a13f37: doh, i got a "join today to see it all" when scrolling
|hook54321||JAA: Should I grab this whole youtube channel? https://www.youtube.com/user/gencat||[20:13]|
|dd0a13f37||I am using tor browser with JS disabled.||[20:14]|
|ola_norsk||dd0a13f37: i think if 'twario/tweep' is made a bit less agressive, it wouldn't need to be 'torified'||[20:16]|
|dd0a13f37||https://mobile.twitter.com/jack?max_id=743833014343024639 I can go quite a bit back
Why castrate your perfectly working tweet scraping tool? Requests can use proxies, or multiple.
|ola_norsk||dd0a13f37: with original 'tweep' it seemed to stop at half a year or so back in time at search word
they could, but it would eventually get noticed i think if's running continously :/
|dd0a13f37||differnt users go differently far back https://mobile.twitter.com/realDonaldTrump?max_id=793833014343024639||[20:18]|
|ola_norsk||i don't mean users, but e.g one word||[20:19]|
|dd0a13f37||There are many tor exit nodes.||[20:19]|
|ola_norsk||how could a python script be _fully_ torifyed? If it could be done without using a virtual machine, that would be cool :D||[20:21]|
|dd0a13f37||torsocks python ./myscript||[20:21]|
|dd0a13f37||Or you can just have requests use a proxy
torsocks -i for guaranteed fresh ip
|***||BnAboyZ has joined #archiveteam-bs||[20:24]|
|hook54321||What collection should I upload that channel to? There's like 400 videos....||[20:27]|
|ola_norsk||dd0a13f37: will definetly test that. And i'm guessing just the tiny bit of extra time storing to an sqlitedb counts as tiny bit of it being nice-ifyed :D||[20:28]|
|dd0a13f37||The time the request takes will, unless you're using twisted/multithreading||[20:29]|
|ola_norsk||dd0a13f37: the reason 'local capture time' column is in tweets i think i put in for exactly that purpose, since JAA pointed out that 'tweep' itself does not seem to be correct at keeping times
|dd0a13f37||The mobile search url query string is ... interesting...
(that was one link)
|dd0a13f37||the base64 encoded part is some kind of bitmask||[20:31]|
i think i've seen that garbled shit before, at tweep crashing :/
when adding 'loggin' module, that looks exactly like the output given on the line where it stopped
because there is no base64 encoding or anything of the sort in tweep
|ola_norsk||maybe i have the output..one sec
seems i've deleted the log, I'll risk trying to run the same command one more time. brb
I enabled the new editor toolbar in the wiki (cc SketchCow )
|ola_norsk||dd0a13f37: could it be compressed stuff, like in 'header: gz' crap?||[20:40]|
|dd0a13f37||No, it's base64
run base64 -d | xxd, then paste it in
You'll see most of the bytes only have one bit set
|dd0a13f37||Since it doesn't do anything if you change the numbers at the beginning (max id), the max_id parameter is in there too||[20:42]|
|ola_norsk||dd0a13f37: all i know that mess of "AAAAAAAA" was the end of the log line when i last tested tweep. And also where it apprently failed.||[20:42]|
|dd0a13f37||Not at the beginning, since that's the same across requests||[20:43]|
|ola_norsk||i'm running the same command now, and will pastebin (when) it fails||[20:43]|
|jrwr||SketchCow: its snazzy||[20:44]|
|ola_norsk||dd0a13f37: for all i know it might've been some nasty character(s) that did it||[20:45]|
|jrwr||makes it a /little/ simpler to edit pages||[20:46]|
|***||bithippo has quit IRC (Quit: Page closed)||[20:48]|
|BartoCH has quit IRC (Ping timeout: 260 seconds)||[20:54]|
|dd0a13f37||Oh, regular twitter has that same AAAAAAAAAAA mess, just not as the requested URL||[20:59]|
|JAA||I think it does. Load a page in your browser (with JS enabled), enable dev console, scroll to the bottom, check out the requests that happen in the background.
I think tweep just tries to imitate what the browser would do.
it has not crashed yet here like last time, but seems like a lot of people love to tweet 'netneutrality' these days. So it's not even done with this month. I think last time it crashed at about this years month of may tweets
holy shit people have been tweeting 'netneutrality' lol
|dd0a13f37||Twitter gets 6k tweets/sec, with 20 tweets/request archiving this is in the realm of possibility||[21:05]|
|ola_norsk||dd0a13f37: aye :D and using webarchive.io, or wget/curl with requests to web.archive.org/save/ is quite futile :D||[21:06]|
|dd0a13f37||You would need to do a few hundred requests per second. The problem is archiving all those avatars, if you saturate a 1gbit/s line you can afford to archive 20kbit avatars assuming no overhead or IP bans
But the avatars aren't very important, are they?
|ola_norsk||reconstructing the links to the tweets is more important||[21:07]|
|dd0a13f37||That's possible too, all the info is in the HTML||[21:08]|
|ola_norsk||and 'tweep' captures the id
|dd0a13f37||Does tweep have a mode where it can just show you all the tweets being done?||[21:08]|
|ola_norsk||it does by default||[21:09]|
|dd0a13f37||Without narrowing down to a hashtag? Does it get 100%?||[21:09]|
|ola_norsk||i don't know. I does a fuck of a lot of tweets though :D
if why it stopp(ed), could be worked out, i bet it could do 100%
|jrwr||for the wikidump nerds At 04:00 on Friday. a copy of the wiki's XML + Images are uploaded to the IA
for good measure
|ola_norsk||right now i'm just doing "python tweep -s 'netneutrality' > tweets.txt" ..to see if it eventually stops like last time. For all i know, piping to a textfile is what did it.||[21:12]|
|dd0a13f37||But can you just run python tweep > t.txt?||[21:13]|
|SketchCow||AND THE WINNER OF THE "DO YOU WANT THIS" SWEEPSTAKES FOR DECEMBER 21 IS||[21:14]|
|ola_norsk||with '-s <search word>' , yes||[21:14]|
|SketchCow||...hundreds of gigs of funeral recordings in mp3||[21:14]|
|***||BartoCH has joined #archiveteam-bs||[21:14]|
|dd0a13f37||But without -s parameter?
You could cheat and just use the X most common words, but that's not a nice solution
|ola_norsk||then it asks for parameters i think .. I think it's either '-u (user)' or '-s (word(s))' possible
either one of those are required.. _i think_
|dd0a13f37||Then a full scrape is difficult, or at least harder||[21:16]|
|jrwr||SketchCow: your collection never ceases to amaze me||[21:16]|
|JAA||jrwr: Yay, finally. The last such dumps were uploaded in 2014 or 15.||[21:16]|
|jrwr||they get dumped here after processing https://www.archiveteam.org/dumps/||[21:17]|
|ola_norsk||dd0a13f37: with a 'users' table i could be easier though perhaps..or :D||[21:17]|
|jrwr||only keeps one||[21:17]|
|jrwr||the backup log for it is in https://www.archiveteam.org/backup.log||[21:17]|
|dd0a13f37||Well, there's still a few users who are never mentioned by others, never use certain hashtags, and never use certain words||[21:18]|
|ola_norsk||dd0a13f37: not to mentioned banned, yet mentioned, and private..etc. i guess
dd0a13f37: i don't have much experience using tweep, so i don't even know how it behaves on finding disabled accounts, or banned users :/
|dd0a13f37||If you're scraping in realtime, that doesn't matter.. it would be one hell of a tweet to get banned in under 5 milliseconds||[21:22]|
|ola_norsk||dd0a13f37: i think it just goes back in time from point of start||[21:22]|
|dd0a13f37||You'll never keep up, better to go in realtime||[21:23]|
|ola_norsk||dd0a13f37: that would need some genoius at python threads i think..and perhaps faster bandwitch than mine :D
|dd0a13f37||twisted-http is fast, no?||[21:24]|
|***||jacketcha has joined #archiveteam-bs
pizzaiolo has quit IRC (Read error: Operation timed out)
|ola_norsk||dd0a13f37: i do not know..tweep uses 'request' / 'urllib(3?)' i think||[21:25]|
|dd0a13f37||one results page is 8.5k gzipped, contains 20 tweets, at 6k tweets/sec this gives 20 mbit/s||[21:26]|
|ola_norsk||dd0a13f37: it wouldn't surprise me if in certain senarios a hashtag were quicker than i could process||[21:26]|
|dd0a13f37||Yeah, requests without threading.||[21:26]|
|***||jacket has quit IRC (Ping timeout: 248 seconds)||[21:27]|
|dd0a13f37||But I think caching will make such attempts impossible, if you do the same query multiple times you'll get the same result||[21:27]|
|ola_norsk||when using crontab wget, i had to cut time from 5 minutes to 3 minutes between each web.archive.org/save/ request..just to have a chance||[21:27]|
|dd0a13f37||You can do those in the background. Fire and forget. But IA won't like it||[21:29]|
|ola_norsk||dd0a13f37: going "upwards" in time in a twitter feed is most likely the best solution. But my grasp of how to do that..is weak :D||[21:29]|
|dd0a13f37||I think archiving twitter is an insanity project anyway, better to just wait for library of congress to get their shit together||[21:30]|
|ola_norsk||dd0a13f37: i just focous on hastags, like netneutrality :D||[21:30]|
|dd0a13f37||that's probably possible||[21:30]|
|ola_norsk||dd0a13f37: entire twitter, or twitter by even years or months..yeah, some congress would've have to do that :D||[21:31]|
|jrwr||and now I rest from poking the wiki really hard over the last 24hr||[21:31]|
|***||jacketcha has quit IRC (Read error: Operation timed out)||[21:33]|
|ola_norsk||dd0a13f37: tweets containing 'netneutrality' been scrolling on my screen for 'since i said i started the command' , and i'm still on 2017-12-19 :/
dd0a13f37: though i expect it will speed up when getting past the 14th a bit
|dd0a13f37||You could archive faster if you modify it to use twisted||[21:34]|
|ola_norsk||just by the protocol stuff or using threading?||[21:35]|
>Twisted is an event-driven networking engine
|ola_norsk||so its beatifulsoup that's bottleneck, or?||[21:37]|
and that it's not using requests with threads
|ola_norsk||could it "re-use" already established connections? because that is one thing that pisses me off about tweep. It seems to do one connection per damn tweet||[21:40]|
|ola_norsk||..or at least try, like wget
|jrwr||anyway JAA I figured once a week is a good backup for a low traffic wiki||[21:41]|
|dd0a13f37||or apparently asyncio is recommended||[21:42]|
|JAA||Yeah, sounds reasonable.
ola_norsk, dd0a13f37: It would probably be easiest to reimplement the whole thing based on aiohttp or similar.
|ola_norsk||ola_norsk taking notes of all :D||[21:44]|
|JAA||I've written scrapers with aiohttp before, it's really nice.||[21:44]|
|ola_norsk||got git? :D||[21:45]|
|JAA||HTTP/2 support would be even better.
No, haven't shared it yet.
It's on my list for the holidays, uploading all my grabs and the corresponding code.
|ola_norsk||JAA: feel free to punch in some stuff :) https://github.com/DuckHP/twario-warrior-tool
i have to the get database thingy working first i guess, before i do anything else :/
(that, and making sure it doesn't freeze)
|JAA||What did you change so far?
Also, port to Python 3 please.
|ola_norsk||JAA: i've barely (not really) touched tweep itself so far :/||[21:48]|
|ola_norsk||bah...I've not python'ed in years..2.7 is new to me :D||[21:49]|
|JAA||Oh please, Python 3 was released in 2008. :-P||[21:50]|
|ola_norsk||why is there not a python script that converts e.g 'print "shit"' ?||[21:51]|
That should handle the most obvious stuff.
|***||icedice has joined #archiveteam-bs||[21:53]|
|ola_norsk||the need of porting an interpreted language...that's a travesty by itself :(||[21:54]|
|JAA||Well, it's necessary because they cleaned up a ton of poorly designed stuff in Python 3.||[21:54]|
|dd0a13f37||it truly boggles the mind
yet c can remain source compatible for 28 years and counting
|JAA||Yeah, let's compare C to Python...
And I doubt that C was as stable in the early stages of development.
Let's discuss that again when Python is 45 years old.
|dd0a13f37||But python is old by now. The 2 to 3 migration was a complete catastrophe.||[21:56]|
|JAA||Well yeah, many of those things (e.g. string vs. unicode distinction) should've been fixed earlier.
But they waited and accumulated all those things and then made one big backwards-incompatible release.
Which makes sense, otherwise you'd have to keep changing the code all the time.
|hook54321||Should these videos be uploaded to community video, or a different collection? https://www.youtube.com/user/gencat||[21:58]|
|JAA||Anyway, this is getting way too offtopic for this channel.||[21:58]|
|dd0a13f37||C was standardized in 1989, and k&r was released in 1978 - 11 years||[21:58]|
|dd0a13f37||python was released in 1991
it was not standardized by 2002
|***||JAA changes topic to: Lengthy Archive Team and archive discussions here | Offtopic: #archiveteam-ot | <godane> SketchCow: your porn tapes are getting digitized right now
schbirid has quit IRC (Quit: Leaving)
|dd0a13f37||Or, to be fair, python2 was released in 2000, and it wasn't standardized by 2011||[21:59]|
|dd0a13f37||I didn't know we had an offtopic channel for the offtopic channel||[22:01]|
|JAA||This isn't the offtopic channel, it was always about lengthy discussions (because #archiveteam is limited to announcements).
And -ot is new, just opened last week I think.
|DFJustin||oh great another channel||[22:02]|
|***||icedice2 has joined #archiveteam-bs||[22:06]|
|JAA||hook54321: Community video sounds reasonable to me. Are you uploading each video as its own item? If so, you should probably ask info@ to create a collection of all of them in the end.||[22:08]|
|***||icedice has quit IRC (Ping timeout: 250 seconds)||[22:08]|
|hook54321||Each of them there own item yeah. I'm using tubeup to do it. I'll email info@ when it's done I guess.||[22:09]|
|***||ola_norsk has quit IRC (R.I.P dear known Python :( https://youtu.be/uy9Mc_ozoP4)||[22:10]|
|Smiley||is -bs not the off topic channel tho?!
Smiley so confuse. fuck that.
|dd0a13f37||When do Igloo's pipelines upload? As part of
Archiveteam: Archivebot GO Pack?
All pipelines do, except astrid's and FalconK's.
|dd0a13f37||So why can't I find a certain !ao job in it?||[22:16]|
|JAA||Let's go to #archivebot.||[22:16]|
|***||pizzaiolo has joined #archiveteam-bs||[22:19]|
|...... (idle for 27mn)|
|icedice has joined #archiveteam-bs
icedice2 has quit IRC (Ping timeout: 245 seconds)
icedice2 has joined #archiveteam-bs
icedice has quit IRC (Ping timeout: 245 seconds)
icedice2 has quit IRC (Client Quit)
kristian_ has joined #archiveteam-bs
icedice has joined #archiveteam-bs
|icedice2 has joined #archiveteam-bs
icedice has quit IRC (Ping timeout: 245 seconds)
|......... (idle for 40mn)|
|jacketcha has joined #archiveteam-bs||[23:48]|
|jacketcha||Hey, does anybody know if there is a node.js implementation of the Warrior program?||[23:56]|
|↑back Search ←Prev date Next date→ Show only urls||(Click on time to select a line by its url)|